DS-200 Study Guide Begin Your Journey to Data Science Recommended Cloudera Training Courses Cloudera Developer Training for Apache Hadoop Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop Introduction to Data Science: Building Recommender Systems Practice Test DSâ200 Practice Test Subscription Online Resources New to Data Science: tutorials, papers, meetups, books, etc. Data Processing & Analytics: Hadoop resources and materials listed by function New to Hadoop: introductory topics from Clouderaâ s Developer Center Quora.com: Data Science topic Helpful Books Hadoop: The Definitive Guide: essential text by Tom White (Chapters 4, 7, 12, 15 , 16) Hadoop In Practice: by Alex Holmes (Chapters 2, 3, 8, 9, 10) Programming Collective Intelligence: by Toby Segaran Algorithms of the Intelligent Web: by Haralambos Marmanis and Dmitry Babenko Mahout In Action: by Sean Owen, et al. Data-Intensive Text Processing with MapReduce: by Jimmy Lin, et al. (Chapter 6) Beautiful Data: by Toby Segaran and Jeff Hammerbacher (Chapter 5) Hadoop In Action: by Chuck Lam (Chapter 12) Introduction to Data Science: online textbook Pattern Recognition and Machine Learning A Programmers Guide to Data Mining Useful Blogs Cloudera's blogs on data science Hillary Mason's blog Math Babe FiveThirtyEight Kaggle competitions blog Alex Holmes's blog O'Reilly Radar Flowing Data Gapminder MLcomp Exam Sections Data Acquisition Data Evaluation Data Transformation Machine Learning Basics Clustering Classification Collaborative Filtering Model/Feature Selection Probability Visualization Optimization Data Acquisition Objectives Access and load data from a variety of sources into a Hadoop cluster, including from databases and systems such as OLTP and OLAP as well as log files and docume
nts. Deploy a variety of acquisition techniques for acquiring data, including databas e integration, working with APIs Use command line tools such wget and curl Use Hadoop tools such as Sqoop and Flume Study Resources Apache Sqoop Clouderaâs blogs on Apache Sqoop Aaron Kimball on Sqoop Apache Flume Cloudera's blogs on Apache Flume Cloudera's blogs on data collection HDFS File System Shell Guide Hadoop: The Definitive Guide, 3rd Edition: Chapter 15 Hadoop In Practice: Chapter 2 Data Evaluation Objectives Knowledge of the file types commonly used for input and output and the advantage s and disadvantages of each Methods for working with various file formats including binary files, JSON, XML, and .csv Tools, techniques, and utilities for evaluating data from the command line and a t scale An understanding of sampling and filtering techniques A familiarity with Hadoop SequenceFiles and serialization using Avro Study Resources Hadoop: The Definitive Guide, 3rd Edition: Chapter 4 Hadoop In Practice: Chapter 3 Apache Avro Cloudera's blogs on Apache Avro Data Transformation Objectives Write a map-only Hadoop Streaming job Write a script that receives records on stdin and write them to stdout Invoke Unix tools to convert file formats Join data sets Write scripts to anonymize data sets Write a Mapper using Python and invoke via Hadoop streaming Write a custom subclass of FileOutputFormat Write records into a new format such AvroOutputFormat or SequenceFileOutputForma t Study Resources Hadoop Streaming Hadoop Streaming wiki Apache Hive Hive tutorial Hive language manual Hive joins documentation Apache Pig Pig's relational operators Cloudera blog on Python frameworks for Hadoop Hadoop: The Definitive Guide, 3rd Edition: Chapters 7, 12 Hadoop In Practice: Chapter 8, 10 Machine Learning Basics
Objectives Understand how to use Mappers and Reducers to create predictive models Understand the different kinds of machine learning, including supervised and uns upervised learning Recognize appropriate uses of the following: parametric/non-parametric algorithm s, support vector machines, kernels, neural networks, clustering, dimensionality reduction, and recommender systems Section Study Resources Apache Mahout Apache Mahout wiki Cloudera's blogs on Apache Mahout Hadoop In Practice: Chapter 9 Hadoop: The Definitive Guide, 3rd Edition: Chapters 16 Algorithms of the Intelligent Web: Chapter 7 A Programmers Guide to Data Mining Clustering Objectives Define clustering and identify appropriate use cases Identify appropriate uses of various models including centroid, distribution, de nsity, group, and graph Describe the value and use of similarity metrics including Pearson correlation, Euclidean distance, and block distance Identify the algorithms applicable to each model (k-means, SVD/PCA, etc.) Study Resources Programming Collective Intelligence: Chapter 3 Algorithms of the Intelligent Web: Chapter 4 Mahout In Action: Part 2 Classification Objectives Describe the steps for training a set of data in order to identify new data base d on known data Identify the use cases for logistic regression, Bayes theorem Define classification techniques and formulas Study Resources Programming Collective Intelligence: Chapters 6, 7, 8, 9, 12 Algorithms of the Intelligent Web: Chapters 5, 6 Mahout In Action: Part 3 Collaborative Filtering Objectives Identify the use of user-based and item-based collaborative filtering techniques describe the limitations and strengths of collaborative filtering techniques Given a scenario, determine the appropriate collaborative filtering implementati on Given a scenario, determine the metrics one should use to evaluate the accuracy of a reccomender system Study Resources Recommendation engines with Apache Mahout Programming Collective Intelligence: Chapter 2 Algorithms of the Intelligent Web: Chapter 3 Mahout In Action: Part 1 Model/Feature Selection Objectives
Describe the role and function of feature selection Analyze a scenario and determine the appropriate features and attributes to sele ct Analyze a scenario and determine the methods to deploy for optimal feature selec tion Study Resources Programming Collective Intelligence: Chapter 10 Pattern Recognition and Machine Learning: Chapter 1.3 Probability Objectives Analyze a scenario and determine the likelihood of a particular outcome Determine sample percentiles Determine a range of items based on a sample probability density function Summarize a distribution of sample numbers Study Resources Programming Collective Intelligence: Chapter 8 Pattern Recognition and Machine Learning: Chapter 2 BetterExplained.com on Probability, Statistics, Bayes Theorem Visualization Objectives Determine the most effective visualization for a given problem Analyze a data visualization and interpret its meaning Study Resources Data Visualization: modern approaches Data Visualization basics Sample Visualizations DataVisualization.ch Data Visualization for human perception Optimization Objectives Understand optimization methods Identify 1st order and 2nd order optimization techniques Determine the learning rate for a particular algorithm Determine the sources of errors in a model Study Resources Leon Bottou on stochastic learning from Advanced Lectures on Machine Learning Leon Bottou on online algorithms and stochastic approximations Programming Collective Intelligence: Chapter 5 Data-Intensive Text Processing with MapReduce: Chapter 6