About the content
Machine Learning is a first-class ticket to the most exciting careers in data analysis today. As data sources proliferate along with the computing power to process them, going straight to the data is one of the most straightforward ways to quickly gain insights and make predictions. Machine learning brings together computer science and statistics to harness that predictive power. It’s a must-have skill for all aspiring data analysts and data scientists, or anyone else who wants to wrestle all that raw data into refined trends and predictions. This is a class that will teach you the end-to-end process of investigating data through a machine learning lens. It will teach you how to extract and identify useful features that best represent your data, a few of the most important machine learning algorithms, and how to evaluate the performance of your machine learning algorithms. This course is also a part of our Data Analyst Nanodegree.
You’ll learn how to start with a question and/or a dataset, and use machine learning to turn them into insights.
Lessons 1-4: Supervised Classification
**Naive Bayes:** We jump in headfirst, learning perhaps the world’s greatest algorithm for classifying text. **Support Vector Machines (SVMs):** One of the top 10 algorithms in machine learning, and a must-try for many classification tasks. What makes it special? The ability to generate new features independently and on the fly. **Decision Trees:** Extremely straightforward, often just as accurate as an SVM but (usually) way faster. The launch point for more sophisticated methods, like random forests and boosting.
Lesson 5: Datasets and Questions
Behind any great machine learning project is a great dataset that the algorithm can learn from. We were inspired by a treasure trove of email and financial data from the Enron corporation, which would normally be strictly confidential but became public when the company went bankrupt in a blizzard of fraud. Follow our lead as we wrestle this dataset into a machine-learning-ready format, in anticipation of trying to predict cases of fraud.
Lesson 6 and 7: Regressions and Outliers
Regressions are some of the most widely used machine learning algorithms, and rightly share prominence with classification. What’s a fast way to make mistakes in regression, though? Have troublesome outliers in your data. We’ll tackle how to identify and clean away those pesky data points.
Lesson 8: Unsupervised Learning
**K-Means Clustering:** The flagship algorithm when you don’t have labeled data to work with, and a quick method for pattern-searching when approaching a dataset for the first time.
Lessons 9-12: Features, Features, Features
**Feature Creation:** Taking your human intuition about the world and turning it into data that a computer can use. **Feature Selection:** Einstein said it best: make everything as simple as possible, and no simpler. In this case, that means identifying the most important features of your data. **Principal Component Analysis:** A more sophisticated take on feature selection, and one of the crown jewels of unsupervised learning. **Feature Scaling:** Simple tricks for making sure your data and your algorithm play nicely together. Learning from Text: More information is in text than any other format, and there are some effective but simple tools for extracting that information.
Lessons 13-14: Validation and Evaluation
**Training/testing data split:** How do you know that what you’re doing is working? You don’t, unless you validate. The train-test split is simple to do, and the gold standard for understanding your results. **Cross-validation:** Take the training/testing split and put it on steroids. Validate your machine learning results like a pro. **Precision, recall, and F1 score:** After all this data-driven work, quantify your results with metrics tailored to what is most important to you.
Lesson 15: Wrapping it all Up
We take a step back and review what we’ve learned, and how it all fits together.
Mini-project at the end of each lesson **Final project:** searching for signs of corporate fraud in Enron data
- Sebastian Thrun - Sebastian Thrun is founder of Udacity. He is Research Professor of Computer Science at Stanford University, and a member of the National Academy of Engineering and the German Academy of Sciences. He was Google Vice President and Fellow where he founded Google X that generated the self-driving car, Google Glass and Loon among other projects. Thrun is best known for his research in robotics, artificial intelligence and machine learning.
Udacity is a for-profit educational organization founded by Sebastian Thrun, David Stavens, and Mike Sokolsky offering massive open online courses (MOOCs). According to Thrun, the origin of the name Udacity comes from the company's desire to be "audacious for you, the student". While it originally focused on offering university-style courses, it now focuses more on vocational courses for professionals.