DATA 602 Introduction to Data Analysis and Machine Learning

Description: This course provides a broad introduction to the practical side of machine-learning and data analysis. Topics covered include supervised vs. unsupervised learning, decision trees, logistic regression, linear discriminant analysis, linear and non-linear regression, and support vector machines. An introduction to neural networks is provided towards to end of the class.

Prerequisite: Students must be enrolled in the Data Science Program. Other students may be admitted with instructor permission.

Course Learning Objectives: Upon completion, students will

  • Understand conceptually the basics of machine learning like hypothesis space, probability, classifier, dimensionality reduction, and cross validation.
  • Be introduced to basic unsupervised learning methods, such as clustering.
  • Learn key supervised learning techniques including decision trees, linear and logistic regression, Bayesian classifiers, and support vector machines.
  • Be introduced to neural networks, deep learning, and reinforcement learning.
  • Apply the learned techniques to some analytics problem through a project.

References

  • Introduction to Machine Learning with Python, A Guide for Data Scientists by Andreas C. Müller and Sarah Guido (2016)
  • Hands on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools and Techniques to Build Intelligent Systems by Aurelien Geron

Recommended Software
The course will be using Python 3 with the following libraries: numpy, sklearn, pandas, matplotlib, Jupyter. If you’d like to have the environment installed locally, Anaconda is a Python distribution that has all required libraries. The recommended option is to use Google’s Colab which is available from your UMBC account.

Course Format and Assignments
Students will complete 3-5 homework assignments, a semester-long project, a midterm, and a final exam. The assignments will give students an opportunity to gain practical insights with specific machine learning methods. The project will give students the opportunity to practice the whole life-cycle and processing pipeline of machine learning tasks for data science applications.

Tentative Syllabus
Week 1 – Course overview: What is Machine Learning?

Week 2 – Overview: Hypotheses spaces, Linear Algebra, Probability and Statistics

Week 3 – Supervised Learning: Linear vs. Logistic Regression

Week 4 – Decision Trees and Naive Bayes

Week 5 – Model Validation: Cross-Validation, Performance Measures

Diagnosing Over/Under fitting

Week 6 – Feature Engineering: text, categorical data, binning

Week 7 – Support Vector Machines, Nearest Neighbor, Linear Discriminant Analysis

Week 8 – Bagging, Boosting and Ensemble Methods, and Random Forests

Week 9 – Experiment Design, Decisions in Model Selection, Productizing Models

Week 10 – Unsupervised learning: agglomerative, divisive, k-means, DBSCAN

Week 11 – Dimensionality reduction and visualization in principal component analysis

Week 12 – Bayesian Networks

Week 13 – Introduction to Neural Networks and Deep Learning

Week 14 – Introduction to Reinforcement Learning

Week 15 – Final Exam/Project presentations