Description: The goal of this course is to introduce methods, technologies, and computing platforms for performing data analysis at scale. Topics include the theory and techniques for data acquisition, cleansing, aggregation, management of large heterogeneous data collections, processing, information and knowledge extraction. Students are introduced to map-reduce, streaming, and external memory algorithms and their implementations using Hadoop and its eco-system (HBase, Hive, Pig and Spark). Students will gain practical experience in analyzing large existing databases.
Prerequisite: DATA 601. Enrollment in the Data Science Program. Other students may be admitted with instructor permission.
Course Learning Objectives: Upon completion, students will:
- Have the ability to analyze large datasets using modern tools
- Be familiar with how to use the public domain big data tooling pipeline such as Hadoop and its eco system (Hbase, Hive, Pig, Spark)
References
- Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale by Tom White (2015)
- Data Analytics with Hadoop: An Introduction for Data Scientists by Benjamin Bengfort and Jenny Kim (2015)
- Hadoop Application Architectures: Designing Real-World Big Data Applications by Mark Grover and Ted Malaska (2015)
- Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau and Andy Konwinski (2015)
- Advanced Analytics with Spark: Patterns for Learning from Data at Scale by Sandy Ryza and Uri Laserson (2015)
- Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem by Douglas Eadline (2015)
- The Stratosphere platform for big data analytics by A. Alexandrov et al, The International Journal on Very Large Data Bases, pp. 939-964 (2014).
- Select articles
Course Format and Assignments
Students will complete 3-5 homework assignments, a semester-long project, a midterm and a final exam. Homework will give students an opportunity to gain practical insights with specific big data methods. The project will give students the opportunity to practice big data management by developing a big data processing application using a common big data platform.
Tentative Syllabus
- Week 1 – Distributed computing overview
- Week 2 – Hadoop File System
- Week 3 – MapReduce Design Patterns
- Week 4 – Data Ingest
- Week 5 – Spark (core)
- Week 6 – Spark (SQL)
- Week 7 – Spark (Streaming)
- Week 8 – Scalable Machine Learning
- Week 9 – Apache Hive
- Week 10 – Hbase
- Week 11 – Yarn
- Week 12 – Stratosphere and MonetDB
- Week 13 – Amazon EC2 and Workflow Management
- Week 14 – Project presentations