DATA 603 Platforms for Big Data Processing

Description: The goal of this course is to introduce methods, technologies, and computing platforms for performing data analysis at scale. Topics include the theory and techniques for data acquisition, cleansing, aggregation, management of large heterogeneous data collections, processing, information and knowledge extraction. Students are introduced to map-reduce, streaming, and external memory algorithms and their implementations using Hadoop and its eco-system (HBase, Hive, Pig and Spark). Students will gain practical experience in analyzing large existing databases.

Prerequisite: DATA 601. Enrollment in the Data Science Program. Other students may be admitted with instructor permission.

Course Learning Objectives: Upon completion, students will:

Have the ability to analyze large datasets using modern tools
Be familiar with how to use the public domain big data tooling pipeline such as Hadoop and its eco system (Hbase, Hive, Pig, Spark)

References

Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale by Tom White (2015)
Data Analytics with Hadoop: An Introduction for Data Scientists by Benjamin Bengfort and Jenny Kim (2015)
Hadoop Application Architectures: Designing Real-World Big Data Applications by Mark Grover and Ted Malaska (2015)
Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau and Andy Konwinski (2015)
Advanced Analytics with Spark: Patterns for Learning from Data at Scale by Sandy Ryza and Uri Laserson (2015)
Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem by Douglas Eadline (2015)
The Stratosphere platform for big data analytics by A. Alexandrov et al, The International Journal on Very Large Data Bases, pp. 939-964 (2014).
Select articles

Course Format and Assignments
Students will complete 3-5 homework assignments, a semester-long project, a midterm and a final exam. Homework will give students an opportunity to gain practical insights with specific big data methods. The project will give students the opportunity to practice big data management by developing a big data processing application using a common big data platform.

Tentative Syllabus

Week 1 – Distributed computing overview
Week 2 – Hadoop File System
Week 3 – MapReduce Design Patterns
Week 4 – Data Ingest
Week 5 – Spark (core)
Week 6 – Spark (SQL)
Week 7 – Spark (Streaming)
Week 8 – Scalable Machine Learning
Week 9 – Apache Hive
Week 10 – Hbase
Week 11 – Yarn
Week 12 – Stratosphere and MonetDB
Week 13 – Amazon EC2 and Workflow Management
Week 14 – Project presentations

Graduate Data Science Programs: Information Hub

College of Engineering and Information Technology

Graduate Data Science Programs: Information Hub

DATA 603 Platforms for Big Data Processing

Graduate Data Science Programs: Information Hub

Subscribe to UMBC Weekly Top Stories

I am interested in: