DATA 606 Capstone in Data Science

Course Description: This is a semi-independent online course that provides graduate Data Science students an opportunity to apply the knowledge, skills, and tools they have learned to a real-world data science project. Students are expected to create data science teams where the group members’ strengths complete each other to tackle challenging problems. Even though real data sets are recommended, students can also use synthetic data sets to experience the entire lifecycle of a data science project. Typically, this cycle includes collecting, cleansing, and transforming the data, choosing the best methods to solve the problem or prove the hypothesis, implementation, and quantifying the robustness and accuracy of their model. The project can be conducted with industry, government, and academic partners, who can provide a data set.

Prerequisite: Completion of DATA 601, 602, 603, 604, and 605.

Course Learning Objectives: Students are expected to demonstrate proficiency and competence in

  • Managing a full life-cycle data science project
  • Effective communication and presentation skills
  • Preparing insightful visualizations
  • Writing a report summarizing their hypothesis, datasets, models implemented, and outcomes of their work.

Required Readings: None

Suggested Reading:

  • Ryan Hodson, Ry’s Git Tutorial, 2014, ASIN: B00QFIA5OC.
  • Robert de Graaf, Managing Your Data Science Projects, Springer, 2019, ISBN 9781484249079.
  • Cole Nussbaumer Knaflic, Storytelling with Data: A Data Visualization Guide for Business Professionals, Wiley, 2008, ISBN 9781119002253.
  • Brian Godsey, Think Like a Data Scientist, Manning, 2017, ISBN 9781633430273.
  • Kenneth S. Rubin, Essential Scrum: A Practical Guide to the Most Popular Agile Process, Addison-Wesley, 2012, ISBN-13: 978-0137043293.
  • Ralph Hughes, Agile Data Warehousing Project Management: Business Intelligence Systems Using Scrum, Morgan Kaufmann, 2012, ISBN: 978-0123964632.

Course Format and Assignments

  • This is an online course, where all the materials prepared by the instructor and students will be shared electronically.
  • The instructor and students will meet weekly or bi-weekly to discuss project progress.
  • Students will work in teams.
  • After creating teams, there are three main phases (steps):
    • Project Pitch
    • EDA & Model Construction
    • Execution and Interpretation
  • Each group will have a team repository on GitHub. This repo will include all the codes, datasets, results, figures generated with proper explanations and well-organized documentation https://docs.github.com/en/organizations/organizing-members-into-teams
  • If the dataset (file size) is larger than 25 MB but less than 1 GB, students are recommended to split the dataset into smaller sets (~23 MB) after data cleansing. If the file size is larger than 1 GB, then students are recommended to provide a short script in their repository to download the dataset to a local drive.
  • Each team will have their website for their project (again on GitHub, please visit https://pages.github.com/). This site needs to include the title of the project, team members and roles, overview and aim of the project, information about the dataset(s), references, etc.
  • Each group will prepare and make presentations to their class demonstrating the progress of their project.