DATA 690 Special Topics: Statistical Analysis and Visualization with Python

Course Description: DATA 690 aims to provide (i) an introduction to Python programming, (ii) the fundamental statistical concepts and methods, which are frequently used in exploratory data analysis, and (iii) how to carry out these analyses and visualize the findings in Python.

Prerequisites: Enrollment in the Data Science program.

Recommended References:

  • Python Programming: An Introduction to Computer Science by John Zelle
  • Programming in Python 3: A Complete Introduction to the Python Language by M. Summerfield
  • Allen B. Downey, Think Stats, 2nd edition, O’Reilly
  • Statistics for Business & Economics by Anderson et al.
  • Python package* user guides (i.e. NumPy User Guide, Pandas Reference Guide, etc.)

Main Python modules we will be using :

  • Analysis: NumPy and SciPy
  • Handling Data: Pandas
  • Plotting: Matplotlib and Seaborn

Tentative Schedule

  • Week 1 – Introduction to Python and Notebooks. Python Basics: Collections.
  • Week 2 – Python Basics: Conditionals and Loops. Python Functions.
  • Week 3 – Python Input/Output and Files. Computing with Numpy.
  • Week 4 – Basic Plotting with Matplotlib. Analyzing Structured Data with Pandas.
  • Week 5 – Advanced Plotting. Time Series.
  • Week 6 – Introduction to Statistics.
  • Week 7 – Descriptive Statistics (Part 1)
  • Week 8 – Descriptive Statistics (Part 2)
  • Week 9 – Introduction to Probability
  • Week 10 – Discrete Probability Distributions
  • Week 11 – Continuous Probability Distributions
  • Week 12 -Sampling, Distribution, and Interval Estimation
  • Week 13 – Hypothesis Testing
  • Week 14 – Inference

 

Important Notes to Students

Please install python3 and Anaconda to your laptop before the first lecture. Bring your laptop and power cable to each class.

If your first intention is learning Python, please start TODAY. There are so many free online-resources. For example, there is a series of videos showing use of Python 3 (These videos don’t use Juptyer Notebooks though). Similarly, there are online courses offered on Python from University of Michigan. Another one provides a very nice list of Python tutorials and annotated analyses.

There are also several free (open-source) textbooks and websites on coding Python 3. Automate the Boring Stuff with Python and Learning to Program with Python by Halterman are just two examples. Some resources teach Python with a special focus on data science (i.e. Python Data Science Handbook), exploratory data analysis (i.e. Think Stats), and modeling (i.e. Modeling and Simulation in Python).

About Python Packages: There are more than 100k packages. In this course, we will mainly use the following modules:

  • Pandas: data manipulation and analysis
  • Numpy: fundamental package for “numerical” computing
  • Matplotlib basic plotting library
  • Seaborn statistical data visualization
  • Statsmodels: statistical modeling, tests, and data exploration

The other packages very useful for data science include

  • Scipy: fundamental package for “scientific” computing
  • Plotly interactive plotting library
  • Scikit-learn: machine learning
  • PyTorch Torch based machine learning
  • NLTK natural language processing
  • Pillow: image analysis and manipulation
  • BeautifulSoup: pulling data out of HTML and XML files
  • Scrapy: web-crawling
  • Dask: Parallel computing with task scheduling

Data Set Resource