DATA608 Probability and Statistics for Data Science

Course Description: Data science relies heavily on the principles of probability theory and inferential statistics for extracting meaningful insight from complex datasets. DATA 608 introduces students to the essential concepts and tools of probability theory and statistics that form the backbone of data-driven decision-making processes. The course emphasizes a combination of theoretical tools, and application-oriented analysis to enable students to utilize statistical methods effectively in real-world data science scenarios. This course consists of two major parts. In the first part, the key concepts of probability theory such as the probability space, different distribution functions, probability mass functions and densities, random variables, variance and covariance, expectation values and moments, conditional probability, independence, Bayes formula, laws of large numbers, and the central limit theorem are introduced. In the second part of the course, the basic concepts of statistical inference are covered. Among the covered topics, sampling methods, confidence intervals, hypothesis testing, and (one way and two way) ANOVA are discussed.

 

Outline of the Course:

Part I: Probability Theory

Week 1: Introduction to Basics of Probability

  • Course overview, syllabus review, and outline of the topics covered in the course
  • Sample space, and events
  • Counting methods and definition of probability
  • Basic rules of probability and definition of probability distribution
  • Exclusion-inclusion principle
  • Independence (of events)

Week 2: Random Variables and Probability Distributions

  • Discrete and continuous random variables
  • Distribution functions, probability mass function, and densities
  • Expectation values of random variables
  • Variance and covariance of random variables

Week 3: Probability Distributions for Discrete Random Variables

  • Uniform and geometric distributions
  • Bernoulli and binomial distributions
  • Poisson distribution
  • Hypergeometric distribution

Week 4: Probability Distributions for Continuous Random Variables

  • Normal distribution
  • Approximating a binomial distribution with a normal distribution
  • Gamma distribution
  • Exponential distribution
  • Calculation of moments of continuous distributions

Week 5: Conditional Probabilities and Bayes Formula

  • Conditional probability: definitions, examples, and calculations
  • Bayes theorem, likelihoods, and priors
  • Applications of Bayes theorem in calculating probabilities

Week 6: Useful Inequalities, Convergences and Limit Theorems

  • Markov inequality
  • Chebyshev inequality
  • Sequences of random variables
  • Convergence in probability
  • Weak law of large numbers
  • Central limit theorem

Week 7: Midterm exam on Probability Concepts

 

Part II: Inferential Statistics

Week 8: Statistics and Sampling Distributions

  • Distributions of sample totals, means, and proportions
  • Distributions based on normal random samples

Week 9: Point Estimation and Confidence Intervals

  • Concept and criteria for point estimation
  • Confidence intervals and Basic properties
  • One-sample confidence interval for the mean
  • One-sample confidence interval for proportions

Week 10: Hypothesis Testing (Based on a Single Sample)

  • Hypotheses and test procedures
  • p-values
  • Hypothesis testing for a population mean
  • Hypothesis testing for a population proportion

Week 11: Analysis of Variance (ANOVA)

  • Hypotheses and procedures
  • Single-factor ANOVA
  • Two-factor ANOVA with and without replication

Week 12: Continuous Targets and Regression Analysis

  • Linear regression model
  • Estimating parameters of linear regression
  • Multiple regression analysis
  • Correlation
  • Regression with matrices

Week 13: Analysis of Categorical Target Variables

  • Chi-squared tests and goodness of fit
  • Two-way contingency tables
  • Chi-squared test for homogeneity
  • Chi-squared test for independence

Week 14: Presentations by students