Welcome to Statistical Learning

Dr. D’Agostino McGowan

👋

Lucy D’Agostino McGowan

  mcgowald@wfu.edu
  Thurs 10a-11a

bit.ly/sta-363-s23

Intros

  • Name
  • Major
  • Fun OR boring fact

Statistical Learning Problems

  • Identify risk factors for breast cancer

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Data: 4601 labeled emails sent to George who works at HP Labs
  • Input features: frequencies of words and punctuation
george you hp free ! edu remove
spam 0.00 2.26 0.02 0.52 0.51 0.01 0.28
email 2.27 1.27 0.90 0.07 0.11 0.29 0.01

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Identify numbers in handwritten zip code

DEMO

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Identify numbers in handwritten zip code
  • Establish the relationship between variables in population survey data

Income survey data for males from the central Atlantic region of US, 2009

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Identify numbers in handwritten zip code
  • Establish the relationship between variables in population survey data
  • Classify pixels of an image

Usage \(\in\) {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil}

✌️ types of statistical learning

  • Supervised Learning
  • Unsupervised Learning

Supervised Learning

  • outcome variable: \(Y\), (dependent variable, response, target)
  • predictors: vector of \(p\) predictors, \(X\), (inputs, regressors, covariates, features, independent variables)
  • In the regression problem, \(Y\) is quantitative (e.g price, blood pressure)
  • In the classification problem, \(Y\) takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample)
  • We have training data \((x_1, y_1), \dots, (x_N, y_N)\). These are observations (examples, instances) of these measurements

Supervised Learning

What do you think are some objectives here?

Objectives

  • Accurately predict unseen test cases
  • Understand which inputs affect the outcome, and how
  • Assess the quality of our predictions and inferences

Unsupervised Learning

  • No outcome variable, just a set of predictors (features) measured on a set of samples
  • objective is more fuzzy – find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation
  • difficult to know how well your are doing
  • different from supervised learning, but can be useful as a pre-processing step for supervised learning

Let’s take a tour - class website

  • Concepts introduced:
    • How to find slides
    • How to find assignments
    • How to find RStudio Cloud
    • How to get help
    • How to find policies