Exploratory Data Analysis

Lucy D’Agostino McGowan

Learning objectives

Identify data types and matching appropriate visualization techniques
Explain the utility of exploratory data analysis
Conduct exploratory data analysis on a new dataset

Why explore your data?

Why EDA?

Check whether everything is as expected in your data
- Are there as many rows as you expected in the data set?
- Is any data missing?
- Are there data entry errors?
- Are there outliers?

Why EDA?

Check whether everything is as expected in your data
Check whether the assumptions of your modeling choice are met
- Do the data types match the analysis method?
- If doing simple linear regression is the relationship linear?
- If doing multiple linear regression, is the functional form modeled correctly?
- Are any points having a strong influence on the model results?

Reading in Data

Reading in data

Some data is already loaded when you load certain packages in R, to access these, you just need to use the data() function like this:

library(datasauRus)
data(datasaurus_dozen)

Reading in data

Other times you’ll have data in a file, like a .csv or Excel file. You can use read_* functions that load when you load the tidyverse package to read these in. For example, to read a .csv file in, you could run:

movie_data <- read_csv("movie_data.csv")

Note, movie_data.csv would need to be saved in your RStudio project folder for this code to run. We will practice this in a few weeks.

Checking your data

`glimpse` at your data

glimpse(datasaurus_dozen)

Rows: 1,846
Columns: 3
$ dataset <chr> "dino", "dino", "dino", "dino", "dino", "d…
$ x       <dbl> 55.3846, 51.5385, 46.1538, 42.8205, 40.769…
$ y       <dbl> 97.1795, 96.0256, 94.4872, 91.4103, 88.333…

How many rows are in this dataset? How many columns?

00:30

`glimpse` at your data

library(palmerpenguins)
glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, …
$ island            <fct> Torgersen, Torgersen, Torgersen,…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650…
$ sex               <fct> male, female, female, NA, female…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 20…

What type of variable is species? How many numeric variables are there?

00:30

Visualizing data

What are we looking for?

The “shape” of the data
Patterns
Outliers (strange points / data errors)

Data

Let’s grab one of the datasaurus_dozen datasets.

x_data <- datasaurus_dozen |>
  filter(dataset == "x_shape")

What does filter do? Why ==?

00:30

One continuous variable

Histogram

The geom_* in ggplot2 describe the type of plot you want to create. What do you think would create a histogram?

00:30

`geom_histogram`

Histogram

ggplot(x_data, aes(x = x)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What does this warning mean? How do you think we can get rid of it?

Histogram

What does this plot tell us about the shape of this data?

00:30

Density plot

What geom_ do you think would create a density plot?

00:30

geom_density

Density plots

ggplot(x_data, aes(x = x)) + 
  geom_density()

Density plots

adjust = 0.1
adjust = 1
adjust = 2

Boxplot

What geom_ do you think would create a boxplot?

00:30

`geom_boxplot`

Boxplot

Does this give us as much information as the histogram?

00:30

Boxplot

ggplot(x_data, aes(x = x, y = 1)) +
  geom_boxplot() + 
  geom_jitter()

Always show your data!

Boxplot

ggplot(x_data, aes(x = x, y = 1)) +
  geom_boxplot() + 
  geom_jitter()

Always show your data!

Relationship between two continuous variables

Scatterplot

ggplot(x_data, aes(x = x, y = y)) +
  geom_point()

Hex plot

ggplot(x_data, aes(x = x, y = y)) +
  geom_hex()

One categorical variable

Barplot

ggplot(datasaurus_dozen, aes(x = dataset)) + 
  geom_bar()

What does this plot tell us?

00:30

Barplot

ggplot(datasaurus_dozen, aes(x = dataset)) + 
  geom_bar() + 
  coord_flip()

Flip the coordinates to make it easier to read

Relationship between continuous and categorical variables

Histogram

ggplot(datasaurus_dozen, aes(x = x, fill = dataset)) + 
  geom_histogram(bins = 30, alpha = 0.5)

Histogram

ggplot(datasaurus_dozen, aes(x = x)) + 
  geom_histogram(bins = 30) + 
  facet_wrap(~dataset)

Histogram

smaller_data <- datasaurus_dozen %>%
  filter(dataset %in% c("slant_down", "x_shape"))

What does %in% do?

00:30

Histogram

ggplot(smaller_data, aes(x = x, fill = dataset)) + 
  geom_histogram(bins = 30, alpha = 0.5)

Ridge plots

library(ggridges)
ggplot(datasaurus_dozen, aes(x = x, y = dataset, fill = dataset)) +
  geom_density_ridges(alpha = 0.4, bandwidth = 2)

Boxplot

ggplot(datasaurus_dozen, aes(x = x, y = dataset)) + 
  geom_boxplot()

What is missing?

00:30

Boxplot

ggplot(datasaurus_dozen, aes(x = x, y = dataset)) + 
  geom_boxplot() + 
  geom_jitter()

How can we make this more legible?

00:30

Boxplot

ggplot(datasaurus_dozen, aes(x = x, y = dataset, color = dataset)) + 
  geom_boxplot() + 
  geom_jitter(alpha = 0.5)

How will we use this?

Plot every outcome variable before performing an analysis
- Be sure to include labels and titles on all plots for full points
Plot important features
Be sure to note any missing data patterns

`Application Exercise`

Open the Welcome Penguins folder from the previous application exercise
Create a boxplot examining the relationship between the body mass of a penguin and their species.
Add jittered points to this plot
Add labels and a title to this plot

08:00

Exploratory Data Analysis

Learning objectives

Why explore your data?

Why EDA?

Why EDA?

Reading in Data

Reading in data

Reading in data

Checking your data

glimpse at your data

glimpse at your data

Visualizing data

What are we looking for?

Data

One continuous variable

Histogram

geom_histogram

Histogram

Histogram

Histogram

Density plot

Density plots

Density plots

Boxplot

geom_boxplot

Boxplot

Boxplot

Always show your data!

Boxplot

Always show your data!

Relationship between two continuous variables

Scatterplot

Hex plot

One categorical variable

Barplot

Barplot

Flip the coordinates to make it easier to read

Relationship between continuous and categorical variables

Histogram

Histogram

Histogram

Histogram

Ridge plots

Boxplot

Boxplot

Boxplot

How will we use this?

Application Exercise

`glimpse` at your data

`glimpse` at your data

`geom_histogram`

`geom_boxplot`

`Application Exercise`