fage
: father’s age in years.mage
: mother’s age in years.mature
: maturity status of mother.weeks
: length of pregnancy in weeks.premie
: whether the birth was classified as premature (premie) or full-term.visits
: number of hospital visits during pregnancy.marital
: whether mother is married or not married at time of birth.gained
: weight gained by mother during pregnancy in lbs.weight
: weight of the baby at birth in pounds.lowbirthweight
: whether baby was classified as low birthweight (low) or not (not low).gender
: biological sex of the baby, limited to f or m.habit
: status of the mother as a nonsmoker or a smoker.whitemom
: whether mom identifies as white or not white.visdat
package is a great way to visualize key information about your datasetvis_dat()
on your dataframe to see the column types and explore missingness
vis_miss()
function to learn more about the missing datavis_dat()
Which column has the most missing data?
vis_miss()
There are three typical approaches:
Advantage: It is easy!
NOTE It is important that you hold on to the original data somewhere when doing complete case analysis, we may need to go back and look at the rows we dropped.
When doing complete cases analysis, it is important to always note how many observations are dropped.
What issues might there be with simply deleting 20% of your sample?
Advantages: Same as complete case analysis (fast and easy!)
Disadvantages: Same as complete case analysis (the variables we are including must be missing completely at random)
Okay, but what about when neither complete case analysis or available case analysis is appropriate?
In these situations, we are going to consider imputation
Imputation is the process of estimating the missing values to create a completed version of the data set.
To impute all the missing values in this data set, we are going to
nc
data set, but with no missing data.There are a lot of techniques we can use for imputation.
The hardest step in this process is deciding which technique might be appropriate
We are going to explore a few commonly used techniques, and discuss the pros and cons of each
fage
: Father’s Agefage
fage
: Father’s Agefage
# A tibble: 1 × 13
fage mage mature weeks premie visits marital gained weight lowbi…¹ gender
<int> <int> <fct> <int> <fct> <int> <fct> <int> <dbl> <fct> <fct>
1 NA 13 younger … 39 full … 10 married 38 7.63 not low male
# … with 2 more variables: habit <fct>, whitemom <fct>, and abbreviated
# variable name ¹lowbirthweight
For our variable of interest, fage
, we can conduct unconditional mean imputation (UMI) by
fage
Behind the scenes (in practice you don’t need the code below, just showing what the recipe is doing)
# A tibble: 1 × 1
mean_fage
<dbl>
1 30.3
nc_mean_impute <- nc |>
mutate(
imputed = is.na(fage),
fage = case_when(
imputed ~ 30.25573,
TRUE ~ as.numeric(fage)
))
nc_mean_impute %>%
slice(1)
# A tibble: 1 × 14
fage mage mature weeks premie visits marital gained weight lowbi…¹ gender
<dbl> <int> <fct> <int> <fct> <int> <fct> <int> <dbl> <fct> <fct>
1 30.3 13 younger … 39 full … 10 married 38 7.63 not low male
# … with 3 more variables: habit <fct>, whitemom <fct>, imputed <lgl>, and
# abbreviated variable name ¹lowbirthweight
Once we have completed an imputation process, our next step is ALWAYS to check
fage
and mage
fage
might be related to another variable present in the data set.Pros: It is fast and easy to compute, you can use all observed data
Cons:
Behind the scenes (in practice you don’t need the code below, just showing what the recipe is doing)
nc_cmean_impute <- fit(linear_reg(),
fage ~ mage,
data = nc) |>
predict(new_data = nc) |>
bind_cols(nc) |>
mutate(
imputed = is.na(fage),
fage = case_when(
imputed ~ .pred,
TRUE ~ as.numeric(fage)
))
nc_cmean_impute
# A tibble: 1,000 × 15
.pred fage mage mature weeks premie visits marital gained weight lowbi…¹
<dbl> <dbl> <int> <fct> <int> <fct> <int> <fct> <int> <dbl> <fct>
1 17.8 17.8 13 younger … 39 full … 10 married 38 7.63 not low
2 18.6 18.6 14 younger … 42 full … 15 married 20 7.88 not low
3 19.5 19 15 younger … 37 full … 11 married 38 6.63 not low
4 19.5 21 15 younger … 41 full … 6 married 34 8 not low
5 19.5 19.5 15 younger … 39 full … 9 married 27 6.38 not low
6 19.5 19.5 15 younger … 38 full … 19 married 22 5.38 low
7 19.5 18 15 younger … 37 full … 12 married 76 8.44 not low
8 19.5 17 15 younger … 35 premie 5 married 15 4.69 low
9 20.3 20.3 16 younger … 38 full … 9 married NA 8.81 not low
10 20.3 20 16 younger … 37 full … 13 married 52 6.94 not low
# … with 990 more rows, 4 more variables: gender <fct>, habit <fct>,
# whitemom <fct>, imputed <lgl>, and abbreviated variable name
# ¹lowbirthweight
fage
is actually conditional on more than just mage
?fage
is actually conditional on more than just mage
?tidymodels
default is to try to impute values using all other predictorsrecipe_nc <- recipe(lowbirthweight ~ ., data = nc) |>
step_impute_linear(fage, impute_with = imp_vars(all_predictors()))
wf <- workflow() |>
add_recipe(recipe_nc) |>
add_model(logistic_reg())
fit(wf, data = nc)
Warning:
There were missing values in the predictor(s) used to impute;
imputation did not occur.
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_impute_linear()
── Model ───────────────────────────────────────────────────────────────────────
Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) fage mage matureyounger mom
-1499.5824 -2.1897 5.1293 44.8299
weeks premiepremie visits maritalnot married
5.4982 15.0877 -2.6857 2.3088
gained weight gendermale habitsmoker
-0.3773 221.9656 -6.9561 -12.8116
whitemomwhite
-11.0732
Degrees of Freedom: 799 Total (i.e. Null); 787 Residual
(200 observations deleted due to missingness)
Null Deviance: 493.3
Residual Deviance: 7.287e-07 AIC: 26
mice
package in R is a nice one for thisDr. Lucy D’Agostino McGowan adapted from Nicole Dalzell’s slides