fage : father’s age in years.mage : mother’s age in years.mature : maturity status of mother.weeks : length of pregnancy in weeks.premie : whether the birth was classified as premature (premie) or full-term.visits : number of hospital visits during pregnancy.marital: whether mother is married or not married at time of birth.gained : weight gained by mother during pregnancy in lbs.weight : weight of the baby at birth in pounds.lowbirthweight: whether baby was classified as low birthweight (low) or not (not low).gender: biological sex of the baby, limited to f or m.habit : status of the mother as a nonsmoker or a smoker.whitemom: whether mom identifies as white or not white.visdat package is a great way to visualize key information about your datasetvis_dat() on your dataframe to see the column types and explore missingness
vis_miss() function to learn more about the missing datavis_dat()Which column has the most missing data?
vis_miss()There are three typical approaches:
Advantage: It is easy!
NOTE It is important that you hold on to the original data somewhere when doing complete case analysis, we may need to go back and look at the rows we dropped.
When doing complete cases analysis, it is important to always note how many observations are dropped.
What issues might there be with simply deleting 20% of your sample?
Advantages: Same as complete case analysis (fast and easy!)
Disadvantages: Same as complete case analysis (the variables we are including must be missing completely at random)
Okay, but what about when neither complete case analysis or available case analysis is appropriate?
In these situations, we are going to consider imputation
Imputation is the process of estimating the missing values to create a completed version of the data set.
To impute all the missing values in this data set, we are going to
nc data set, but with no missing data.There are a lot of techniques we can use for imputation.
The hardest step in this process is deciding which technique might be appropriate
We are going to explore a few commonly used techniques, and discuss the pros and cons of each
fage: Father’s Agefagefage: Father’s Agefage# A tibble: 1 × 13
   fage  mage mature    weeks premie visits marital gained weight lowbi…¹ gender
  <int> <int> <fct>     <int> <fct>   <int> <fct>    <int>  <dbl> <fct>   <fct> 
1    NA    13 younger …    39 full …     10 married     38   7.63 not low male  
# … with 2 more variables: habit <fct>, whitemom <fct>, and abbreviated
#   variable name ¹lowbirthweightFor our variable of interest, fage, we can conduct unconditional mean imputation (UMI) by
fageBehind the scenes (in practice you don’t need the code below, just showing what the recipe is doing)
# A tibble: 1 × 1
  mean_fage
      <dbl>
1      30.3nc_mean_impute <- nc |>
  mutate(
    imputed = is.na(fage),
    fage = case_when(
    imputed ~ 30.25573,
    TRUE ~ as.numeric(fage)
  ))
nc_mean_impute %>%
  slice(1)# A tibble: 1 × 14
   fage  mage mature    weeks premie visits marital gained weight lowbi…¹ gender
  <dbl> <int> <fct>     <int> <fct>   <int> <fct>    <int>  <dbl> <fct>   <fct> 
1  30.3    13 younger …    39 full …     10 married     38   7.63 not low male  
# … with 3 more variables: habit <fct>, whitemom <fct>, imputed <lgl>, and
#   abbreviated variable name ¹lowbirthweightOnce we have completed an imputation process, our next step is ALWAYS to check
fage and magefage might be related to another variable present in the data set.Pros: It is fast and easy to compute, you can use all observed data
Cons:
Behind the scenes (in practice you don’t need the code below, just showing what the recipe is doing)
nc_cmean_impute <- fit(linear_reg(),
    fage ~ mage,
    data = nc) |>
  predict(new_data = nc) |>
  bind_cols(nc) |>
  mutate(
    imputed = is.na(fage),
    fage = case_when(
    imputed ~ .pred,
    TRUE ~ as.numeric(fage)
  ))
nc_cmean_impute# A tibble: 1,000 × 15
   .pred  fage  mage mature    weeks premie visits marital gained weight lowbi…¹
   <dbl> <dbl> <int> <fct>     <int> <fct>   <int> <fct>    <int>  <dbl> <fct>  
 1  17.8  17.8    13 younger …    39 full …     10 married     38   7.63 not low
 2  18.6  18.6    14 younger …    42 full …     15 married     20   7.88 not low
 3  19.5  19      15 younger …    37 full …     11 married     38   6.63 not low
 4  19.5  21      15 younger …    41 full …      6 married     34   8    not low
 5  19.5  19.5    15 younger …    39 full …      9 married     27   6.38 not low
 6  19.5  19.5    15 younger …    38 full …     19 married     22   5.38 low    
 7  19.5  18      15 younger …    37 full …     12 married     76   8.44 not low
 8  19.5  17      15 younger …    35 premie      5 married     15   4.69 low    
 9  20.3  20.3    16 younger …    38 full …      9 married     NA   8.81 not low
10  20.3  20      16 younger …    37 full …     13 married     52   6.94 not low
# … with 990 more rows, 4 more variables: gender <fct>, habit <fct>,
#   whitemom <fct>, imputed <lgl>, and abbreviated variable name
#   ¹lowbirthweightfage is actually conditional on more than just mage?fage is actually conditional on more than just mage?tidymodels default is to try to impute values using all other predictorsrecipe_nc <- recipe(lowbirthweight ~ ., data = nc) |>
  step_impute_linear(fage, impute_with = imp_vars(all_predictors()))
wf <- workflow() |>
add_recipe(recipe_nc) |>
  add_model(logistic_reg())
fit(wf, data = nc)Warning: 
          There were missing values in the predictor(s) used to impute;
          imputation did not occur.
        Warning: glm.fit: algorithm did not convergeWarning: glm.fit: fitted probabilities numerically 0 or 1 occurred══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_impute_linear()
── Model ───────────────────────────────────────────────────────────────────────
Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
Coefficients:
       (Intercept)                fage                mage   matureyounger mom  
        -1499.5824             -2.1897              5.1293             44.8299  
             weeks        premiepremie              visits  maritalnot married  
            5.4982             15.0877             -2.6857              2.3088  
            gained              weight          gendermale         habitsmoker  
           -0.3773            221.9656             -6.9561            -12.8116  
     whitemomwhite  
          -11.0732  
Degrees of Freedom: 799 Total (i.e. Null);  787 Residual
  (200 observations deleted due to missingness)
Null Deviance:      493.3 
Residual Deviance: 7.287e-07    AIC: 26mice package in R is a nice one for this
Dr. Lucy D’Agostino McGowan adapted from Nicole Dalzell’s slides