\[ f(X) = \beta_0 + \sum_{k=1}^K\beta_kh_k(X)\\ =\beta_0 + \sum_{k=1}^K\beta_kg(w_{k0}+\sum_{j=1}^p w_{kj}X_j) \]
Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
Here is a simple example of gradient descent: we have a feature \(x\) and an outcome \(y\) and want to find the line that best fits the data
The “loss function” can be mean squared error \(L = \frac{1}{n}\sum_{i=1}^{n}(y_i - (\hat\beta_0 + \hat\beta_1x_i ))^2\)
We can use gradient descent to find the values of \(\hat\beta_0\) and \(\hat\beta_1\) that minimize the MSE.
The gradient of the loss function with respect to the parameters \(\beta_0\) is: \(\frac{\partial L}{\partial \beta_0} = \frac{-2}{n}\sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i))\)
And for \(\beta_1\): \(\frac{\partial L}{\partial \beta_1} = \frac{-2}{n}\sum_{i=1}^{n}x_i(y_i - (\beta_0 + \beta_1x_i))\)
We can then update the values of \(\hat\beta_0\) and \(\hat\beta_1\) using the following equations: \(\beta_{i,new} = \beta_i - \alpha \frac{\partial L}{\partial \beta_i}\) where \(\alpha\) is the learning rate.
Here’s an example of gradient descent with a small dataset:
x | y |
---|---|
1 | 1 |
2 | 3 |
3 | 5 |
4 | 7 |
Application Exercise
install.packages("keras")
once in the consolekeras::install_keras()
once in the consoleY
02:00
Application Exercise
step_naomit()
set skip = FALSE
to make sure it does this when we are prepping the data)step_normalize()
05:00
rec <- recipe(Salary ~ ., data = Hitters) |>
step_naomit(Salary, skip = FALSE) |>
step_impute_knn(all_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_predictors())
set.seed(1)
splits <- initial_split(Hitters, prop = 2/3)
train <- training(splits)
test <- testing(splits)
training_processed <- prep(rec) |> bake(new_data = train)
testing_processed <- prep(rec) |> bake(new_data = test)
training_matrix_x <- training_processed |> select(-Salary) |> as.matrix()
testing_matrix_x <- testing_processed |> select(-Salary) |> as.matrix()
Application Exercise
04:00
mod <- keras_model_sequential() |>
layer_dense(units = 50, activation= "relu", input_shape = 19) |>
layer_dropout(rate = 0.4) |>
layer_dense(units = 1) |>
compile(loss = "mse",
metrics = list("mse")) |>
fit(training_matrix_x,
training_processed$Salary,
epochs = 100,
batch_size = 32,
validation_data = list(testing_matrix_x, testing_processed$Salary))
Application Exercise
06:00
mod <- keras_model_sequential() |>
layer_dense(units = 40, activation= "relu", input_shape = 19) |>
layer_dropout(rate = 0.5) |>
layer_dense(units = 1) |>
compile(loss = "mse",
metrics = list("mse")) |>
fit(training_matrix_x,
training_processed$Salary,
epochs = 1000,
batch_size = 30,
validation_data = list(testing_matrix_x, testing_processed$Salary))
How many parameters does this model have in total?
\[ -\sum_{i=1}^n\sum_{m=0}^9 y_{im}log(f_m(x_i)) \]
model <- keras_model_sequential() |>
layer_dense(units = 256, activation = "relu", input_shape = 784) |>
layer_dropout(rate = 0.4) |>
layer_dense(units = 128, activation = "relu") |>
layer_dropout(rate = 0.3) |>
layer_dense(units = 10, activation = "softmax")
summary(model)
Model: "sequential"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_2 (Dense) (None, 256) 200960
dropout_1 (Dropout) (None, 256) 0
dense_1 (Dense) (None, 128) 32896
dropout (Dropout) (None, 128) 0
dense (Dense) (None, 10) 1290
================================================================================
Total params: 235,146
Trainable params: 235,146
Non-trainable params: 0
________________________________________________________________________________
Dr. Lucy D’Agostino McGowan adapted from slides by Hastie & Tibshirani