【Python】Section 4: 偏差、方差和超参数 Bias, Variance, and Hyperparameters from HarvardX

发布于:2024-05-06 ⋅ 阅读:(36) ⋅ 点赞:(0)

1. Bias and Variance

1.1 Test Error and Generalization

We know to evaluate models on both training and test data because models can do well on training data but poorly on new data. When models do well on new data (test data), we call it generalization. Models that “generalize well” are good with new data.

  • High Noise Level: If the data is very noisy we will not be able to generalize very well and we will get a high test error due to this noise.
  • Underfitting: The model is not complex enough to capture the patterns in the data.
  • Overfitting: The model focuses too much on the training data and does not generalize to the test data.

Three graphs. High noise level: the data is very scattered. Underfitting: a straight line through curved data. Overfitting: a very, very complex line through simple data.

Let's focus in on noise.

1.2 Irreducible and Reducible Error

There are two ways that “noise” can contribute to the generalization error:

  • Irreducible errorThis is due to the noise in the data. We can’t do anything to decrease the this kind of error. This is also known as aleatoric error.
  • Reducible errorThis is due to the model. We can decrease the error due to overfitting and underfitting by improving the model. This kind of error is also known as epistemic error.

1.3 Model Complexity and Reducible Error

Reducible error comes from either underfitting or overfitting. Simple, less complex models are more likely to be underfit. However, as the complexity of our models increase, we are more likely to be overfitting.

On the left we see a simple linear model, a low complexity model that is underfit to the data. While the data points roughly form a curved parabola, the simple linear model predicts a straight line through the center of the plot. On the right we see a high degree polynomial model, a very complex model that is overfit to the data. The data points form the same U-shaped curve as before. This time, the high degree polynomial model does predict along the center of the curve, but the line is very jagged, shooting up and down at unexpected points.

1.4 Bias and Variance

As the model complexity increases, the error on our training set will decrease. We call this the bias. Said another way, the bias will decrease as the model complexity increases. A low bias model will have it's prediction around the true value according to the training set.

Model variance, on the other hand, is the variability among multiple fits of the same model on different training sets. You can think of variance as indicating how sensitive a model is to changes in its training data. More complex model are more sensitive and so variance will increase as the model complexity increases.

1.5 Bias vs Variance: Variance of a Simple model

To visualize variance, lets look at a simple model. Each orange line is the same simple linear model but trained using a different split of the training data. Note that there is not much variation between prediction lines. Even when we fit on 500 different samples, all prediction lines are pretty close to each other.

1.6 Bias vs Variance: Variance of a Complex model

Now let's do the same thing but using a more complex degree-8 polynomial model. Observe how the predictions now vary wildly between samples used to fit the model. When we plot the predictions of 500 different sample fits, the predictions may generally center around the line of the true function but the variation between predictions creates a mess of lines far above and below the true function line.

You may also notice that you can almost pick out the data points selected for the training sample in the first three plots. This is because the prediction line passes through those points exactly. This is indication of the low bias of the mode.

1.7 Bias vs Variance: the Trade-off

Now we see that the error of the bias decreases as our model complexity increases but the variance of the model increases. This is where we see the trade-off: we cannot lower one without increasing the other. We want to find balance between the two to get the lowest bias and variance possible.

Based on the image below, the point where the lines intersect would be a good balance between bias and variance.

You can roughly think of bias as how accurate a model is and variance as how precise it is.

Consider the diagram below:

In the case of high bias and low variance the model shows very similar predictions regardless of what training data is used to fit it. However, these predictions are systematically off; predictions are wrong but all wrong in roughly the same way. It is precise but not accurate.

If we have low bias and high variance then the model is very complex and so very sensitive to changes in the training data. The predictions across the different fits are correct on average, but there is large variability or "spread" in the predictions. It is accurate but not precise.

We ideally want low bias and low variance. This would be a model whose predictions when fit on different training sets would be both accurate (centered on the true value) and precise (low spread).

If we have high bias and high variance, then model is neither accurate nor precise. So predictions are systematically off and there is also a lot of spread. If this is the case then we have a very poor model indeed. We would do well to reassess the steps and assumptions of our modeling approach in this situation.

1.8 Bias vs Variance

Consider this plot of 2,000 best-fit simple linear regression models, each fitted on a different 20-point training set. Note that there is not much variation among the different fits' predictions.

Now consider this plot of 2,000 best-fit degree-20 polynomial models, each fitted on a different 20 point training set. Note the wild variation among the predictions of the different model fits.

1.9 Bias vs Variance (coefficients)

Let's look at the range of the coefficient values for these different models. For the 2,000 different simple linear regression models, we see that there is some variability, but very little when compared to the polynomial fits.

These are the first 10 coefficient values for the 2,000 degree-20 polynomial fits. Be sure to notice the change in scale on the -axis between this plot and the last! The spread of 12 coefficients visualized here vary much more between fits than did those for the simple model. And some of the coefficient values become quite extreme. This means that small variation in the training data can make a huge change in the resulting model's coefficients and thus its predictions.

1.10 Model Selection

Model selection is the application of a principled method to determine the complexity of the model, e.g., choosing a subset of predictors, choosing the degree of the polynomial model, etc.

A strong motivation for performing model selection is to avoid overfitting, which we saw can happen when:there are too many predictors, because:

  1. there are too many predictors, because:
    • the feature space has high dimensionality
    • the polynomial degree is too high
    • too many interaction terms are considered
  2. the coefficients' values are too extreme

We've already seen ways to address the problem of choosing predictors and polynomial degree using greedy algorithms and cross-validation. But what about the second potential source of overfitting? How do we discourage extreme coefficient values in the model parameters?

1.11 Regularization

What we want is low model error. We've been using mean squared error for our model's loss function:

\frac{1}{n}\sum_{i=1}^{n}\left | y_i-\beta^Tx_i \right |^2

We also want to discourage extreme parameter values. We could also create a loss which is a function of the magnitudes of the model's parameters. We'll call this . We could do this in several ways. For example, we could sum the squares of the parameters or their absolute values.

L_{reg}=\left\{\begin{matrix} \sum_{j=1}^{J}\beta^2j\\ \sum_{j=1}^{J}\left | \beta_j \right | \end{matrix}\right.

Not that the summation index starts at 1. The model is not penalized for its  which can be interpreted as the intercept.

Now we can combine these two loss functions into a single loss function for our model using regularization.

L=\frac{1}{n}\sum_{i=1}^{n}\left | y_i-\beta^Tx_i \right |^2+\lambda L_{reg}

\lambda is the regularization parameter. It controls the relative importance between model error and the regularization term.

\lambda =0 : equivalent to regression model using no regularization. \lambda =\infty : yields a model where all \betas are 0.

But how do we determine which value of \lambda to use? The answer is with cross-validation! We will try many different values of \lambda and pick the one that gives us the best cross-validation loss scores

1.12 Regularization: LASSO Regression

LASSO regression: minimize L_{LASSO} with respect to \betas.

L_{LASSO}=\frac{1}{n}\sum_{i=1}^{n}\left | y_i-\beta^Tx_i \right |^2+\lambda \sum_{j=1}^{J}\left | \beta_j \right |

Note that \sum_{j=1}^{J}\left | \beta_j \right | is the L_1 norm of the \beta vector.

There's no need to regularize the bias, \beta_0 , since it is not connected to the predictors.

1.13 Regularization: Ridge Regression

Ridge regression: minimize L_{RIDGE} with respect to \betas.

L_{RIDGE}=\frac{1}{n}\sum_{i=1}^{n}\left | y_i-\beta^Tx_i \right |^2+\lambda \sum_{j=1}^{J}\beta_j^2

Note that \sum_{j=1}^{J}\beta_j^2 is the L_2 norm of the \beta vetor.

Again we do not regularize the bias, \beta_0 .

1.14 Ridge regularization with single validation set vs with cross-validation

To emphasize the usefulness of cross-validation, compare these two plots demonstrating ridge regularization using a single validation set and using cross-validation. Note how by taking the average of the 5 folds we can get more reliable results than relying on just one single validation split.

2. Ridge and LASSO

2.1 Ridge and LASSO - Computational complexity

Solution to ridge regression:

\beta=(X^TX+\lambda I)^{-1}X^TY

LASSO, on the other hand, has no conventional analytical solution, as the L1 norm has no derivative at zero. We can, however, use the concept of subdifferential or subgradient to find a manageable expression.

2.2 Ridge Visualized

The ridge estimator is where the constraint and the loss intersect.

The values of the coefficients decrease as lambda increases, but they are not nullified.

2.3 LASSO visualized

The Lasso estimator tends to zero out parameters as the OLS loss can easily intersect with the constraint on one of the axis.

The values of the coefficients decrease as lambda increases and are nullified fast.

Variable Selection as Regularization

What are the pros and cons of the two approaches?

Since LASSO regression tends to produce zero estimates for a number of model parameters - we say that LASSO solutions are sparse - we consider LASSO to be a variable selection method.

Ridge is faster to compute, but many people prefer using LASSO for variable selection, as well as for suppressing extreme parameter values and therefore being easier to interpret.

2.4 Ridge regularization with validation only: step by step

Here we will go through Ridge regularization using using a single validation set using MSE as our loss.
For ridge regression there exist an analytical solution for the coefficients:

  1. Split the data into train, validation, and test sets, X,Y_{train},X,Y_{validation},X,Y_{test} 
  2. Iterate over a range of \lambda values for \lambda in \lambda_{min}...\lambda _{max}:
    • Determine the \beta that minimizes the L_{ridge} using the train data, \beta_{ridge}m(\lambda )=(X^TX+\lambda I)^{-1}X^TY
    • Record the MSE loss for this \lambda using the validation data, L_{MSE}(\lambda ) .
  3. Select the \lambda that minimizes the MSE loss on the validation data, \lambda _{ridge}=argmin_\lambda L_{MSE}(\lambda )
  4. Refit the model using both train and validation data combined using the selected \lambdaX,Y_{train},X,Y_{validation}, now using \lambda _{ridge}, resulting to \widehat{\beta}_{ridge}(\lambda _{ridge})
  5. Report the MSE on the test set, X,Y_{test} given the \widehat{\beta}_{ridge}(\lambda _{ridge}).

2.5 LASSO regularization with validation only: step by step

Here we will go through Lasso regularization using using a single validation set using MSE as our loss .

The steps are largely the same as with Ridge regression except that there is no analytical solution for the coefficients in Lasso regression, so we use a solver.

  1. Split the data into train, validation, and test sets, X,Y_{train},X,Y_{validation},X,Y_{test}
  2. Iterate over a range of \lambda values for \lambda in \lambda _{min}...\lambda _{max}:
    • Determine the \beta that minimizes the L_{lasso},\beta_{lasso}(\lambda ),  using the train data. This is done using a solver.
    • Record the MSE loss for this \lambda using the validation data, L_{MSE}(\lambda ) .
  3. Select the \lambda that minimizes the MSE loss on the validation data,\lambda _{lasso}=argmin_\lambda L_{MSE}(\lambda )
  4. Refit the model using both train and validation data combined using the selected \lambda , X,Y_{train},X,Y_{validation},X,Y_{test}, now using \lambda _{lasso} , resulting to \widehat{\beta}_{lasso}(\lambda _{lasso})
  5. Report the MSE on the test set, X,Y_{test} , given the \widehat{\beta}_{lasso}(\lambda _{lasso}).

2.6 Ridge regularization with CV: step by step

Lastly, let us go through Ridge regularization using using a cross-validation using  as our loss.

  1. Split the data into train, validation, and test sets, X,Y_{train},X,Y_{validation},X,Y_{test}
  2. Split the train data into K folds, X,Y_{train}^{-k},X,Y_{validation}^k
  3. Iterate over these K folds for k in 1,...,K
  4. Iterate over a range of \lambda values for \lambda in \lambda_0,...,\lambda _n:
    • Determine the \beta that minimizes the L_{ridge},\beta_{ridge}(\lambda ,k)=(X^TX+\lambda I)^{-1}X^TY using the train data of the fold, X,Y_{train}^{-k}
    • Record L_{MSE}(\lambda ,k) using the validation data of the fold X,Y_{validation}^k

At this point we have a 2-D matrix, rows are for different k, and columns are for different \lambda values.

  1. Average the L_{MSE}(\lambda ,k) for each \lambda ,\overline{L}_{MSE}(\lambda )
  2. Find the \lambda that minimizes the \overline{L}_{MSE}(\lambda ), resulting to \lambda _{ridge}.
  3. Refit the model using the full training data, X,Y_{train},X,Y_{validation}, resulting to \widehat{\beta}_{ridge}(\lambda _{ridge})
  4. Report the MSE on the test set, X,Y_{test} given the \widehat{\beta}_{ridge}(\lambda _{ridge})