20 minute read

This post presents methods for finding a balance between under fitting and over fitting a model. Under fitting is when the model is a poor predictor of the response. With linear regression, this is largely addressed through diagnostic checks, which was covered in a previous post. A linear model is over fitted when it includes more predictors than are needed to represent the relationship to the response variable. Appropriately reducing the complexity of the model improves its ability to make predictions based on new data, and it helps with interpretability.

There are three general approaches to reducing model complexity: dimension reduction, variable selection, and regularization. Dimension reduction is beyond the scope of this post and will not be covered. This post presents two methods of variable selection (testing- and criterion-based methods) and regularization through lasso regression.

Testing-Based Methods

Testing-based methods are the easiest to implement but should only be considered when there are only a few predictors. The idea is simple. In forward elimination, we start with a linear model with no predictors, manually add them one at a time, and keep only those predictors with a low p-value. Backward elimination is just the opposite: we start with a linear model that contains all predictors (including interactions, if suspected), remove the predictor with the highest p-value, build a new linear model with the reduced set or predictors, and continue that process until only those predictors with low p-values remain.

We’ll use the teengamb dataset from the faraway package to demonstrate backward elimination. This dataset contains survey results from a study of teenage gambling in Britain. The response variable is gamble, which is the expenditure on gambling in pounds per year. The predictors are information regarding each survey respondent, such as gender and income.

library(faraway)
data(teengamb)
head(teengamb)
##   sex status income verbal gamble
## 1   1     51   2.00      8    0.0
## 2   1     28   2.50      8    0.0
## 3   1     37   2.00      6    0.0
## 4   1     28   7.00      4    7.3
## 5   1     65   2.00      8   19.6
## 6   1     61   3.47      6    0.1

A linear model with all predictors is as follows (we’ll assume this model passes all of the required diagnostic checks):

tg = lm(gamble~., data=teengamb)
summary(tg)
##
## Call:
## lm(formula = gamble ~ ., data = teengamb)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -51.082 -11.320  -1.451   9.452  94.252
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.55565   17.19680   1.312   0.1968    
## sex         -22.11833    8.21111  -2.694   0.0101 *  
## status        0.05223    0.28111   0.186   0.8535    
## income        4.96198    1.02539   4.839 1.79e-05 ***
## verbal       -2.95949    2.17215  -1.362   0.1803    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared:  0.5267,	Adjusted R-squared:  0.4816
## F-statistic: 11.69 on 4 and 42 DF,  p-value: 1.815e-06

Since the p-value for status is the highest, we remove it first.

tg = update(tg, . ~ . -status) # remove status
summary(tg)
##
## Call:
## lm(formula = gamble ~ sex + income + verbal, data = teengamb)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -50.639 -11.765  -1.594   9.305  93.867
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.1390    14.7686   1.634   0.1095    
## sex         -22.9602     6.7706  -3.391   0.0015 **
## income        4.8981     0.9551   5.128 6.64e-06 ***
## verbal       -2.7468     1.8253  -1.505   0.1397    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.43 on 43 degrees of freedom
## Multiple R-squared:  0.5263,	Adjusted R-squared:  0.4933
## F-statistic: 15.93 on 3 and 43 DF,  p-value: 4.148e-07

Then we remove verbal.

tg = update(tg, . ~ . -verbal) # remove verbal
summary(tg)
##
## Call:
## lm(formula = gamble ~ sex + income, data = teengamb)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -49.757 -11.649   0.844   8.659 100.243
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.041      6.394   0.632  0.53070    
## sex          -21.634      6.809  -3.177  0.00272 **
## income         5.172      0.951   5.438 2.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.75 on 44 degrees of freedom
## Multiple R-squared:  0.5014,	Adjusted R-squared:  0.4787
## F-statistic: 22.12 on 2 and 44 DF,  p-value: 2.243e-07

Notice that even though we eliminated half of the predictors from the model, we only slightly reduced the adjusted $R^{2}$. The simpler model explains almost as much variance in the response with only half the number of predictors. Something to keep in mind when conducting forward or backward elimination is that the predictor p-value does not necessarily have to be above 0.05 to eliminate the predictor from the model. You could also choose something higher - even up to around 0.15 to 0.20 if predictive performance is the goal. For example, note that the p-value for verbal in the second model was 0.14, and the adjusted $R^{2}$ for the model was the highest of the three. The coefficient for verbal was also negative, which is what we’d expect: teens with higher verbal scores spend less money on gambling. We should therefore consider keeping verbal in the model. As you can see, there’s a little bit of an art to this method.

Criterion-Based Methods

As previously stated, testing-based procedures should only be considered when there are just a few factors to consider. The more potential factors in your model, the greater the chance that you’ll miss the optimal combination. We saw in the previous section that we had two competing goals: model simplicity versus model fit. @akaike1974 developed a method to measure this balance between simplicity and fit called the Akaike Information Criterion (AIC), which takes the form of:

\[AIC = 2(p+1) - 2ln(\hat{L})\]

where,

  • $p$ is the number of predictors, and
  • $\hat{L}$ is the maximized likelihood for the predictive model.

We then choose the model with the lowest AIC. The Bayes Information Criterion (BIC) is an alternative to AIC and replaces $2(p+1)$ with $ln(n)(p+1)$, where $n$ is the number of observations. Adding $ln(n)$ increases the penalty for the number of factors in the model more for larger data sets. Which criterion you use can therefore depend on the dataset you’re working with.

Another common estimator of error is Mallow’s Cp, which is defined as:

\[C_{p}=\frac{1}{n}(RSS+2p\hat{\sigma}^{2})\]

where,

  • $RSS$ is the root sum of squares,
  • $p$ is the number of predictor, and
  • $\hat{\sigma}^{2}$ is an estimate of the variance of the error, $\varepsilon$, in the linear regression equation.

As with AIC and BIC, the penalty term (in this case $2p\hat{\sigma}^{2}$) increases as the number of predictors in the model increases, which is intended to balance the corresponding decrease in $RSS$. With each of these methods, as we vary $p$, we get an associated criterion value from which we select the minimum as the best model. In R, we can calculate AIC and BIC with the bestglm() function from the bestglm package. Be aware that bestglm() expects the data to be in a dataframe with the response variable in the last column.

library(bestglm)
# Note that bestglm is picky about how your dataset is structured
# It expects a dataframe with the response variable in the last column
# and all other columns are predictors. Don't include any other "extra"
# columns. Fortunately, teengamb is already set up that way.

tg.AIC = bestglm(teengamb, IC="AIC")

# this will provide the best model
tg.AIC
## AIC
## BICq equivalent for q in (0.672366796081496, 0.87054246206156)
## Best Model:
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  24.138972 14.7685884  1.634481 1.094591e-01
## sex         -22.960220  6.7705747 -3.391177 1.502436e-03
## income        4.898090  0.9551179  5.128256 6.643750e-06
## verbal       -2.746817  1.8252807 -1.504874 1.396672e-01

Notice that verbal is included in the best fit model even though its p-value is > 0.05. Using summary(), we get a likelihood-ratio test for the best model compared to the null model.

summary(tg.AIC)
## Fitting algorithm:  AIC-leaps
## Best Model:
##            df deviance
## Null Model 43 21641.54
## Full Model 46 45689.49
##
## 	likelihood-ratio test - GLM
##
## data:  H0: Null Model vs. H1: Best Fit AIC-leaps
## X = 24048, df = 3, p-value < 2.2e-16

To get the best model in a lm() format:

tg.AIC$BestModel
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
##     drop = FALSE], y = y))
##
## Coefficients:
## (Intercept)          sex       income       verbal  
##      24.139      -22.960        4.898       -2.747

We can also see a comparison of the best model (model 1) to the next 4 best models.

tg.AIC$BestModels
##     sex status income verbal Criterion
## 1  TRUE  FALSE   TRUE   TRUE  294.2145
## 2  TRUE  FALSE   TRUE  FALSE  294.6268
## 3  TRUE   TRUE   TRUE   TRUE  296.1758
## 4  TRUE   TRUE   TRUE  FALSE  296.2086
## 5 FALSE   TRUE   TRUE   TRUE  301.6659

We can also see the best model (row 3 below) and its subsets. Row 0 contains just the y-intercept, and in each successive row one predictor is added tp the model.

tg.AIC$Subsets
##    (Intercept)   sex status income verbal logLikelihood      AIC
## 0         TRUE FALSE  FALSE  FALSE  FALSE     -161.6677 323.3354
## 1         TRUE FALSE  FALSE   TRUE  FALSE     -150.1678 302.3356
## 2         TRUE  TRUE  FALSE   TRUE  FALSE     -145.3134 294.6268
## 3*        TRUE  TRUE  FALSE   TRUE   TRUE     -144.1072 294.2145
## 4         TRUE  TRUE   TRUE   TRUE   TRUE     -144.0879 296.1758

Using BIC, however, verbal is excluded from the best fit model.

tg.BIC = bestglm(teengamb, IC="BIC")
tg.BIC
## BIC
## BICq equivalent for q in (0.0507226962510261, 0.672366796081496)
## Best Model:
##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)   4.040829  6.3943499  0.6319374 5.306977e-01
## sex         -21.634391  6.8087973 -3.1774174 2.717320e-03
## income        5.171584  0.9510477  5.4377755 2.244878e-06

For Mallow’s Cp, we can use the leaps package.

library(leaps)

# leaps expects x and y to be passed separately
tg.cp = leaps(x=teengamb[-5], y=teengamb$gamble, method="Cp")
tg.cp
## $which
##       1     2     3     4
## 1 FALSE FALSE  TRUE FALSE
## 1  TRUE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE  TRUE
## 1 FALSE  TRUE FALSE FALSE
## 2  TRUE FALSE  TRUE FALSE
## 2 FALSE  TRUE  TRUE FALSE
## 2 FALSE FALSE  TRUE  TRUE
## 2  TRUE  TRUE FALSE FALSE
## 2  TRUE FALSE FALSE  TRUE
## 2 FALSE  TRUE FALSE  TRUE
## 3  TRUE FALSE  TRUE  TRUE
## 3  TRUE  TRUE  TRUE FALSE
## 3 FALSE  TRUE  TRUE  TRUE
## 3  TRUE  TRUE FALSE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE
##
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
##
## $size
##  [1] 2 2 2 2 3 3 3 3 3 3 4 4 4 4 5
##
## $Cp
##  [1] 11.401283 30.984606 41.445676 45.517426  3.248323 12.003293 12.276400
##  [8] 25.967108 26.743051 42.897591  3.034526  4.856329 10.256053 26.416920
## [15]  5.000000

It takes a little finagling to get the predictors that we should include in the best model. We want the index of the minimum value in $Cp, and we use that to find the corresponding row in $which to determine the predictors that should remain in the model. Columns 1, 2, and 4 correspond to sex, status, and verbal, which is the same as the AIC result.

tg.cp$which[which.min(tg.cp$Cp), ]
##     1     2     3     4
##  TRUE FALSE  TRUE  TRUE

Cross Validation

An alternative approach to using AIC, BIC, or Cp is to use cross validation (CV) to select the best model. The idea is that we randomly divide our data into a training set and a test set. An 80/20 split between the training set and test set is common but will depend on your sample size. For very large sample sizes (in the millions), the training set can contain a larger percentage, while for relatively small sample sizes, the split may be closer to 50/50.

The training set is further randomly divided into $k$ subsets (also called folds), and one of these folds is withheld as the validation set. We fit a model to the remaining training set, and then measure the prediction error using the validation set. Typically, the prediction error is measured by the mean squared error (MSE) for a quantitative response variable. We repeat this process by cycling though each of the folds and holding it out as the validation set. The cross validated error (CV error) is then the average prediction error for the $k$ folds.

The website for the scikit-learn module for Python has a good visualization (shown below) of these various data sets and a good explanation of this and other cross validation methods. A more thorough, academic treatment of cross validation may be found in Chapter 7.10 in Elements of Statistical Learning written by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Once the CV process is complete, we re-combine each of the folds into a single training set for a final evaluation against the test set. With this approach, we can compare multiple CV methods and choose the method with the best performance.

Notice that we are not using an Information Criterion (IC) anywhere in this method. Another difference is that with criterion-based methods, we chose the model with the lowest IC score, but with CV, we don’t choose the model with the lowest CV error. Instead, we calculate the standard deviation ($\sigma$) of the CV error for each of the $p$ predictors and then choose the smallest model that’s CV error is within one standard error of the lowest. Standard error is defined as $se = \sigma/\sqrt(k)$. This is best shown graphically, which you’ll see below.

CV techniques are particularly useful for datasets with many predictors, but for consistency, we’ll stick with the teengamb dataset. Below, we’ll perform k-fold cross validation on the teamgamb dataset, once again using bestglm(). We’ll use an 80/20 train/test split.

set.seed(2)
test_set = sample(47, 10, replace=FALSE)  # randomly select row indices
tg_test = teengamb[test_set, ]              # create test set
tg_train = teengamb[-test_set, ]            # create training set

The training set has only 24 observations, so if we further partition it into a large number of folds, we’ll have a small number of observations in each of the validation folds. For this example, we’ll choose just 3 folds. In the bestglm() function, we specify CV as the IC and pass three arguments to specify cross validation parameters. As mentioned, there are a variety of cross validation methods to choose from. For the method described above, we specify Method="HTF", which you might have noticed are the first letters of the last names of the authors mentioned in the “Elements of Statistical Learning” reference above. K=3 specifies the number of k-folds, and we can chose one or more repetition with REP. Remember that cross validation randomly partitions the data into folds, so if we want to repeat the CV process with different random partitions, we increase the REP value. Due to the small sample size and number of folds, we’ll do 10 repetitions.

tg.cv = bestglm(tg_train, IC="CV", CVArgs=list(Method="HTF", K=3, REP=10))
tg.cv
## CV(K = 3, REP = 10)
## BICq equivalent for q in (0.000199326484859652, 0.329344259543028)
## Best Model:
##              Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -6.132874   6.892883 -0.8897401 3.796805e-01
## income       5.877955   1.149221  5.1147292 1.134028e-05

The model above is the model with the fewest predictors that is within one standard error of the model with the lowest CV error. To illustrate this relationship, next we’ll visualize how this model was determined based on the CV and standard errors. We can get the CV errors and the $se$ from the tg.cv object.

What About The Test Set?

This model selection method included income as the only predictor variable in their respective best model. However, the coefficients differ between the two models, so now we can bring in the test set and compare against the best BIC model. For a fair comparison with the CV results, we’ll find the best model using BIC on the training set only.

# get the BIC model on the training set only
tg_train.BIC = bestglm(tg_train, IC="BIC")
bic_preds = predict(tg_train.BIC$BestModel, newdata = data.frame(tg_test[, -5]))

print("BIC predictors included are:")
## [1] "BIC predictors included are:"
print(tg_train.BIC$BestModel$coefficients)
## (Intercept)         sex      income
##    3.515245  -19.116151    5.362915

Now we’ll get the CV model.

# based on the CV results, only income should be included as a factor
cv.glm = glm(gamble~income, data=tg_train)
cv_preds = predict(cv.glm, newdata = data.frame(tg_test[, -5]))

We’ll use mean absolute error as our measure of error.

# calculate and compare mean absolute error
print(paste("BIC mean absolute error:", round(mean(abs(bic_preds - tg_test$gamble)), 1)))
## [1] "BIC mean absolute error: 10.9"
print(paste("CV mean absolute error:", round(mean(abs(cv_preds - tg_test$gamble)), 1)))
## [1] "CV mean absolute error: 16.3"

Using mean absolute error, BIC out-performed the cross-validated model. This result shouldn’t be too surprising given that the BIC model contained additional predictor variables that appeared to be statistically significant.

Lasso Regression

Ridge and lasso regression are closely related regularization techniques to reduce model complexity. The primary difference between the two methods is that ridge regression reduces factor coefficients close to (but not equal to) zero, while lasso regression reduces the coefficients all the way to zero, which makes it useful for reducing model complexity by eliminating factors.

Background Reading

For the theoretical framework, please refer to this article. Don’t worry about the Python code if you’re not famailiar with it. Just read the text portions of the article that explain the how ridge and, more importantly, lasso regression work.

Lasso Regression In R

Lasso regression is particularly useful when a dataset has many factors, but we’ll continue to use the teengamb data so we can compare the results with the stepAIC() method. Performing lasso regression with the glmnet package is straight forward. The function has two required arguments, an x and a y, where x are the data associated with the predictors (note x must be a data.matrix, not a data.frame), and y is the response as a vector. By default, glmnet automatically scales and centers the data, and then converts them back to the original scale when providing results. If we plot the results, we get the following.

library(glmnet)

# for some reason, glmnet works best with data.matrix instead of as.matrix
x = data.matrix(tg_train[-5])
y = tg_train$gamble

tg.lasso = glmnet(x, y)
plot(tg.lasso, xvar="lambda", label=TRUE)

Each of the above lines represents a predictor. The number next to each line on the left side of the plot refers to the column number in the x matrix. The vertical axis represents the factor coefficient. The bottom x axis is $log(\lambda)$, and the top x axis is the associated number of predictors included in the model.

So how do we interpret this plot? At the far right, we can see that the coefficient for every predictor is zero. In other words, this is the null model. As $\lambda$ decreases, predictors are added one at a time to the model. Since predictor #3 (income) is the first to have a non-zero coefficient, it is the most significant. sex (predictor #1) is the next non-zero coefficient followed by verbal (predictor #4) and then status (predictor #2). If we compare this order with the p-values from the best fit linear model, we see that there is consistency. Note that income was the first non-zero coefficient, and it has the lowest p-value in the linear model. Also note that the maximum coefficients in the lasso regression plot are also consistent with the linear model coefficients.

Our task now is to find the model that has good predictive power while including only the most significant predictors. In other words, we need a method to find the right $\lambda$ value. Before we get to how we identify that $\lambda$, let’s look at some other useful information from tg.lasso. If we print our glmnet object, we see (going by columns from left to right) the number of predictors included in the model (Df, not to be confused with the degrees of freedom in a linear model summary), the percent of null deviance explained, and the associated $\lambda$ value.

print(tg.lasso)
##
## Call:  glmnet(x = x, y = y)
##
##    Df  %Dev  Lambda
## 1   0  0.00 21.9800
## 2   1  7.26 20.0300
## 3   1 13.29 18.2500
## 4   1 18.30 16.6300
## 5   1 22.45 15.1500
## 6   1 25.90 13.8100
## 7   1 28.77 12.5800
## 8   1 31.15 11.4600
## 9   2 34.07 10.4400
## 10  2 36.78  9.5160
## 11  2 39.03  8.6710
## 12  2 40.90  7.9010
## 13  2 42.46  7.1990
## 14  2 43.75  6.5590
## 15  2 44.82  5.9770
## 16  2 45.71  5.4460
## 17  3 46.46  4.9620
## 18  3 47.38  4.5210
## 19  3 48.14  4.1190
## 20  3 48.77  3.7530
## 21  3 49.30  3.4200
## 22  3 49.73  3.1160
## 23  3 50.10  2.8390
## 24  3 50.40  2.5870
## 25  3 50.65  2.3570
## 26  3 50.85  2.1480
## 27  3 51.03  1.9570
## 28  3 51.17  1.7830
## 29  3 51.29  1.6250
## 30  3 51.39  1.4800
## 31  3 51.47  1.3490
## 32  3 51.54  1.2290
## 33  3 51.59  1.1200
## 34  3 51.64  1.0200
## 35  3 51.68  0.9298
## 36  3 51.71  0.8472
## 37  3 51.74  0.7719
## 38  3 51.76  0.7033
## 39  3 51.78  0.6409
## 40  3 51.79  0.5839
## 41  3 51.80  0.5320
## 42  4 51.83  0.4848
## 43  4 51.85  0.4417
## 44  4 51.87  0.4025
## 45  4 51.89  0.3667
## 46  4 51.90  0.3341
## 47  4 51.91  0.3045
## 48  4 51.92  0.2774
## 49  4 51.93  0.2528
## 50  4 51.93  0.2303
## 51  4 51.94  0.2099
## 52  4 51.94  0.1912
## 53  4 51.95  0.1742
## 54  4 51.95  0.1587
## 55  4 51.95  0.1446
## 56  4 51.95  0.1318
## 57  4 51.96  0.1201
## 58  4 51.96  0.1094
## 59  4 51.96  0.0997
## 60  4 51.96  0.0908
## 61  4 51.96  0.0828
## 62  4 51.96  0.0754
## 63  4 51.96  0.0687
## 64  4 51.96  0.0626

We can also see the coefficient values for any given $\lambda$ with coef. We can see that small values of $\lambda$ include more predictors and so correspond with the right side of the plot above. We can get the coefficients for any given $\lambda$ value with coef(). If we choose the smallest values of $\lambda$ from the above data, we get:

# Note that we specify lambda with s
coef(tg.lasso, s=0.0626)
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  18.03799848
## sex         -18.24900253
## status        0.08230075
## income        5.20255057
## verbal       -2.69223049

Now we can more directly compare these coefficients to the full linear model coefficients. Recall that we withheld a test set prior to performing lasso regression, so the coefficients are close, but not equal to the linear model coefficients.

sumary(lm(gamble~., data=teengamb))
##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  22.555651  17.196803  1.3116   0.19677
## sex         -22.118330   8.211115 -2.6937   0.01011
## status        0.052234   0.281112  0.1858   0.85349
## income        4.961979   1.025392  4.8391 1.792e-05
## verbal       -2.959493   2.172150 -1.3625   0.18031
##
## n = 47, p = 5, Residual SE = 22.69034, R-Squared = 0.53

If we choose a $\lambda$ associated with 2 Df, we see that only two predictors have non-zero coefficients.

coef(tg.lasso, s=5.9770)
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept)  5.858393
## sex         -8.912086
## status       .       
## income       4.039764
## verbal       .

To find the optimal value for $\lambda$, we use cross validation again. We can include cross validation in the glmnet() function by prepending cv. as shown below. The default number of folds in the cv.glmnet function is 10, which is fine for this example. There’s a built-in method for plotting the results as we did manually above.

tg.cv = cv.glmnet(x, y)
plot(tg.cv)

What we get is the cross validation curve (red dots) and two values for $\lambda$ (vertical dashed lines). The left dashed line is the value of lambda that gives the minimum mean cross-validated error. The right dashed line is the value of $\lambda$ whose error is within one standard deviation of the minimum. This is the $\lambda$ we’ve been after. We can get the coefficients associated with this $\lambda$ by specifying s = "lambda.1se". Our cross validated best fit lasso regression model is shown below.

coef(tg.cv, s = "lambda.1se")
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) 12.864028
## sex          .       
## status       .       
## income       1.826509
## verbal       .

For a more thorough discussion of the glmnet package, including its use with non-Gaussian data, refer to the vignette written by Trevor Hastie and Junyang Qian.

Parting Thought

In this chapter, we have seen that different methods for model selection can produce different “best” models, which might make you leery about the whole thing. Remember the George Box quote:

All models are wrong…

We’re just trying to find one that’s useful.

Comments