Cross-validation is a way of checking your model choice and parameters, but final training should be done on the entire training set
How do we know if stratification is necessary?
Stratification is used to mitigate sampling bias
The binomial distribution can be used to model the probability of choosing
Suppose we randomly sample 100 people. What is the probability of fewer than 75 or more than 85 cilantro lovers?
Here we've defined an "unbiased sample" as being
The need for stratification depends on sample size, distribution of stratification category, and how much bias you're willing to accept
| Small Sample Size | Large Sample Size | |
|---|---|---|
| Unbalanced Classes | Stratify | Maybe |
| Balanced Classes | Maybe | Not necessary |
Stratification categories can be the target variable, or a predictor
Goal is to have the same class distribution in both testing and training
Now that we've got a test set safely stashed, we can ask questions about the data and use visualizations and statistics to answer them. Some examples:
A few things to tweak that can make visualizations easier to read:
alpha)