Lecture 7: Interaction effects
HTML Slides
│ PDF Slides
│ Demo code on GitHub
Topic overview#
- Definitions and description of interaction effects
- Detecting interaction effects
- A brief discussion on feature selection
Resources used:
Definition#
Two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone. – Feature Engineering, Ch 7
- Interactions matter if they affect the outcome
- Features may have a relationship without having an interaction effect
Example: Stroke data from week 1
Mathematical representation#
A linear model is trying to fit:
$$\hat{y} = w_0 + w_1x_1 + w_2x_2 + \cdots + w_k x_k$$
To consider interaction effects, we add the product of features. For a model with only features $x_1$ and $x_2$:
$$\hat{y} = w_0 + w_1x_1 + w_2 x_2 + w_3x_1x_2$$
Similar to a basis expansion transformation that adds polynomial terms
Which interactions to include?#
First, use your domain knowledge! Then consider guiding principles:
- Hierarchy principle: The higher the degree of interaction, the less likely the interaction will explain variation in the response
- Effect sparsity: only a fraction of the possible interactions are responsible for variation in the response
- Heredity principle: for interaction term $x_1x_2$ to be considered:
- $x_1$ AND $x_2$ must be significant (strong heredity), or
- $x_1$ OR $x_2$ must be significant (weak heredity)
Brute-force approach#
- Try all second order interactions and see if they improve the model, e.g. using scikit-learn’s PolynomialFeatures
- Based on guiding principles, this results in many extraneous features
- Feature selection can be used to prune them back
Feature selection#
- Not just for interaction terms!
- Fields such as bioinformatics can end up afflicted by the Curse of Dimensionality or the $p \gg n$ problem
- Models do not like to have more features than samples:
- Risk of overfitting
- Multicollinearity issues
- Can negatively impact performance
Goal of feature selection: Reduce the number of predictors as far as possible without compromising predictive performance
Performing feature selection#
- Many different ways to choose which features to keep
- Intrinsic methods: some models effectively ignore irrelevant features
- Filter methods: filter features based on some criteria (e.g. correlation)
- Wrapper methods: select subset based on model results, then iterate
- As usual, Scikit-learn can help with this
A subtle source of data leakage is in performing feature selection on the entire training set, then cross-validating
Summary#
- Interactions between features can be considered by adding new features with their product
- This can cause a dimensionality explosion
- Particularly for small datasets, feature selection is then needed to avoid adverse effects of irrelevant features
As always, don’t make feature selection decisions on test data!
Overall Processing Order#
In general, the recommended order is:
Numeric Features#
- Impute any missing features
- Compose interaction terms
- Transform if necessary
- Rescale
Categorical features#
- Encode + Impute
- Compose interaction terms
- Rescale if necessary (e.g. high cardinality ordinal)
Coming up next#
- Reading week!!
- Midterm practice (both in lab and in class)
- I will post practice questions during reading week as well
- After midterm: text wrangling