Lecture 6: Missing and weird data

HTML Slides html │ PDF Slides PDF  │ Demo code on GitHub GitHub

Topic overview#

  • What to do with missing data
  • Detecting and handling outliers

Resources used:

The problem#

  • As you’ve seen, real-world data is messy
  • Missing values are common, other values don’t make sense
  • We need to decide how to deal with these problems

What examples have we seen so far? Why might data be missing or weird*?

*I’m using “weird” as an informal catch-all for unexpected or outlier values

Missing data#

When data are missing in the features we have a few options:

  1. Do nothing! Some algorithms (e.g. Decisions Trees) can handle missing values
  2. Remove features with missing values
  3. Remove samples with missing values
  4. Invent a new value to represent “missingness”
  5. Impute a value based on other data

Most important: understand why data are missing (more EDA!)

Option 1: removing features#

  • If a feature has:
    • A high proportion of missing values, and
    • Little apparent relationship to the target or
    • its information is redundant with other features
  • It may be reasonable to remove it entirely. You can drop it, e.g.:
    df.drop(columns=['feature_name'], inplace=True)
    or (probably more reliable) just not select it when building your pipeline

Option 2: removing samples#

  • If only a small number of samples have missing values, you can drop them from the training data:
    train.dropna(subset=[["features","we","care","about"]], inplace=True)
  • Good idea if the same samples have missing values from multiple features
  • Still useful to explore why data are missing

What should we do for inference?

Option 3: invent a new value#

  • Categorical features: add a new category for “missing”
  • Add a new binary feature indicating whether the value was missing
  • I have seen advice to use extreme values for numerical features, like the -1 income in the OKCupid dataset, but I’m not convinced this is a good idea

Case study: where missingness is informative

Option 4: impute missing values#

  • Fill in the missing values with an “educated guess”
  • Replace missing value with:
    • constant
    • mean, median, or mode (most_frequent)
  • Use other features to infer missing value:
    • K-nearest neighbours
    • simple models to predict missing values
  • Can be combined with option 3 to indicate missing features
  • How much to impute? Feat.Engineering suggests no more than 20%

Choosing an imputation strategy#

StrategyWhen to use
ConstantWhen there is a reasonable default value
MeanNumeric features with normal distribution
MedianNumeric features with extreme outliers
KNNRelationship with other features
Missingness indicatorIf missingness seems informative

Outliers#

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. – D. M. Hawkins

  • There are many entire books dedicated to outlier detection
  • Useful for anomaly detection, e.g:
  • Our focus is on dealing with outliers in preprocessing

Where we left off on February 5#

Detecting outliers#

bg right fit

  • Visually as part of EDA
  • Statistically, e.g. $\gt 3\sigma$
  • Algorithmically, e.g. Isolation Forests
  • As usual, context and domain knowledge are essential

Context matters!#

  • Look at the relationship between outliers and target (training dataset, of course)

What do the dots on box plots represent? bg right fit

What to do about outliers?#

  1. Data transformations
  2. Drop the samples
  3. Encode them somehow
  4. Leave them alone

As usual, very data- and model-dependent. Tree based methods are particularly impacted by outliers!

Any other ideas?

Nonlinear transformations#

bg left fit

  • Transforming the data does not actually remove the outlier
  • Can help make the relationship less extreme

Dropping the samples#

  • In general, not a thing I love to do
  • If you drop a sample from training, you need to decide what to do at inference
  • My opinion: Only do it if you’re confident it’s an error in the dataset

    Can you think of an example?

  • What can you do at inference time when outliers are encountered?

Encoding outliers#

A few other options that might fall into “encoding”:

  • Just like with missing values, binary column indicating outlier/inlier
  • Remove the values and convert them to missing
  • Bin or impose a cap (floor/ceiling) on the value
  • Replace the numeric value with a rank or quantile bucket
  • Probably other things!

Leave them alone#

If your outliers are:

  • Real values (not data entry or other errors)
  • Representative of things that might happen during inference

Then you probably want to keep them!

Consider using a RobustScalar to standardize if you have lots of outliers

Missing values and outliers in the target#

  • If you have missing values in the target, you probably want to avoid imputing
  • This is a good case for dropping samples from training!
  • Outliers are trickier – again, check if they’re real or mistakes
  • There may be a case for transforming the target

Coming up next#

  • Interactions between variables
  • Feature selection
  • Midterm stuff
  • Reading week!!!!