Lecture 4: Categorical data

HTML Slides html │ PDF Slides PDF  │ Demo code on GitHub GitHub

Topic overview#

  • Exploring categorical data
  • Categorical data encoding strategies

Resources used:

What is categorical data?#

  • Samples can take on one of several discrete values or groups
    • Nominal: no particular order to the groups
    • Ordinal: groups relate to each other in a specific order
  • Categories can be represented as strings or numeric types
    • Domain knowledge is necessary!

Let’s take a few minutes to brainstorm some examples

Exploring categorical data#

We’ve already done some of this, but some ideas to consider:

  • pandas.DataFrame.value_counts() - how many of each category?
  • Use category to group, then compute summary stats
  • Plot color per category
  • Scatter plot with jitter

Representing categorical data#

  • Tree-based models can handle string-based categories as-is Not in Scikit-Learn
  • Consider:
    • Ordinal or nominal?
    • How many possible categories (cardinality)?
    • Any chance new ones might show up?

How could we encode the examples? What are the benefits and drawbacks of each method?

Ordinal encoding#

bg left fit

CategoryFeature
Nil0
Low1
Moderate2
High3
Very High4
Extreme5

Nominal categories: one-hot encoding#

  • Categories have no natural relationship
  • Create $k$ new features from $k$ categories, very sparse matrix
Animalcatdograbbit
cat100
dog010
rabbit001

Another approach: target encoding#

  • Basic concept: replace the category with the mean of the target
  • Essential to avoid data leakage!
  • Example: predicting weight of animal
Animalmean_kg
cat4.1
dog15.4
rabbit2.2

Where we left off on January 22#

Getting fancy with feature hashing#

  • Good if you have too many categories, or combinations of categories
  • Converts each category into a fixed-length feature vector
AnimalA_0A_1A_2A_16
cat1001
dog0101
rabbit1010

Even fancier methods exist, like supervised encoding methods which basically train a model-within-a-model

From numbers to categories: discretization#

  • Sometimes numeric data is better represented as categorical
  • Still needs encoding strategy
  • Can introduce nonlinear relationships
  • Like everything, contextual

Advantages/disadvantages of various methods#

Ordinal#

  • Compact (1 column/feature)
  • Only useful for naturally ordered data
  • Imposes (linear) relationship
  • Difficult to deal with novel categories

One-hot#

  • No assumptions made about relationships
  • Novel features implied by 0
  • Memory intensive
  • Not good for high cardinality
  • Potential collinearity issues

Advantages/disadvantages continued#

Target encoding#

  • Powerful if strong predictor
  • Compact
  • Doesn’t work for unsupervised
  • Central tendency poor measure for categories with few samples
  • Difficult to deal with novel categories

Feature hashing#

  • Natively handles novel categories
  • Compact (fixed columns/features)
  • Good for high cardinality
  • Risk of hash collision
  • Loss of interpretability/meaning

Coming up next#

  • Assignment 1 January 30th
  • Lab: practice with modelling process

Feature Engineering Chapter 5