DATA 3464: Fundamentals of Data Processing

Numeric Data Transformations

Charlotte Curtis
January 27, 2026

Topic overview

  • Why transformations are necessary
  • Common transformations
  • Dimensionality reduction

Resources used:

Common 1:1 transformations

"Most models work best when each feature (and in regression also the target) is loosely Gaussian distributed" -- Introduction to Machine Learning with Python

  • Scaling: normalization or standardization
  • Nonlinear transforms: log, square root, polynomial
  • Fancier methods: Box-Cox, Yeo-Johnson

A brief intro to gradient descent

  • Many linear models minimize some cost function through gradient descent

  • The gradient is a vector of partial derivatives

    for some scalar-valued

Descending the gradient

For a loss (or cost) function such as

  1. Start with a random

  2. Calculate the gradient for the current

  3. Update as

  4. Repeat 2-3 until some stopping criterion is met

    where is the learning rate, or the size of step to take in the direction opposite the gradient.

Visualizing in 2D

  • The gradient has dimensions, where is the number of features
  • step size is a scalar parameter

center

Main takeaway: feature should be more or less on the same scale

Figure 4-7 from Hands-on Machine Learning

Approaches to scaling

Standardize:

Normalize:

Nonlinear transforms

  • Common case: count data
  • Example: Ask 1000 students how often they checked D2L that day
  • Not a Gaussian distribution!
  • What about the central limit theorem?

Where we left off on January 27

Transformations in training vs inference

  • Define functions, e.g.
    def standardize(X, mu, sigma):
        return (X - mu) / sigma
    
  • Compute scaling parameters on the training data, then stash them somewhere:
    mu = X_train.mean()
    sigma = X_train.std()
    # ... later on, during inference
    X = standardize(X, mu, sigma)
    

    What would happen if standardize instead computed values on the fly?
    What else am I missing here?

Manual approach in the wild

You may run across magic numbers, e.g from the PyTorch tutorials:

import torch
from torchvision import transforms, datasets

data_transform = transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

This really should have a comment! Derived from ImageNet.

An alternative solution: Scikit-learn Pipelines

  • Hard-coding scaling (and other) parameters is okay, provided you can justify the choice and document where they came from
  • Scikit-learn has a handy Pipeline class that handles this for you
  • Each step in the pipeline has a fit and transform method
    • fit computes parameters from the training data
    • transform applies the transformation
    • fit_transform does both -- only use on training data!
  • You can call these functions on the whole pipeline to fit or apply all in one go

Pipeline

  • The Pipeline class chains things together in a linear fashion
  • Output from one step is input to the next
  • make_pipeline is slightly simpler syntax (no names needed)
    from sklearn.pipeline import make_pipeline, Pipeline
    pipeline = make_pipeline(
        StandardScaler(),
        SGDClassifier()
    )
    # or, equivalently:
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('sgd', SGDClassifier())
    ])
    

ColumnTransformer

  • The ColumnTransformer selects specific features and processes in parallel
  • Most of the time mixed dataset need this
    center

1:many transformations

  • So far our numeric transformations have been 1:1
  • 1:many creates multiple features from 1, like binning
  • Basis expansion introduces nonlinearity to a continuous feature, e.g.:

  • More exciting is the spline version where you can define different polynomials for different ranges of
Full disclosure: I have never used this approach

Where we left off on January 29

Dimensionality reduction methods

many:many transformations

  • Linear projection methods create new features:

    where is the projection matrix and is the transformed data
  • We can use this for dimensionality reduction by only keeping some of
  • Example: Principle component analysis (PCA) finds the set of orthogonal vectors that explain the most variance

Variance and covariance - Review (?)

  • The variance of a single feature is:

  • The covariance between two features and is:

  • Independent variables will have low covariance, but low covariance does not necessarily mean independence!

    Or divide by n-1 for an unbiased estimate of the population variance

Covariance matrix

  • The covariance matrix between all features in a matrix is:

  • Just like regular variance, covariance scales with the data

  • If you normalize this, you get the correlation matrix

Back to PCA

  • Principle component analysis, proposed in 1901 by Karl Pearson, is a linear projection of multidimensional data onto the axes of maximum variance
  • The axes are found by eigendecomposition of the covariance matrix:

  • The eigenvectors form the new orthogonal basis for the covariance matrix , while the eigenvalues represent the amount of variance "explained"
  • can be calculated as:

Dimensionality reduction with PCA

  • Sort in order of decreasing eigenvalue ("amount of variance explained")
  • Keep only the first eigenvectors to form
  • Project data onto this smaller basis:

  • Implemented in Scikit-learn by passing n_components to PCA(), e.g.
    pca = PCA(n_components=2) # just two axes
    X_reduced = pca.fit_transform(X)
    

Linear discriminant analysis (LDA)

  • Similar to PCA, but supervised
  • Finds basis that maximize class separability
  • Number of projection vectors must be <= min(n_classes - 1, n_features)
  • Still uses projections and eigenvalues, but adds some statistical magic
  • Key assumptions:
    • Features are normally distributed within a class
    • Classes have identical covariance matrices
  • Implemented in Scikit-learn as LinearDiscriminantAnalysis

Other methods

  • It seems crazy, but random projections can work to reduce dimensionality
  • Another many:many transformation that has gained popularity in the "Deep Learning era" is the autoencoder
  • These are (typically) neural networks that learn a lower-dimensional representation by training the output to match the input
  • The "bottleneck" layer in the middle is the reduced representation
  • During inference, the model is chopped in half

Pros and cons of dimensionality reduction

Pros

  • Can improve model by:
    • removing noise
    • reducing collinearity
  • Lower complexity/compute cost
  • Can help with visualization
  • Can be informative about features

Cons

  • Throwing out information
  • Harder to interpret results
  • May not actually improve anything

Higher list of pros, but still... don't do it unless it's demonstrably beneficial

Summary

  • Numerical data often needs to be transformed to fit model assumptions
  • Standardizing (and maybe normalizing) are very common
  • Nonlinear transformations can also be beneficial
  • Dimensionality reduction can help with model or visualization
  • As usual, let the data be your guide

Coming up next

  • Assignment 1 - due Friday, ish (Monday is fine, and let me know if you need more than that) - Already passed!
  • Lab: practice with transformation pipelines Already passed!
  • Next topic: Missing data, outliers, and interaction effects

draw on the board

Discussion time