Lecture 1. Introduction

HTML Slides html │ PDF Slides PDF  │ Demo code on GitHub GitHub

Meet your instructor#

bg right flavour

Name: Charlotte Curtis

Pronouns: She/her

Office: B102-4

Email: ccurtis@mtroyal.ca

Office hours: Book here

My Background#

center w:900px flavour

Another new class!#

This course introduces techniques for ethically and responsibly wrangling and manipulating datasets to make them appropriate for addressing the question at hand. Topics may include cleaning and transforming data, integrity and quality measures, common file formats, feature selection and engineering, and generating features from unstructured sources such as text and images.

Grade Assessment#

ComponentWeight
Tutorial exercises10%
Assignments30%
Midterm exam25%
Final exam35%

Bonus marks may be awarded for substantial corrections to materials, submitted as pull requests

Source repo: https://github.com/mru-data3464/w26

Rendered at: https://mru-data3464.github.io/w26

Textbook(s)#

bg right:40% 50%

Don’t just rely on AI summaries!

Speaking of AI…#

In this course (and others, and your career), you will need to know:

  • What to do, and why
  • How to do it

(also when and who)

Which of these things seem appropriate for AI assistance?

The plan - before Reading Week#

WeekTopicChapter (ish)
1Review and overview1-2
2Exploring data, sampling, splitting3-4
3Representing categorical data5
4Numeric transformations, dimensionality reduction6
5Dealing with missing values7-8
6Feature selection10

The plan - after Reading Week#

WeekTopic
7Midterm
8Extracting data from text
9Image representation and processing
10Data labelling and augmentation
11Processing pipelines
12Supervised and unsupervised learning
13Project presentations, buffer time

Core courses so far#

flowchart

What do you know about…#

  • Various probability distributions
  • Linear and logistic regression
  • Data quality measures
  • Data stewardship best practices
  • Document parsing, web scraping, audio/video feature detection
  • Linear algebra and array programming
  • Prediction tasks: classification and regression
  • Clustering and anomaly detection
  • Evaluation metrics
  • Basic data visualization (scatter plots, histograms, etc)

What do you want to know about?#

Examples of Subject Matter#

  • Finance
  • Real estate
  • Transportation
  • Climate
  • Politics
  • Biology
  • Chemistry
  • Malware

Examples of Data types#

  • Unstructured text
  • Structured text (e.g. csv, HTML)
  • PDF
  • Word documents
  • Images
  • Audio
  • Video

Where we left off on January 6#

Case study: risk of ischemic stroke#

Chapter 2: http://www.feat.engineering/stroke-tour

  • Arterial stenosis can predict risk
  • Plaque composition plays a role
  • Features extracted from CT images
  • Other risk factors (demographics, lifestyle) added to dataset

Many decisions in the data analysis process are subjective - I will often make different decisions than the textbook

From data to prediction#

  1. Understand the problem and define the task
  2. Collect, anonymize and organize the data
  3. Extract features
  4. Explore the dataset
  5. Select a model and preprocess
  6. Train the model
  7. Evaluate, fine-tune, iterate
  8. Deploy and maintain your system

Applied to the stroke example#

  1. What is the problem? What do we need to do?
  2. (Collect, anonymize and organize the data) - Done for us
  3. (Extract features) - Done for us
  4. Explore the dataset
    • A critically important component, DO NOT OFFLOAD TO AI
    • This can even be where the data sciencing stops and we jump straight to visualizations and communicating insights!
    • Check out Data for Good case studies

5. Select a model and preprocess#

center h:500px

7. Evaluate, fine-tune, and iterate#

  • In my example, I jumped straight to testing on the held-back test set
  • This is a terrible idea! We have no confidence that the model actually worked. We could be:
    • overfitting to the training data
    • making incorrect assumptions about the data
    • applying inappropriate transformations, or missing some
    • using the wrong model altogether

Validation needs to happen before the final testing

Coming up next#

  • Lab: basic regression, show me where you’re at
  • Lectures: exploratory data analysis
    • Summary statistics
    • Basic visualizations
    • When and how to split your dataset