Lecture 8: Wrangling text
HTML Slides
│ PDF Slides
│ Demo code on GitHub

Topic overview#
- Text file formats and character encodings
- Extracting features from (un)structured text
- Regular expressions
Resources used:
What is a text file?#
- Any file is just a sequence of bytes
- The file extension is somewhat meaningless
- Text files contain only human-readable characters
- Binary files are everything else, including:
- Images
- PDFs
- Word documents*
- Executables
Character encodings#

- To properly read a text file, we need to know its encoding
- Character encodings define how bytes are interpreted
- Turns out this is surprisingly complicated
ASCII#
- American Standard Code for Information Interchange (1963)
- 7-bit encoding (128 characters)
- First 32 are control characters (e.g., newline, tab)
- Then punctuation, digits, uppercase letters, lowercase letters
- Most computers use 8-bit bytes, so there’s a whole 128 characters “left over”
This is a very English-centric character set!
