Lab 8: Bash and Regular Expressions#
Objective#
Learn about:
- Navigating a computer on the command line
- Basic bash syntax
- Extracting text features with regular expressions
Setup#
There is no starter code this week; instead, I’d like you to create a lab08 subdirectory and put your files in it.
While I can’t tell how you actually make a directory, this is a good excuse to practice with bash! Try doing the following:
- Navigate to your labs repository using
cd - Create a new directory using
mkdir cdinto it so that you’re working in the right place
Basic Bash#
Open up Git Bash (if you’re on Windows), or Terminal on OSX, or some flavour of terminal on Linux. If you’re on Git bash, then the C: drive is mapped to /c/. On the lab computers, you’re probably starting off in your H: drive, or /h/.
There are many useful bash commands for interacting with your file system. I’m not going to go through them in detail, but here are a few that might be useful. Try them!
echodisplays whatever arguments you give itpwddisplay the name of the current working directorylslist the contents of a directorycpcopy a filemvcopy a file, remove the original – also used for renamingcat,less,head,taildisplay the contents of a file in various waysrmremove a filemkdircreate a new directoryrmdirremove an empty directorycdchange current working directories
Some bash commands have mandatory arguments. These are separated by spaces. For example, you can’t mkdir without telling it the name of the directory you want to make:
$ mkdir some_new_dirSince spaces separate arguments, filenames with spaces can cause trouble. Try the following:
$ mkdir some new dirWhat did this do?
curl: transfer data with URLs#
First, make sure you are working in your lab08 subdirectory. Then, try downloading a webpage (any webpage) with:
$ curl -O urlwhere url is the address of the webpage (you can copy this from your browser).
-O (uppercase) tells curl to output to a file with the same name as the source URL.
Repeat#
One of the main advantages to using the command line is the ability to loop over files. A basic for loop might be:
for file in *; do
echo $file;
donewhich just loops over every file in the current directory and prints its name.
You can also define lists of numbers to loop over, e.g:
for n in {1..3}; do
echo $n;
doneTask 1: Download a set of Hansards#
I showed you this briefly in class last Thursday, but it’s a handy dataset so we’ll do it again.
Parliamentary debates are recorded in a document called a Hansard. These are published online at the website ourcommons.ca in structured XML, but accessing each day’s record is kind of annoying and involves a lot of clicking. Fortunately, the name of each XML document is predictable, and is currently of the form (as of March 13, 2026):
https://www.ourcommons.ca/Content/House/[PPS]/Debates/[HHH]/HAN[HHH]-[L].XMLThe stuff in square brackets are placeholders:
- [PPS]: two digits for the parliament and 1 for session
- [HHH]: 3 digits for the Hansard number
- [L]: 1 character for language (E or F)
For example, the Hansard for March 12th, 2026 is number 95 of the 1st session of the 45th parliament. The English language Hansard can be retrieved from:
https://www.ourcommons.ca/Content/House/451/Debates/095/HAN095-E.XMLUsing this information, combined with the for loop syntax above, download a selection of 10 Hansards.
Task 2: Extract information#
The goal in this section is to create a text file containing only the sentences from the Hansards that reference the economy (or economics, or economist, etc). At the end of this lab, you should have two text files: one containing your incremental regex development, and the other the final result.
As you do each step, record the command that you used in a text or Markdown file. Formatting doesn’t matter.
For quick and easy iterative regex development, try using https://regex101.com
Using
grep, find and print to the terminal all instances of the wordeconin the Hansards, taking care not to match words incidentally containingeconsuch assecond.Use the flag
-Efor “extended” regex (which is what we’ve been discussing in class).Modify your regex to be case-insensitive (hint: it’s another flag to pass to
grep)Modify your regex to match the entire word that has
econin it, and limit the terminal output to just those words (yet another flag).Modify your regex to match the entire sentence that has
econin it.Finally, run your
grepcommand and instead of printing the result to the terminal, redirect the output to a text file.
Submit#
Add the two files that you created in Task 2 to your lab08 repo and push them to GitHub. You do not need to commit the Hansard xml files. If you didn’t get all the way to the sentence extractor, that’s okay - as usual, it’s a completion-based lab mark.