Lab 4: Numeric Data Transformations#
Objective#
Learn a bit about:
- Transforming numeric data
- Building preprocessing pipelines
Since you’ve already done a lot of painful wrangling in your first assignment to combine data into a useful tabular form, I’ve done this bit for you in this lab. We’re also going to use the same (sort of) housing assessment data from lab 2, so the dataset should be familiar.
If you are unable to fetch data from the City: I’ve put a copy of a (200MB+) CSV version at “I:\Labs\CompSci\Resources\DATA 3464\housing_data_pre_split.csv”, accessible either through the lab computers or WebFiles at gp.mtroyal.ca
Setup#
Merge in the new pull request from your labs repo to get the starter code. This time you should be able to enter your credentials and hit go, and then it’ll fetch the data and do some preliminary cleaning. In the end, you’ll be left with a train/test/validation split of residential housing assessments with the following columns.
df.info()<class 'pandas.core.frame.DataFrame'>
Index: 487996 entries, 0 to 490511
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ADDRESS 487996 non-null object
1 RE_ASSESSED_VALUE 487996 non-null float64
2 COMM_CODE 487996 non-null object
3 COMM_NAME 487996 non-null object
4 LAND_SIZE_SM 487996 non-null float64
5 SUB_PROPERTY_USE 487996 non-null object
6 MULTIPOLYGON 487996 non-null object
7 YEAR_OF_CONSTRUCTION 484288 non-null float64
8 PROPERTY_USE_DESC 480509 non-null object
9 longitude 487996 non-null float64
10 latitude 487996 non-null float64
dtypes: float64(5), object(6)
memory usage: 44.7+ MBI ran this on Friday, January 30th. The numbers may have been updated a little since then - this is why I chose to use the hashing method to split the dataset.
Some of these columns are redundant: COMM_CODE and COMM_NAME are the same information, with COMM_NAME just providing human-readable information. Similarly, PROPERTY_USE_DESC just explains what the SUB_PROPERTY_USE categories are. Some others, like the MULTIPOLYGON should probably be ignored entirely.
Preprocessing applied#
Feel free to read through and see the preprocessing done. I’ve tried to add comments to explain my decisions, but I didn’t include the exploration that led to those decisions. This included things like inspecting duplicate addresses and missing assessment values, as well as looking at the value_counts of SUB_PROPERTY_USE.
In the end, there’s a test/train/validation split applied, with RE_ASSESSED_VALUE assigned to the target y variable.
Your task#
Assume that we are going to use this data with some kind of linear model that benefits from having roughly standard normal data. This means you must build a preprocessing pipeline that:
- applies any necessary transformations to make the relationship between the feature and target more linear
- applies any necessary transformation to make the numeric features more normally distributed
- rescales the numeric features
As you’ve already seen, the land size doesn’t make much sense for multi-unit buildings. This will probably benefit from considering interaction effects but can be ignored for now, just try to get the numeric features on the same scale and roughly normal. You can do this with scikit-learn pipelines, or by defining your own functions to compute the transforms - just make sure to calculate any parameters from the training set only.
If this is all the energy you have to do in this lab, you’re done! Apply your transform to the validation set and plot the numeric features after your transformations to make sure it worked.
Extras, if you like#
Try preparing the data further by encoding the categorical predictors and combining the results with the numeric predictors. Next, try training a regression model (like plain old LinearRegression) to see if you can predict the assessment values.