In Exercise 04, you will write some utility functions for wrangling data from a raw CSV data file into representations more conducive for computing. In the process, you will gain experience and comfort working with
list data structures.
In this exercise, you will move through a very common first set of steps when working with a new data set:
- Read the data
- Transform it to be in a “shape” that is easier to work with
- Preview and select just the parts of the dataset you are interested in
- Run (simple, in this notebook) analyses
The sample data set provided alongside this exercise is police stop data from Durham, as compiled by the Stanford Open Policing Project. This is a very small (348 rows out of the many millions the paper authors compiled) subset of the data behind the recent reading assignment.
Due Sunday 3/28 at 11:59pm - Start Early! Try completing at least one function per day to be on pace.
0. Pull the skeleton code
You will find the starter files needed by “pulling” from the course workspace repository. Before beginning, be sure to:
- Be sure you are in your course workspace. Open the file explorer and you should see your work for the course. If you do not, open your course workspace through File > Open Recent.
- Open the Source Control View by clicking the 3-node (circles) graph (connected by lines) icon in your sidebar or opening the command palatte and searching for Source Control.
- Click the Ellipses in the Source Control pane and select “Pull, Push” from the drop-down menu, then select “Pull from…” A box will appear and you should select either “origin” or “upstream”, but not “backup”. This will begin the pulling process from the course repository. It should silently succeed. (If you are on macOS and do not see anything in source control anymore, it may be because of a macOS update. You can typically resolve this by opening a new Terminal, typing the command
xcode-select --install, pressing enter and following its instructions.) You will also need to restart VSCode after doing this.
- Return to the File Explorer pane and open the
exercisesdirectory. You should see it now contains the directory named
ex04. If you expand that directory, you will see the starter files:
data_wrangling.ipynb. Additionally, you will see a CSV file in the
data/directory with the sample traffic stops from one week in Durham in 2015.
1. Starter Files
Your work in this exercise will be completed across two files:
data_utils.py- This is the Python module where you will implement utility functions for working with data.
data_wrangling.ipynb- This is the Jupyter Notebook file that makes use of the utility functions
The descriptions of the functions you will need to implement, as well as example code that makes use of those functions once they are correctly implemented, can be found in the
If your screen is large enough, you are encouraged to open these files side-by-side in VSCode by dragging the tab of one to the right side of VSCode so that it changes to a split pane view. Closing your file explorer can help give you additional horizontal space.
Be sure to save your work in
data_utils.py before reevaluating cells in
Unfortunately, we cannot publish corrections to the
data_wrangling.ipynb file without creating conflicts with anyone who has begun work on this exercise. As such, any corrections we need to issue will be posted here.
In the code cell, the
if statement should compare
data_cols_head.keys() != data_cols.keys().
In the event the argument passed to
head indicating the number of rows to include is greater than the number of rows, you should return the entire table.
There is a typo in the write-up that references “3 columns” where only “2 columns” are expected.
The counts that are listed as expected are slightly incorrect because we changed the demo to use the
subject_sex instead of
search_performed column. You can ignore the
search_counts line of the expected output.
Login to Gradescope and select the assignment named “EX04 - Data Wrangling”. You’ll see an area to upload a zip file. To produce a zip file for autograding, return back to Visual Studio Code.
If you do not see a Terminal at the bottom of your screen, open the Command Palette and search for “View: Toggle Integrated Terminal”.
To produce a zip file for
ex04, type the following command (all on a single line):
python -m tools.submission exercises/ex04
In the file explorer pane, look to find the zip file named “21.mm.dd-hh.mm-exercises-ex04.zip”. The “mm”, “dd”, and so on, are timestamps with the current month, day, hour, minute. If you right click on this file and select “Reveal in File Explorer” on Windows or “Reveal in Finder” on Mac, the zip file’s location on your computer will open. Upload this file to Gradescope to submit your work for this exercise.
Autograding will take a few moments to complete. For this exercise there will be points manually graded for style – using meaningful variable names and snake_case. If there are issues reported, you are encouraged to try and resolve them and resubmit. If for any reason you aren’t receiving full credit and aren’t sure what to try next, come give us a visit in office hours!