EX04 - Data Wrangling


Overview

In Exercise 04, you will write some utility functions for wrangling data from a raw CSV data file into representations more conducive for computing. In the process, you will gain experience and comfort working with dict and list data structures.

In this exercise, you will move through a very common first set of steps when working with a new data set:

  1. Read the data
  2. Transform it to be in a “shape” that is easier to work with
  3. Preview and select just the parts of the dataset you are interested in
  4. Run (simple, in this notebook) analyses

The sample data set provided alongside this exercise is police stop data from Durham, as compiled by the Stanford Open Policing Project. This is a very small (348 rows out of the many millions the paper authors compiled) subset of the data behind the recent reading assignment.

Due Sunday 3/28 at 11:59pm - Start Early! Try completing at least one function per day to be on pace.

0. Pull the skeleton code

You will find the starter files needed by “pulling” from the course workspace repository. Before beginning, be sure to:

  1. Be sure you are in your course workspace. Open the file explorer and you should see your work for the course. If you do not, open your course workspace through File > Open Recent.
  2. Open the Source Control View by clicking the 3-node (circles) graph (connected by lines) icon in your sidebar or opening the command palatte and searching for Source Control.
  3. Click the Ellipses in the Source Control pane and select “Pull, Push” from the drop-down menu, then select “Pull from…” A box will appear and you should select either “origin” or “upstream”, but not “backup”. This will begin the pulling process from the course repository. It should silently succeed. (If you are on macOS and do not see anything in source control anymore, it may be because of a macOS update. You can typically resolve this by opening a new Terminal, typing the command xcode-select --install, pressing enter and following its instructions.) You will also need to restart VSCode after doing this.
  4. Return to the File Explorer pane and open the exercises directory. You should see it now contains the directory named ex04. If you expand that directory, you will see the starter files: list_utils.py and data_wrangling.ipynb. Additionally, you will see a CSV file in the data/ directory with the sample traffic stops from one week in Durham in 2015.

1. Starter Files

Your work in this exercise will be completed across two files:

  1. data_utils.py - This is the Python module where you will implement utility functions for working with data.

  2. data_wrangling.ipynb - This is the Jupyter Notebook file that makes use of the utility functions

The descriptions of the functions you will need to implement, as well as example code that makes use of those functions once they are correctly implemented, can be found in the data_wrangling.ipynb file.

If your screen is large enough, you are encouraged to open these files side-by-side in VSCode by dragging the tab of one to the right side of VSCode so that it changes to a split pane view. Closing your file explorer can help give you additional horizontal space.

Be sure to save your work in data_utils.py before reevaluating cells in data_wrangling.ipynb.

Corrections

Unfortunately, we cannot publish corrections to the data_wrangling.ipynb file without creating conflicts with anyone who has begun work on this exercise. As such, any corrections we need to issue will be posted here.

1.0) head Function

In the code cell, the if statement should compare data_cols_head.keys() != data_cols.keys().

In the event the argument passed to head indicating the number of rows to include is greater than the number of rows, you should return the entire table.

1.1) select Function

There is a typo in the write-up that references “3 columns” where only “2 columns” are expected.

2.0) count Function

The counts that are listed as expected are slightly incorrect because we changed the demo to use the subject_sex instead of search_performed column. You can ignore the search_counts line of the expected output.

Autograding

Login to Gradescope and select the assignment named “EX04 - Data Wrangling”. You’ll see an area to upload a zip file. To produce a zip file for autograding, return back to Visual Studio Code.

If you do not see a Terminal at the bottom of your screen, open the Command Palette and search for “View: Toggle Integrated Terminal”.

To produce a zip file for ex04, type the following command (all on a single line):

python -m tools.submission exercises/ex04

In the file explorer pane, look to find the zip file named “21.mm.dd-hh.mm-exercises-ex04.zip”. The “mm”, “dd”, and so on, are timestamps with the current month, day, hour, minute. If you right click on this file and select “Reveal in File Explorer” on Windows or “Reveal in Finder” on Mac, the zip file’s location on your computer will open. Upload this file to Gradescope to submit your work for this exercise.

Autograding will take a few moments to complete. For this exercise there will be points manually graded for style – using meaningful variable names and snake_case. If there are issues reported, you are encouraged to try and resolve them and resubmit. If for any reason you aren’t receiving full credit and aren’t sure what to try next, come give us a visit in office hours!