In this assignment, you are tasked with reading a data science paper investigating whether there is widespread evidence of racial bias in police forces across the nation. The authors compiled a dataset of over 100 million traffic stops across the United States and performed several analyses of the policing decisions. The paper was published in the Nature Human Behavior Journal in May 2020 and led to the
Stanford Open Policing Project which aims to make this data more accessible for the general public’s use.
In the next project, you will reproduce some of the analysis performed in the paper’s section on “Assessing Bias in Search Decisions.” To do so, we will need to learn how to read the datasets produced by this research group into our Python programs. In preparation for understanding the motivation for applying those techniques, reading the assigned sections of the paper will help establish a conceptual framework.
Before you begin reading
You will probably find that this paper is a more technically dense than the previous. Many of the statistical methods and mathematical derivations are beyond the scope of this class, and frankly, beyond our area of expertise, too. The goal of reading this paper is not to come away being able to explain each detail, but rather to gain exposure to an impactful application data science and computing in social issues. This paper demonstrates computer scientists are able to investigate evidence of racially biased policing practices through large-scale data analysis.
To aid in you reading of this paper, we have compiled a short list of terms below that may be useful to have definitons of. This is not an exhaustive list, so Googling and searching around is encouraged if there is a term or concept you want to learn more about.
It may also be helpful to follow along with the guided reading questions on Gradescope as you complete the reading.
- Normalizing data: Getting everything on a common scale
- Logistic regression: Used to model probability of a class or event happening that is normally binary — dead/alive/ true/false, positive/negative
- Spline: Interpolating using high degree polynomials; interpolation is a way of estimating within a range of discrete set of points
- Degrees of freedom: Independent values which can vary
- Confidence intervals: Given some range of plausible values for some results, how confident you are in your results (in this paper, we see 95% confident)
- P-Value: Probability of an observed result assuming that the null hypothesis is true
- Null hypothesis: The hypothesis to be rejected
- Statistically significant: If something is very unlikely to have occurred given a null hypothesis, it was probably not by chance, so the resuts are statistically significant
- Bayesian: Generally, Bayes models operate off of expectation as a measure of probability instead of frequency
- Benchmark analysis: Studying specific indicators resulting in some performance and comparing it to others
- Credible interval: Bayesian equivalent of a confidence interval; Where an unobserved parameter value falls with a particular probability
Who is the paper’s primary author?
Beginning in summer 2021, I will be an assistant professor of computer science at Cornell Tech, Cornell University’s new NYC campus. I am currently a senior researcher at Microsoft Research New England. I develop data science and machine learning methods to study two broad areas: inequality and healthcare. Link to Emma Pierson’s Homepage.
Where should you ask questions on readings?
Is there a term or concept used in the paper that you’re confused by? Please direct your questions to this form and we will respond to you, either directly or via a running Frequently Asked Questions page that responds to questions that arise multiple times.
Any questions asked on readings in office hours, oustide of logistical questions, will be redirected to the form above.
Read the Paper, Reflect, and Respond
You can find a PDF copy of the paper to read from the author’s home page here: https://5harad.com/papers/100M-stops.pdf
Please read the following sections:
- Introduction (page 736)
- Compiling a national database of traffic stop (Page 737)
- Assessing bias in traffic stop decisions (Page 737-738)
- Assessing bias in search decisions (Page 738-739)
There is an assignment open on Gradescope with some guided reading questions. Please respond to these based on your reading of the paper by 3/17/21 at 11:59pm.