A major article widely cited as proof of racial bias in the criminal justice system is actually anything but.
According to ProPublica’s May article “Machine Bias,” a popular risk assessment software program, used across the country to evaluate criminal defendants’ risk of reoffending, has extremely biased outcomes against black defendants compared to white ones.
The article’s findings have started to take off. Immediately after publication, it was cited on sites like The Daily Dot and Fusion. On June 25, The New York Times cited it as proof that both computer algorithms and law enforcement are substantially biased against blacks. Last week, it was touted by FiveThirtyEight’s science reporter.
But ProPublica’s conclusions are deeply flawed, representing a major error in statistical reasoning on the part of its staff.
1. How ProPublica could find bias
ProPublica’s investigation focused on the COMPAS tool from Northpointe, Inc., which predicts a criminal defendant’s likelihood of reoffending after release based on inputs for over 100 different variables (race is explicitly excluded). COMPAS rates defendants on a 1-10 scale, with 1 being the lowest risk and 10 being the highest. COMPAS is used nationwide in pretrial, sentencing, and parole hearings to assist in a variety of judicial decisions.
If COMPAS genuinely had a racial bias, it wouldn’t be too difficult to find evidence. Functionally, it’s just a risk assessment tool, taking the details of criminal defendants and then producing a rating from 1-10 concerning their chances of reoffending. In a phone call, Northpointe co-founder and chief scientist Tim Brennan said a risk rating of 1 represents a 1-2 percent chance of reoffending, while a rating of 10 represented a risk approaching 80 percent.
Finding bias in COMPAS, then, is straightforward. For COMPAS to be a valid and fair test, it must do two things. First, it must predict recidivism with a useful level of accuracy, and second, it must handle different races equally in assessing their risk level. If the program were unfair toward members of a particular race, then members of that race would receive unjustifiably high risk ratings. A biased COMPAS, for instance, might rate a black male defendant as a 4 when he should be a 2, as a 7 when he should be a 5, or as a 10 when he should be an 8.
If COMPAS were showing this kind of bias, it would become apparent based on the recidivism of certain groups. For instance, if blacks rated an 8 reoffended 60 percent of the time, while whites rated an 8 reoffended 75 percent of the time, it would indicate blacks are wrongly receiving substantially higher scores than they should, and are therefore wrongly being treated as more of a crime risk than they really are.
2. What ProPublica actually found
ProPublica was commendably open about its research methods, as they not only wrote a separate article just on their methodology, but also posted all of their data and computer code on GitHub. The transparency makes it easy to follow how ProPublica analyzed the data and reached its conclusions.
In its methodology, ProPublica does check for the kind of bias described above. And the results are encouraging: There’s no evidence it exists.
To assess COMPAS, ProPublica used a data set of 18,610 people who received a COMPAS score in Broward County, Florida in 2013 and 2014. From this pool, it it screened out all Hispanic, Asian, American Indian, and “Other” defendants so that it could make calculations using only black and white defendants. Then, it took these defendants and assessed them, looking at their COMPAS scores along with their actual recidivism rates within two years. When conducting its assessments, ProPublica simplified COMPAS’s 1-10 rating system, so that defendants rated 1-4 were “low risk,” those rated 5-7 were “medium risk,” and those rated 8-10 were “high risk” (COMPAS also uses these divisions). For many of its assessments (including those described below), ProPublica further simplified the ratings by merging the medium and high-risk groups into a single unified “high-risk” category of its own creation.
ProPublica ran a Cox regression (a statistical test) on its data set, and the results, by their own admission, were very clear.
“The predictive accuracy of the COMPAS recidivism score was consistent between races in our study – 62.5 percent for white defendants vs. 62.3 percent for black defendants,” they write about midway through their methodology piece, accompanied by a handy chart. “Across every risk category, black defendants recidivated at higher rates.”
In other words, COMPAS is admirably fair in racial terms. Within each risk group, recidivism rates are very similar between races, and when gaps exist they are actually favorable towards blacks. Thirty-five percent of low-risk blacks recidivated, compared to 29 percent of whites, while 63 percent of high-risk blacks recidivated, compared to 59 percent of whites. Critically, being labeled “high-risk” means approximately the same thing for whites as it does for blacks. Blacks also have a slightly higher recidivism rate at each risk level, the exact opposite of what would be expected if COMPAS were wrongly placing them in higher risk categories.
So, how did ProPublica’s team of four reporters arrive at the conclusion COMPAS is biased against blacks? By working backward, a big statistical no-no.
Their conclusion stems from focusing on cases where COMPAS was “wrong,” where high-risk individuals did not reoffend or low-risk individuals actually committed new crimes. Out of black defendants who ultimately did not reoffend, 44.9 percent of them came from the high-risk pool, compared to just 23.5 percent of whites. Meanwhile, among whites who reoffended, 47.7 percent were originally labeled low-risk, compared to just 28 percent of black reoffenders. Or, as ProPublica sums it up, “the algorithm is more likely to misclassify a black defendant as higher risk than a white defendant.”
Here’s where ProPublica goes wrong: A person classified as high-risk who doesn’t recidivate wasn’t necessarily “misclassified” by COMPAS, they were simply a high-risk individual who thankfully didn’t reoffend. As even ProPublica acknowledges, black defendants were substantially more likely than whites to reoffend (overall, 51.4 percent of all blacks reoffended, compared to 39.4 percent of whites).
COMPAS is designed to identify the traits of criminals who reoffend. Blacks, by recidivating more often, are also more likely to have the traits of recidivating criminals, and those traits are being detected by COMPAS. As long as it equally accurate for all races, COMPAS is going to justifiably classify a higher percentage of blacks as high-risk, because they have more risk factors as a group. By extension, it’s inevitable blacks will have a higher percentage of “false positives,” because a higher percentage of them are placed in the only category where a “false positive” is possible.
Black defendants having a higher proportion of “false positives” isn’t a sign of bias in COMPAS, it’s actually proof the system is working exactly as it should. In other words, ProPublica is accusing COMPAS of bias over something that actually indicates a lack of bias.
To illustrate the point further, imagine if there were a computer program that estimated a person’s “risk” of eating a Philly cheesesteak sandwich in the next two weeks, based on factors like whether they’ve eaten them in the past, how many cheesesteak restaurants are in their neighborhood, etc. Then imagine there are two populations, one of 1,000 Philadelphians and another of 1,000 New Yorkers. 700 Philadelphians and 300 New Yorkers are classified as having a high cheesesteak-eating risk, while 300 Philadelphians and 700 New Yorkers are classified as having a low cheesesteak-eating risk. Ninety percent of those labeled “high-risk” in both pools end up eating a cheesesteak, while 90 percent of those labeled low-risk do not.
In this hypothetical, the Philadelphians would have a “false positive” rate of 20.6 percent (70 false positives out of 340), while New Yorkers would have a false positive rate of just 4.5 percent (30 false positives out of 660). The disparity doesn’t make the test biased against Philadelphians, it just means the test was accurately classifying more Philadelphians as high-risk in total.
3. ProPublica presented its findings in a misleading way
Despite ProPublica’s commitment to open data, the way ProPublica presented its data is heavy on spin. Its main article, titled “Machine Bias,” goes easy on stats in favor of readability. The piece is heavy on anecdotes designed to portray COMPAS as very unreliable (when ProPublica knows this is untrue), and it features just a handful of charts. Two of the charts show the radically different distribution in risk ratings for whites and blacks:This chart is presented in a vacuum, creating an impression of a major anti-black bias. Nowhere in the main article do ProPublica’s writers mention their own finding, which is that the risk scores accurately correspond to the reoffense risk of whites and blacks. In fact, the entire article never mentions that blacks have a higher recidivism rate at all.
In the article’s only other table, ProPublica presents the data regarding “false positives” and “false negatives” in a decidedly misleading way:The chart’s wording (which emphasizes the risk rating, and then whether a person reoffended) strongly points the reader toward a specific interpretation, namely that 44.9 percent of higher-risk blacks don’t reoffend while just 23.5 percent of higher-risk whites do the same. Meanwhile, it suggests that 47.7 percent of lower-risk whites reoffended, while just 28 percent of blacks do the same.
But the chart shows nothing of the sort. Instead, the chart is simply displaying the higher “false positive” rate for blacks, and as described above, that higher false positive rate is not proof of bias. Despite ProPublica’s phrasing, there is absolutely no predictive failure shown on this chart; as their own numbers show, COMPAS is almost equally accurate in its predictions for white and black defendants. The statistical accuracy across races would only be clear, though, to people who bother taking a close look at ProPublica’s in-depth methodology (contained inside a separate post on their site).
The article uses unjustified or misleading rhetoric in other ways. For instance, it describes COMPAS as “remarkably unreliable” in estimating the risk of violent crime, because “only 20 percent of the people predicted to commit violent crimes actually went on to do so.” But COMPAS doesn’t “predict” whether somebody will commit a violent crime, it rates their risk of doing so relative to others. Violent crime recidivism rates are already low in general, so unsurprisingly even high-risk ratings represent a sub-50% chance of actually committing another violent offense.
4. The Aftermath
Co-author Jeff Larson dismissed TheDCNF’s questions, denied his team’s analysis was misleading, and responded by suggesting TheDCNF was simply racist.
“It seems to me that you’re arguing, because black defendants have a higher recidivism rate … [that] innocent black defendants, the ones that cleaned up their lives after being arrested, somehow deserved it, because other people who looked like them had a higher recidivism rate.”
At another point, Larson claimed that Northpointe’s system was unconstitutional on the grounds that individuals labeled high-risk who didn’t reoffend were losing their due process right to be treated as innocent until proven guilty. His allegation is bizarre; COMPAS does not declare anybody guilty, it forecasts a person’s relative chance of reoffending in order to guide decisions on bail, sentencing, parole, and more. Even without computer programs, the justice system weighs risks all the time. A person’s perceived flight risk is integral to granting bail; the risk of reoffending is central to the decision of any parole board. Computer programs didn’t create the risk assessment process, they’ve human being from it. As previously noted, in the case of COMPAS, the program has even removed race as a factor.
Larson offered other, more substantive responses as well.
He pointed out that ProPublica’s reporting team had conducted a logistic regression on the data, which allowed them to control for factors such as age and prior offenses. After conducting these controls, he said, the team found blacks were about 45 percent more likely than whites to receive a high risk score.
Notably, Larson said ProPublica didn’t just correct for potential inputs (such as age), but also the output (whether a person did reoffend in the end).
Faye Taxman, a criminology professor at George Mason University, told TheDCNF that’s a huge no-no in statistics.
“You can’t control for the thing you’re predicting,” Taxman said. “It messes up the equations.”
Taxman also criticized what she considered to be ethical issues with ProPublica’s reporting. Although ProPublica obtained its data set through a Florida open-records request, Taxman said it was inappropriate for them to publish the entire data set with the name of every defendant intact.
Tim Brennan, who co-founded Northpointe and still works as its chief scientist, was considerably harsher in defending his work.
“If the ProPublica thing had to be peer reviewed, it would be thrown out,” he told TheDCNF. “Their analysis would be undoubtedly rejected.”
Regarding Brennan’s peer review criticism, Larson said “many, many experts” had in fact reviewed ProPublica’s work. He declined to identify who those experts were, though, saying all of them had asked not to be named.
Several other experts, though, openly voiced criticism of ProPublica’s work.
“This is a really confused and confusing analysis, in my view, that does not show what it is they are trying to show,” University of California professor and risk assessment expert Jennifer Skeem told The Washington Post, in an unfinished article that was accidentally posted on Wonkblog two weeks ago (it was taken down, but preserved on other websites). In a separate email to TheDCNF, Skeem reiterated her criticism, saying that after several additional weeks it was “clear now” that ProPublica’s assertions were “deeply flawed.”
Miguel Marino, a biostatistician at Oregon Health and Science University, echoed Brennan by telling the Post the piece “could have really benefited from peer review.”
Northpointe said it would be releasing a more in-depth criticism of ProPublica’s methods in the near future. Brennan expressed sadness over the accusations of bias brought against his company, saying he views COMPAS as a way to remove bias and help keep people out of prison.
“The whole system of risk assessment is aiming to do good things for the criminal justice system,” he said. “These systems are more accurate than judges, [and] the particular way these things are accurate is, they have a lower rate of false positives.”
“ProPublica … is trying to reform the system by attacking risk assessment, [but] they’re attacking a tool that can mitigate racism,” he added. “They’re shooting themselves in the foot.”
Send tips to email@example.com.
Content created by The Daily Caller News Foundation is available without charge to any eligible news publisher that can provide a large audience. For licensing opportunities of our original content, please contact firstname.lastname@example.org.