Dear Data Community: Your Data is Not Color Blind

May 17

Written By Maria T. Khan and Eva Pereira

Eight charts from the Ishihara test for detection of red–green color blindness (deficiencies). Image from Wellcome Images, a website operated by Wellcome Trust, a global charitable foundation based in the United Kingdom. CC BY 4.0, via Wikimedia Commons

by Maria T. Khan and Eva Pereira

Data is often thought of as an unbiased representation of facts, but the reality is that data and algorithms built atop of it are subject to bias and can amplify underlying disparities in our society. As a community of data scientists, we have a responsibility to be transparent about biases in the sources of data used; how it’s collected, who it represents, and the deep-seated racism in our institutions that affect how we use it. Without these considerations, bias can unintentionally creep into the data collection and decision-making process.

Algorithms used for decision-making that are marketed as race-blind or race-neutral can sometimes amplify underlying disparities.

Source: ProPublica, Machine Bias, May 23, 2016

ProPublica illustrates this with a study on risk assessment tools being used in courtrooms across the country. The tool used to predict the likelihood of committing future crimes was twice as likely to falsely flag Black defendants as “high risk” for recidivism compared to white defendants. In addition to that, the algorithm was 77 percent more likely to label Black defendants as a “higher risk” for committing future violent crimes and 45 percent more likely to commit a future crime of any kind. These predictions had very real implications in determining sentencing and bail for Black defendants.

Another example of how bias gets introduced into algorithms is covered in this Wired article on Allegheny County’s use of a predictive model to determine children most at risk for child abuse. Researchers have pointed to the blind spots in the algorithm; it’s much more likely to assign high scores to children living in poverty because the model is only fed data from public programs. “Allegheny County has an extraordinary amount of information about the use of public programs. But the county has no access to data about people who do not use public services. Parents accessing private drug treatment, mental health counseling, or financial support are not represented in DHS data. Because variables describing their behavior have not been defined or included in the regression, crucial pieces of the child maltreatment puzzle are omitted from the AFST.” They also detected racial bias in the referral system; child abuse and neglect hotlines are called on to report black and biracial families three and a half times more often than white families.

In the field of healthcare, NBC News documented how algorithms are used to predict and rank which patients would benefit the most from additional care. Researchers identified issues with using prior medical costs in the model, because of underlying racial disparities in access to healthcare. “Black patients spent $1,800 less in medical costs per year than white patients with the same chronic conditions, leading the algorithm to conclude incorrectly that the black patients must be healthier since they spend less on health care.” Recognizing the bias at play, they were able to take steps to fix the model, leading to an 84 percent reduction in bias. With the adjustment, the number of black patients served by the algorithm would increase from 17.5 percent to 46.5 percent.

The United States has a history of racist policies and practices that continue to pose a disadvantage to minority groups and will show up as biases in the underlying data being fed into algorithms.

For example, the Fair Housing Policy Act outlawed redlining in 1968 but, according to the Washington Post, “A recent study by Redfin found that the typical home in a redlined neighborhood gained $212,023 [in value] or 52 percent less than one in a “greenlined” neighborhood over the past 40 years.” Unsurprisingly, Black families are five times more likely to own a home in a historically redlined neighborhood than in a greenlined area. Advancement Project CA also finds that in California, Black families are most likely to be denied mortgage applications.

Housing is just one systemic example of how any data collected needs to caution the underlying racist history driving the numbers we see.

Denied mortgage applications

Lenders who assess risk solely by the socioeconomic status and credit scores of applicants disproportionately deny mortgages to people of color as people of color are less likely to have access to mainstream banking services and wealth-building assets than White people.

California has a total of 474,844 denied mortgage applications, with a rate of 24.2%.

Denied mortgage applications in the state of California (accessed and updated July 30, 2024) © Catalyst California (formerly known as Advancement Project CA); RACE COUNTS, racecounts.org

As organizations increasingly lean on algorithm-based decisions, the impact is scalable to affect more people than ever before. Machine learning is increasingly present in more systems that impact human life, and artificial intelligence researchers worry that this leads the way for a new era of scientific racism, according to Data Feminism. When data analysis is driving human decisions and algorithms are producing decision-making outputs, data scientists have a responsibility to center racial equity in every research project.

Three ways to strive for equity

To avoid developing algorithms and conclusions that are misrepresentative, equity needs to be a core value to the work all data scientists do. Three simple ways to strive for equity in any data project are:

1. Embed equity and ethics-focused questions throughout your workflow.

How is the data collected? Who is represented and who is not represented in the data? Find a list of questions from Dr. Timnit Gebru and her colleagues’ work on “Datasheets for Datasets.”

2. Explore different ways to disaggregate your data that voice smaller and underrepresented populations.

Is your sample distribution aligned with the population distribution? Check out this matrix developed by We All Count to help determine the most appropriate methodology for your work.

3. Whenever possible, highlight the rates for racial groups in your dataset.

Race needs to be a part of every data conversation we’re in. Start learning about the role race plays in California by visiting https://www.racecounts.org/, developed and updated by Advancement Project CA.

Last but not least, sign up for the LA Tech4Good’s Data Equity + Ethics workshop series and connect with others engaging in equitable data practices. For more information about upcoming workshops, see this overview, and for more data ethics and equity materials, check out these resources. Maria and Eva completed the series as a part of the first cohort in December 2020.

About the authors

Maria T. Khan

Maria T. Khan is a data storyteller and public policy advocate in Los Angeles, CA. She’s currently working at Advancement Project CA, a racial equity and justice advocacy nonprofit, as a research and data analyst. Maria holds an M.S. in Public Policy and Management from Carnegie Mellon University. Learn more about her work at mariatkhan.weebly.com

Eva Pereira

Eva Pereira is the Deputy Chief Data Officer for the Mayor’s Office at the City of Los Angeles. Her team leads the delivery of a wide range of data projects around topics like racial equity, census outreach, COVID response, and more. The team recently won the 2020 City Innovator award from Harvard University’s Technology & Entrepreneurship Center and was recognized as an Open Cities Index award winner for the City’s open data program.

Maria T. Khan and Eva Pereira

Dear Data Community: Your Data is Not Color Blind

Three ways to strive for equity

About the authors

Maria T. Khan

Eva Pereira

Surveillance Tech 101

Case Study: Piloting Data Equity at HopSkipDrive