we have set up several demonstrations to help you get hands-on practice for identifying bias using the current state-of-the-art detection and debiasing tools.

We use

parity-fairness python package

to illustrate biases in different publicly available data from various fields such as human resources, health, education, and policing.

  • Historical Bias in Hiring

    Let us consider a human resource dataset from a male-dominated industry. This data contains different attributes that represent the skills of the candidates applying for a pilot position. What the model is trying to predict is whether or not a candidate will proceed to the next stage of interviewing. A hiring algorithm modeled on this dataset tends to favor candidates who are men, predicting more frequently that they will advance to the next recruitment stage. This model is biased against women candidates (and does not have any representation from non-binary candidates), by two mathematical standards out of five metrics the Parity package evaluates.

    View data source

  • Historical Bias in Recidivism Risk Assessment

    COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a commercial algorithm used by judges and parole officers to score criminal defendants' likelihood of reoffending (recidivism). Criminal defendants respond to a COMPAS questionnaire. Their answers are fed into the COMPAS tool to generate a Risk of Recidivism score.

    ProPublica, an independent and non-profit group that produces investigative journalism for public interest, looked at more than 10,000 criminal defendants in Broward County, Florida, and compared their predicted recidivism rates over a two-year period. From ProPublica's analysis, it has been found that in this real-world application, the dataset consisted of more Black people (3,175) than white people (2,103). For this demo, we looked at how we can detect unfair biases on the same data used by ProPublica.

    From the results, we observe that the race attribute with white as the defined privileged group failed two of the Parity fairness metrics. This shows the algorithm favors white defendants who have a lower chance of being labeled as likely reoffenders as compared to non-white defendants.

    View ProPublica's analysis

  • Representation Bias in Credit Worthiness

    Credit issuing companies tend to favor older people due to the assumption that they are more likely to be employed, have more years of work experience, and have a longer track record of paying on time. However, this fails to recognize that young people may be just as capable of repaying their loans even if they have not lived long enough to build up a history of proof.

    With this in mind, we looked to detect a lending bias in the credit scoring dataset from the UCI machine learning repository. This dataset consists of customer records from Taiwan. Most of the clients in the datasets are between the ages of 31-40 years old. Our analysis reveals representation bias due to this imbalance in the dataset. The algorithm trained on this dataset failed four out of five Parity fairness metrics which indicates an advantage for the 31-40 age group compared to other age groups.

    View data source

  • Aggregation Bias in Predicting Diabetes

    We tested how aggregation bias exists in diagnosing diabetic patients who have apparent differences across ethnicities and genders. HbA1c levels are widely used in the diagnosis and monitoring of diabetes, but they vary in complicated ways across genders and ethnicities. To examine this, we will use a dataset obtained from UCI Machine Learning.

    Let us assume white is the privileged group for the race attribute. Our classifier modeled on this data failed one of the five Parity fairness metrics. This indicates that the model tends to provide a more accurate prediction for white patients as compared to non-white patients, which may lead doctors who rely on this algorithm to misdiagnose non-whites.

    View data source

  • Evaluation Bias in Facial Recognition

    Th Adience dataset is used to standardize the performance evaluation of a model, specifically, in recognizing faces. However, in research done by Buolamwini and Gebru, audience was determined to be biased toward skin color and gender. In the Buolomwini & Gebru paper, only 4.4% of the faces in the benchmark dataset used for evaluation are of dark-skinned female faces. Therefore if the models underperform on those samples, it will not matter much to their evaluation metrics. The benchmarks will fail to detect and penalize this kind of bias.

    In our own audit of a deep learning model trained on the Adience dataset, we found that there is a higher tendency for the model to falsely classify images of women's faces compared to men’s.

    View data source

  • Interpretation Bias in Student Performance

    One common faux pas is to mistake correlation for causation. For example, if student performance is correlated with romantic status, a school administrator may inappropriately conclude that the reason why girls with romantic partners are failing their class is because of their relationship. A policy based on this false interpretation (e.g., discourage girls from having romantic partners), may completely miss the root causes of student performance (e.g., undiagnosed learning differences) and cause lasting, sexist harms.