DEFINING AND DETECTING FAIRNESS BIAS FOR BINARY CLASSIFICATION PROBLEM IN FINANCIAL ANALYSIS

Ghalachyan Gevorg

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(9), 2021

ՏՆՏԵՍԱԳԻՏՈՒԹՅՈՒՆ, ECONOMICS, ЭКОНОМИКА

DEFINING AND DETECTING FAIRNESS BIAS FOR BINARY CLASSIFICATION PROBLEM IN FINANCIAL ANALYSIS*

UDC 330.4 DOI: 10.52063/25792652-2021.2-183

GEVORG GHALACHYAN

Yerevan State University,

Faculty of Economics and Management,

Ph.D. Student,

Yerevan, Republic of Armenia gevghalachvan@gmail.com

This article aims to present the fairness bias in the models of artificial intelligence. First, it introduces use cases and legislative constrains of automated decision making towards sensitive features. Then using academic datasets, the historical human bias, measures of dataset fairness, and the effective way of choosing the respective metric are presented. And last, different AI models are estimated to show the replication of decision bias from data to models.

The design of the research is observational; academical datasets have been used. For the quantitative analysis both descriptive and inferential statistics are applied.

The analysis was done for the problem of binary classification mainly focusing on the decision making in finance. The phenomenon of unequal decisions aimed at unprivileged demographic groups was shown and quantified, stating the example given with averaging 820% bias between groups, which was also present in even most accurate models — 85% and 90% AUC score.

Key words: artificial intelligence, machine learning, binary classification, algorithmic fairness, disparate impact, equalized odds, representation error, fairness bias.

Introduction

There is a common belief that using an automated system makes decisions more objective and fairer. Yet, AI (artificial intelligence) algorithms are not always as objective as we expect them to be. And the main reason of biased algorithms is that they generally learn from the historical data, thus learn the historical biases, too. In the following research, we show that “wellperforming” machine learning algorithms replicate the human bias as they imitate human behavior, and apply the proposed methodology to two predictive modeling problems in finance, one for the income prediction and the other for the bank account access of an individual. We will also refer to the methodologies of supervision to select the proper metric of fairness.

Though there had been different definitions of algorithmic fairness, the conference talk by Arvind Narayanan summarized many different opinions on the topic and most importantly a domain specific approach was applied as definition of human and algorithmic fairness in jurisdiction differ from fairness in text analysis (Narayanan 00:00:01 - 00:55:20). So far, such topics have become more and more popular during conferences in the scope of fairness, accountability and transparency (Katel; Kamisnki and Malgieri). Speaking of transparency, it is

* Հոդվածը ներկայացվել է ընդունվել' 30.06.2021թ.:

19.05.2021թ., գրախոսվել'

19.06.2021թ., տպագրության

183

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(9), 2021

necessary to mention some of the legislative regulations (General Data Protection Regulation; California Consumer Privacy Act) that provide the fair use of the client data, violation of which has of course caused some major companies enormous fines.

Our research mainly focuses on the historical and algorithmic fairness for socio-economic applications. Many of the automated decisions affect human lives (job applications, loan applications, medications, bail), there is an ethical demand, sometimes a legal one, too, to create unbiased AI algorithms or mitigate the bias from the existing ones. Here are some use cases of different domains.

• An algorithm used by the United States criminal justice system had falsely predicted future criminality among African-Americans at twice the rate as it predicted for white people (Angwin et al).

• Amazon discovered that their AI hiring system was discriminating against female candidates, particularly for software development and technical positions. One suspected reason for this is that most recorded historical data were for male software developers.

• Google’s ad-targeting algorithm had proposed higher-paying executive jobs more for men than for women

• Face detection system by Nikon was falsely classifying the Asian as eye-blinking.

In the theoretical part of the research we summarize different methodologies and techniques suggested by authors. For the practical part we have use cases and applications supporting the observational outlook of the study.

The following steps are applied as a map of research. First, the theoretical overview of the problem is presented and respective mathematical formulations of the concept. Previously suggested methods are grouped in the matter of similarity for use. Second, two cases of the application are shown for the field of financial management. Academic and open-source datasets are shown for the later considerations and respective exploratory data analysis is implemented including fairness reporting. Then, we train the suggested models choosing the best hyperparameters: grid-search method with 5-fold cross-validation is used for the 80% of the entire data. 20% holdout is used for the final model evaluation for accuracy, AUROC, AUPR. Last, fairness metrics are computed and interpreted.

To implement the suggested methodology, a set of open source tool-kits have been used supporting Python 3.9. Scikit-learn library was used for model evaluation and feature engineering. AIF360 was used for fairness metrics estimation, both for datasets and models. Visualization was performed using Matplotlib library. A reproducible source code link for the analysis can be provided as requested.

Defining Fairness

For the further discussion we consider the following problem (Dwork et al 217): we have a historical data of credit loan applications and optionally we need to build a model that can predict whether the new applicant will be successfully granted a loan or not. For the historical dataset, we have a set of features, can be both discrete and continuous, which will be used for training and predictions of the model, e.g. age, educational level, gender, race, monthly income, and prediction label, a binary feature which will be used for supervision while training the model and outcome variable for prediction, e.g. the application was successful. A label whose value corresponds to an outcome that provides an advantage to the recipient (such as receiving a loan) is called favorable label and we noted as Y = 1 . An attribute from X that partitions the dataset into groups whose outcomes may hypothetically have parity is called protected attribute, S, e.g. gender of the applicant, and a protected attribute value indicating a group that has historically been at a systemic advantage is called privileged value (group), of a protected attribute, S = 1 , (here we assume males as privileged group).

184

ԳԻՏԱԿԱՆ ԱՐՑԱԽ

SCIENTIFIC ARTSAKH

научный арцах

№ 2(9), 2021

Individual vs Group

With respect to the analysis purposes, a proper metric of fairness bias must be chosen. When the fairness of decision is related to a specific subject, here the applicant, an individual fairness metric must be chosen. Individual fairness seeks for similar subjects to be treated fairly, that is to have the approximately equal conditional probability to be classified to same label. We note that as following,

|P( у( f) = у, |*( 0) ֊ P(= у, |Х(Л)| < £ ,if d(i,j) * 0

where d( i ,j) is the distance between 2 observations. The selection of a distance measure is problem-specific. In general case, where there are few features, we choose Euclidian distance; as the Euclidian distance assumes the independence of the features. Hamming distance is the alternative option once we have many categorical features. Individual fairness is useful for case reports, but also can be summarized as descriptive statistic to a specific group of subjects. Note that protected attribute is simply considered as the rest of the features.

For most of the cases group fairness metrics are computed as either descriptive statistics or a model performance measure. In this case we subset the dataset by the possible values of the protected attribute. However, the methodology highly depends on the objectives of the research, therefore we classify such metrics to following groups.

Group Fairness: Data vs Algorithm

Every time when algorithmic fairness is mentioned, we intuitively think of an algorithm that is applied fairly, that is we think of a production process of an application. However, we can measure the historical human bias. For example, we need to know how fair our loan office acted and whether there has been a systematic bias towards a specific group of applicants. We can find it by computing the difference between conditional priors - ( ) (

1 ). Yet, not all the metrics can be computed for dataset descriptions, thus hereafter we present the metrics for the model fairness, additionally notating Y as outcome vector of the model.

Group Representations vs Group Errors

Concept of various fairness analysis have different requirements. Some analysis is to imply overall fairness of the model or the dataset: for such purposes we use representation metrics. Disparate impact (Barocas et al 671) is the proportional difference of positive outcome probabilities between unprivileged and privileged. groups.

P( ? =1 | S =£ 1)

------------------- > 1 — £

P( Y = 11S = 1)

shows the proportional fairness difference between the groups, and empirically optimal value must be below , that is disparate impact metric is above . However, typical cases are when the dataset is imbalanced by the classification label and proportional metric can be misleading; metrics of absolute difference must be applied - Demographical (Statistical) parity.

|p(Y = 1|s փ 1 )֊ p(Y = 1|s = 1 )| < £

Whenever using group representations, we assume the “We are all equal” (Moritz et al 4) rule. Disadvantages of this approach are overall aggregating between classification groups which

185

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(9), 2021

leads us to use “What you see is what you get approach”. Group metrics target a specific classification group, e.g. for loan application problem, we are rather interested in not approving loan to the failing subjects than not failing ones, that is we concentrate rather on false positives rather than false negatives. From the perspective of algorithmic fairness, we need to assure that the false positives are similar between groups. Confusion matrix, both total and group-wise, is used to compute group errors.

Table 1: Confusion matrix of a binary classification model

For the current situation, as a loan officer we are interested in False Discovery Rate or False Positive Rate (Fall-out) and their parity as our decisions are punitive (Kleinberg et al).

FP FP

FDR=W, FPR = —

FDR Parity FDRunpriV FDRpriv FPR Parity = FPRunpriv ֊ FPRpriv

For assistive cases, such as judicial operations, recidivism prediction, we aim to assist society, therefore we select different metrics of performance and fairness, such as False Omission Rate and False Negative Rate.

FN FN

FOR=—, FNR = —

FOR Parity FORunpriv PORpriv FNR Parity FNRunpriv P^^priv

Previous methods, called equalized odds, in addition to many other metrics, support specific groups of interest, therefore any problem is required to be considered more detailly and consulted with a domain expert. As a summary of the methodology, we present the following tree of rule for the reference.

Figure 1: Decision rule for fairness metric selection

186

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(9), 2021

Datasets and Baseline Models

Adult Dataset (adult)

First dataset, that we used for the analysis is the “Adult Data Set” from UCI Machine Learning Repository. It consists of records of individual and their yearly incomes, other features include age, race, education, gender, etc. and the task is to predict whether the individual has income of higher than 50,000 US dollars or not. For the fairness analysis we assume gender as protected attribute with male as privileged group, in addiction to race with white as privileged group. For the baseline classification models we selected 4 different models, 1 linear and 3 tree-based, Logistic Regression (LR), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB) with training pipeline presented in the source code reference. GB obviously outperforms other models with both ROC and PR scores.

Figure 2: ROC and PR curves for baseline classification models (adult dataset, train and test splits)

Financial Inclusion in Africa (finincl)

Second dataset, Financial Inclusion in Africa, is a survey result from 4 different African countries with prediction task to classify subjects into two groups whether they have a bank account or now. And again, we assume gender as a protected attribute. Ensemble methods have shown better performance for this dataset, too.

Figure 3: ROC and PR curves for baseline classification models (finincl dataset, train and test splits)

Results

As mentioned in the methodology, first we need to implement a full exploratory analysis on the dataset. Adult dataset contains has a significant class imbalance towards unpreferable label, 75% of all records, while for the other dataset this phenomenon is milder, 59%. One method, we

187

ԳԻՏԱԿԱՆ ԱՐՑԱԽ

SCIENTIFIC ARTSAKH

научный арцах

№ 2(9), 2021

suggest using for visualizing the relations between two categorical features (here outcome variable and protected attribute), is mosaic plot. The areas of the on a mosaic represent the percentage of total presence for the respective subgroup.

Figure 4: Mosaic plots for adult-gender, adult-race, finincl-gender dataset-attribute pairs

To support our visual representation of the historical fairness bias we evaluate previously introduced dataset metrics. Disparate Impact (DI) metrics for all 3 cases are below 0.8 which is empirically considered a reasonable threshold. Note that for the finincl dataset in spite of low DI, Statistical Parity (SP) difference equals to -0.08, which combined alert that we work with small target subgroups of preferable label. After examining the historical bias, we can proceed with the model evaluation.

dataset_protected-attribute adult_gender adult_race finincl_gender

dispa rate_impact statistical_parity_difference 0.36 0.60 0.56 -0.20 -0.10 -0.08

Table 2: Dataset fairness metrics

Accuracy metrics supports AUROC and AUPR metrics to choose GB model over other three. Group representation fairness metrics of the models are not better than dataset metrics, that is we showed that historical bias is replicated and reproduced.

Speaking of group errors, we may see that for some model-target pairs we may even have reverse fairness bias (highlighted with red). Considering the case of the finincl dataset, we see that with gradient boosting method fairness bias is not reproduced for the positives. Furtherly we may consider research of significance for the following metrics and analysis of confidence intervals.

188

ԳԻՏԱԿԱՆ ԱՐՑԱԽ

SCIENTIFIC ARTSAKH

научный арцах

№ 2(9), 2021

dataset_protected-attribute adult gender adult race finincl gender

metric LR DT RF GB LR DT RF GB LR DT RF GB

accuracy 0.82 0.76 0.77 0.87 0.82 0.76 0.77 0.87 0.81 0.79 0.81 0.89

disparate impact 0.30 0.26 0.24 0.31 0.65 0.69 0.64 0.59 0.47 0.55 0.51 0.52

statistical parity difference -0.28 -0.41 -0.42 -0.17 -0.11 -0.14 -0.16 -0.08 -0.19 -0.17 -0.17 0.04

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

false discovery rate difference 0.01 -0.08 -0.10 -0.01 0.09 0.10 0.08 0.09 0.01 0.01 0.01 [ 0.11

false_ discovery_rate_ratio 1.02 0.84 0.79 0.98 1.24 1.20 1.16 1.44 1.01 1.02 1.02 1.46

false positive rate difference -0.17 -0.34 -0.33 -0.06 -0.05 -0.08 -0.10 -0.01 -0.15 -0.15 -0.14 0.01

false positive rate ratio 0.24 0.17 0.15 0.23 0.71 0.72 0.65 0.73 0.44 0.51 0.48 0.70

false omission rate difference -0.05 -0.03 -0.02 -0.08 -0.04 -0.03 -0.03 -0.05 -0.01 -0.03 -0.01 0.04

false omission rate ratio 0.47 0.57 0.65 0.40 0.49 0.47 0.49 0.56 0.82 0.60 0.74 0.65

false negative rate difference 0.17 0.18 0.21 0.11 0.01 0.01 0.01 0.04 0.17| 0.07 0.13 0.10

fa lse_negative_rate_ratio 1.86 2.94 3.36 1.33 1.03 1.05 1.13 1.11 1.82 1.31 1.61 1.15

Table 3: Model fairness metrics (colored scaling is following, yellow - best performing model, white - no fairness bias detected, blue - fairness bias against privileged group detected, red - fairness bias in favor of privileged group detected)

Conclusions

As opposed to the idealistic approach that automated systems are peeking the performance and surpassing human in all frontiers, our research shows the performance issue of the “learning from data” method and most specifically it is reproducing human bias. However, different automation problems require specific approach to compute algorithmic fairness: Here we show a tree of decision to choose a proper metrics of performance, their mathematical formulation and references.

The application of the suggested methodology has shown that the historical systematic bias towards a privileged group is reflected into baseline models. For all the datasets disparate impact and statistical parity metrics are the same both for datasets and models that is subjects will be treated similarly as a result of the automations. For adult dataset we have no predefined problem of automation which means group representation metrics are preferred, in contrary to financial inclusion dataset where we planned to minimize false positives group errors are important. False positive ratios between male and female groups differ twice which is considered as a serious algorithmic bias.

Next, we intend to explore different bias mitigation algorithms, apply for problems of the same scope and report metrics comparison. Such techniques are currently popular for automated systems of juridical decisions and we plan to apply them for financial modeling. Besides, next steps include building of the efficient accuracy-fairness frontier for a specific class of models and optimization path from starting model.

Works Cited

1. “California Consumer Privacy Act”, State of California Department of Justice Office of the Attorney General, 2018, https://oag.ca.gov/privacy/ccpa#sectionf

2. “General Data Protection Regulation”, European Parliament and Council of the European Union, 2018, https://gdpr-info.eu/art-5-gdpr/

3. Angwin, Julia, et al. “Machine Bias at ProPublica” 2016, https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

4. Barocas, Solon, et al. Big data’s Disparate Impact. Calif. L. Rev. 104, 2016

5. Dwork, Cynthia, et al. “Fairness through Awareness. ” Proceedings of the 3rd innovations in theoretical computer science conference. ACM, 2012, 214–226

6. Kaminski, Margot, and Malgieri, Gianclaudio. ""Multi-Layered Explanations from Algorithmic Impact Assessments in the GDPR." Proceedings of the 2020 Conference on

189

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(9), 2021

Fairness, Accountability, and Transparency. 2020.

https://dl.acm.org/doi/10.1145/3351095.3372875

7. Katell, Michael, et al. "Toward Situated Interventions for Algorithmic Equity: Lessons from the Field." Proceedings of the 2020 conference on fairness, accountability, and transparency. 2020. https://dl.acm.org/doi/10.1145/3351095.3372874.

8. Kleinberg, Jon, et al. “Inherent Trade-Offs in the Fair Determination of Risk Scores". In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz Zentrum fuer Informatik.

9. Moritz, Hardt, et al. “Equality of Opportunity In Supervised Learning". Advances in neural information processing systems. 2016

10. Narayanan, Arvind. "21 Fairness Definitions and their Politics." Youtube, uploaded by Arvind Narayanan, 1 March 2018, https://www.youtube.com/watch?v=jIXIuYdnyyk

ԱՐԴԱՐՈՒԹՅԱՆ ՇԵՂՄԱՆ ՍԱՀՄԱՆՈՒՄՆ ՈՒ ՀԱՅՏՆԱԲԵՐՈՒՄԸ ՖԻՆԱՆՍԱԿԱՆ ՎԵՐԼՈՒԾՈՒԹՅԱՆ ԲԻՆԱՐ ԴԱՍԱԿԱՐԳՄԱՆ ԽՆԴԻՐՆԵՐՈՒՄ

ԳԵՎՈՐԳ ՂԱԼԱՉՅԱՆ

Երևանի պետական համալսարանի տնտեսագիտության և կառավարման ֆակուլտետի ասպիրանտ, ք. Երևան, Հայաստանի Հանրապետություն

Հոդվածի նպատակն է ներկայացնել արհեստական բանականության մոդելներում արդարության շեղումը։ Նաև և առաջ ներկայացվում են կիրառության օրինակներ և օրենսդրական սահմանափակումներ, որոնք առնչվում են որոշումների կայացման ավտոմատ համակարգերում զգայուն փոփոխականներին։ Այնուհետև օգտագործելով ակադեմիական տվյալներ՝ ներկայացրել ենք մարդու կողմից թույլ տրված պատմական շեղումը, դրա քանակական չափորոշիչները և համապատասխան ցուցանիշի ընտրության արդյունավետ մեթոդները։ Վերջում գնահատվել են արհեստական բանականության մի քանի մոդելներ՝ ցույց տալու տվյալներից մոդելներ շեղման կրկնօրինակումը։

Հետազոտության տեսակը դիտարկվող է, օգտագործվել են ակադեմիական տվյալներ։ Քանակական վերլուծության համար կիրառվել է նկարագրական ու կանխատեսող վիճակագրություն։

Հետազոտությունը կատարել ենք բինար դասակարգման խնդրի օրինակով՝ առանձնակիորեն ուշադրություն դարձնելով դրա կիրառությանը ֆինանսներում։ Ցույց ենք տվել և չափել ենք խոցելի սոցիալական խմբերում անհավասար որոշումների կայացումը, մեր օրինակում միջինում 8-20% շեղում կա առկա խմբերում, որը նաև առկա նույնիսկ կանխատեսման լավագույն մոդելներում 85% և 90% AUC ցուցանիշ։

Հիմնաբառեր' արհեստական բանականություն, մեքենայական ուսուցում, բինար դասակարգում, ալգորիթմի արդարություն, տարբերակող ազդեցություն, հավասարեցված հավանականություններ, ցուցադրման սխալանք, արդարության շեղում:

190

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(9), 2021

ОПРЕДЕЛЕНИЕ И ОБНАРУЖЕНИЕ ПРЕДВЗЯТОСТИ СПРАВЕДЛИВОСТИ ДЛЯ ПРОБЛЕМЫ ДВОИЧНОЙ КЛАССИФИКАЦИИ В ФИНАНСОВОМ АНАЛИЗЕ

ГЕВОРГ КАЛАЧЯН

аспирант факультета экономики и менеджмента Ереванского государственного университета, г. Ереван, Республика Армения

Целью данной статьи является представление предвзятого отношения к справедливости в моделях искусственного интеллекта. В первую очередь показаны варианты использования и законодательные ограничения автоматического принятия решений в отношении чувствительных атрибутов. Затем, используя академические наборы данных, мы представляем допущенную человеком историческую предвзятость, ее количественные параметры и эффективные методы выбора соответствующих показателей. В заключение была проведена оценка нескольких моделей искусственного интеллекта с целью продемонстрировать закономерность отклонений данных моделей.

Методика исследования - наблюдение, при котором использовались академические данные. Для количественного анализа были применены описательные и прогнозирующие статистические данные.

Анализ был проведен для проблемы бинарной классификации, при этом основное внимание было уделено ее применению в сфере финансов. Мы показали и количественно оценили принятие неравноценных решений в наиболее уязвимых социальных группах: на нашем примере видно отклонение в среднем на 8–20% в данных группах, которое присутствовует даже в самых лучших по прогнозам моделях с показателями AUC 85% и 90%.

Ключевые слова: искусственный интеллект, машинное обучение, бинарная

классификация, алгоритмическая справедливость, разрозненное влияние, уравнивание шансов, ошибка представления, систематическая ошибка справедливости.

191

DEFINING AND DETECTING FAIRNESS BIAS FOR BINARY CLASSIFICATION PROBLEM IN FINANCIAL ANALYSIS Текст научной статьи по специальности «Экономика и бизнес»

Аннотация научной статьи по экономике и бизнесу, автор научной работы — Ghalachyan Gevorg

Похожие темы научных работ по экономике и бизнесу , автор научной работы — Ghalachyan Gevorg

Текст научной работы на тему «DEFINING AND DETECTING FAIRNESS BIAS FOR BINARY CLASSIFICATION PROBLEM IN FINANCIAL ANALYSIS»