Original dataset: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=ny&records=all-records&field_descriptions=labels
So far from original dataset:¶
- Selected "no co-applicant" records from the column "co_applicant_ethnicity_name"
- Filter columns:
- action_taken
- denial_reason_name_1
- Created the column "action_taken":
- df['action_taken'] = df['action_taken_name'].replace({
})
4) First feature selection:
- df = df[
['loan_type_name',
'property_type_name',
'loan_purpose_name',
'loan_amount_000s',
'action_taken',
'msamd_name',
'applicant_ethnicity_name',
'applicant_race_name_1',
'applicant_sex_name',
'applicant_income_000s',
'denial_reason_name_1',
'denial_reason_name_2',
'denial_reason_name_3',
'rate_spread',
'lien_status_name',
'minority_population',
'hud_median_family_income',
'tract_to_msamd_income']
]
- Excluded "Credit application incomplete" records from the column "denial_reason_name_1"
- New column "ethnicity_race_sex"
- Preprocessing diverse:
- Drop 'msamd_name'
- Eliminate the 20 values where 'One-to-four family dwelling (other than manufactured housing)') & tract_to_msamd_income are null
- Create mirror columns for missing values on 'tract_to_msamd_income', 'minority_population', 'hud_median_family_income' AND
- Filling null values with Median from each group from the column "ethnicity_race_sex"
- New column "loan_to_application_ratio"
- Filling missing values with
- "0" rate_spread
- or "unknown" (denial_reason_name_1, denial_reason_name_2, denial_reason_name_3)
- Adressed outliers on applicant_income_000s and loan_amount_000s
In [1]:
!pip install shap
Collecting shap Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from shap) (1.26.4) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from shap) (1.13.1) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (from shap) (1.3.2) Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from shap) (2.1.4) Requirement already satisfied: tqdm>=4.27.0 in /usr/local/lib/python3.10/dist-packages (from shap) (4.66.5) Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.10/dist-packages (from shap) (24.1) Collecting slicer==0.0.8 (from shap) Downloading slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB) Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from shap) (0.60.0) Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from shap) (2.2.1) Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->shap) (0.43.0) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2024.2) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas->shap) (2024.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn->shap) (3.5.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas->shap) (1.16.0) Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 540.1/540.1 kB 11.2 MB/s eta 0:00:00 Downloading slicer-0.0.8-py3-none-any.whl (15 kB) Installing collected packages: slicer, shap Successfully installed shap-0.46.0 slicer-0.0.8
In [2]:
!pip install aif360
Collecting aif360 Downloading aif360-0.6.1-py3-none-any.whl.metadata (5.0 kB) Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from aif360) (1.26.4) Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from aif360) (1.13.1) Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from aif360) (2.1.4) Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.10/dist-packages (from aif360) (1.3.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from aif360) (3.7.1) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360) (2024.2) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360) (2024.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360) (3.5.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (1.3.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (4.53.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (1.4.7) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (24.1) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360) (3.1.4) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=0.24.0->aif360) (1.16.0) Downloading aif360-0.6.1-py3-none-any.whl (259 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 259.7/259.7 kB 8.3 MB/s eta 0:00:00 Installing collected packages: aif360 Successfully installed aif360-0.6.1
In [3]:
pip install 'aif360[Reductions]'
Requirement already satisfied: aif360[Reductions] in /usr/local/lib/python3.10/dist-packages (0.6.1) Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (1.26.4) Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (1.13.1) Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (2.1.4) Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (1.3.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from aif360[Reductions]) (3.7.1) Collecting fairlearn~=0.7 (from aif360[Reductions]) Downloading fairlearn-0.10.0-py3-none-any.whl.metadata (7.0 kB) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[Reductions]) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[Reductions]) (2024.2) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[Reductions]) (2024.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[Reductions]) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[Reductions]) (3.5.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (1.3.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (4.53.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (1.4.7) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (24.1) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[Reductions]) (3.1.4) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=0.24.0->aif360[Reductions]) (1.16.0) Downloading fairlearn-0.10.0-py3-none-any.whl (234 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 234.1/234.1 kB 6.8 MB/s eta 0:00:00 Installing collected packages: fairlearn Successfully installed fairlearn-0.10.0
In [4]:
pip install 'aif360[inFairness]'
Requirement already satisfied: aif360[inFairness] in /usr/local/lib/python3.10/dist-packages (0.6.1) Requirement already satisfied: numpy>=1.16 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (1.26.4) Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (1.13.1) Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (2.1.4) Requirement already satisfied: scikit-learn>=1.0 in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (1.3.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from aif360[inFairness]) (3.7.1) Collecting skorch (from aif360[inFairness]) Downloading skorch-1.0.0-py3-none-any.whl.metadata (11 kB) Collecting inFairness>=0.2.2 (from aif360[inFairness]) Downloading inFairness-0.2.3-py3-none-any.whl.metadata (8.1 kB) Collecting POT>=0.8.0 (from inFairness>=0.2.2->aif360[inFairness]) Downloading POT-0.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB) Requirement already satisfied: torch>=1.13.0 in /usr/local/lib/python3.10/dist-packages (from inFairness>=0.2.2->aif360[inFairness]) (2.4.0+cu121) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[inFairness]) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[inFairness]) (2024.2) Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->aif360[inFairness]) (2024.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[inFairness]) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.0->aif360[inFairness]) (3.5.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (1.3.0) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (4.53.1) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (1.4.7) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (24.1) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->aif360[inFairness]) (3.1.4) Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.10/dist-packages (from skorch->aif360[inFairness]) (0.9.0) Requirement already satisfied: tqdm>=4.14.0 in /usr/local/lib/python3.10/dist-packages (from skorch->aif360[inFairness]) (4.66.5) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=0.24.0->aif360[inFairness]) (1.16.0) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (3.16.0) Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (4.12.2) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (1.13.2) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (3.3) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (3.1.4) Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (2024.6.1) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (2.1.5) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.13.0->inFairness>=0.2.2->aif360[inFairness]) (1.3.0) Downloading inFairness-0.2.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.8/45.8 kB 1.6 MB/s eta 0:00:00 Downloading skorch-1.0.0-py3-none-any.whl (239 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 239.4/239.4 kB 11.6 MB/s eta 0:00:00 Downloading POT-0.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (835 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 835.4/835.4 kB 29.5 MB/s eta 0:00:00 Installing collected packages: POT, skorch, inFairness Successfully installed POT-0.9.4 inFairness-0.2.3 skorch-1.0.0
In [5]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import lightgbm as lgbm
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_auc_score, log_loss
from scipy.stats import skew
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTE
from aif360.metrics import BinaryLabelDatasetMetric
from aif360.datasets import BinaryLabelDataset
from aif360.algorithms.preprocessing import Reweighing
import joblib
/usr/local/lib/python3.10/dist-packages/dask/dataframe/__init__.py:42: FutureWarning: Dask dataframe query planning is disabled because dask-expr is not installed. You can install it with `pip install dask[dataframe]` or `conda install dask`. This will raise in a future version. warnings.warn(msg, FutureWarning) /usr/local/lib/python3.10/dist-packages/inFairness/utils/ndcg.py:37: FutureWarning: We've integrated functorch into PyTorch. As the final step of the integration, `functorch.vmap` is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use `torch.vmap` instead; see the PyTorch 2.0 release notes and/or the `torch.func` migration guide for more details https://pytorch.org/docs/main/func.migrating.html vect_normalized_discounted_cumulative_gain = vmap( /usr/local/lib/python3.10/dist-packages/inFairness/utils/ndcg.py:48: FutureWarning: We've integrated functorch into PyTorch. As the final step of the integration, `functorch.vmap` is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use `torch.vmap` instead; see the PyTorch 2.0 release notes and/or the `torch.func` migration guide for more details https://pytorch.org/docs/main/func.migrating.html monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))
In [6]:
# Load the Drive helper and mount
from google.colab import drive
# This will prompt for authorization.
drive.mount('/content/drive')
%ls "/content/drive/My Drive/Colab Notebooks/hmdaNY_02092024_1603_Ready_2.csv"
Mounted at /content/drive '/content/drive/My Drive/Colab Notebooks/hmdaNY_02092024_1603_Ready_2.csv'
In [7]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmdaNY_02092024_1603_Ready_2.csv')
print(df.shape,'\n')
print(df.dtypes)
(146718, 15) action_taken object applicant_income_000s float64 ethnicity_race_sex object hud_median_family_income float64 hud_median_family_income_missing int64 lien_status_name object loan_amount_000s int64 loan_purpose_name object loan_type_name object loan_to_income_ratio float64 minority_population float64 minority_population_missing int64 property_type_name object tract_to_msamd_income float64 tract_to_msamd_income_missing int64 dtype: object
A) Creation of Binary target¶
In [8]:
# Creation of binary column for approval - denails
df['action_taken_binary'] = df['action_taken'].map({'denied': 1,'approved': 0})
# Drop the original 'action_taken' column
df = df.drop(columns=['action_taken'])
print(df['action_taken_binary'].value_counts())
action_taken_binary 0 116408 1 30310 Name: count, dtype: int64
B) Splitting data¶
- We are using 70-30 ratio.
In [9]:
# Split the data
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)
train_df.to_csv('/content/drive/My Drive/Colab Notebooks/hmdaNY_03092024_1603_train_df.csv', index=False)
test_df.to_csv('/content/drive/My Drive/Colab Notebooks/hmdaNY_03092024_1603_test_df.csv', index=False)
In [10]:
print(train_df.shape)
print(test_df.shape)
(102702, 15) (44016, 15)
C) Feature Engineering¶
In [11]:
# Log transformation
# Scaling
# 'loan_to_income_ratio'
plt.hist(train_df['loan_to_income_ratio'], bins=100, edgecolor='black')
plt.title('Distribution of loan_to_income_ratio')
plt.xlabel('loan_to_income_ratio')
plt.ylabel('Frequency')
plt.show()
# Calculating the skewness of loan_to_income_ratio
original_skewness = skew(train_df['loan_to_income_ratio'])
print(f"Skewness: {original_skewness}")
Skewness: 4.0683291922765275
In [12]:
# Log transformation: yes
# Scaling: yes
# applicant_income_000s
plt.hist(train_df['applicant_income_000s'], bins=30, edgecolor='black')
plt.title('Distribution of Applicant Income (in 000s)')
plt.xlabel('Applicant Income (000s)')
plt.ylabel('Frequency')
plt.show()
# Calculating the skewness of applicant_income_000s
original_skewness = skew(train_df['applicant_income_000s'])
print(f"Skewness: {original_skewness}")
Skewness: 3.39818358160144
In [13]:
# Log transformation: No
# Scaling: yes
# hud_median_family_income
plt.hist(train_df['hud_median_family_income'], bins=30, edgecolor='black')
plt.title('Distribution of hud_median_family_income')
plt.xlabel('hud_median_family_income')
plt.ylabel('Frequency')
plt.show()
# Calculating the skewness of hud_median_family_income
original_skewness = skew(train_df['hud_median_family_income'])
print(f"Skewness: {original_skewness}")
Skewness: 1.2479818056447094
In [14]:
# Log transformation: yes
# scaling: yes
# loan_amount_000s
plt.hist(train_df['loan_amount_000s'], bins=30, edgecolor='black')
plt.title('Distribution of loan_amount_000s')
plt.xlabel('loan_amount_000s')
plt.ylabel('Frequency')
plt.show()
# Calculating the skewness of loan_amount_000s
original_skewness = skew(train_df['loan_amount_000s'])
print(f"Skewness: {original_skewness}")
Skewness: 1.824575061311811
In [15]:
# ONLY scaling.
# minority_population
plt.hist(train_df['minority_population'], bins=30, edgecolor='black')
plt.title('Distribution of minority_population')
plt.xlabel('minority_population')
plt.ylabel('Frequency')
plt.show()
# Calculating the skewness of minority_population
original_skewness = skew(train_df['minority_population'])
print(f"Skewness: {original_skewness}")
Skewness: 1.1907000593514665
In [16]:
# Log transformation optional
# Scaling yes
# Applicant Income
plt.hist(train_df['tract_to_msamd_income'], bins=30, edgecolor='black')
plt.title('Distribution of tract_to_msamd_income')
plt.xlabel('tract_to_msamd_income')
plt.ylabel('Frequency')
plt.show()
# Calculating the skewness of tract_to_msamd_income
original_skewness = skew(train_df['tract_to_msamd_income'])
print(f"Skewness: {original_skewness}")
Skewness: 1.9468038744343557
C.2) Log Transformation¶
'tract_to_msamd_income',
'loan_amount_000s',
'applicant_income_000s'
In [17]:
columns_to_log = ['loan_to_income_ratio', 'tract_to_msamd_income', 'loan_amount_000s', 'applicant_income_000s']
train_df[columns_to_log] = train_df[columns_to_log].apply(np.log1p)
test_df[columns_to_log] = test_df[columns_to_log].apply(np.log1p)
C.3) Scaling;¶
Standardization (Z-score normalization)¶
- We want to avoid any leakage on from the training data into the test data, hence we will apply scaler and one-hot encoding separately (train_df - test_df)
- We will first, Fit preprocessing steps on training data only.
- Then we will apply preprocessing to both sets: we will use the parameters learned from the training data to transform both the training and test sets.
This approach ensures that:
- The test set remains truly unseen during the training process.
- Both training and test data are in the same format for model training and evaluation.
- No information from the test set leaks into the preprocessing steps.
In [18]:
# Features to Scale
columns_to_scaling = ['loan_to_income_ratio',
'applicant_income_000s',
'hud_median_family_income',
'loan_amount_000s',
'minority_population',
'tract_to_msamd_income'
]
# Fitting preprocessing on training data
scaler = StandardScaler()
scaler.fit(train_df[columns_to_scaling])
## Applying preprocessing to both sets
train_df[columns_to_scaling] = scaler.transform(train_df[columns_to_scaling])
test_df[columns_to_scaling] = scaler.transform(test_df[columns_to_scaling])
In [19]:
# Check the mean and standard deviation
print("Means of scaled features:")
print(train_df[columns_to_scaling].mean())
print("\nStandard deviations of scaled features:")
print(train_df[columns_to_scaling].std())
Means of scaled features: loan_to_income_ratio 2.184859e-16 applicant_income_000s -2.292788e-16 hud_median_family_income -2.480970e-16 loan_amount_000s -1.053202e-15 minority_population -7.131233e-17 tract_to_msamd_income -2.276737e-15 dtype: float64 Standard deviations of scaled features: loan_to_income_ratio 1.000005 applicant_income_000s 1.000005 hud_median_family_income 1.000005 loan_amount_000s 1.000005 minority_population 1.000005 tract_to_msamd_income 1.000005 dtype: float64
In [20]:
# Checking the range
print("\nMin values:")
print(train_df[columns_to_scaling].min())
print("\nMax values:")
print(train_df[columns_to_scaling].max())
Min values: loan_to_income_ratio -2.212567 applicant_income_000s -2.513814 hud_median_family_income -1.002217 loan_amount_000s -3.041207 minority_population -1.023697 tract_to_msamd_income -11.541474 dtype: float64 Max values: loan_to_income_ratio 5.389856 applicant_income_000s 3.714564 hud_median_family_income 2.033801 loan_amount_000s 2.021953 minority_population 2.410389 tract_to_msamd_income 3.007542 dtype: float64
C.4) One-hot encoding¶
We follow the same approach as above,
- We want to avoid any leakage on from the training data into the test data, hence we will apply scaler and one-hot encoding separately (train_df - test_df)
- We will first, Fit preprocessing steps on training data only.
- Then we will apply preprocessing to both sets: we will use the parameters learned from the training data to transform both the training and test sets.
This approach ensures that:
- The test set remains truly unseen during the training process.
- Both training and test data are in the same format for model training and evaluation.
- No information from the test set leaks into the preprocessing steps. we use This approach ensures that:
OneHotEncoder vs get_dummies:¶
OneHotEncoder and get_dummies are both used for one-hot encoding, but they have some differences:
- OneHotEncoder (from scikit-learn):
- Can be fit on training data and applied to test data separately
- Handles unseen categories in test data (with 'handle_unknown' parameter)
- Part of scikit-learn's preprocessing module, integrates well with pipelines
- Can handle non-string categorical data easily
¶
- get_dummies (from pandas):
- Simpler to use for quick encoding of pandas DataFrames
- Applies encoding immediately to the entire dataset
- Doesn't handle unseen categories in new data by default
- Works directly on pandas DataFrames and returns a DataFrame
¶
- Main difference (which is why we are using OneHotEncoder) is:
- OneHotEncoder is better for preventing data leakage and handling unseen categories in test data
- get_dummies is more convenient for quick, one-time encoding of all data
In [21]:
# Features to encode
categorical_columns = ['loan_type_name',
'loan_purpose_name',
'property_type_name',
'lien_status_name',
'ethnicity_race_sex']
# Fitting preprocessing on training data only
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_df[categorical_columns])
# Applying preprocessing on both sets
train_df_LogEnco = encoder.transform(train_df[categorical_columns])
test_df_LogEnco = encoder.transform(test_df[categorical_columns])
# Creating new column names for the encoded features
new_column_names = encoder.get_feature_names_out(categorical_columns)
# Converting the encoded arrays to DataFrames
train_df_LogEnco = pd.DataFrame(train_df_LogEnco, columns=new_column_names, index=train_df.index)
test_df_LogEnco = pd.DataFrame(test_df_LogEnco, columns=new_column_names, index=test_df.index)
# Dropping the original categorical columns and add the encoded columns
train_df = train_df.drop(columns=categorical_columns).join(train_df_LogEnco)
test_df = test_df.drop(columns=categorical_columns).join(test_df_LogEnco)
#https://medium.com/@vinodkumargr/11-column-transformer-in-ml-sklearn-column-transformer-in-machine-learning-48479f8cb48f#:~:text=Scikit%2DLearn's%20Column%20Transformer%20is,transformer%20should%20be%20applied%20to.
#https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
In [22]:
print(train_df)
applicant_income_000s hud_median_family_income \ 93790 -0.560030 -0.621935 3731 0.380770 -0.279059 58545 2.404752 -0.279059 142437 -0.189761 2.033801 21819 0.030613 -0.279059 ... ... ... 110268 -1.733987 -0.603233 119879 -1.167283 -0.603233 103694 -0.716095 -0.621935 131932 0.091232 2.033801 121958 0.009887 0.020179 hud_median_family_income_missing loan_amount_000s \ 93790 0 -0.235946 3731 0 0.774233 58545 0 1.897857 142437 0 0.573865 21819 0 0.066945 ... ... ... 110268 0 -2.694126 119879 0 -0.268873 103694 0 -0.311898 131932 0 0.679899 121958 0 -1.136912 loan_to_income_ratio minority_population \ 93790 -0.117323 1.438200 3731 0.707355 -0.335849 58545 0.781324 -0.119502 142437 0.898603 1.477005 21819 -0.136738 0.080362 ... ... ... 110268 -1.862718 1.527486 119879 0.354152 -0.740041 103694 -0.104487 2.104069 131932 0.817267 -0.266481 121958 -1.473303 -0.368130 minority_population_missing tract_to_msamd_income \ 93790 0 -1.883378 3731 0 0.883041 58545 0 0.680940 142437 0 -1.704510 21819 0 0.811845 ... ... ... 110268 0 -1.092325 119879 0 -0.024931 103694 0 -1.151373 131932 0 -0.389044 121958 0 0.302802 tract_to_msamd_income_missing action_taken_binary ... \ 93790 0 0 ... 3731 0 0 ... 58545 0 0 ... 142437 0 0 ... 21819 0 0 ... ... ... ... ... 110268 0 1 ... 119879 0 0 ... 103694 0 0 ... 131932 0 1 ... 121958 0 0 ... ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female \ 93790 0.0 3731 0.0 58545 0.0 142437 0.0 21819 0.0 ... ... 110268 0.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male \ 93790 0.0 3731 0.0 58545 0.0 142437 0.0 21819 0.0 ... ... 110268 0.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_asian_female \ 93790 0.0 3731 1.0 58545 0.0 142437 0.0 21819 0.0 ... ... 110268 0.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_asian_male \ 93790 0.0 3731 0.0 58545 0.0 142437 0.0 21819 1.0 ... ... 110268 0.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_black or african american_female \ 93790 0.0 3731 0.0 58545 0.0 142437 0.0 21819 0.0 ... ... 110268 0.0 119879 0.0 103694 1.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_black or african american_male \ 93790 0.0 3731 0.0 58545 0.0 142437 0.0 21819 0.0 ... ... 110268 1.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female \ 93790 0.0 3731 0.0 58545 0.0 142437 0.0 21819 0.0 ... ... 110268 0.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male \ 93790 0.0 3731 0.0 58545 0.0 142437 0.0 21819 0.0 ... ... 110268 0.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_white_female \ 93790 0.0 3731 0.0 58545 0.0 142437 1.0 21819 0.0 ... ... 110268 0.0 119879 0.0 103694 0.0 131932 0.0 121958 0.0 ethnicity_race_sex_not hispanic or latino_white_male 93790 1.0 3731 0.0 58545 1.0 142437 0.0 21819 0.0 ... ... 110268 0.0 119879 1.0 103694 0.0 131932 0.0 121958 1.0 [102702 rows x 43 columns]
In [23]:
print(train_df.isnull().sum())
applicant_income_000s 0 hud_median_family_income 0 hud_median_family_income_missing 0 loan_amount_000s 0 loan_to_income_ratio 0 minority_population 0 minority_population_missing 0 tract_to_msamd_income 0 tract_to_msamd_income_missing 0 action_taken_binary 0 loan_type_name_Conventional 0 loan_type_name_FHA-insured 0 loan_type_name_FSA/RHS-guaranteed 0 loan_type_name_VA-guaranteed 0 loan_purpose_name_Home improvement 0 loan_purpose_name_Home purchase 0 loan_purpose_name_Refinancing 0 property_type_name_Manufactured housing 0 property_type_name_Multifamily dwelling 0 property_type_name_One-to-four family dwelling (other than manufactured housing) 0 lien_status_name_Not secured by a lien 0 lien_status_name_Secured by a first lien 0 lien_status_name_Secured by a subordinate lien 0 ethnicity_race_sex_hispanic or latino_american indian or alaska native_female 0 ethnicity_race_sex_hispanic or latino_american indian or alaska native_male 0 ethnicity_race_sex_hispanic or latino_asian_female 0 ethnicity_race_sex_hispanic or latino_asian_male 0 ethnicity_race_sex_hispanic or latino_black or african american_female 0 ethnicity_race_sex_hispanic or latino_black or african american_male 0 ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female 0 ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male 0 ethnicity_race_sex_hispanic or latino_white_female 0 ethnicity_race_sex_hispanic or latino_white_male 0 ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female 0 ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male 0 ethnicity_race_sex_not hispanic or latino_asian_female 0 ethnicity_race_sex_not hispanic or latino_asian_male 0 ethnicity_race_sex_not hispanic or latino_black or african american_female 0 ethnicity_race_sex_not hispanic or latino_black or african american_male 0 ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female 0 ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male 0 ethnicity_race_sex_not hispanic or latino_white_female 0 ethnicity_race_sex_not hispanic or latino_white_male 0 dtype: int64
D) Define features and target variable:¶
We separate the features (X) and target variable (y) for both train and test sets.
In [24]:
features =[ 'action_taken_binary',
# 'applicant_income_000s',
'hud_median_family_income',
'hud_median_family_income_missing',
# 'loan_amount_000s',
'loan_to_income_ratio',
'minority_population',
'minority_population_missing',
'tract_to_msamd_income',
'tract_to_msamd_income_missing',
'loan_type_name_Conventional',
'loan_type_name_FHA-insured',
'loan_type_name_FSA/RHS-guaranteed',
'loan_type_name_VA-guaranteed',
'loan_purpose_name_Home improvement',
'loan_purpose_name_Home purchase',
'loan_purpose_name_Refinancing',
'property_type_name_Manufactured housing',
'property_type_name_Multifamily dwelling',
'property_type_name_One-to-four family dwelling (other than manufactured housing)',
'lien_status_name_Not secured by a lien',
'lien_status_name_Secured by a first lien',
'lien_status_name_Secured by a subordinate lien',
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_hispanic or latino_asian_female",
"ethnicity_race_sex_hispanic or latino_asian_male",
"ethnicity_race_sex_hispanic or latino_black or african american_female",
"ethnicity_race_sex_hispanic or latino_black or african american_male",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_hispanic or latino_white_female",
"ethnicity_race_sex_hispanic or latino_white_male",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_not hispanic or latino_asian_female",
"ethnicity_race_sex_not hispanic or latino_asian_male",
"ethnicity_race_sex_not hispanic or latino_black or african american_female",
"ethnicity_race_sex_not hispanic or latino_black or african american_male",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_not hispanic or latino_white_female",
'ethnicity_race_sex_not hispanic or latino_white_male'
]
train_df = train_df[features]
test_df = test_df[features]
In [25]:
# Train set
X_train = train_df.drop('action_taken_binary', axis=1)
y_train = train_df['action_taken_binary']
# Test Set
X_test = test_df.drop('action_taken_binary', axis=1)
y_test = test_df['action_taken_binary']
In [26]:
# Checking the shapes of the resulting splits
print(f'X_train shape: {X_train.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_test shape: {y_test.shape}')
X_train shape: (102702, 40) y_train shape: (102702,) X_test shape: (44016, 40) y_test shape: (44016,)
E) Bias measurement:¶
We measure biases BEFORE smote with AIF360. We use Disparate Impact as measurement.¶
- First, we'll create a single categorical column from the 20 one-hot encoded ethnicity-race-sex columns. This new column will be our protected attribute.
- We'll use the BinaryLabelDataset from AIF360, specifying 'action_taken_binary' as the target and our new categorical column as the protected attribute.
- We'll use BinaryLabelDatasetMetric to compute metrics for each group, focusing on disparate impact and statistical parity difference.
- We'll examine the metrics for each ethnicity-race-sex group and identify any groups facing significantly higher rejection rates.
Privileged class:¶
Since we are interested on disparities within our dataset, we will use the most "populous class" (privieged class) as benchmarking
In [27]:
print(train_df['ethnicity_race_sex_not hispanic or latino_white_male'])
93790 1.0 3731 0.0 58545 1.0 142437 0.0 21819 0.0 ... 110268 0.0 119879 1.0 103694 0.0 131932 0.0 121958 1.0 Name: ethnicity_race_sex_not hispanic or latino_white_male, Length: 102702, dtype: float64
E.1) Dataset creation¶
In [28]:
# Define variables:
protected_attribute_names=[
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_hispanic or latino_asian_female",
"ethnicity_race_sex_hispanic or latino_asian_male",
"ethnicity_race_sex_hispanic or latino_black or african american_female",
"ethnicity_race_sex_hispanic or latino_black or african american_male",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_hispanic or latino_white_female",
"ethnicity_race_sex_hispanic or latino_white_male",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_not hispanic or latino_asian_female",
"ethnicity_race_sex_not hispanic or latino_asian_male",
"ethnicity_race_sex_not hispanic or latino_black or african american_female",
"ethnicity_race_sex_not hispanic or latino_black or african american_male",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_not hispanic or latino_white_female",
'ethnicity_race_sex_not hispanic or latino_white_male' #notice we inlcude the privilege group
]
favorable_label = 0 # loan approved
unfavorable_label = 1 # loan denied
# First, we create the dataset
aif_dataset = BinaryLabelDataset(
df=train_df,
label_names=['action_taken_binary'],
protected_attribute_names=protected_attribute_names,
favorable_label = favorable_label, # loan approved
unfavorable_label = unfavorable_label # loan denied
)
#https://www.rdocumentation.org/packages/aif360/versions/0.1.0/topics/aif_dataset
#https://aif360.readthedocs.io/en/latest/
#
E.2) Defining groups¶
In [29]:
# Defining privileged group directly
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
# Defining unprivileged groups using a loop
unprivileged_groups = []
for attribute in protected_attribute_names:
if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
unprivileged_groups.append({attribute: 1})
# Checking the groups
print("Privileged group:", privileged_groups)
print("Number of unprivileged groups:", len(unprivileged_groups))
print("First few unprivileged groups:", unprivileged_groups[:3])
Privileged group: [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}] Number of unprivileged groups: 19 First few unprivileged groups: [{'ethnicity_race_sex_hispanic or latino_american indian or alaska native_female': 1}, {'ethnicity_race_sex_hispanic or latino_american indian or alaska native_male': 1}, {'ethnicity_race_sex_hispanic or latino_asian_female': 1}]
E.3) Metrics¶
In [30]:
# Calculating metrics
metric = BinaryLabelDatasetMetric(aif_dataset,
unprivileged_groups=unprivileged_groups,
privileged_groups=privileged_groups)
In [31]:
# Printing metrics
print(f"Disparate Impact: {metric.disparate_impact():.4f}")
print(f"Statistical Parity Difference: {metric.statistical_parity_difference():.4f}")
# We calculate and print the mean difference in label predictions
print(f"Mean difference in label predictions: {metric.mean_difference():.4f}")
# Calculate group-specific metrics
for group in unprivileged_groups:
group_metric = BinaryLabelDatasetMetric(aif_dataset,
unprivileged_groups=[group],
privileged_groups=privileged_groups)
group_name = list(group.keys())[0]
print(f"\nGroup: {group_name}")
print(f"Disparate Impact: {group_metric.disparate_impact():.4f}")
print(f"Statistical Parity Difference: {group_metric.statistical_parity_difference():.4f}")
Disparate Impact: 0.9534 Statistical Parity Difference: -0.0380 Mean difference in label predictions: -0.0380 Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female Disparate Impact: 0.6647 Statistical Parity Difference: -0.2732 Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male Disparate Impact: 0.5467 Statistical Parity Difference: -0.3694 Group: ethnicity_race_sex_hispanic or latino_asian_female Disparate Impact: 0.8863 Statistical Parity Difference: -0.0927 Group: ethnicity_race_sex_hispanic or latino_asian_male Disparate Impact: 0.7809 Statistical Parity Difference: -0.1785 Group: ethnicity_race_sex_hispanic or latino_black or african american_female Disparate Impact: 0.7363 Statistical Parity Difference: -0.2149 Group: ethnicity_race_sex_hispanic or latino_black or african american_male Disparate Impact: 0.7181 Statistical Parity Difference: -0.2297 Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female Disparate Impact: 0.5137 Statistical Parity Difference: -0.3963 Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male Disparate Impact: 0.5764 Statistical Parity Difference: -0.3452 Group: ethnicity_race_sex_hispanic or latino_white_female Disparate Impact: 0.9077 Statistical Parity Difference: -0.0753 Group: ethnicity_race_sex_hispanic or latino_white_male Disparate Impact: 0.9181 Statistical Parity Difference: -0.0667 Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female Disparate Impact: 0.8057 Statistical Parity Difference: -0.1584 Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male Disparate Impact: 0.7504 Statistical Parity Difference: -0.2034 Group: ethnicity_race_sex_not hispanic or latino_asian_female Disparate Impact: 1.0054 Statistical Parity Difference: 0.0044 Group: ethnicity_race_sex_not hispanic or latino_asian_male Disparate Impact: 0.9994 Statistical Parity Difference: -0.0005 Group: ethnicity_race_sex_not hispanic or latino_black or african american_female Disparate Impact: 0.8262 Statistical Parity Difference: -0.1416 Group: ethnicity_race_sex_not hispanic or latino_black or african american_male Disparate Impact: 0.8220 Statistical Parity Difference: -0.1451 Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female Disparate Impact: 0.8331 Statistical Parity Difference: -0.1360 Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male Disparate Impact: 0.8433 Statistical Parity Difference: -0.1277 Group: ethnicity_race_sex_not hispanic or latino_white_female Disparate Impact: 0.9999 Statistical Parity Difference: -0.0001
F) SMOTE¶
Application for our unbalanced data.
In [32]:
#Initializing SMOTE
smote = SMOTE(random_state=42)
# Application of SMOTE only on the training set to balance the classes
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
# Checking the distribution of the target variable after SMOTE
print("Before SMOTE:", y_train.value_counts())
print("\n After SMOTE:", y_train_smote.value_counts())
Before SMOTE: action_taken_binary 0 81489 1 21213 Name: count, dtype: int64 After SMOTE: action_taken_binary 0 81489 1 81489 Name: count, dtype: int64
F.1) Bias AFTER SMOTE¶
We measure biases after SMOTE application
F.1.1) First¶
We resamble the dataset.
In [33]:
# Converting the resampled X_train and y_train into a DataFrame
X_train_s = pd.DataFrame(X_train_smote, columns=X_train.columns) # We retained original column names
y_train_s = pd.DataFrame(y_train_smote, columns=['action_taken_binary'])
# Combining X and y into a single DataFrame
train_df_smote = pd.concat([X_train_s, y_train_s], axis=1)
F.1.2) Second¶
We use the resambled dataset to apply the same procces used earlier and assess any difference.
In [34]:
# Defining variables:
protected_attribute_names=[
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_hispanic or latino_asian_female",
"ethnicity_race_sex_hispanic or latino_asian_male",
"ethnicity_race_sex_hispanic or latino_black or african american_female",
"ethnicity_race_sex_hispanic or latino_black or african american_male",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_hispanic or latino_white_female",
"ethnicity_race_sex_hispanic or latino_white_male",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_not hispanic or latino_asian_female",
"ethnicity_race_sex_not hispanic or latino_asian_male",
"ethnicity_race_sex_not hispanic or latino_black or african american_female",
"ethnicity_race_sex_not hispanic or latino_black or african american_male",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_not hispanic or latino_white_female",
'ethnicity_race_sex_not hispanic or latino_white_male' #notice we inlcude the privilege group
]
favorable_label = 0 # loan approved
unfavorable_label = 1 # loan denied
# Creatimng the dataset
aif_dataset = BinaryLabelDataset(
df=train_df_smote,
label_names=['action_taken_binary'],
protected_attribute_names=protected_attribute_names,
favorable_label = favorable_label, # loan approved
unfavorable_label = unfavorable_label # loan denied
)
In [35]:
# Definining privileged group directly
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
# Defining unprivileged groups using a loop
unprivileged_groups = []
for attribute in protected_attribute_names:
if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
unprivileged_groups.append({attribute: 1})
# Checking the groups for verification
print("Privileged group:", privileged_groups)
print("Number of unprivileged groups:", len(unprivileged_groups))
print("First few unprivileged groups:", unprivileged_groups[:3])
Privileged group: [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}] Number of unprivileged groups: 19 First few unprivileged groups: [{'ethnicity_race_sex_hispanic or latino_american indian or alaska native_female': 1}, {'ethnicity_race_sex_hispanic or latino_american indian or alaska native_male': 1}, {'ethnicity_race_sex_hispanic or latino_asian_female': 1}]
In [36]:
# Calculating metrics
metric = BinaryLabelDatasetMetric(aif_dataset,
unprivileged_groups=unprivileged_groups,
privileged_groups=privileged_groups)
In [37]:
# Printing metrics
print(f"Disparate Impact: {metric.disparate_impact():.4f}")
print(f"Statistical Parity Difference: {metric.statistical_parity_difference():.4f}")
# We calculate and print the mean difference in label predictions
print(f"Mean difference in label predictions: {metric.mean_difference():.4f}")
# Calculating group-specific metrics
for group in unprivileged_groups:
group_metric = BinaryLabelDatasetMetric(aif_dataset,
unprivileged_groups=[group],
privileged_groups=privileged_groups)
group_name = list(group.keys())[0]
print(f"\nGroup: {group_name}")
print(f"Disparate Impact: {group_metric.disparate_impact():.4f}")
print(f"Statistical Parity Difference: {group_metric.statistical_parity_difference():.4f}")
Disparate Impact: 0.9054 Statistical Parity Difference: -0.0506 Mean difference in label predictions: -0.0506 Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female Disparate Impact: 0.9000 Statistical Parity Difference: -0.0535 Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male Disparate Impact: 0.4300 Statistical Parity Difference: -0.3049 Group: ethnicity_race_sex_hispanic or latino_asian_female Disparate Impact: 1.3501 Statistical Parity Difference: 0.1873 Group: ethnicity_race_sex_hispanic or latino_asian_male Disparate Impact: 1.1136 Statistical Parity Difference: 0.0608 Group: ethnicity_race_sex_hispanic or latino_black or african american_female Disparate Impact: 0.6992 Statistical Parity Difference: -0.1609 Group: ethnicity_race_sex_hispanic or latino_black or african american_male Disparate Impact: 0.6863 Statistical Parity Difference: -0.1678 Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female Disparate Impact: 0.4948 Statistical Parity Difference: -0.2702 Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male Disparate Impact: 0.6231 Statistical Parity Difference: -0.2016 Group: ethnicity_race_sex_hispanic or latino_white_female Disparate Impact: 0.8179 Statistical Parity Difference: -0.0974 Group: ethnicity_race_sex_hispanic or latino_white_male Disparate Impact: 0.8335 Statistical Parity Difference: -0.0891 Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female Disparate Impact: 0.8476 Statistical Parity Difference: -0.0815 Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male Disparate Impact: 0.7295 Statistical Parity Difference: -0.1447 Group: ethnicity_race_sex_not hispanic or latino_asian_female Disparate Impact: 1.0254 Statistical Parity Difference: 0.0136 Group: ethnicity_race_sex_not hispanic or latino_asian_male Disparate Impact: 1.0085 Statistical Parity Difference: 0.0045 Group: ethnicity_race_sex_not hispanic or latino_black or african american_female Disparate Impact: 0.6568 Statistical Parity Difference: -0.1836 Group: ethnicity_race_sex_not hispanic or latino_black or african american_male Disparate Impact: 0.6555 Statistical Parity Difference: -0.1843 Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female Disparate Impact: 1.0247 Statistical Parity Difference: 0.0132 Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male Disparate Impact: 0.9997 Statistical Parity Difference: -0.0002 Group: ethnicity_race_sex_not hispanic or latino_white_female Disparate Impact: 0.9981 Statistical Parity Difference: -0.0010
G) Reweighting¶
In [38]:
# AIF360 dataset
aif_dataset = BinaryLabelDataset(
df=train_df_smote, # Using the SMOTE Dataset
label_names=['action_taken_binary'],
protected_attribute_names=protected_attribute_names,
favorable_label=favorable_label, # loan approved
unfavorable_label=unfavorable_label # loan denied
)
# Defining the privileged and unprivileged groups
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
unprivileged_groups = []
for attribute in protected_attribute_names:
if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
unprivileged_groups.append({attribute: 1})
# Applying the Reweighing algorithm
RW = Reweighing(unprivileged_groups=unprivileged_groups,
privileged_groups=privileged_groups)
# Fitting the "reweighing model". Transforming the dataset
reweighted_dataset = RW.fit_transform(aif_dataset)
# Turning the reweighted data back to pandas to use with our logistic regression model
reweighted_df = reweighted_dataset.convert_to_dataframe()[0]
# Re-defining variables
X_train_reweighted = reweighted_df.drop(columns=['action_taken_binary'])
y_train_reweighted = reweighted_df['action_taken_binary']
H) LightGBM¶
In [39]:
# Defining the parameter grid for tuning
param_grid = {
'n_estimators': [100, 200, 300, 500],
'learning_rate': [0.01, 0.05, 0.1],
'num_leaves': [31, 62, 127],
'max_depth': [-1, 5, 10, 15],
'min_child_samples': [20, 50, 100],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'reg_alpha': [0, 0.1, 0.5],
'reg_lambda': [0, 0.1, 0.5]
}
# Setting the lgbm_model and randomizedSearch
lgbm_model = lgbm.LGBMClassifier(random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# We use joblib parallel backend for threading since we got our previous code stuck
with joblib.parallel_backend('threading'): # This line is added to avoid JAX threading issues
random_search = RandomizedSearchCV(
estimator=lgbm_model,
param_distributions=param_grid,
n_iter=50,
scoring='roc_auc',
cv=skf,
verbose=2,
random_state=42,
n_jobs=-1
)
# Fitting the randomized search
random_search.fit(X_train_reweighted, y_train_reweighted)
# Getting the best lgbm_model and parameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
print("Best parameters:", best_params)
# Making predictions
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
Fitting 5 folds for each of 50 candidates, totalling 250 fits [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 11.6s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 11.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 11.4s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 11.6s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 9.1s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time= 11.4s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time= 11.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time= 11.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time= 13.4s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=300, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=1.0; total time= 14.4s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 14.6s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 13.1s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 13.3s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 13.1s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 6.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 11.8s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 8.4s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 7.4s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 6.1s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 2.3s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 6.1s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 3.3s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 4.2s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 3.6s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 2.9s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 8.4s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 8.2s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 10.7s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 10.4s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time= 7.7s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 9.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time= 8.7s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time= 7.9s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time= 7.9s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=0.6; total time= 8.4s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 4.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 3.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 3.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 2.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0, subsample=1.0; total time= 3.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time= 7.8s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time= 9.4s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time= 8.3s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time= 6.8s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=0.8; total time= 8.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 10.8s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 9.5s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 9.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 10.5s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 9.8s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 4.7s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 5.7s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 5.7s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 6.5s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 5.9s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time= 3.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time= 3.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time= 3.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time= 3.7s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time= 3.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.6; total time= 6.2s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time= 4.3s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time= 4.3s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time= 4.2s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=1.0; total time= 3.3s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 6.7s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 7.7s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 6.5s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 5.6s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=300, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 5.5s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time= 12.5s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time= 12.2s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time= 11.0s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time= 12.6s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=10, min_child_samples=50, n_estimators=300, num_leaves=127, reg_alpha=0.5, reg_lambda=0.5, subsample=0.6; total time= 10.9s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time= 9.1s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time= 11.1s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time= 10.8s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time= 9.6s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=500, num_leaves=31, reg_alpha=0.5, reg_lambda=0.1, subsample=0.6; total time= 11.3s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time= 5.5s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time= 4.0s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time= 4.1s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time= 4.5s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0.5, subsample=1.0; total time= 4.2s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 9.4s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 9.5s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 7.1s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 7.7s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=200, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 9.4s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 13.4s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 14.3s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 14.4s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 14.5s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 14.3s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 8.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 10.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 9.7s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 8.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 9.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 12.7s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 11.3s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 12.5s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 12.6s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 2.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 3.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=300, num_leaves=62, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 11.8s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 4.6s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 3.4s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 2.8s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 14.3s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 14.4s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 13.7s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 14.3s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 13.3s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 13.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 14.4s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 13.7s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 14.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 14.5s [CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 6.8s [CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 7.2s [CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.6s [CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 5.1s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 4.0s [CV] END colsample_bytree=1.0, learning_rate=0.1, max_depth=-1, min_child_samples=50, n_estimators=200, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.9s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 6.5s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 6.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 4.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 4.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 26.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 26.1s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 26.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 27.5s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 5.0s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 3.2s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 3.2s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 3.2s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 5.4s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 25.9s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 11.0s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 12.9s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 12.9s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 12.1s [CV] END colsample_bytree=1.0, learning_rate=0.05, max_depth=15, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 13.1s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 14.6s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 14.2s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 13.3s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 14.0s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 3.4s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 3.4s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0.1, subsample=0.6; total time= 14.2s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.2s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.8s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=10, min_child_samples=50, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 5.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 12.5s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 11.9s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 14.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 12.3s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 4.1s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 2.9s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 2.5s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 13.5s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 2.5s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 2.6s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 3.1s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 4.3s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 5.4s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 3.9s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=-1, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0, subsample=0.6; total time= 3.1s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 3.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 2.9s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 3.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 2.9s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=50, n_estimators=100, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=0.6; total time= 3.1s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 14.0s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 13.7s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 14.3s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 14.3s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 3.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 4.6s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 3.2s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=31, reg_alpha=0, reg_lambda=0, subsample=1.0; total time= 14.5s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 3.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=100, num_leaves=31, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 3.1s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 11.0s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 11.0s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 8.6s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 9.2s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 10.4s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=20, n_estimators=200, num_leaves=127, reg_alpha=0.5, reg_lambda=0, subsample=1.0; total time= 12.0s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 11.9s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 11.9s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 9.7s [CV] END colsample_bytree=0.8, learning_rate=0.05, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.1, subsample=1.0; total time= 10.6s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 17.6s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 18.0s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 17.5s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 19.3s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=10, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 18.8s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 20.1s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 18.5s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 17.9s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 18.5s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=15, min_child_samples=20, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 19.6s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 14.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 13.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 14.8s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 13.6s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 6.4s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 4.2s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0, subsample=0.8; total time= 14.2s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 4.3s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 4.9s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0, reg_lambda=0, subsample=0.8; total time= 6.4s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 19.1s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 18.4s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 18.1s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 17.7s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 2.6s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 2.6s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 3.9s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 3.3s [CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=-1, min_child_samples=20, n_estimators=100, num_leaves=31, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 2.6s [CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=-1, min_child_samples=50, n_estimators=500, num_leaves=62, reg_alpha=0.1, reg_lambda=0.5, subsample=1.0; total time= 17.9s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time= 3.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time= 3.0s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time= 2.8s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time= 3.1s [CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=5, min_child_samples=20, n_estimators=100, num_leaves=62, reg_alpha=0.5, reg_lambda=0.1, subsample=0.8; total time= 4.4s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 23.6s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 21.5s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 21.3s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 21.0s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 6.7s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 4.7s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 4.7s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=15, min_child_samples=100, n_estimators=500, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.8; total time= 24.0s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 7.0s [CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, min_child_samples=100, n_estimators=200, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=0.8; total time= 4.5s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time= 10.7s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time= 12.1s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time= 12.4s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time= 10.7s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 6.0s [CV] END colsample_bytree=0.6, learning_rate=0.1, max_depth=5, min_child_samples=100, n_estimators=500, num_leaves=62, reg_alpha=0.5, reg_lambda=0.5, subsample=1.0; total time= 12.3s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.3s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.3s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.4s [CV] END colsample_bytree=0.6, learning_rate=0.05, max_depth=15, min_child_samples=20, n_estimators=100, num_leaves=127, reg_alpha=0.1, reg_lambda=0.5, subsample=0.6; total time= 4.4s [LightGBM] [Warning] Found whitespace in feature_names, replace with underlines [LightGBM] [Info] Number of positive: 81489, number of negative: 81489 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.064936 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 3976 [LightGBM] [Info] Number of data points in the train set: 162978, number of used features: 38 [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf Best parameters: {'subsample': 0.8, 'reg_lambda': 0.5, 'reg_alpha': 0.1, 'num_leaves': 127, 'n_estimators': 500, 'min_child_samples': 100, 'max_depth': 15, 'learning_rate': 0.1, 'colsample_bytree': 0.6}
In [40]:
# Model performance evaluation:
# Accuracy
accuracy = accuracy_score(y_test, y_pred)
# Precision
precision = precision_score(y_test, y_pred)
# Recall
recall = recall_score(y_test, y_pred)
# F1 Score
f1 = f1_score(y_test, y_pred)
# Calculate ROC-AUC
roc_auc = roc_auc_score(y_test, y_pred_proba)
# Negative log loss
neg_log_loss = log_loss(y_test, y_pred_proba)
# Results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
print(f"Negative Log Loss: {neg_log_loss:.4f}")
Accuracy: 0.7421 Precision: 0.3914 Recall: 0.4465 F1 Score: 0.4172 ROC-AUC: 0.7008 Negative Log Loss: 0.5332
In [41]:
# For each group we will print in descendent order the different metrics
# Adding the predicted values to a new test set for evaluation
X_test_with_pred = X_test.copy()
X_test_with_pred['y_test'] = y_test # Columns for y_test and y_pred.
X_test_with_pred['y_pred'] = y_pred
# Predicted probabilities for class 1 (default)
X_test_with_pred['y_pred_proba'] = best_model.predict_proba(X_test)[:, 1]
# List of intersectional group columns
ethnicity_race_sex_cols = [col for col in X_test_with_pred.columns if col.startswith('ethnicity_race_sex')]
# Creation of a column where each row represents an intersectional group
X_test_with_pred['intersectional_group'] = X_test_with_pred[ethnicity_race_sex_cols].idxmax(axis=1)
# Groupping by the intersectional group to evaluate the model performance for each intersectional group
grouped = X_test_with_pred.groupby('intersectional_group')
# Evaluation of the metrics for each group
for group_name, group_data in grouped:
accuracy = accuracy_score(group_data['y_test'], group_data['y_pred'])
precision = precision_score(group_data['y_test'], group_data['y_pred'], zero_division=0)
recall = recall_score(group_data['y_test'], group_data['y_pred'], zero_division=0)
f1 = f1_score(group_data['y_test'], group_data['y_pred'])
roc_auc = roc_auc_score(group_data['y_test'], group_data['y_pred_proba'])
neg_log_loss = log_loss(group_data['y_test'], group_data['y_pred_proba'])
# Showing results
print(f"# Group: {group_name}")
print(f" Accuracy: {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1 Score: {f1:.4f}")
print(f" ROC-AUC: {roc_auc:.4f}")
print(f" Negative Log Loss: {neg_log_loss:.4f}\n")
# Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female Accuracy: 0.6897 Precision: 0.6875 Recall: 0.7333 F1 Score: 0.7097 ROC-AUC: 0.8190 Negative Log Loss: 0.5684 # Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male Accuracy: 0.6585 Precision: 0.5926 Recall: 0.8421 F1 Score: 0.6957 ROC-AUC: 0.8038 Negative Log Loss: 0.6024 # Group: ethnicity_race_sex_hispanic or latino_asian_female Accuracy: 0.8235 Precision: 0.6667 Recall: 0.8000 F1 Score: 0.7273 ROC-AUC: 0.8000 Negative Log Loss: 0.5010 # Group: ethnicity_race_sex_hispanic or latino_asian_male Accuracy: 0.4815 Precision: 0.2222 Recall: 0.2222 F1 Score: 0.2222 ROC-AUC: 0.4444 Negative Log Loss: 0.7215 # Group: ethnicity_race_sex_hispanic or latino_black or african american_female Accuracy: 0.6167 Precision: 0.4821 Recall: 0.6136 F1 Score: 0.5400 ROC-AUC: 0.6932 Negative Log Loss: 0.7674 # Group: ethnicity_race_sex_hispanic or latino_black or african american_male Accuracy: 0.7228 Precision: 0.6271 Recall: 0.8605 F1 Score: 0.7255 ROC-AUC: 0.8360 Negative Log Loss: 0.5922 # Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female Accuracy: 0.8667 Precision: 0.7143 Recall: 1.0000 F1 Score: 0.8333 ROC-AUC: 0.8600 Negative Log Loss: 0.5357 # Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male Accuracy: 0.5625 Precision: 0.6000 Recall: 0.7895 F1 Score: 0.6818 ROC-AUC: 0.6518 Negative Log Loss: 0.7241 # Group: ethnicity_race_sex_hispanic or latino_white_female Accuracy: 0.6888 Precision: 0.4000 Recall: 0.5641 F1 Score: 0.4681 ROC-AUC: 0.6906 Negative Log Loss: 0.6104 # Group: ethnicity_race_sex_hispanic or latino_white_male Accuracy: 0.6948 Precision: 0.3974 Recall: 0.5578 F1 Score: 0.4642 ROC-AUC: 0.7020 Negative Log Loss: 0.6046 # Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female Accuracy: 0.7315 Precision: 0.6889 Recall: 0.6739 F1 Score: 0.6813 ROC-AUC: 0.7763 Negative Log Loss: 0.5611 # Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male Accuracy: 0.6525 Precision: 0.4571 Recall: 0.4211 F1 Score: 0.4384 ROC-AUC: 0.6322 Negative Log Loss: 0.6571 # Group: ethnicity_race_sex_not hispanic or latino_asian_female Accuracy: 0.7576 Precision: 0.3316 Recall: 0.4235 F1 Score: 0.3720 ROC-AUC: 0.6509 Negative Log Loss: 0.5311 # Group: ethnicity_race_sex_not hispanic or latino_asian_male Accuracy: 0.7572 Precision: 0.3837 Recall: 0.4219 F1 Score: 0.4019 ROC-AUC: 0.6914 Negative Log Loss: 0.5279 # Group: ethnicity_race_sex_not hispanic or latino_black or african american_female Accuracy: 0.6140 Precision: 0.4294 Recall: 0.6771 F1 Score: 0.5255 ROC-AUC: 0.6788 Negative Log Loss: 0.6956 # Group: ethnicity_race_sex_not hispanic or latino_black or african american_male Accuracy: 0.6205 Precision: 0.4610 Recall: 0.6549 F1 Score: 0.5411 ROC-AUC: 0.6841 Negative Log Loss: 0.6689 # Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female Accuracy: 0.6739 Precision: 0.3077 Recall: 0.4000 F1 Score: 0.3478 ROC-AUC: 0.6028 Negative Log Loss: 0.5880 # Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male Accuracy: 0.6667 Precision: 0.6522 Recall: 0.5769 F1 Score: 0.6122 ROC-AUC: 0.6663 Negative Log Loss: 0.7676 # Group: ethnicity_race_sex_not hispanic or latino_white_female Accuracy: 0.7672 Precision: 0.3787 Recall: 0.3790 F1 Score: 0.3789 ROC-AUC: 0.6992 Negative Log Loss: 0.4978 # Group: ethnicity_race_sex_not hispanic or latino_white_male Accuracy: 0.7597 Precision: 0.3607 Recall: 0.3736 F1 Score: 0.3670 ROC-AUC: 0.6889 Negative Log Loss: 0.5089
I) Bias measurement:¶
We use Disparate Impact as a method to measure biases
We re-join X_test and y_pred
In [42]:
# We got many empty rows due to index misalignment, so we need to reset index for both df:
# X_test and y_pred
X_test_reset = X_test.reset_index(drop=True)
y_pred_reset = pd.DataFrame(y_pred, columns=['action_taken_binary']).reset_index(drop=True)
# Concatenating the two DataFrames
lgbm_trained_df = pd.concat([X_test_reset, y_pred_reset], axis=1)
In [43]:
# Defining variables:
protected_attribute_names=[
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_hispanic or latino_asian_female",
"ethnicity_race_sex_hispanic or latino_asian_male",
"ethnicity_race_sex_hispanic or latino_black or african american_female",
"ethnicity_race_sex_hispanic or latino_black or african american_male",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_hispanic or latino_white_female",
"ethnicity_race_sex_hispanic or latino_white_male",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female",
"ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male",
"ethnicity_race_sex_not hispanic or latino_asian_female",
"ethnicity_race_sex_not hispanic or latino_asian_male",
"ethnicity_race_sex_not hispanic or latino_black or african american_female",
"ethnicity_race_sex_not hispanic or latino_black or african american_male",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female",
"ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male",
"ethnicity_race_sex_not hispanic or latino_white_female",
'ethnicity_race_sex_not hispanic or latino_white_male' #notice we inlcude the privilege group
]
favorable_label = 0 # loan approved
unfavorable_label = 1 # loan denied
# Creating the dataset
aif_dataset = BinaryLabelDataset(
df=lgbm_trained_df,
label_names=['action_taken_binary'],
protected_attribute_names=protected_attribute_names,
favorable_label = favorable_label, # loan approved
unfavorable_label = unfavorable_label # loan denied
)
In [44]:
# Definition of the privileged group directly
privileged_groups = [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}]
# Defining unprivileged groups using a loop
unprivileged_groups = []
for attribute in protected_attribute_names:
if attribute != 'ethnicity_race_sex_not hispanic or latino_white_male':
unprivileged_groups.append({attribute: 1})
# Checking groups
print("Privileged group:", privileged_groups)
print("Number of unprivileged groups:", len(unprivileged_groups))
print("First few unprivileged groups:", unprivileged_groups[:3])
Privileged group: [{'ethnicity_race_sex_not hispanic or latino_white_male': 1}] Number of unprivileged groups: 19 First few unprivileged groups: [{'ethnicity_race_sex_hispanic or latino_american indian or alaska native_female': 1}, {'ethnicity_race_sex_hispanic or latino_american indian or alaska native_male': 1}, {'ethnicity_race_sex_hispanic or latino_asian_female': 1}]
In [45]:
# Calculating metrics
metric = BinaryLabelDatasetMetric(aif_dataset,
unprivileged_groups=unprivileged_groups,
privileged_groups=privileged_groups)
In [46]:
# Printing metrics
print(f"Disparate Impact: {metric.disparate_impact():.6f}")
print(f"Statistical Parity Difference: {metric.statistical_parity_difference():.6f}")
# We calculate and print the mean difference in label predictions
print(f"Mean difference in label predictions: {metric.mean_difference():.6f}")
# Calculating group-specific metrics
for group in unprivileged_groups:
group_metric = BinaryLabelDatasetMetric(aif_dataset,
unprivileged_groups=[group],
privileged_groups=privileged_groups)
group_name = list(group.keys())[0]
print(f"\nGroup: {group_name}")
print(f"Disparate Impact: {group_metric.disparate_impact():.4f}")
print(f"Statistical Parity Difference: {group_metric.statistical_parity_difference():.4f}")
Disparate Impact: 0.905503 Statistical Parity Difference: -0.076246 Mean difference in label predictions: -0.076246 Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female Disparate Impact: 0.5556 Statistical Parity Difference: -0.3586 Group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male Disparate Impact: 0.4232 Statistical Parity Difference: -0.4654 Group: ethnicity_race_sex_hispanic or latino_asian_female Disparate Impact: 0.8020 Statistical Parity Difference: -0.1598 Group: ethnicity_race_sex_hispanic or latino_asian_male Disparate Impact: 0.8263 Statistical Parity Difference: -0.1402 Group: ethnicity_race_sex_hispanic or latino_black or african american_female Disparate Impact: 0.6610 Statistical Parity Difference: -0.2735 Group: ethnicity_race_sex_hispanic or latino_black or african american_male Disparate Impact: 0.5154 Statistical Parity Difference: -0.3910 Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female Disparate Impact: 0.6610 Statistical Parity Difference: -0.2735 Group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male Disparate Impact: 0.2711 Statistical Parity Difference: -0.5881 Group: ethnicity_race_sex_hispanic or latino_white_female Disparate Impact: 0.8151 Statistical Parity Difference: -0.1492 Group: ethnicity_race_sex_hispanic or latino_white_male Disparate Impact: 0.8271 Statistical Parity Difference: -0.1395 Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female Disparate Impact: 0.7230 Statistical Parity Difference: -0.2235 Group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male Disparate Impact: 0.8718 Statistical Parity Difference: -0.1035 Group: ethnicity_race_sex_not hispanic or latino_asian_female Disparate Impact: 0.9711 Statistical Parity Difference: -0.0233 Group: ethnicity_race_sex_not hispanic or latino_asian_male Disparate Impact: 0.9759 Statistical Parity Difference: -0.0195 Group: ethnicity_race_sex_not hispanic or latino_black or african american_female Disparate Impact: 0.6223 Statistical Parity Difference: -0.3047 Group: ethnicity_race_sex_not hispanic or latino_black or african american_male Disparate Impact: 0.6380 Statistical Parity Difference: -0.2921 Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female Disparate Impact: 0.8891 Statistical Parity Difference: -0.0895 Group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male Disparate Impact: 0.7393 Statistical Parity Difference: -0.2104 Group: ethnicity_race_sex_not hispanic or latino_white_female Disparate Impact: 1.0071 Statistical Parity Difference: 0.0057
J) SHAP¶
J.1) Overall model assessment¶
In [47]:
# lgbm_model.fit(X_train_smote, y_train_smote)
# lgbm_model = lgb.LGBMClassifier(random_state=42)
# Creating SHAP explainer for LightGBM
explainer = shap.TreeExplainer(best_model, X_train_reweighted)
# Calculating SHAP values for the test set
shap_values = explainer.shap_values(X_test)
# Calculating mean absolute SHAP values for each feature
mean_abs_shap = np.abs(shap_values).mean(axis=0)
# Creating a DataFrame with feature names and mean absolute SHAP values
shap_importance = pd.DataFrame({
'feature': X_test.columns,
'importance': mean_abs_shap
})
# Sorting by importance
shap_importance_sorted = shap_importance.sort_values(by='importance', ascending=False)
100%|===================| 43998/44016 [26:42<00:00]
In [48]:
# Bar Plot for Top 20 Feature Importances
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=shap_importance_sorted.head(20))
plt.title('Top 20 Feature Importances (based on absolute SHAP values)')
plt.tight_layout()
plt.show()
# Summary Plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test, plot_type="bar")
plt.show()
# Detailed Summary Plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_test)
plt.show()
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
J.2) WaterFall plot per group.¶
In [49]:
# List of the intersectional group columns
ethnicity_race_sex_cols = [col for col in X_test_with_pred.columns if col.startswith('ethnicity_race_sex')]
# Creating a column that represents each intersectional group
X_test_with_pred['intersectional_group'] = X_test_with_pred[ethnicity_race_sex_cols].idxmax(axis=1)
# Reseting the index of X_test_with_pred and X_test to ensure they match shap_values (we had issues earlier without resetting)
X_test_with_pred = X_test_with_pred.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
# Grouping by intersectional group to evaluate SHAP values for each group
grouped = X_test_with_pred.groupby('intersectional_group')
# Loop through each intersectional group
for group_name, group_data in grouped:
print(f"Generating SHAP Waterfall plot for group: {group_name}")
# Getting the subset of the data for this intersectional group
X_group = X_test.loc[group_data.index]
# Creation of a subset for the SHAP values for this group
shap_values_group = shap_values[group_data.index] # This bit ensure that SHAP values match X_group
# Picking a specific row to explain for the waterfall plot
row_to_explain = group_data.index[0]
# We generate the SHAP waterfall plot for a single prediction in this group
shap.waterfall_plot(
shap.Explanation(
values=shap_values[row_to_explain], # SHAP values for the specific row
base_values=explainer.expected_value, # Base value for the SHAP model
data=X_test.iloc[row_to_explain, :], # Input data for the specific row
feature_names=X_test.columns # Feature names
)
)
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_male
J.3) Summary per group¶
In [50]:
for group_name, group_data in grouped:
print(f"Generating SHAP Summary plot for group: {group_name}")
# Geting the subset of the data for this intersectional group
X_group = X_test.loc[group_data.index]
# Subsetting the SHAP values for this group
shap_values_group = shap_values[group_data.index]
# Generating the SHAP summary plot for the entire group
shap.summary_plot(shap_values_group, X_group, feature_names=X_test.columns)
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_white_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_hispanic or latino_white_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_female
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
Generating SHAP Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_male
/usr/local/lib/python3.10/dist-packages/shap/plots/_beeswarm.py:950: UserWarning: Tight layout not applied. The left and right margins cannot be made large enough to accommodate all axes decorations. pl.tight_layout()
In [51]:
# ALTERNATIVE TO Summary VALUES pere group SINCE SOME OF the charts were showing distorted display.
for group_name, group_data in grouped:
print(f"Generating SHAP Custom Summary plot for group: {group_name}")
# Getting the subset of the data for this intersectional group
X_group = X_test.loc[group_data.index]
# Subsetting the SHAP values for this group
shap_values_group = shap_values[group_data.index]
# Calculating mean absolute SHAP values for the group
mean_abs_shap = np.abs(shap_values_group).mean(axis=0)
# Creating a DataFrame with feature names and mean absolute SHAP values
shap_importance = pd.DataFrame({
'feature': X_test.columns,
'importance': mean_abs_shap
})
# Sorting df by importance
shap_importance_sorted = shap_importance.sort_values(by='importance', ascending=False)
# Plotting the top 20 features
plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=shap_importance_sorted.head(20))
plt.title(f'Top 20 Feature Importances for Group: {group_name}')
plt.tight_layout()
plt.show()
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_asian_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_white_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_hispanic or latino_white_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_female
Generating SHAP Custom Summary plot for group: ethnicity_race_sex_not hispanic or latino_white_male
J.4) Summary per group (average)¶
In [52]:
for group_name, group_data in grouped:
print(f"Generating SHAP Waterfall plot for group: {group_name}")
# Getting the subset of the data for this intersectional group
X_group = X_test.loc[group_data.index]
# Subsetting the SHAP values for this group
shap_values_group = shap_values[group_data.index] # This bit ensure that SHAP values match X_group
# Calculating the average SHAP values for the group
avg_shap_values = shap_values_group.mean(axis=0)
# Generating the SHAP waterfall plot for the average prediction Mind that is for the AVG!
shap.waterfall_plot(
shap.Explanation(
values=avg_shap_values, # Average SHAP values for the group
base_values=explainer.expected_value, # Base value for the SHAP model
data=X_group.mean(axis=0), # Average input data for the group
feature_names=X_test.columns # Feature names
)
)
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_american indian or alaska native_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_asian_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_black or african american_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_native hawaiian or other pacific islander_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_hispanic or latino_white_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_american indian or alaska native_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_asian_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_black or african american_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_native hawaiian or other pacific islander_male
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_female
Generating SHAP Waterfall plot for group: ethnicity_race_sex_not hispanic or latino_white_male