HMDA. Mortgage Data Analysis and Modeling Predictions.¶
The Dataset: https://www.consumerfinance.gov/data-research/hmda/¶
The Home Mortgage Disclosure Act (HMDA) requires many financial institutions to maintain, report, and publicly disclose loan-level information about mortgages. These data help show whether lenders are serving the housing needs of their communities; they give public officials information that helps them make decisions and policies; and they shed light on lending patterns that could be discriminatory. The public data are modified to protect applicant and borrower privacy.
HMDA was originally enacted by Congress in 1975 and is implemented by Regulation C.
The Goal:¶
We will use the models Logistic Regression and LightGBM. to assess the outcomes, we will use ROC-AUC and Log Loss. The purpouse will be:
- Find out if the are any intersectional biases in model's predictions.
- Which features influenced the most on model's outcomes. We will use SHAP Values.
Steps:¶
Step 1: Setup of working environment and libraries.
Step 2: Overview
A) Understanding features: Know what each column represents and whether it is useful.
B) Understanding the data, shape and structure:
- df.shape
- df.columns
- df.dtypes
- df.head()
- Step 3: Data cleaning
A) Excluding irrelevant records from columns.
- Quality issues: in "edit_status_name" there are 190,210 records with quality issues. We will ommit them.
- Only idividuals: we will focus on individuals hence from 'co_applicant_ethnicity_name' we only keep 'No co-applicant'
B) Excluding irrelevant records from:
- applicant_ethnicity_name
- applicant_race_name_1
- applicant_sex_name
- action_taken_name
C) Addressing missing values:
- Filling relevant empty cells with "Unknown"
- Deleting irrelevant records from other columns.
D) Mapping action_taken_name into "pproved" & "denied"
E) Feature engineering; Feature creation
- Creation of the column 'ethnicity_race_sex'
- Step 4: Exploratory Data Analysis
A) What is the proportion of "denials" to "approvals" for each record-group from the column "ethnicity_race_sex"?
B) What is the total % of applications for each ethnicity_race_sex group?
C) What are the approval rates?
D) Outliers
- Step 5: Feature engineering
A) Feature creation
B) Feature selection
C) Feature transformation
Final Considerations: Please note that at the bottom of some coding blocks we acknowledge external original sources of the code used, if any source is mentioned the creation is original.
Step 1: Setup of working environment and libraries.¶
Note:
!git init is for Google Colab environment, we should use "subprocess.run(["git", "init"])" which is more in line with PEP 8 doc. Style.
We only use once "git init", when starting a new project. The purpose of this code is to create things related to the project in a .git directory in the project folder.
subprocess.run(["git", "init"])
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Setting up GitHub
import os
import subprocess
from getpass import getpass #Secure token storage
# Importing working space
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Username and email:
subprocess.run(["git", "config", "--global", "user.name", "AJLR888"], check=True)
subprocess.run(["git", "config", "--global", "user.email", "roldan.analytics@gmail.com"], check=True)
# Storing GitHub token and repository details
GITHUB_TOKEN = getpass("Enter GitHub Token:")
REPO_OWNER = "AJLR888"
REPO_NAME = "hmda-ny-2007-loan-default"
BRANCH_NAME = "main"
#Setting GitHub remot URL with authentcation
GIT_REMOTE_URL = f"https://{GITHUB_TOKEN}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
os.system(f"git remote set-url origin {GIT_REMOTE_URL}")
Enter GitHub Token:··········
0
# Loading the data
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')
# Checking data size
df = pd.DataFrame(df)
<ipython-input-176-100f7bdf4485>:2: DtypeWarning: Columns (34,36,38,42,44,46,48) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/hmda_2007_ny_all-records_labels.csv')
Step 2: General overview¶
A) Understanding of each feature. What does each feature describe?¶
as_of_year: The year the mortgage was reported.
respondent_id: The lender.
agency_name: The regulatory agency responsible for overseeing the financial institution that reported the mortgage.
agency_abbr: Agency abrebiation name.
agency_code: The regulator that supervises the regulated lender that report the data.
loan_type_name:The type of covered loan or application (if covered). By covered mean, there is a mechanism to protect the lender.
property_type_name:
- One to four-family (other than manufactured housing)
- Manufactured housing
- Multifamily
loan_purpose_name: What was the loan used for.
owner_occupancy_name: Whether the owner intend to live in the property or not.
loan_amount_000s
preapproval_name: Preapproval only applies to home purchase loans, not refinancing or home improvement loans.
action_taken_name: The final outcome of a loan application.
msamd_name: MSA stands for Metropolitan Statistical Area, is a region with a high population density at its core and close economic ties throught the area; MD stands for Metropolitan division, is a sub-region within a large MSA (used when the area is particularly populous).
- Considerations:
- If the property is outside of the MSA/MD area it may be blank/empty.
- Helps analyze lending patterns, approval rates, and borrower demographics in specific regions.
- msamd is broader, city or regional-level in comparison with census_tract_number. msamd is better use for housing trends or market segmentation for instance.
census_tract_number: is a small area. Designed to be socially and economically homogeneous. In other words, they are kind of neighbourhoods.
purchaser_type_name: Describes who purchased the loan in the secondary market after it was originated by the reporting institution.
denial_reason_name_1: Important to consider that if NA the loan was not denied.
rate_spread: The exra interest that a borrower pay compared to the "best-qualified" borrowers. Generally speaking it means higher risk although there are nuances to consider. e.g. someone can have a higher rate spread due to a bad credit score which doesn't reflect their ability and will to pay back.
hoepa_status_name: hoepa stands for Home Ownership and Equity Protection Act. is a law disgned to protect borrowers from predatory lending.
- Considerations:
- It can be used to track if high-cost loans are disproportionally targeted at vulnerable groups.
lien_status_name: Basically, it indicates whether the mortgage is secured or not.
edit_status_name: Are rules to assist filers in checking the accuracy of HMDA data prior to submission. Idicate if there are any accuracy(quality)issues.
- Note: In our dataset we have almost 200,000 records as "Quality edit failure only" which is indicative of some potentail data inconsistency.
minority_population: % of minority population to total tract population.
hud_median_family_income: Median family income for the MSA (metropolitan statistical area) or MD (metropolitan division)
- Note: hud(Housing & Urban development).
- List item
population: total population in tract (tract ~ neighbourhood).
tract_to_msamd_income: Indicates how wealthy or poor a tract is compared to its surroundings areas. compares the median family income of the census tract to the median family income of the corresponding metropolitan statistical area MSA OR metropolitan division MD. Let's remember that a tract is inside of a MSA/MD.
number_of_owner_occupied_units: Total number of housing units in the census tract that are occupied by their owners rather than rented.
number_of_1_to_4_family_units: Indicates the number of housing units that are classified as single-family homes. The higher the number the less population density.
Check the following sources for further undertanding:
Check filing instructions guide for further information: https://ffiec.cfpb.gov/documentation/fig/2024/overview
Original source: https://www.consumerfinance.gov/data-research/hmda/historic-data/?geo=nationwide&records=all-records&field_descriptions=labels
Check "lar_record_codes.pdf" from my GitHub for further Inf.
B) Understanding the data¶
# Checking content in the columns
pd.set_option('display.max_columns', None)
print(df.head(5))
as_of_year respondent_id agency_name \ 0 2007 20-2096530 Office of the Comptroller of the Currency 1 2007 2085300005 Department of Housing and Urban Development 2 2007 20-2096530 Office of the Comptroller of the Currency 3 2007 0000003970 Office of Thrift Supervision 4 2007 2085300005 Department of Housing and Urban Development agency_abbr agency_code loan_type_name loan_type \ 0 OCC 1 Conventional 1 1 HUD 7 Conventional 1 2 OCC 1 Conventional 1 3 OTS 4 Conventional 1 4 HUD 7 Conventional 1 property_type_name property_type \ 0 One-to-four family dwelling (other than manufa... 1 1 One-to-four family dwelling (other than manufa... 1 2 One-to-four family dwelling (other than manufa... 1 3 One-to-four family dwelling (other than manufa... 1 4 One-to-four family dwelling (other than manufa... 1 loan_purpose_name loan_purpose owner_occupancy_name \ 0 Home improvement 2 Owner-occupied as a principal dwelling 1 Refinancing 3 Owner-occupied as a principal dwelling 2 Home improvement 2 Owner-occupied as a principal dwelling 3 Refinancing 3 Owner-occupied as a principal dwelling 4 Refinancing 3 Owner-occupied as a principal dwelling owner_occupancy loan_amount_000s preapproval_name preapproval \ 0 1 54 Not applicable 3 1 1 60 Not applicable 3 2 1 56 Not applicable 3 3 1 374 Not applicable 3 4 1 158 Not applicable 3 action_taken_name action_taken \ 0 Application denied by financial institution 3 1 Application denied by financial institution 3 2 Application denied by financial institution 3 3 Loan purchased by the institution 6 4 Application denied by financial institution 3 msamd_name msamd state_name state_abbr state_code \ 0 Binghamton - NY 13780.0 New York NY 36 1 Binghamton - NY 13780.0 New York NY 36 2 Binghamton - NY 13780.0 New York NY 36 3 Nassau, Suffolk - NY 35004.0 New York NY 36 4 Binghamton - NY 13780.0 New York NY 36 county_name county_code census_tract_number applicant_ethnicity_name \ 0 Tioga County 107.0 206.0 Not Hispanic or Latino 1 Broome County 7.0 17.0 Hispanic or Latino 2 Tioga County 107.0 206.0 Not Hispanic or Latino 3 Nassau County 59.0 4086.0 Not applicable 4 Broome County 7.0 125.0 Not Hispanic or Latino applicant_ethnicity co_applicant_ethnicity_name co_applicant_ethnicity \ 0 2 No co-applicant 5 1 1 No co-applicant 5 2 2 No co-applicant 5 3 4 Not applicable 4 4 2 Not Hispanic or Latino 2 applicant_race_name_1 applicant_race_1 applicant_race_name_2 \ 0 White 5 NaN 1 White 5 NaN 2 White 5 NaN 3 Not applicable 7 NaN 4 White 5 NaN applicant_race_2 applicant_race_name_3 applicant_race_3 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN applicant_race_name_4 applicant_race_4 applicant_race_name_5 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN applicant_race_5 co_applicant_race_name_1 co_applicant_race_1 \ 0 NaN No co-applicant 8 1 NaN No co-applicant 8 2 NaN No co-applicant 8 3 NaN Not applicable 7 4 NaN White 5 co_applicant_race_name_2 co_applicant_race_2 co_applicant_race_name_3 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN co_applicant_race_3 co_applicant_race_name_4 co_applicant_race_4 \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN co_applicant_race_name_5 co_applicant_race_5 \ 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN applicant_sex_name applicant_sex \ 0 Male 1 1 Information not provided by applicant in mail,... 3 2 Male 1 3 Not applicable 4 4 Information not provided by applicant in mail,... 3 co_applicant_sex_name co_applicant_sex \ 0 No co-applicant 5 1 No co-applicant 5 2 No co-applicant 5 3 Not applicable 4 4 Information not provided by applicant in mail,... 3 applicant_income_000s purchaser_type_name \ 0 16.0 Loan was not originated or was not sold in cal... 1 36.0 Loan was not originated or was not sold in cal... 2 66.0 Loan was not originated or was not sold in cal... 3 NaN Commercial bank, savings bank or savings assoc... 4 62.0 Loan was not originated or was not sold in cal... purchaser_type denial_reason_name_1 denial_reason_1 \ 0 0 Collateral 4.0 1 0 NaN NaN 2 0 Credit application incomplete 7.0 3 6 NaN NaN 4 0 NaN NaN denial_reason_name_2 denial_reason_2 denial_reason_name_3 denial_reason_3 \ 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN rate_spread hoepa_status_name hoepa_status lien_status_name \ 0 NaN Not a HOEPA loan 2 Secured by a first lien 1 NaN Not a HOEPA loan 2 Secured by a first lien 2 NaN Not a HOEPA loan 2 Secured by a subordinate lien 3 NaN Not a HOEPA loan 2 Not applicable 4 NaN Not a HOEPA loan 2 Secured by a first lien lien_status edit_status_name edit_status sequence_number population \ 0 1 NaN NaN 13711 7424.0 1 1 NaN NaN 10815 4302.0 2 2 NaN NaN 13712 7424.0 3 4 NaN NaN 174127 4497.0 4 1 NaN NaN 10823 6421.0 minority_population hud_median_family_income tract_to_msamd_income \ 0 2.25 52800.0 91.790001 1 14.16 52800.0 94.230003 2 2.25 52800.0 91.790001 3 11.12 93800.0 97.110001 4 2.59 52800.0 93.949997 number_of_owner_occupied_units number_of_1_to_4_family_units \ 0 2259.0 3030.0 1 1033.0 1823.0 2 2259.0 3030.0 3 1341.0 1459.0 4 1926.0 2669.0 application_date_indicator 0 0 1 0 2 0 3 2 4 0
# Force to dislplay all rows
pd.set_option('display.max_rows', None)
# Data types
print(df.dtypes)
as_of_year int64 respondent_id object agency_name object agency_abbr object agency_code int64 loan_type_name object loan_type int64 property_type_name object property_type int64 loan_purpose_name object loan_purpose int64 owner_occupancy_name object owner_occupancy int64 loan_amount_000s int64 preapproval_name object preapproval int64 action_taken_name object action_taken int64 msamd_name object msamd float64 state_name object state_abbr object state_code int64 county_name object county_code float64 census_tract_number float64 applicant_ethnicity_name object applicant_ethnicity int64 co_applicant_ethnicity_name object co_applicant_ethnicity int64 applicant_race_name_1 object applicant_race_1 int64 applicant_race_name_2 object applicant_race_2 float64 applicant_race_name_3 object applicant_race_3 float64 applicant_race_name_4 object applicant_race_4 float64 applicant_race_name_5 object applicant_race_5 float64 co_applicant_race_name_1 object co_applicant_race_1 int64 co_applicant_race_name_2 object co_applicant_race_2 float64 co_applicant_race_name_3 object co_applicant_race_3 float64 co_applicant_race_name_4 object co_applicant_race_4 float64 co_applicant_race_name_5 object co_applicant_race_5 float64 applicant_sex_name object applicant_sex int64 co_applicant_sex_name object co_applicant_sex int64 applicant_income_000s float64 purchaser_type_name object purchaser_type int64 denial_reason_name_1 object denial_reason_1 float64 denial_reason_name_2 object denial_reason_2 float64 denial_reason_name_3 object denial_reason_3 float64 rate_spread float64 hoepa_status_name object hoepa_status int64 lien_status_name object lien_status int64 edit_status_name object edit_status float64 sequence_number int64 population float64 minority_population float64 hud_median_family_income float64 tract_to_msamd_income float64 number_of_owner_occupied_units float64 number_of_1_to_4_family_units float64 application_date_indicator int64 dtype: object
# Null values
print(df.isnull().sum())
as_of_year 0 respondent_id 0 agency_name 0 agency_abbr 0 agency_code 0 loan_type_name 0 loan_type 0 property_type_name 0 property_type 0 loan_purpose_name 0 loan_purpose 0 owner_occupancy_name 0 owner_occupancy 0 loan_amount_000s 0 preapproval_name 0 preapproval 0 action_taken_name 0 action_taken 0 msamd_name 92920 msamd 92920 state_name 0 state_abbr 0 state_code 0 county_name 1007 county_code 1007 census_tract_number 1370 applicant_ethnicity_name 0 applicant_ethnicity 0 co_applicant_ethnicity_name 0 co_applicant_ethnicity 0 applicant_race_name_1 0 applicant_race_1 0 applicant_race_name_2 1006430 applicant_race_2 1006430 applicant_race_name_3 1009259 applicant_race_3 1009259 applicant_race_name_4 1009369 applicant_race_4 1009369 applicant_race_name_5 1009386 applicant_race_5 1009386 co_applicant_race_name_1 0 co_applicant_race_1 0 co_applicant_race_name_2 1008514 co_applicant_race_2 1008514 co_applicant_race_name_3 1009393 co_applicant_race_3 1009393 co_applicant_race_name_4 1009417 co_applicant_race_4 1009417 co_applicant_race_name_5 1009427 co_applicant_race_5 1009427 applicant_sex_name 0 applicant_sex 0 co_applicant_sex_name 0 co_applicant_sex 0 applicant_income_000s 88875 purchaser_type_name 0 purchaser_type 0 denial_reason_name_1 828980 denial_reason_1 828980 denial_reason_name_2 971378 denial_reason_2 971378 denial_reason_name_3 1002497 denial_reason_3 1002497 rate_spread 939632 hoepa_status_name 0 hoepa_status 0 lien_status_name 0 lien_status 0 edit_status_name 819241 edit_status 819241 sequence_number 0 population 1466 minority_population 1477 hud_median_family_income 1418 tract_to_msamd_income 1745 number_of_owner_occupied_units 3093 number_of_1_to_4_family_units 2965 application_date_indicator 0 dtype: int64
# Unique values for categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns.drop("respondent_id")
for col in categorical_columns:
print(f"Value counts for: {col}")
print(df[col].value_counts(), "\n")
print("\n" + "=+"*50 + "\n")
Value counts for: agency_name agency_name Office of the Comptroller of the Currency 363756 Department of Housing and Urban Development 214451 Office of Thrift Supervision 199948 Federal Reserve System 173886 National Credit Union Administration 34779 Federal Deposit Insurance Corporation 22631 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: agency_abbr agency_abbr OCC 363756 HUD 214451 OTS 199948 FRS 173886 NCUA 34779 FDIC 22631 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: loan_type_name loan_type_name Conventional 966120 FHA-insured 39128 VA-guaranteed 3573 FSA/RHS-guaranteed 630 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: property_type_name property_type_name One-to-four family dwelling (other than manufactured housing) 990336 Manufactured housing 13427 Multifamily dwelling 5688 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: loan_purpose_name loan_purpose_name Refinancing 490921 Home purchase 393021 Home improvement 125509 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: owner_occupancy_name owner_occupancy_name Owner-occupied as a principal dwelling 930351 Not owner-occupied as a principal dwelling 73039 Not applicable 6061 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: preapproval_name preapproval_name Not applicable 841011 Preapproval was not requested 148534 Preapproval was requested 19906 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: action_taken_name action_taken_name Loan originated 398639 Application denied by financial institution 265860 Loan purchased by the institution 125325 Application withdrawn by applicant 98667 Application approved but not accepted 88070 File closed for incompleteness 32846 Preapproval request denied by financial institution 39 Preapproval request approved but not accepted 5 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: msamd_name msamd_name New York, White Plains, Wayne - NY, NJ 373622 Nassau, Suffolk - NY 205507 Rochester - NY 61609 Albany, Schenectady, Troy - NY 61400 Buffalo, Niagara Falls - NY 58655 Poughkeepsie, Newburgh, Middletown - NY 51484 Syracuse - NY 40105 Utica, Rome - NY 17609 Binghamton - NY 13557 Kingston - NY 12881 Glens Falls - NY 11108 Elmira - NY 5087 Ithaca - NY 3907 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: state_name state_name New York 1009451 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: state_abbr state_abbr NY 1009451 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: county_name county_name Suffolk County 119892 Queens County 106680 Nassau County 85631 Kings County 85527 Westchester County 49237 Erie County 47096 Monroe County 41702 New York County 41171 Bronx County 34588 Richmond County 32395 Orange County 29681 Onondaga County 28400 Dutchess County 21820 Albany County 18913 Saratoga County 16946 Rockland County 16734 Oneida County 13488 Ulster County 13010 Schenectady County 12071 Niagara County 11564 Rensselaer County 11077 Broome County 10491 Ontario County 7473 Putnam County 7351 Oswego County 7215 Jefferson County 6461 Sullivan County 6357 Chautauqua County 6328 Wayne County 6204 Warren County 5702 Washington County 5419 Chemung County 5088 Steuben County 5076 Cayuga County 4923 Clinton County 4829 Madison County 4494 Fulton County 4355 St. Lawrence County 4353 Greene County 4315 Columbia County 4201 Herkimer County 4124 Cattaraugus County 4105 Otsego County 4051 Tompkins County 3910 Livingston County 3628 Montgomery County 3538 Delaware County 3189 Tioga County 3069 Genesee County 3034 Chenango County 2935 Essex County 2904 Franklin County 2746 Orleans County 2637 Cortland County 2615 Schoharie County 2410 Allegany County 2285 Wyoming County 2169 Seneca County 2021 Yates County 1535 Lewis County 1518 Schuyler County 1285 Hamilton County 478 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_ethnicity_name applicant_ethnicity_name Not Hispanic or Latino 701712 Information not provided by applicant in mail, Internet, or telephone application 163345 Hispanic or Latino 83748 Not applicable 60646 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_ethnicity_name co_applicant_ethnicity_name No co-applicant 601274 Not Hispanic or Latino 269823 Information not provided by applicant in mail, Internet, or telephone application 66060 Not applicable 47780 Hispanic or Latino 24514 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_1 applicant_race_name_1 White 610172 Information not provided by applicant in mail, Internet, or telephone application 171080 Black or African American 110814 Not applicable 60199 Asian 46654 American Indian or Alaska Native 5342 Native Hawaiian or Other Pacific Islander 5190 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_2 applicant_race_name_2 White 2051 Black or African American 422 Native Hawaiian or Other Pacific Islander 261 Asian 173 American Indian or Alaska Native 114 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_3 applicant_race_name_3 White 82 Black or African American 76 Native Hawaiian or Other Pacific Islander 18 American Indian or Alaska Native 13 Asian 3 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_4 applicant_race_name_4 Native Hawaiian or Other Pacific Islander 58 White 17 Asian 7 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_race_name_5 applicant_race_name_5 White 56 Native Hawaiian or Other Pacific Islander 5 Asian 2 American Indian or Alaska Native 2 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_1 co_applicant_race_name_1 No co-applicant 601274 White 248732 Information not provided by applicant in mail, Internet, or telephone application 68896 Not applicable 46507 Black or African American 27264 Asian 13513 American Indian or Alaska Native 1652 Native Hawaiian or Other Pacific Islander 1613 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_2 co_applicant_race_name_2 White 665 Black or African American 101 Native Hawaiian or Other Pacific Islander 82 Asian 56 American Indian or Alaska Native 33 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_3 co_applicant_race_name_3 White 23 Black or African American 22 Native Hawaiian or Other Pacific Islander 10 American Indian or Alaska Native 3 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_4 co_applicant_race_name_4 Native Hawaiian or Other Pacific Islander 14 White 12 Asian 5 Black or African American 3 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_race_name_5 co_applicant_race_name_5 White 15 Asian 4 Native Hawaiian or Other Pacific Islander 3 American Indian or Alaska Native 2 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: applicant_sex_name applicant_sex_name Male 548444 Female 313848 Information not provided by applicant in mail, Internet, or telephone application 86910 Not applicable 60249 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: co_applicant_sex_name co_applicant_sex_name No co-applicant 601274 Female 235906 Male 86243 Not applicable 47369 Information not provided by applicant in mail, Internet, or telephone application 38659 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: purchaser_type_name purchaser_type_name Loan was not originated or was not sold in calendar year covered by register 690764 Affiliate institution 79082 Fannie Mae (FNMA) 75053 Freddie Mac (FHLMC) 47531 Other type of purchaser 34667 Life insurance company, credit union, mortgage bank, or finance company 24912 Private securitization 21975 Commercial bank, savings bank or savings association 20277 Ginnie Mae (GNMA) 15139 Farmer Mac (FAMC) 51 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: denial_reason_name_1 denial_reason_name_1 Credit history 42911 Debt-to-income ratio 31689 Credit application incomplete 30550 Collateral 29969 Other 29472 Unverifiable information 11298 Insufficient cash (downpayment, closing costs) 2672 Employment history 1748 Mortgage insurance denied 162 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: denial_reason_name_2 denial_reason_name_2 Other 10141 Credit history 8133 Debt-to-income ratio 6406 Collateral 4863 Unverifiable information 2918 Credit application incomplete 2445 Insufficient cash (downpayment, closing costs) 2314 Employment history 716 Mortgage insurance denied 137 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: denial_reason_name_3 denial_reason_name_3 Other 2508 Credit history 1007 Collateral 977 Unverifiable information 624 Debt-to-income ratio 579 Insufficient cash (downpayment, closing costs) 567 Credit application incomplete 535 Employment history 103 Mortgage insurance denied 54 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: hoepa_status_name hoepa_status_name Not a HOEPA loan 1009337 HOEPA loan 114 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: lien_status_name lien_status_name Secured by a first lien 694718 Secured by a subordinate lien 159095 Not applicable 125325 Not secured by a lien 30313 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ Value counts for: edit_status_name edit_status_name Quality edit failure only 190210 Name: count, dtype: int64 =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
Step 3: Data cleaning¶
A) Excluding irrelevant records from columns¶
# Records/columns
df.shape
(1009451, 78)
# Quality issues: There are 190210 records with quality issues. We will ommit them.
df = df[~df["edit_status_name"].isin(["Quality edit failure only"])]
# Our focus is on individuals hence we will focus on data with "No co-applicant"
df = df[df['co_applicant_ethnicity_name'] == 'No co-applicant']
# Records/shape
df.shape
(472701, 78)
B) Excluding irrelevant records from the following columns:¶
- applicant_ethnicity_name
- applicant_race_name_1
- applicant_sex_name
- action_taken_name
df = df[
~df["action_taken_name"].isin([
"File closed for incompleteness",
"Preapproval request denied by financial institution",
"Preapproval request approved but not accepted",
"Application withdrawn by applicant"
]) &
~df["applicant_ethnicity_name"].isin([
"Information not provided by applicant in mail, Internet, or telephone application",
"Not applicable"
]) &
~df["applicant_race_name_1"].isin([
"Information not provided by applicant in mail, Internet, or telephone application",
"Not applicable"
]) &
~df["applicant_sex_name"].isin([
"Information not provided by applicant in mail, Internet, or telephone application",
"Not applicable"
])
]
print(df.shape)
(323044, 78)
C) Addressing missing values¶
print(df.isnull().sum())
as_of_year 0 respondent_id 0 agency_name 0 agency_abbr 0 agency_code 0 loan_type_name 0 loan_type 0 property_type_name 0 property_type 0 loan_purpose_name 0 loan_purpose 0 owner_occupancy_name 0 owner_occupancy 0 loan_amount_000s 0 preapproval_name 0 preapproval 0 action_taken_name 0 action_taken 0 msamd_name 34069 msamd 34069 state_name 0 state_abbr 0 state_code 0 county_name 357 county_code 357 census_tract_number 460 applicant_ethnicity_name 0 applicant_ethnicity 0 co_applicant_ethnicity_name 0 co_applicant_ethnicity 0 applicant_race_name_1 0 applicant_race_1 0 applicant_race_name_2 321711 applicant_race_2 321711 applicant_race_name_3 322955 applicant_race_3 322955 applicant_race_name_4 323019 applicant_race_4 323019 applicant_race_name_5 323027 applicant_race_5 323027 co_applicant_race_name_1 0 co_applicant_race_1 0 co_applicant_race_name_2 323044 co_applicant_race_2 323044 co_applicant_race_name_3 323044 co_applicant_race_3 323044 co_applicant_race_name_4 323044 co_applicant_race_4 323044 co_applicant_race_name_5 323044 co_applicant_race_5 323044 applicant_sex_name 0 applicant_sex 0 co_applicant_sex_name 0 co_applicant_sex 0 applicant_income_000s 2719 purchaser_type_name 0 purchaser_type 0 denial_reason_name_1 251827 denial_reason_1 251827 denial_reason_name_2 308294 denial_reason_2 308294 denial_reason_name_3 320397 denial_reason_3 320397 rate_spread 295313 hoepa_status_name 0 hoepa_status 0 lien_status_name 0 lien_status 0 edit_status_name 323044 edit_status 323044 sequence_number 0 population 507 minority_population 511 hud_median_family_income 484 tract_to_msamd_income 585 number_of_owner_occupied_units 1092 number_of_1_to_4_family_units 967 application_date_indicator 0 dtype: int64
Filling and dropping¶
df = df.assign(
msamd_name=df['msamd_name'].fillna("Unknown"),
denial_reason_name_1=df['denial_reason_name_1'].fillna("Unknown"),
rate_spread=df['rate_spread'].fillna(0)
)
df = df.dropna(subset=['county_name',
'census_tract_number',
'applicant_income_000s',
'minority_population',
'hud_median_family_income',
'tract_to_msamd_income',
'number_of_owner_occupied_units',
'number_of_1_to_4_family_units'
]
)
print(df.isnull().sum())
as_of_year 0 respondent_id 0 agency_name 0 agency_abbr 0 agency_code 0 loan_type_name 0 loan_type 0 property_type_name 0 property_type 0 loan_purpose_name 0 loan_purpose 0 owner_occupancy_name 0 owner_occupancy 0 loan_amount_000s 0 preapproval_name 0 preapproval 0 action_taken_name 0 action_taken 0 msamd_name 0 msamd 33468 state_name 0 state_abbr 0 state_code 0 county_name 0 county_code 0 census_tract_number 0 applicant_ethnicity_name 0 applicant_ethnicity 0 co_applicant_ethnicity_name 0 co_applicant_ethnicity 0 applicant_race_name_1 0 applicant_race_1 0 applicant_race_name_2 317520 applicant_race_2 317520 applicant_race_name_3 318752 applicant_race_3 318752 applicant_race_name_4 318816 applicant_race_4 318816 applicant_race_name_5 318824 applicant_race_5 318824 co_applicant_race_name_1 0 co_applicant_race_1 0 co_applicant_race_name_2 318841 co_applicant_race_2 318841 co_applicant_race_name_3 318841 co_applicant_race_3 318841 co_applicant_race_name_4 318841 co_applicant_race_4 318841 co_applicant_race_name_5 318841 co_applicant_race_5 318841 applicant_sex_name 0 applicant_sex 0 co_applicant_sex_name 0 co_applicant_sex 0 applicant_income_000s 0 purchaser_type_name 0 purchaser_type 0 denial_reason_name_1 0 denial_reason_1 248033 denial_reason_name_2 304184 denial_reason_2 304184 denial_reason_name_3 316221 denial_reason_3 316221 rate_spread 0 hoepa_status_name 0 hoepa_status 0 lien_status_name 0 lien_status 0 edit_status_name 318841 edit_status 318841 sequence_number 0 population 0 minority_population 0 hud_median_family_income 0 tract_to_msamd_income 0 number_of_owner_occupied_units 0 number_of_1_to_4_family_units 0 application_date_indicator 0 dtype: int64
D) Mapping action_taken_name into "approved" & "denied"¶
print(df["action_taken_name"].value_counts())
action_taken_name Loan originated 148863 Application denied by financial institution 108294 Application approved but not accepted 31152 Loan purchased by the institution 30532 Name: count, dtype: int64
atn_map = {
"Loan originated" : 'approved',
"Application approved but not accepted" : 'approved',
"Loan purchased by the institution" : 'approved',
"Application denied by financial institution" : 'denied'
}
df["action_taken_name"] = df["action_taken_name"].map(atn_map)
print(df["action_taken_name"].value_counts())
action_taken_name approved 210547 denied 108294 Name: count, dtype: int64
print(df.columns)
Index(['as_of_year', 'respondent_id', 'agency_name', 'agency_abbr', 'agency_code', 'loan_type_name', 'loan_type', 'property_type_name', 'property_type', 'loan_purpose_name', 'loan_purpose', 'owner_occupancy_name', 'owner_occupancy', 'loan_amount_000s', 'preapproval_name', 'preapproval', 'action_taken_name', 'action_taken', 'msamd_name', 'msamd', 'state_name', 'state_abbr', 'state_code', 'county_name', 'county_code', 'census_tract_number', 'applicant_ethnicity_name', 'applicant_ethnicity', 'co_applicant_ethnicity_name', 'co_applicant_ethnicity', 'applicant_race_name_1', 'applicant_race_1', 'applicant_race_name_2', 'applicant_race_2', 'applicant_race_name_3', 'applicant_race_3', 'applicant_race_name_4', 'applicant_race_4', 'applicant_race_name_5', 'applicant_race_5', 'co_applicant_race_name_1', 'co_applicant_race_1', 'co_applicant_race_name_2', 'co_applicant_race_2', 'co_applicant_race_name_3', 'co_applicant_race_3', 'co_applicant_race_name_4', 'co_applicant_race_4', 'co_applicant_race_name_5', 'co_applicant_race_5', 'applicant_sex_name', 'applicant_sex', 'co_applicant_sex_name', 'co_applicant_sex', 'applicant_income_000s', 'purchaser_type_name', 'purchaser_type', 'denial_reason_name_1', 'denial_reason_1', 'denial_reason_name_2', 'denial_reason_2', 'denial_reason_name_3', 'denial_reason_3', 'rate_spread', 'hoepa_status_name', 'hoepa_status', 'lien_status_name', 'lien_status', 'edit_status_name', 'edit_status', 'sequence_number', 'population', 'minority_population', 'hud_median_family_income', 'tract_to_msamd_income', 'number_of_owner_occupied_units', 'number_of_1_to_4_family_units', 'application_date_indicator'], dtype='object')
E) Feature engineering; Feature creation¶
# Creation of ethnicity_race_sex column:
df['ethnicity_race_sex'] = df['applicant_ethnicity_name'].str.lower() + "_" + df['applicant_race_name_1'].str.lower() + "_" + df['applicant_sex_name'].str.lower()
# Checking column created
print(df[['ethnicity_race_sex']].value_counts())
ethnicity_race_sex not hispanic or latino_white_male 126540 not hispanic or latino_white_female 84557 not hispanic or latino_black or african american_female 27181 not hispanic or latino_black or african american_male 22053 hispanic or latino_white_male 17879 not hispanic or latino_asian_male 13608 hispanic or latino_white_female 11132 not hispanic or latino_asian_female 8975 not hispanic or latino_american indian or alaska native_male 1116 hispanic or latino_black or african american_male 937 not hispanic or latino_native hawaiian or other pacific islander_male 922 hispanic or latino_black or african american_female 873 not hispanic or latino_native hawaiian or other pacific islander_female 741 not hispanic or latino_american indian or alaska native_female 725 hispanic or latino_native hawaiian or other pacific islander_male 436 hispanic or latino_american indian or alaska native_male 385 hispanic or latino_native hawaiian or other pacific islander_female 242 hispanic or latino_american indian or alaska native_female 241 hispanic or latino_asian_male 175 hispanic or latino_asian_female 123 Name: count, dtype: int64
Mapping "ethnicity_race_sex" for better readability:¶
# Mapping "ethnicity_race_sex" for better readitability:
ers_map = {
"not hispanic or latino_white_male": "NH_White_M",
"not hispanic or latino_white_female": "NH_White_F",
"not hispanic or latino_black or african american_female": "NH_Black_F",
"not hispanic or latino_black or african american_male": "NH_Black_M",
"hispanic or latino_white_male": "H_White_M",
"not hispanic or latino_asian_male": "NH_Asian_M",
"hispanic or latino_white_female": "H_White_F",
"not hispanic or latino_asian_female": "NH_Asian_F",
"not hispanic or latino_american indian or alaska native_male": "NH_Indigenous_M",
"hispanic or latino_black or african american_male": "H_Black_M",
"not hispanic or latino_native hawaiian or other pacific islander_male": "NH_PacificIslander_M",
"hispanic or latino_black or african american_female": "H_Black_F",
"not hispanic or latino_native hawaiian or other pacific islander_female": "NH_PacificIslander_F",
"not hispanic or latino_american indian or alaska native_female": "NH_Indigenous_F",
"hispanic or latino_native hawaiian or other pacific islander_male": "H_PacificIslander_M",
"hispanic or latino_american indian or alaska native_male": "H_Indigenous_M",
"hispanic or latino_native hawaiian or other pacific islander_female": "H_PacificIslander_F",
"hispanic or latino_american indian or alaska native_female": "H_Indigenous_F",
"hispanic or latino_asian_male": "H_Asian_M",
"hispanic or latino_asian_female": "H_Asian_F"
}
df["ethnicity_race_sex"] = df["ethnicity_race_sex"].map(ers_map)
print(df.shape, df["ethnicity_race_sex"].value_counts())
(318841, 79) ethnicity_race_sex NH_White_M 126540 NH_White_F 84557 NH_Black_F 27181 NH_Black_M 22053 H_White_M 17879 NH_Asian_M 13608 H_White_F 11132 NH_Asian_F 8975 NH_Indigenous_M 1116 H_Black_M 937 NH_PacificIslander_M 922 H_Black_F 873 NH_PacificIslander_F 741 NH_Indigenous_F 725 H_PacificIslander_M 436 H_Indigenous_M 385 H_PacificIslander_F 242 H_Indigenous_F 241 H_Asian_M 175 H_Asian_F 123 Name: count, dtype: int64
Step 4: Exploratory Data Analysis¶
# horizontal barplot
def value_counts_bar(col, title):
vc = col.value_counts()
plt.figure(figsize=(6, len(vc.keys())/2))
plt.barh(vc.keys(), vc.values)
plt.title(title, fontsize=15)
plt.xticks(rotation='vertical', fontsize=10)
plt.show()
# https://q-shick.github.io/portfolio/proj_hmda.html by By Kyoosik Kim
value_counts_bar(df.action_taken_name, "Action Taken")
value_counts_bar(df.loan_purpose_name, "Loan Purpose")
What is the proportion of "denials" to "approvals" for each record-group from the column "ethnicity_race_sex"?¶
# Calculating and plotting the distribution of action_taken for each ethnicity_race_sex group
action_distribution = df.groupby(['ethnicity_race_sex', 'action_taken_name']).size().unstack(fill_value=0)
action_distribution_pct = action_distribution.div(action_distribution.sum(axis=1), axis=0)
# Sort the data by 'approved' column
sorted_index = action_distribution_pct['approved'].sort_values(ascending=True).index
action_distribution_pct_sorted = action_distribution_pct.loc[sorted_index]
# Display
plt.figure(figsize=(20, 10))
action_distribution_pct_sorted.plot(kind='barh', stacked=True)
plt.title('Applications approved - denied')
plt.xlabel('Proportion')
plt.ylabel('Ethnicity-Race-Sex')
plt.legend(title='Action Taken', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.subplots_adjust(right=0.8)
plt.show()
#https://stackoverflow.com/questions/25068384/bbox-to-anchor-and-loc-in-matplotlib
#https://www.geeksforgeeks.org/python-pandas-dataframe-div/
<Figure size 2000x1000 with 0 Axes>
What is the total % of applications for each ethnicity_race_sex group?¶
# Total number of applications (approved + denied) for each ethnicity_race_sex group
total_applications = df.groupby('ethnicity_race_sex')['action_taken_name'].count()
# % distribution of total applications
total_applications_pct = (total_applications / total_applications.sum()) * 100
# Sorting data
total_applications_pct_sorted = total_applications_pct.sort_values(ascending=True)
# Plotting results
plt.figure(figsize=(20, 10))
ax = total_applications_pct_sorted.plot(kind='barh', color='firebrick')
# Labels
plt.title('Distribution of Loan Applications by Ethnicity-Race-Sex')
plt.xlabel('Percentage of Total Applications')
plt.ylabel('Ethnicity-Race-Sex')
# % on bars
for i, v in enumerate(total_applications_pct_sorted):
ax.text(v + 0.5, i, f'{v:.1f}%', va='center')
plt.tight_layout()
plt.show()
# Filtering out "not hispanic or latino_white_female" and "not hispanic or latino_white_male"
total_applications = df.groupby('ethnicity_race_sex')['action_taken_name'].count()
# Total number of applications (approved + denied) for each ethnicity_race_sex group
total_applications_pct = (total_applications / total_applications.sum()) * 100
# Filtering out "not hispanic or latino_white_female" and "not hispanic or latino_white_male" from visualization
visualization_pct = total_applications_pct.drop(['NH_White_M', 'NH_White_F'])
# Sortting data
visualization_pct_sorted = visualization_pct.sort_values(ascending=True)
# Plotting
plt.figure(figsize=(20, 10))
ax = visualization_pct_sorted.plot(kind='barh', color='indianred')
# Labels
plt.title('Distribution of Loan Applications by Ethnicity-Race-Sex (Excluding NH_White_M, NH_White_F')
plt.xlabel('Percentage of Total Applications')
plt.ylabel('Ethnicity-Race-Sex')
# % for each bar
for i, v in enumerate(visualization_pct_sorted):
ax.text(v + 0.1, i, f'{v:.1f}%', va='center')
plt.tight_layout()
plt.show()
What are the approval rates?¶
df['approved'] = (df['action_taken_name'] == 'approved').astype(int)
approval_rates = df.groupby('ethnicity_race_sex')['approved'].mean().sort_values(ascending=True)
plt.figure(figsize=(20, 10))
ax = approval_rates.plot(kind='barh')
plt.title('Approval Rates by Ethnicity-Race-Sex Intersections')
plt.xlabel('Approval Rate')
plt.ylabel('Ethnicity-Race-Sex')
plt.xlim(0, 1) # Set x-axis limits from 0 to 1 for percentage
# Adding percentage labels
for i, v in enumerate(approval_rates):
ax.text(v, i, f' {v:.1%}', va='center')
plt.tight_layout()
plt.show()
#https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.text.html
#https://github.com/DJ4seasons/Python_Basic4beginners
Outliers¶
categorical_columns = ['loan_type_name', 'loan_purpose_name', 'property_type_name', 'lien_status_name']
numerical_columns = ['loan_amount_000s', 'applicant_income_000s', 'minority_population',
'hud_median_family_income', 'tract_to_msamd_income']
target_column = 'action_taken_name'
# Plotting categorical columns
for col in categorical_columns:
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x=col, hue=target_column)
plt.title(f'Distribution of {col} against {target_column}')
plt.xticks(rotation=45)
plt.show()
# Plotting numerical columns
for col in numerical_columns:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x=target_column, y=col)
plt.title(f'Distribution of {col} against {target_column}')
plt.xticks(rotation=45)
plt.show()
#https://seaborn.pydata.org/generated/seaborn.countplot.html
#https://seaborn.pydata.org/generated/seaborn.boxplot.html
# Percentile bounds for applicant_income_000s
income_lower_bound = np.percentile(df['applicant_income_000s'], 1)
income_upper_bound = np.percentile(df['applicant_income_000s'], 99)
# Percentile bounds for loan_amount_000s
loan_lower_bound = np.percentile(df['loan_amount_000s'], 1)
loan_upper_bound = np.percentile(df['loan_amount_000s'], 99)
# Percentile bounds for hud_median_family_income
hud_income_lower_bound = np.percentile(df['hud_median_family_income'], 1)
hud_income_upper_bound = np.percentile(df['hud_median_family_income'], 99)
# Apply filtering to all three
df = df[
(df['applicant_income_000s'] >= income_lower_bound) &
(df['applicant_income_000s'] <= income_upper_bound) &
(df['loan_amount_000s'] >= loan_lower_bound) &
(df['loan_amount_000s'] <= loan_upper_bound) &
(df['hud_median_family_income'] >= hud_income_lower_bound) &
(df['hud_median_family_income'] <= hud_income_upper_bound)
]
Step 5: Feature engineering¶
A) Feature creation¶
# Creation of loan_to_income_ratio column:
df['loan_to_income_ratio'] = (df['loan_amount_000s'] / df['applicant_income_000s']).round(2)
# Checking
print(df['loan_to_income_ratio'].head())
# Note: Figures below represent the times of their annual income.
# E.g. 3.10, means that the loan amount is 3.10 times their annual income.
0 3.38 2 0.85 5 3.07 6 1.76 7 1.33 Name: loan_to_income_ratio, dtype: float64
B) Feature selection¶
df = df[[
"loan_type_name",
"property_type_name",
"loan_purpose_name",
"loan_amount_000s",
"action_taken_name",
"msamd_name",
"census_tract_number",
"hoepa_status_name",
"applicant_income_000s",
"denial_reason_name_1",
"rate_spread",
"lien_status_name",
"hud_median_family_income",
"tract_to_msamd_income",
"minority_population",
"number_of_owner_occupied_units",
"ethnicity_race_sex",
'loan_to_income_ratio'
]]
df['income_sqrt'] = np.sqrt(df['applicant_income_000s'])
df['loan_amount_sqrt'] = np.sqrt(df['loan_amount_000s'])
# https://q-shick.github.io/portfolio/proj_hmda.html
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for action in df['action_taken_name'].unique():
sns.kdeplot(
data=df[df['action_taken_name'] == action],
x='income_sqrt',
ax=axes[0],
label=action
)
sns.kdeplot(
data=df[df['action_taken_name'] == action],
x='loan_amount_sqrt',
ax=axes[1],
label=action
)
axes[0].set_title("Income (sqrt) by Loan Decision")
axes[1].set_title("Loan Amount (sqrt) by Loan Decision")
axes[1].legend(bbox_to_anchor=(1.05, 0.6))
plt.tight_layout()
plt.show()
# https://q-shick.github.io/portfolio/proj_hmda.html
dist_0 = df[df['action_taken_name']=='approved']['income_sqrt']
dist_1 = df[df['action_taken_name']=='denied']['income_sqrt']
stat, p_value = stats.ttest_ind(dist_0, dist_1, equal_var=False)
print("Stat/P-value: (", round(stat, 2), ',', p_value, ")")
Stat/P-value: ( 25.6 , 2.3018618003693452e-144 )
from scipy import stats
# Split the data
dist_0 = df[df['action_taken_name'] == 'approved']['income_sqrt']
dist_1 = df[df['action_taken_name'] == 'denied']['income_sqrt']
# Run Welch’s t-test (assumes unequal variances)
stat, p_value = stats.ttest_ind(dist_0, dist_1, equal_var=False)
# Print results
print(f"Stat: {stat:.2f}, P-value: {p_value:.4f}")
Stat: 25.60, P-value: 0.0000