# Lending Club Data Analysis

LendingClub is a US peer-to-peer lending company headquartered in San Francisco, California, and has helped over 2.5 million customers simplify their finances in the last 10 years. LendingClub improves the loan process for borrows by offering a fast and easy online application. For investors, the company offers historical returns of 3 – 8% and anyone can invest with as little as \$1000 .

Because LendingClub relies heavily on technology to evaluate their borrowers, getting an accurate risk analysis for each applicant requires systems which can quickly assess the applications, and upon approval, offer these loans to interested investors at a given interest rate. Of the \$7.9 billion dollars loaned in 2018, \$233 million was written off as defaulted loans. While this may seem insignificant at 2.9%, this does represent risks and losses which investors and the company would prefer to avoid. In order to mitigate risk, lending companies traditionally apply a fitting interest rate to each loan. For example, loans for a home or a car may have lower interest rates because the risk is reduced due to directly related collateral. In another example, someone with a poor credit history or having declared bankruptcy may have a higher interest rate due to the inherent risk of history repeating.

LendingClub provides an anonymized data set  of all their current and completed loans available for download on their website. Our goal was to use the data set to try and understand which data points may contribute to the interest rate designated to the loans. We reviewed the data set of 107,000+ observations (Appendix 1) and the accompanying data dictionary (Appendix 2) of the ~120 data categories included for each loan, and decided that we wanted to build a model which included 7 categories, resulting in 8 independent variables due to dummy variables, to try and predict if there is any relationship between these independent variables and the interest rate on the loan. This relationship could be described by the basic model Equation (1) below:

Int_rate = b0 – b1(loan_amount) + b2(funded_amount) – b3(annual_income) – b4(rent) – b5(last_payment_amount) + b6(debt_to_income_ratio) + b7(open_accounts) – b8(total_accounts) – b9(mortgage) (1)

Table 1: Independent variables include

Note: Home ownership was divided into to two dummy variables, mortgage and rent. Therefore, if both values are zero, the observation is for an individual who owns property outright.

Table 2: Results Summary

Analysis of results

In initially looking at the overall results, we can take away a few points. After running our regression analysis on the data set, we found we had an R2 of 0.0308 using the stated independent variables. Traditionally, this is not a strong R2 value for cross sectional data. In general, the acceptable to excellent range for this type of data would be from 0.3 to 0.7. However, we can take a deeper look at the results to see if we can deduct any other information. Next, we can take a look at the F-test. The F-test tells us if the independent variables, as a group, explain a statistically significant share of variation in the dependent variable. Our results included an F (calculated value) and the test is shown below:

F-TEST:

Null hypothesis H0: R2 = 0

Alternative hypothesis HA: R2 > 0

The null can be stated alternatively as the model has explanatory power; the alternative is then that the model has no explanatory power.

If the value of  F Calculated is greater than, or equal to the F Critical (From F Table) 1.88 (9 DoF x ∞ DoF) we can reject H0, if the value of F Calculated is less than F Critical we fail to reject H0.

In looking at the results, we find a calculated F value of 379 which is greater than the table value of 1.88. Therefore, we can reject H0. This implies that the model has explanatory power; however, we must look at several other factors before validating the model.

T-TEST:

Our next step is to look at each independent variable and its relationship to interest rate. This can be validated through a T-test, as shown below. These calculated t values shown in Table 3 are compared to the threshold value of 2.262.

Table 3: Estimate regression coefficients of interest rate relating to various variables

Null hypothesis H0: βi = 0

Alternative Hypothesis HA: βi ≠ 0

β = coefficient of variable being checked/Std Error of coefficient

If the absolute value of t Calculated is greater than or equal to t Critical (From the t Table) 2.262 (9 DoF x 0.05) we can reject H0, if the absolute value of t Calculated is less than the t Critical we fail to reject H0.

Comparing the t Critical value to each independent variable, we can see that we would reject H0 for annual income, home ownership status (rent/mortgage), last payment amount, debt to income ratio and both open and total accounts. On the other hand, we would fail to reject H0 for the variables loan amount and funded amount. These are calculated at the 95% confidence level.

P-TEST:

A third test that can be used to validate the model is the P test.

Null hypothesis H0: βi = 0

Alternative Hypothesis HA: βi ≠ 0

If the absolute value of the P value is less than or equal to 0.05, we can reject H0, if the P value is greater than 0.05, which is the P critical value, we fail to reject H0.

In reviewing the P values in Table 3, we can see that they all fall below the threshold of 0.05, or a confidence level equal to, or above 95%. This implies that all variables are relevant, contradicting some of our results from the T test shown above.

ELASTICITIES:

Elasticities were calculated in order to determine the magnitude of effect of each independent variable on the dependent variable. Elasticity is defined as the percent change in the dependent variable as a result of a percent change in the independent variable.  The elasticities and formula are shown below in Table 4.

Table 4: Elasticities

According to Table 4 above, the loan amount and funded amount have the greatest impact on the interest rate. The greatest impact is determined through the largest absolute value. The anomaly in this is that the loan amount variable has a negative elasticity and funded amount has a positive elasticity. This does not make sense given that most of the observations have the same value for these variables. It would make sense that the loan amount has a negative elasticity because a large loan amount generally lean towards something that contains equity, such as a car or house. These types of loans typically have lower interest rates and lower risk due to the availability of collateral. Alternatively, an account such as a credit card generally has a small balance and a high interest rate due to high risk and no collateral. Other interesting trends to note here would be the elasticities associated with open accounts and total accounts. A negative elasticity on open accounts is intuitive because if a person has many open accounts (large amounts of debt), risk is increased. Alternatively, a person with a large number of total accounts is assumed to have a lot of credit history and may be assumed to be less of a risk, therefore showing a negative elasticity.

Conclusion

In conclusion, the model described above has some characteristics that are intuitive given the input variables used; however the overall model is not great. The R2 value is very low for cross-sectional data despite having passed the F-test. All of the variables pass the P-test but a majority of them do not pass the T-test. The model could be improved if other variables were available to test. Things that might impact the interest rate could be credit score or the purpose of the loan. We did have access to credit “grade” which we believe is related to credit score, but these variables showed colinearity when included in the model. Additionally, the data for “purpose of the loan” was available in the data set, yet the inputs were not uniform. In order to include this in the model, we would need to adjust the observation values for each of the 107,000 data points.  Overall, we have a valid model with plenty of room for improvement.

Appendix 1

Screenshot of example data set to be used.

Appendix 2

Included below is a sample of the data set, along with the metadata, explaining the fields, and the descriptions.