View on GitHub

DATA146

Project 5, Part I

Setup

After downloading and importing the anonymized dataset persons.csv and preprocessing the data by converting all values to int and filtering out the NaN values, I performed three different regressions (linear, Ridge, and Lasso) and computed the Mean Standard Error (MSE) or correlation coefficient (R^2) to compare the performance of each one.

First, I used the variable wealthC as my target, and then I changed it to the variable wealthI to determine which is a better indicator of wealth prediction in the unknown West African country.

For each regression model, I used the raw data and then standardized data to determine correlation between the features (a list of 60 different descriptive factors for people in the country, including potability of the person’s water, their age and gender, weather they own a car, the type of electricity and cooking mechanisms present in the home, and more) and the target (either wealthC or wealthI).

All of the reported correlation coefficients and MSE are from the testing data, which was used to test the models after “training” to a randomly selected subset of the data, and thus indicate the strength of the model’s ability to predict unknown data.

Target Variable: wealthC

Linear Regression

First, I performed a Linear Regression using both the raw data and standardized data. Shown below, the MSE and R^2 values indicate a strong correlation between the features and the wealthC target. The standardized MSE and R^2 are quite close to the raw data, indicating that standardization does not have a significant effect on the overall measure of correlation.

Raw Data MSE Standardized Data MSE
0.44281 0.45205
Raw Data R^2 Standardized Data R^2
0.73582 0.73031

The values of R^2 for the two datasets are similar, though not identical, and this is reflected in the correlation coefficients given by the below snippets of code as well.

The correlation matrices were nearly identical, with small changes a few decimal places down.

tform_df = pd.DataFrame(data=tform_X)
tform_df['wealthC'] = y
tform_df.corr()
df = pd.DataFrame(data=X)
df['wealthC'] = y
df.corr()

tform_x and X are the standardized and raw feature data, respectively, and y represents the target data.

Ridge Regression

Similarly to the linear regression model, the Ridge regression performed well with a strong correlation reflected in the R^2 value of around 0.73. The standardized data reflects a very slight improvement on the raw data, and the overall data is comparable to the linear regression model’s results.

Raw Data R^2 Standardized Data R^2
0.73478 0.73505

Lasso Regression

The Lasso regression, too, indicates a strong correlation between targets and features, and the correlation is essentially the same as the previous two models. The standardized data shows a slight improvement in correlation.

Raw Data R^2 Standardized Data R^2
0.73386 0.73502

Target Variable: wealthI

The features clearly correlation with the target variable wealthC, with an overall R^2 of around 0.73 across all regression models.

Next I analyzed the same features with the same regression models with the new target variable of wealthI.

Linear Regression

The linear regression for this target variable is even more favorable than the previous. The R^2 values, again, are highly comparable and seem to indicate almost no difference between standardized and raw data.

The correlation matrices are comparable yet again, and the features show a strong ability to predict wealthI.

Raw Data R^2 Standardized Data R^2
0.82383 0.82281

Ridge Regression

The Ridge regression for this target is a slight improvement on the linear regression, as shown in the chart below.

Raw Data R^2 Standardized Data R^2
0.82463 0.82520

Lasso Regression

Finally, I performed a Lasso regression as in my consideration of wealthC, and again found that wealthI has a stronger correlation with the features.

Raw Data R^2 Standardized Data R^2
0.82522 0.82501

Analysis and Conclusions

Overall, it is clear to see that wealthI is a stronger measure of wealth in this anonymous West African country than wealthC.

Plot 1: Plot of R^2 vs. Alpha value for Ridge regression. Used standardized dataset of testing data for the wealthC target.

Plot 2: Plot of R^2 vs. Alpha value for Ridge regression. Used standardized dataset of testing data for the wealthI target.

The Ridge regression seemed overall best, because although the Lasso R^2 coefficient was slightly higher for the wealthI target, it was nearly identical to the Ridge coefficient. For wealthC, the Ridge model performed slightly better than the Lasso, and so it seems overall the best indicator of correlation between wealth and our features. Both the Lasso and the Ridge regressions performed better than the linear, though the improvement was so slightly that any of the three can be reliably utilized as a predictor.

Project 5, Part II

Setup

In Part II, I analyze a slightly different dataset city_persons.csv from a larger city than before, in a West African country. After importing the data, I again preprocessed by excluding null values indicated by NaN and changing all of the data to the same data type by converting two of the columns to int.

Whereas last time I considered both wealthC and wealthI, this time I only analyze data using wealthC as the target.

This time, my analytical focus will be clustering, to identify into which wealth class persons in the dataset fall, with three different clustering models for comparison.

Clustering: K-nearest neighbors

The first clustering method I tried is K-nearest neighbors (KNN), which assings classification to a particular data point according to its “k-nearest” (for some chosen integer k) neighboring points.

The optimal k-value I found for this data was 95 neighbors, and the testing data correlation coefficient was 0.54417, indicating a moderate correlation.

KNN can be performed using “weights,” which can tell KNN which characteristics to consider more than others. In this case, after testing the data without specifying a weight, I told KNN to consider distance between points and weight them using the distance (i.e. closer points would be considered more important than further points in informing a datapoint’s classification).

After adding the distance weight, the optimal k-value turned out to be 128 neighbors, and the correlatin coefficient is slightly lower than before at 0.50756, indicating that the data was slightly more reliably modelled without distance weighting.

Clustering: Logistic Regression

Next, I performed a logistic regression on the dataset, after scaling the data using StandardScaler.

Training R^2 Testing R^2
0.54979 0.54677

The logistic regression performed better than the weighted KNN model, and slightly better (though similarly) to the unweighted KNN model.

Since the training better has a slightly higher correlation than the testing, we can see that the data is slightly overfit, but overall the values are quite close and comparable.

Logistic regression results from CompareClasses():

Figure 1: Screenshot of comparison matrix output from CompareClasses(), with accuracy 0.55, sorting the data into classes and indicating accuracy of said sorting.

Clustering: Random Forest Model

Next, I executed a random forest model and ran it four times, with 100, 500, 1000, and 5000 trees. In all of these trials, I found my data was quite underfit, as the average training score was over 0.7 for each trial.

Trees Testing R^2
100 0.50464
500 0.49487
1000 0.49780
5000 0.50512

The Each of these performed worse than the KNN and logistic models, so I tried standardizing the data.

After standardization, the new chart (below) depicts data that is slightly more correlated than before, though overall very similar.

The most accurate of all of these trials is the 5000-tree forest of unscaled data.

Trees Testing R^2 Scaled Testing R^2
100 0.48804 0.49195
500 0.49244 0.49927
1000 0.49732 0.50122
5000 0.48902 0.50220

Merging clusters 2, 3

Since the predictive power of these models is weak, I combined classes 2 and 3 into a single outcome. The new correlation coefficients improved, indicating that the initial classes didn’t fit the data very well. Which the new classifications, we see much stronger predictive power in our models.

  KNN Logistic Regression Random Forest
R^2 0.60615 0.59688 0.60827

Conclusions and Analysis

Overall, the model performed best when considering classes 2 and 3 as a single classification. Under these conditions, the models that performed best were KNN and Random Forest. Their correlation coefficients were very close, but overall the Random Forest model performed slightly better because, while both models were underfit (indicating a low predictive power when the model is confronted with new information), its discrepancy between training and testing coefficients was smaller.

  Training Testing
KNN 0.79631 0.60615
Random Forest 0.696235 0.60827

From the plots below, we can see visual confirmation that the class conbination improved out model, as the training and testing scores were much closer together.

Plot 1: Plot of R^2 vs. K-value for K-nearest neighbors model.

Plot 2: Plot of R^2 vs. K-value for K-nearest neighbors model using the merged class for classes 2, 3.