In this project, I employed several supervised learning algorithms to accurately model individuals' income using data collected from the 1994 U.S. Census. After running preliminary results, I selected the best-performing candidate algorithm and further optimized it to better model the data. My goal was to construct a model that could accurately predict whether an individual earns more than $50,000. This type of analysis can be useful in a non-profit setting, where organizations rely on donations. By understanding an individual's income, a non-profit can better estimate the size of a potential donation or even decide whether to reach out at all. While directly determining someone's income from public data can be challenging, I demonstrated how it’s possible to infer income brackets from other publicly available features.
Exploring the Data¶
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualization code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the Census dataset
data = pd.read_csv("census.csv")
# Success - Display the first record
display(data.head(n=1))
| age | workclass | education_level | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | Bachelors | 13.0 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174.0 | 0.0 | 40.0 | United-States | <=50K |
Implementation: Data Exploration¶
# TODO: Total number of records
n_records = len(data)
# Number of records where individual's income is more than $50,000
n_greater_50k = (data['income'] == '>50K').sum()
# Number of records where individual's income is at most $50,000
n_at_most_50k = (data['income'] == '<=50K').sum()
# Percentage of individuals whose income is more than $50,000
greater_percent = (n_greater_50k / n_records) * 100
# Print the results
print("Total number of records: {}".format(n_records))
print("Individuals making more than $50,000: {}".format(n_greater_50k))
print("Individuals making at most $50,000: {}".format(n_at_most_50k))
print("Percentage of individuals making more than $50,000: {}%".format(greater_percent))
Total number of records: 45222 Individuals making more than $50,000: 11208 Individuals making at most $50,000: 34014 Percentage of individuals making more than $50,000: 24.78439697492371%
Featureset Exploration
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
Preparing the Data¶
Transforming Skewed Continuous Features¶
# Split the data into features and target label
income_raw = data['income']
features_raw = data.drop('income', axis = 1)
# Visualize skewed continuous features of original data
vs.distribution(data)
<Figure size 640x480 with 0 Axes>
# Log-transform the skewed features
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data = features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))
# Visualize the new log distributions
vs.distribution(features_log_transformed, transformed = True)
<Figure size 640x480 with 0 Axes>
Normalizing Numerical Features¶
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler
# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])
# Show an example of a record with scaling applied
display(features_log_minmax_transform.head(n = 5))
| age | workclass | education_level | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.301370 | State-gov | Bachelors | 0.800000 | Never-married | Adm-clerical | Not-in-family | White | Male | 0.667492 | 0.0 | 0.397959 | United-States |
| 1 | 0.452055 | Self-emp-not-inc | Bachelors | 0.800000 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0.000000 | 0.0 | 0.122449 | United-States |
| 2 | 0.287671 | Private | HS-grad | 0.533333 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0.000000 | 0.0 | 0.397959 | United-States |
| 3 | 0.493151 | Private | 11th | 0.400000 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0.000000 | 0.0 | 0.397959 | United-States |
| 4 | 0.150685 | Private | Bachelors | 0.800000 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0.000000 | 0.0 | 0.397959 | Cuba |
Implementation: Data Preprocessing¶
# One-hot encode the 'features_log_minmax_transform' data using pandas.get_dummies()
features_final = pd.get_dummies(features_log_minmax_transform)
# Encode the 'income_raw' data to numerical values
income_convert = {'<=50K': 0, '>50K': 1}
income = income_raw.map(income_convert)
# Print the number of features after one-hot encoding
encoded = list(features_final.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))
# Uncomment the following line to see the encoded feature names
# print(encoded)
103 total features after one-hot encoding.
Shuffle and Split Data¶
# Import train_test_split
from sklearn.model_selection import train_test_split
# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_final,
income,
test_size = 0.2,
random_state = 0)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Training set has 36177 samples. Testing set has 9045 samples.
Evaluating Model Performance¶
Naive Predictor Performance¶
Evaluation
- If I had chosen a model that always predicted an individual made more than $50,000, it would have achieved an accuracy of 0.2478 and an F-score of 0.2917, based on the performance of the naive predictor.
# Counting the ones as this is the naive case.
# Note that 'income' is the 'income_raw' data encoded to numerical values done in the data preprocessing step.
TP = np.sum(income)
FP = income.count() - TP
TN = 0 # No predicted negatives in the naive case
FN = 0 # No predicted negatives in the naive case
# Calculate accuracy, precision and recall
accuracy = (TP + TN)/(TP + FP + TN + FN )
recall = (TP)/(TP + FN)
precision = (TP)/(TP + FP)
# Calculate F-score using the formula above for beta = 0.5 and correct values for precision and recall.
beta = 0.5
beta_square = beta**2
fscore = (1 + beta_square) *(precision* recall) / ((beta_square * precision) + recall)
# Print the results
print("Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}]".format(accuracy, fscore))
Naive Predictor: [Accuracy score: 0.2478, F-score: 0.2917]
Supervised Learning Models¶
Chosen Supervised Learning Models
Model Choice #1: Logistic Regression
- Logistic Regression is an excellent choice for our scenario, as it is widely used in the finance industry as a powerful predictive tool. For example, financial institutions regularly utilize this model to assess the likelihood of a customer defaulting on a loan. They do this by analyzing key indicators, such as credit scores, income levels, and employment history, to make a well-informed assessment. Its efficiency on handling these tasks makes it a strong candidate for predicting individual income levels in our project.
- A key strength of this model is its ability to effectively interpret model coefficients as indicators of feature importance. This means that each coefficient reflects its impact on the predicted outcome, providing insight into how different variables influence the probability of an individual belonging to a specific group.
- On the flip side, logistic regression does not handle non-linear relationships well. This means that if the data has complex patterns that do not follow traditional linear trends, the model will struggle to accurately capture and represent the underlying relationships without extra transformation techniques.
- Due to the scenario of predicting whether an individual earns over $50K, logistic regression will provide probabilistic outputs that indicate the likelihood of an individual belonging to the high-income category. These probabilities can then be mapped to binary classifications, allowing for clear decision-making and interpretability in income prediction analysis.
Reference: Information from GeeksForGeeks.org
Model Choice #2: Ensemble Methods
- For our second option, Ensemble Methods would be a strong choice for building an accurate prediction model. They are widely used in the healthcare and finance industries due to their ability to combine multiple models to improve predictive accuracy. By leveraging techniques like bagging and boosting, ensemble methods reduce variance and bias, making them highly effective for complex classification tasks like predicting income levels.
- Ensemble methods can be applied to a wide range of data and problem types, from classification to regression tasks, making them a versatile and powerful tool for our analysis. Some methods, like Random Forests, can also highlight the significance of individual features, allowing us to identify which variables play the most crucial role in predicting income levels.
- The increased complexity of ensemble methods is a common weakness among machine learning algorithms that prioritize accuracy over interpretability. Since these models aggregate multiple learners, they often produce complicated results that are difficult to interpret, making it challenging to understand how individual features contribute to the final prediction.
- Ensemble methods, such as Random Forest and AdaBoost, are well-suited for our problem statement because of their ability to handle more complex relationships inside data that would not be effectively captured with simpler models like Logistic Regression or Decision Trees. The data we have is also a mix of numerical and categorical values, which ensemble methods can effectively pull patterns and insights from without requiring extensive preprocessing.
Reference: Information from DataScienceDojo.com
Model Choice #3: Stochastic Gradient Descent Classifier (SGDC)
- My final choice for a predictive model to consider is SGDC. These models are widely used for training deep neural networks, particularly in applications within the technology and healthcare industries. SGDC plays a crucial role in deep learning, driving optimization in natural language processing, computer vision, and reinforcement learning. By efficiently training large-scale neural networks, SGD is a fundamental component of frameworks like TensorFlow and PyTorch.
- A key advantage of SGDC is its efficiency and scalability, making it ideal for large datasets. Unlike traditional gradient descent, which processes the entire dataset at once, SGDC updates model parameters incrementally, allowing for faster convergence. Its memory efficiency also enables it to handle large-scale data without excessive computational resources.
- Despite its efficiency and scalability, a noticeable disadvantage of SGDC is its noisy convergence, as it updates model parameters based on a single data point or small batch. This can cause fluctuations in the cost function, making convergence less stable.
- SGDC is well-suited for our project due to its efficiency, scalability, and ability to handle large datasets. With 45,222 records and 103 features after encoding, SGDC is an ideal choice as it processes batches of data incrementally, reducing computational cost and memory usage compared to traditional gradient descent.
Reference: Information from GeeksForGeeks.org
Website Links to References:
- https://www.geeksforgeeks.org/what-are-some-applications-of-logistic-regression/
- https://www.geeksforgeeks.org/advantages-and-disadvantages-of-logistic-regression/
- https://datasciencedojo.com/blog/ensemble-methods-in-machine-learning/
- https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/
Implementation - Creating a Training and Predicting Pipeline¶
# Import two metrics from sklearn - fbeta_score and accuracy_score
from sklearn.metrics import fbeta_score, accuracy_score
from time import time
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
'''
inputs:
- learner: the learning algorithm to be trained and predicted on
- sample_size: the size of samples (number) to be drawn from training set
- X_train: features training set
- y_train: income training set
- X_test: features testing set
- y_test: income testing set
'''
results = {}
# Fit the learner to the training data using slicing with 'sample_size' using .fit(training_features[:], training_labels[:])
start = time() # Get start time
learner.fit(X_train[:sample_size], y_train[:sample_size])
end = time() # Get end time
# Calculate the training time
results['train_time'] = end - start
# Get the predictions on the test set(X_test),
# then get predictions on the first 300 training samples(X_train) using .predict()
start = time() # Get start time
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train[:300])
end = time() # Get end time
# Calculate the total prediction time
results['pred_time'] = end - start
# Compute accuracy on the first 300 training samples which is y_train[:300]
results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
# Compute accuracy on test set using accuracy_score()
results['acc_test'] = accuracy_score(y_test, predictions_test)
# Compute F-score on the the first 300 training samples using fbeta_score()
results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
# Compute F-score on the test set which is y_test
results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
# Success
print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
# Return the results
return results
Implementation: Initial Model Evaluation¶
# Import the three supervised learning models from sklearn
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
# Initialize the three models
clf_A = LogisticRegression(random_state=0)
clf_B = AdaBoostClassifier(algorithm='SAMME', random_state=0)
clf_C = SGDClassifier(random_state=0)
# Calculate the number of samples for 1%, 10%, and 100% of the training data
# samples_100 is the entire training set, samples_10 is 10% of samples_100, and samples_1 is 1% of samples_100
samples_100 = len(y_train)
samples_10 = int(0.1 * samples_100)
samples_1 = int(0.01 * samples_100)
# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = \
train_predict(clf, samples, X_train, y_train, X_test, y_test)
# Run metrics visualization for the three supervised learning models chosen
vs.evaluate(results, accuracy, fscore)
LogisticRegression trained on 361 samples. LogisticRegression trained on 3617 samples. LogisticRegression trained on 36177 samples. AdaBoostClassifier trained on 361 samples. AdaBoostClassifier trained on 3617 samples. AdaBoostClassifier trained on 36177 samples. SGDClassifier trained on 361 samples. SGDClassifier trained on 3617 samples. SGDClassifier trained on 36177 samples.
<Figure size 640x480 with 0 Axes>
Improving Results¶
Choosing the Best Model
- F-Score Assessment: While observing the F-score on the Testing Set, the AdaBoostClassifier (dark blue) achieves the highest F-score when using 100% of the training data. This indicates that AdaBoost performs the best at calculating precision and recall to measure predictions. However, SGDC and Logistic Regression are not far behind, with only a small margin of approximately 0.02 to 0.03 difference in scores. Additionally, AdaBoost has the highest accuracy out of the models, but only at a narrow margin as well.
- Training Time Assessment: AdaBoost has the highest training time with a significant increase of approximately 1.7 to 2.0 seconds compared to the other models when using 100% of the training data. SGDClassifier (green) appears to be the fastest to train, making it a good option for workloads that need a lot of scalability.
- Prediction Time Assessment: Like the training time, AdaBoost has the highest prediction time, making it the slowest option out of the models. SGDClassifier and LogisticRegression are much faster, which is good for real-time applications that rely on rapid decision-making and low-latency predictions.
- Final Decision: After going over the results and the requests of this project, I believe AdaBoostClassifier is the best choice for what the company is looking for. My reasoning for this is because AdaBoost has the highest accuracy and F-score among all the models, and the company specifically asked for a model that can best identify individuals earning more than $50,000. Additionally, since this dataset is static and offline, the longer training and prediction times are not a major concern. Since real-time predictions are not required, the primary focus is maximizing accuracy rather than minimizing computational cost.
Describing the Model in Layman's Terms
AdaBoost, short for Adaptive Boosting, is a method that combines multiple simple models to create a stronger, more accurate model. It's able to make higher accuracy predictions due to its ability to learn from the previous model's mistakes. By the end, AdaBoost combines all the weak models into a strong, final prediction that is much more accurate than any single model alone.
To begin learning from the previous models, AdaBoost starts by assigning equal weight to all data points and begins training an initial weak model (Phase B1) to create a decision boundary while also misclassifying some data points. To correct this, Adaboost gives higher weights to these misclassified points so the next model focuses more on them, resulting in a refined decision boundary with fewer errors (Phase B2). The process continues, with further adjustments (Phase B3) where newly misclassified points receive increased importance, leading to a more accurate classifier. Finally, all the weak models (Phases B1, B2, and B3) are combined into a strong final model (Phase B4). By aggregating multiple classifiers, AdaBoost significantly improves accuracy, making it the most reliable choice for predicting individuals' incomes to help CharityML better identify potential donors.
Reference: Information from GeeksForGeeks.org
Website Link to Reference:
https://www.geeksforgeeks.org/boosting-in-machine-learning-boosting-and-adaboost/
Implementation: Model Tuning¶
# Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.ensemble import AdaBoostClassifier
# Initialize the classifier
clf = AdaBoostClassifier(algorithm="SAMME", random_state=0)
# Create the parameters list you wish to tune, using a dictionary if needed.
# parameters = {'parameter_1': [value1, value2], 'parameter_2': [value1, value2]}
parameters = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 1.0]
}
# Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta=0.5)
# Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(estimator=clf, param_grid=parameters, scoring=scorer, cv=3, n_jobs=-1, verbose=1)
# Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train, y_train)
# Get the estimator
best_clf = grid_fit.best_estimator_
# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)
# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
Fitting 3 folds for each of 9 candidates, totalling 27 fits Unoptimized model ------ Accuracy score on testing data: 0.8483 F-score on testing data: 0.7029 Optimized Model ------ Final accuracy score on the testing data: 0.8507 Final F-score on the testing data: 0.7132
Final Model Evaluation¶
Results:¶
| Metric | Unoptimized Model | Optimized Model |
|---|---|---|
| Accuracy Score | 0.8483 | 0.8507 |
| F-score | 0.7029 | 0.7132 |
Results Evaluation
- My scores on the optimized testing data were 0.8507 for Accuracy and 0.7132 for F-score.
- This shows that the optimized model slightly outperforms the unoptimized model by about 1.47% in F-score and 0.28% in Accuracy. Though the improvement is marginal, this still indicates that the tuning successfully increased the rate of better predictions.
- Compared to the naive predictor, which had an accuracy of 0.2478 and an F-score of 0.2917, the optimized model shows a substantial accuracy improvement from 24.78% to 85.07%, a 60.29 percentage point increase. Additionally, the F-score increased from 0.2917 to 0.7132, improving by approximately 42%. This demonstrates the new model's effectiveness in distinguishing individuals by their income.
Feature Relevance Observation¶
Most Important Features for Prediction
- Education: I consider education to be the most important feature in our prediction model, as higher education is often correlated with higher income levels. While the feature Education-Num could also serve as a strong indicator, the number of years spent in school does not always guarantee that an individual has earned a degree.
- Occupation: This feature is important as it directly impacts salary levels. It is well known that certain professions consistently yield higher earnings, while others are generally linked to lower wages.
- Capital-Gain: Wealth accumulation serves as a clear indicator of an individual's financial capacity. Those with high capital gains are more likely to invest, making this a strong predictor of potential donors.
- Hours-Per-Week: This feature could serve as a strong predictor when analyzed alongside other variables. Higher working hours are generally correlated with larger paychecks. Additionally, this metric can help differentiate between part-time and full-time employees based on their total hours worked.
- Age: Older individuals tend to have more accumulated wealth and years of experience compared to their younger counterparts, and therefore are more likely to earn higher incomes and have greater financial stability. However, this is ranked last because age alone does not directly determine income level. Compared to the other listed features, such as education, occupation, and capital gains, age is a less direct predictor of earning potential and is more effective when evaluated in conjunction with other factors.
Implementation - Extracting Feature Importance¶
# TODO: Import a supervised learning model that has 'feature_importances_'
model = AdaBoostClassifier(algorithm='SAMME', random_state=0)
# TODO: Train the supervised model on the training set using .fit(X_train, y_train)
model = AdaBoostClassifier(random_state=0)
model.fit(X_train, y_train)
# TODO: Extract the feature importances using .feature_importances_
importances = model.feature_importances_
# Plot
vs.feature_plot(importances, X_train, y_train)
C:\Users\Janel\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:527: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning. warnings.warn(
<Figure size 640x480 with 0 Axes>
Extracting Feature Importance¶
Feature Evaluation
- Comparing what I chose to what the function selected, some of my choices (hours-per-week, age, capital-gain) aligned with the model's top predictors, while others (education-num and capital-loss) did not.
- Initially, I believed education would be a better choice due to uncertainty of education-num. However, after reevaluating the variables, education-num proves to be a more precise numerical representation of education level, making it a stronger predictor of income. Additionally, it eliminates the ambiguity of categorical labels, making it more practical for machine analysis.
- Similarly, capital-loss initially caught me off guard, but I can recognize it as a strong indicator of someone who actively invests and has disposable income.
- Based on the model's selections so far, I can understand why occupation isn't considered a key feature. After reviewing the Census data, many occupation titles, such as 'Sales,' are quite vague. The algorithm may find these broad categories too vague and inefficient to serve as a strong predictor of income.
- My predictions about hours-per-week, capital-gain, and age were accurate, as they all have distinct numerical values that provide clear, measurable insights into work patterns, financial status, and experience levels.
Feature Selection¶
# Import functionality for cloning a model
from sklearn.base import clone
# Reduce the feature space
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(importances)[::-1])[:5]]]
# Train on the "best" model found from grid search earlier
clf = (clone(best_clf)).fit(X_train_reduced, y_train)
# Make new predictions
reduced_predictions = clf.predict(X_test_reduced)
# Report scores from the final model using both versions of data
print("Final Model trained on full data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print("\nFinal Model trained on reduced data\n------")
print("Accuracy on testing data: {:.4f}".format(accuracy_score(y_test, reduced_predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, reduced_predictions, beta = 0.5)))
Final Model trained on full data ------ Accuracy on testing data: 0.8507 F-score on testing data: 0.7132 Final Model trained on reduced data ------ Accuracy on testing data: 0.8263 F-score on testing data: 0.6533
Effects of Feature Selection¶
Final Evaluation
Previous Model: Accuracy: 0.8507 | F-score: 0.7132
Refined Model: Accuracy: 0.8263 | F-score: 0.6533
- The reduced model experiences a drop of approximately 2.9% in accuracy and 5.6% in F-score, indicating some loss in predictive performance.
- If training time were a significant factor, I would consider using the reduced model. The percentage point differences are not significant enough to warrant the additional computational cost of training on the full dataset when given the constraint. However, I would still prefer the full model under normal circumstances, as it offers higher accuracy, which CharityML considers a critical factor in effectively identifying potential donors and maximizing outreach success.