The goal of this project is to develop a predictive model to classify whether a patient is likely to be diagnosed with diabetes based on various medical and demographic factors.
Proposed Solution Framework
We propose to build and evaluate two machine learning models: Random Forest Classifier and Logistic Regression. The models will be trained on preprocessed data from PIMA Indians Diabetes Dataset, sourced from the UCI Machine Learning Repository, and their performance will be compared using various evaluation metrics.
Explanation of Each Step
Data Exploration
We performed an initial train-test split to reserve 20% of the data for testing.
The training dataset comprises 614 samples, and we explored its structure and statistical properties.
A correlation heatmap was generated to visualize relationships between features.
Data Preprocessing
Missing values (represented as zeros) were imputed using the median of each feature.
Numerical features were standardized to have a mean of 0 and a standard deviation of 1.
Model Training and Evaluation
Random Forest Classifier:
Hyperparameters such as the number of estimators, maximum depth, and minimum samples split were tuned using GridSearchCV.
The model was evaluated on the test set, achieving an ROC AUC Score of 0.8172, with detailed performance metrics provided in the classification report and confusion matrix.
Logistic Regression:
Hyperparameters such as the regularization parameter C and solver were tuned using GridSearchCV.
The model was evaluated on the test set, achieving an ROC AUC Score of 0.8163, with detailed performance metrics provided in the classification report and confusion matrix.
Model Comparison
Confusion matrices for both models were visualized to compare their performance visually.
Source Code
The complete source code is provided in this notebook.