Classification: Heart Disease

In this project, I will be using the classification method for modeling. I want to classify the presence of a heart disease in an individual, inputting their personal information such as age, gender and their bodily metrics.

I will be implementing the decision tree classification models in this project.

The following are the python libraries that I have utilized in the project:

Pandas

Matplotlib

Seaborn

Numpy

Sklearn (train_test_split, confusion_matrix, tree, classification_report, accuracy_score, DecisionTreeClassifier)

Introduction to the data

The dataset I have chosen to explore is a single csv format file from kaggle.com, published by M Yasser H, who is an AI and ML engineer. The name of the dataset is “Heart Disease Dataset” and it can be publicly accessed.

The following are attributes in the dataset:

age - age in years

sex - sex (1 = male; 0 = female)

cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)

trestbps - resting blood pressure (in mm Hg on admission to the hospital)

chol - serum cholesterol in mg/dl

fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)

restecg - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)

thalach - maximum heart rate achieved

exang - exercise induced angina (1 = yes; 0 = no)

oldpeak - ST depression induced by exercise relative to rest

slope - the slope of the peak exercise ST segment (1 = up-sloping; 2 = flat; 3 = downsloping)

ca - number of major vessels (0-3) colored by fluoroscopy

thal - 3 = normal; 6 = fixed defect; 7 = reversible defect

target= 0 - Healthy Individual, 1 - Heart-Disease Patient

Pre-Processing the Data

For this stage, my goal was to check for any NaN values in the dataset. The datatype of most columns is integer and float because they are statistics of metrics. And since most of the data is in numerical form we can make visualizations using the seaborn library. I used the seaborn library to check that the target variable was balanced so that the model would not be biased towards one outcome because of its frequency. The dataset itself did not need much cleaning as there weren’t any NaN values and categorical variables were also transformed into discrete values for modeling.

Data Understanding/Visualization

The below script visualizes the frequency of both classes.

The below 3 graphs show the frequency of the target class among the 3 features to check for any splits in classification. These graphs shows that there is not a clear split between predictors.

Modeling

For this project, I chose the decision tree model for modeling. The goal is to classify if an individual has a heart disease or not. I chose this particular classification method because of its intuitiveness and it is easier to explain the findings to potential beneficiaries.

Evaluation

The following is the confusion matrix of the model.

The true positive and true negative count is 48 and 65 respectively. False positives are 22 and False negatives are 17.

The accuracy of our model is 74.3 %, which is not the best accuracy score. But because the predictor attributes are generic and relatively easy to calculate for an individual, making a prediction if an individual has a heart disease with ~75% accuracy is a moderately good model.

All metrics are greater than 65% which is good for health related model. The model does perform better at classifying if a person has heart disease(1).

Storytelling

This project was very interesting and insightful. I learned about creating decision trees and interpreting the model and evaluating its performance. For this particular model, an accuracy of 75% is relatively good because the data for the predictors can be obtained fairly easily. Of course, with a 75% accuracy score, this model can not be used as a primary indicator for identifying if an individual has heart problems but it is a great factor for initial recognition for patients. The initial split happens on the number of major vessels that are colored by fluoroscopy in patients. The next split happens on cp and thal. After many more, the last split happens on thalach levels to complete the decision tree. In conclusion, this model can make a big difference in the medical area. Because of its moderately high performance it is a great tool for early recognition by initially conducting a fluoroscopy since that 's where the initial split happens and most information is gained at the root.

References:

https://www.kaggle.com/yasserh/heart-disease-dataset

https://blogs.oracle.com/ai-and-datascience/post/a-simple-guide-to-building-a-confusion-matrix

https://towardsdatascience.com/understanding-decision-trees-once-and-for-all-2d891b1be579

Click here to see the code