Project Reference: CS97

Student’s Name: Abdullah Chaudhry

Project title: Using Machine Learning To Predict Stroke

Course Title: Computer Science

Supervisor’s Name: Abdel-Rahman Tawil

The aim of my project is to be able to predict the probability that the observation belongs to one of the two groups (stroke/ no Stroke). Also my project aims to predict whether a patient is at risk of having stroke or not

Machine Learning has become a very popular tool for analyzing and addressing medical and clinical problems because of their promising results. This study examines three different algorithms that can assist Hospital, Heath industries, Surgeons and Medical professionals in predicting stroke based on patient’s data. The report is divided into 4 parts, In the first part of the report we discuss the importance, motives and the main aims and objective of this study. An interactive dashboard in Power BI was also created to better understand the dataset and bring useful insights. In the second part 12 selected features where visualized and their importance to the target value, using various graphs and chart. The third part of this report four different Algorithms; Random Forrest (RF), Decision Tree (DT), K-Nearest Neighbors(KNN) and Support Vector(SV) Classifiers were built, trained, and then tested. In this part we also oversampled the dataset using SMOTE. Gird Search CV was also used to find the best combination of paraments for the highest accuracy of the model. In the final part of the report we will visualized and discussed the results obtain from these algorithms, and their effectiveness using model evaluation methods such as Confusion Matrix, Classification Report and Roc Curve. Lastly, we concluded with the chosen and recommended algorithm and provided recommendations which could impact future studies and research. In the report we explore a large dataset of 5110 patients and 12 different features. The results show that Random Forest and Decision Tree both have highest accuracy in predicting the risk of stroke then K-neighbour and then Support Vector. Meanwhile Random Forest achieves an ROC of 99%, F1-Score of 94% and 95% by using hyperparameters. Random Forest received an accuracy of 94% on the oversampled data with precision, recall and F1 all above 90%. Decision Tree was found to be the faster training model before hyperparameters and after applying Grid Search CV its accuracy was 0.04% higher than Random Forest, but its F1 score, recall and precision where all lower than Random Forest.

This study examines three different algorithms that can assist Hospital, Heath industries, Surgeons and Medical professionals in predicting stroke based on a patient’s data. The Random Forest Algorithm received an ROC accuracy of 99%, this means that it can easily distinguish between the stroke and none stroke patients classes. In this project we experimented oversampling to balance imbalanced data, hyperparameters to tune model parameters. The accuracy and model performance where compared respectfully before and after oversampling/before and after hyperparameters tuning.