Research Article

A Mathematical Model of Analytical Supervised Learning Algorithms for Stroke Prediction Using PySpark: Precision, Dispersion and Random Noise Fluctuation Analysis

Ogoegbulem Ozioma , Suit Patrick OGHENERHORO, Angela Okwuolise OKONYE
✉️ Ozioma.Ogoegbulem@dou.edu.ng
IJMS
1
1
2026
1-20
Jun 02, 2026
79
👁️ View PDF ⬇️ Download PDF
How to Cite
Ogoegbulem Ozioma , Suit Patrick OGHENERHORO, Angela Okwuolise OKONYE. (2026). A Mathematical Model of Analytical Supervised Learning Algorithms for Stroke Prediction Using PySpark: Precision, Dispersion and Random Noise Fluctuation Analysis. Ktrend - International Journal of Mathematics and Statistics (IJMS), 1(1), 1-20. https://doi.org/10.5281/zenodo.20510833

Abstract

<p>Stroke prediction is a significant problem in computational medicine because stroke occurrence is influenced by nonlinear interactions among demographic, physiological, and lifestyle risk variables. This paper develops a journal-ready mathematical model of analytical supervised learning algorithms for the prediction of stroke using PySpark. The study treats stroke prediction as a binary classification problem and formulates the learning pipeline using empirical risk minimization, logistic probability maps, impurity-based recursive partitioning, ensemble aggregation, gradient boosting updates, separating hyperplanes, confusion matrices, receiver operating characteristic curves, and cross-validation. In addition to the conventional machine learning pipeline, the paper introduces a precision-dispersion and random-noise fluctuation framework for studying the stability of medical predictors. This extension is motivated by recent work on data precision and dispersion analysis of interacting simulated data with random noise fluctuation, and it is used to quantify how feature variability may influence model reliability. The rebuilt model includes actual publication-style graphical components: a TikZ analytical workflow, a performance comparison chart, a feature-importance chart, conceptual ROC curves, a three-dimensional stroke-risk surface, a precision-dispersion plot, a random-noise fluctuation plot, cross-validation graphics, and confusion-matrix heatmaps. The comparative results indicate that Random Forest and Gradient Boosted Trees provide the strongest predictive behaviour among the five supervised classifiers considered. Random Forest achieved a testing AUC of 92.41%, accuracy of 86.64%, and F1 score of 87.20% before cross-validation, and maintained a testing AUC of 92.26% with F1 score of 87.74% after cross-validation. Feature-importance and risk-surface analysis indicate that age, body-mass index, average glucose level, hypertension, and heart disease are dominant predictive factors. The paper concludes that PySpark-based ensemble learning, when supplemented with precision, dispersion, and noise-fluctuation analysis, provides a scalable mathematical framework for interpretable stroke-risk prediction. However, any clinical deployment requires external validation, privacy protection, fairness auditing, and professional medical oversight.</p>