Abstract:
In the top-tier football, particularly in any of the high-demand leagues such as the English Premier League (EPL) and the Spanish LaLiga, injuries among players continue to significantly affect performance and revenue. This paper is based on the use of multidimensional data on 669 professional soccer players and can introduce a novel ensemble-based machine learning system that is capable of identifying the likelihood of an individual suffering an injury in 2024/25. The data set will contain data on match workload, GPS-computed locomotor measures, physiological measures, recovery and wellness measures, and historical injury data. Some of the learning algorithms that we tested include LightGBM, XGBoost, CatBoost, Random Forest, and Logistic Regression. Soft Voting Ensemble. The best-discriminating (AUC = 0.883) and best-calibrating probability models were tuned gradient-boosting models, which outperformed any individual model. Conversely, the LightGBM model on the baseline provided risk estimates that were overconfident and thus not helpful when making medical decisions in practice. SHAP demonstrates significant variablescausing injuries, including the number of seasons, the frequency of previous injuries, stride length, sleep quality, and high intensity accelerations. The proposed model creates a powerful, intuitive means by which elite football clubs can forecast injury risk on a real-time basis. This contributes to evidence-based load management and medical interventions of the EPL and LaLiga.