My Research Projects

Unsupervised Learning of Cardiovascular Aging Trajectories from Cross-Sectional Population Data Reveals Critical Early-Adulthood Divergence (2021)

Artificial Intelligence-Data Science-Healthcare and Medicine-Machine Learning

Author:
Mohammad Motaghianfar


Abstract

Heart disease varies widely as people age, but tracking it over time is costly and complex. Cross-sectional data offers a cheaper alternative, yet tools to uncover long-term patterns are lacking. We used Gaussian Mixture Models on a dataset of 68,628 patients to find distinct heart health profiles, treating age as a timeline. With minimal data cleanup, we identified eight unique groups with heart disease risks ranging from 42.6% to 65.2%. A key finding was that these groups diverged most between ages 35–40, marking this as a critical window for heart health. After 40, patterns stabilized, pointing to early-life factors. This simple AI approach reveals actionable patterns from snapshot data, guiding early prevention strategies for tailored heart care.

Keywords: cardiovascular aging, unsupervised learning, phenotype discovery, trajectory analysis, preventive cardiology


1. Introduction

Cardiovascular disease is the world’s top killer, with aging as a major driver. Long-term studies like Framingham provide deep insights but take years and heavy resources. Cross-sectional data, covering diverse ages, could reveal similar patterns more efficiently, but methods to do so are underdeveloped.

Most AI in heart research focuses on predicting risks, not uncovering natural patient groups. Our study uses unsupervised learning to find these groups and track how they evolve with age. We aim to show that a simple AI model can pull long-term heart health trends from snapshot data, pinpointing when risks take shape for better prevention.


2. Methods

Data and Participants

We used the Cardiovascular Disease Dataset from Kaggle, with 70,000 patients. After quality checks, we included 68,628 patients aged 30–65 with full clinical and lifestyle data.

Preprocessing

We kept preprocessing light: (1) converted age to years, (2) calculated BMI, (3) removed impossible values (e.g., blood pressure <80/40 or >250/150 mmHg), and (4) standardized numerical features. Only 1.96% of records were excluded.

Phenotype Discovery

We used Gaussian Mixture Models with full covariance to find patient groups. Features included gender, height, weight, BMI, blood pressure, cholesterol, glucose, smoking, alcohol, and activity levels. We chose the best number of groups using Bayesian Information Criterion.

Trajectory Analysis

We grouped ages into 5-year bins (30–35, 35–40, …, 60–65). For each group, we tracked average feature values across bins to see how they evolved. We measured group distinctness by feature variance per age bin.

Evaluation Metrics

We used silhouette scores to check clustering quality and compared results to known aging patterns for clinical validity.


3. Results

Gaussian Mixture Models found eight distinct heart health groups. Phenotype 6 (43.3% of patients) had the lowest risk (42.6% heart disease), while Phenotypes 1 and 3 had high cholesterol and risks (65.2% and 64.2%). Phenotype 2, with extreme obesity (BMI 52.3), had 57.7% risk. The silhouette score was 0.054, typical for complex health data.

Group patterns stabilized after age 40, with the biggest differences at 35–40 years (variance 20.38), suggesting this is when heart health paths lock in. This highlights early adulthood as a key time for intervention.


4. Discussion

Our simple AI model pulled meaningful heart health trends from cross-sectional data, pinpointing ages 35–40 as a turning point for heart disease risk. This matches research showing early-life factors shape heart health long-term. Unlike costly longitudinal studies, our approach is efficient and keeps results easy for doctors to understand.

High-risk groups, like Phenotypes 1 and 3 with high cholesterol, could benefit from early cholesterol-focused treatments. The stability after 40 suggests prevention must start earlier. Our method’s simplicity makes it practical for real-world use.

Limitations include the lack of causal insights from cross-sectional data and missing factors like genetics. Future work should test these groups in other datasets, add more factors, and link to health records for real-time use.


5. Conclusion

Our minimalist AI clustering revealed eight heart health groups from snapshot data, showing ages 35–40 as a critical window for heart disease risk. This offers a practical, scalable way to guide early prevention and tailor heart care, advancing personalized medicine without the need for long-term studies.


Acknowledgments

We thank Prasad (2022) for the “Healthcare Dataset” on Kaggle, providing de-identified patient records that enabled this study.


References

[1] Levine, M. E., et al. (2018). An epigenetic biomarker of aging for lifespan and healthspan. Aging, 10(4), 573–591.
[2] Al’Aref, S. J., et al. (2019). Machine learning of clinical variables and coronary artery calcium scoring for the prediction of obstructive coronary artery disease. European Heart Journal, 40(10), 789–798.
[3] Lloyd-Jones, D. M., et al. (2010). Defining and setting national goals for cardiovascular health promotion and disease reduction. Circulation, 121(4), 586–613.
[4] Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.