Sam Rusk, BS1 • Fred Turkington, BS1 • Chris R. Fernandez, MS1 • Yoav N. Nygate, MSc1 • Nick Glattard, MS1 • Melania Abrahamian, PhD1 • Tom Vanasse, PhD1 • Dana Richardson, BSN, MAB2 • Tim Bartholow, MD2 • Nathaniel Watson, MD, MSc3
Introduction
Healthcare insurance claims data contain an unrecognized wealth of structured data that can be leveraged to investigate epidemiologic and economic relationships in health and disease. We studied the feasibility for machine learning algorithms to improve upon screening for obstructive and central sleep apnea (SA) at the population health level using existing health insurance claims data.
Methods
A logistic regression model was trained to predict the presence or absence of SA from an aggregated healthcare insurance claims dataset. The dataset was composed of medical and pharmacy claims between the years 2016 and 2020 from the Wisconsin All-Payor claims database which included coverage of >4,000,000 patients, >10,000 ICD codes, and >$50 billion in medical spending.
A total of 1,870,000 patients and 39,712 unique federal drug identification codes were included within 91.5 million pharmacy claims in the dataset. Input features were constructed by counting the total number of claims for each unique drug in each subject resulting in a patient-level feature vector of 39,712 drug frequencies.
The positive SA population was defined by individuals who had both at least one medical claim for sleep apnea diagnosis (ICD codes G4733/G4731) and an appropriate sleep test (CPT codes 9580*/9581*). The logistic regression model was evaluated using randomized 10-fold cross-validation and performance reported using ROC-AUC statistics and top-10 feature importance analysis.
Results
The logistic regression model detecting SA based solely on observed medication frequencies produced a ROC-AUC of 0.77. In a feature importance analysis, three of the Top-10 most discriminative features were medications for the treatment of diabetes, hypertension, and hyperlipidemia.
We hypothesize this drug-frequency based model functions by exploiting the strong correlation of SA with specific clusters of known co-morbid conditions and corresponding medication regimens..
Conclusions
We demonstrate health insurance claims records contain predictive information that can aid in more systematic screening of undiagnosed conditions like SA.
Furthermore, in a statistical analysis of feature importance, we observed medications indicative of comorbidities with known association to SA.
These findings are useful to clinicians and payers in identifying undiagnosed SA populations, including those responsible for value-based payment models.