skip to main content
portfolio·peer-reviewed·Bioengineering 2024

Detecting mislabels in healthcare data, unsupervised.

An unsupervised error-detection method, applied to ICU sepsis records. Published in Bioengineering (MDPI), Vol. 11, No. 8, 2024.

Bioengineering 2024eICU datasetPattern Discovery & Disentanglement+38% vs K-Means+4% supervised lift

why this matters

Medical datasets are imbalanced and contain errors due to subjective test results and clinical variability. Poor data quality affects the accuracy and reliability of every downstream classifier. Surfacing abnormal samples lets clinicians make better decisions, but most error-detection methods require ground-truth labels we don’t have.

what we did

  • ·Discover statistically significant association patterns. We applied the Pattern Discovery and Disentanglement (PDD) model, from earlier work by the lab, to the eICU Collaborative Research Database for sepsis risk assessment.
  • ·Cluster samples and flag anomalies. PDD generates an interpretable knowledge base, clusters samples in an unsupervised manner, and identifies abnormal samples whose pattern signature is inconsistent with their assigned label.
  • ·Re-train downstream classifiers. We removed the flagged samples and re-trained multiple supervised classifiers, then compared their accuracy against the same models trained on the noisy original.
Two-part table from the paper. (a) Knowledge Space and Pattern Space columns describing how PDD groups records by feature values (gcs, blood pressure, heart rate, etc.) and how each group splits into outliers, undecided, and mislabeled buckets. (b) Specific record IDs from each bucket with their underlying clinical features.
figurePDD organizes ICU records into a Knowledge Space (groups and subgroups) and a Pattern Space (the feature signature of each group). From there, each record is bucketed as an outlier, undecided, or mislabeled based on how its features sit relative to the cluster it was assigned to.

what we found

PDD beat K-Means at unsupervised clustering and made every downstream classifier more accurate.

+38%

vs K-Means · full dataset

+47%

vs K-Means · reduced dataset

+4%

avg supervised accuracy gain

Six radar charts, one per classifier (Random Forest, SVM, Neural Net, Logistic Regression, LightGBM, XGBoost), each plotting Recall, Accuracy, Precision, Balanced Accuracy, and F1-Score. Red polygon is before correction; blue polygon is after removing the samples PDD flagged. Every classifier's blue polygon is on or outside its red polygon.
figurePer-classifier metrics before (red) and after (blue) removing the samples PDD flagged as abnormal. Every model gained on at least one metric; the boost was most pronounced on LightGBM, XGBoost, and the neural net.

On unsupervised clustering, PDD outperformed K-Means by 38% on the full dataset and 47% on a reduced dataset. When we removed the samples PDD flagged as abnormal and re-trained multiple supervised classifiers, their accuracy improved by an average of 4%. The flagged samples also serve as a review queue: a clinician can decide case-by-case whether to relabel or drop, instead of trusting noisy chart labels uniformly.