Clustering Data

Aman Kansal and Sarah Scharber


Making a Case for “Why Cluster?”

Patients’ analyte values can be useful predictors, or features, in machine learning and statistical modeling. Automated models, while powerful, are sensitive to input and input not properly labeled can result in observations and predictions that are misinformed. Our group first encountered this problem when working on the Chronic Kidney Disease Project. When trying to follow creatinine, for example, we found over 15 distinct test names corresponding to creatinine, with no way of grouping names together into more meaningful categories. We again encountered this problem in our Sepsis Watch™ Project. From July to September 2014, the component name “report” corresponded to the order description “culture, blood”. Unfortunately, physicians had difficulty distinguishing “report” in a patient’s chart and often reordered the blood culture, resulting in a spike in orders for those three months. After September, the component name changed to “culture blood (bkr)”, which was more easily found. Not having those two component names clustered under a standardized “common name” resulted in danger to patient safety and also incomplete input to our model, leading to incorrect predictions.

The Dynamic Data Wrinkle

When trying to come up with a solution, we realized even if analytes are standardized at a single point, there are no guarantees they will not change. So, could we develop a way to automate the clustering process? In order to solve this problem, we used analyte data derived from October 2014 until October 2017. We included pertinent raw fields, such as “component name”, “reference unit”, and “value”, in order to develop an algorithm to group similar analytes under the standard common names. At first, we tried using supervised and unsupervised algorithms. However, there were not enough informative features to generate a reliable supervised learning model, and unsupervised learning models did not reach a threshold for guaranteed accuracy required for clinical decisions. We then explored the Bhattacharya distance, a method to compare probability distributions. We compared analyte distributions to a gold standard analyte for each common name and ranked analytes from least to most distance to develop a recommendation engine to be used in conjunction with physician curation. We tested our algorithm with some success on 12 common names in particular: bicarbonate, creatinine, glucose, hematocrit, lactate, magnesium, potassium, PCO2, platelets, PO2, sodium, troponin, and white blood cell.

A Path Forward

Clustering helps make sense of, and trend, valuable predictors for machine learning models. We have proposed a way to retroactively group analytes into more meaningful data. Furthermore, we propose a prototype workflow for prospectively grouping analytes: 1) separate clustering into two levels of grouping – the highest level corresponding to the common name and the more granular level corresponding to categories within the common name (e.g., outpatient vs. inpatient vs. OR, percentage vs. raw); 2) monitor new analyte names; 3) assign new analytes to appropriate existing or new common names. Establishing a strong foundation to better understand and group analytes is crucial towards future success.