Assessing the Accuracy of GYANT’s Artificial Intelligence Diagnosis Model in a Real-World Telemedicine Clinic

Rohan Bhandari, Ph.D.*, Kirill Kireyev, Ph.D.*, and Rick Wolf, Ph.D.*
*GYANT

Abstract

An unfortunate side effect of the medical community’s transition to electronic health records is that medical providers have been burdened with time-consuming clerical work that takes away from actual patient engagement. This medical history collection, however, can be automated and analyzed to provide a preliminary differential diagnosis1 to providers, making them more efficient and accurate in their clinical decision making. This study assesses the accuracy of GYANT’s artificial intelligence platform that performs patient intake to generate a history of present illness and uses a machine learning model to analyze the relevant information and output a preliminary differential diagnosis. The study uses a dataset of 5,379 patient encounters collected from GYANT’s real-world telemedicine clinic and found the GYANT diagnosis model to have a top-3 accuracy of 92%. Further study of the model’s decision making indicates that the model utilizes medically relevant information that is consistent with general medical expertise. We conclude that GYANT’s chatbot and diagnosis model provides state-of-the-art automated medical intake and diagnostic performance.

Introduction

Since the medical community transitioned to the electronic health record (EHR) medical providers have found themselves tasked with additional clerical work during patient encounters and with far less time to actually interact with patients. One study found that 37% of the time doctors spend in the exam room with a patient is dedicated to EHR and clerical work [1]. This increased burden on medical providers has led to physician burnout, as 59% of physicians report bureaucratic tasks as the main contributor to their burnout [2]. At the same time, this has resulted in inflated patient wait times that impact patient satisfaction and perceived quality of care [3]. Software tools designed with an emphasis on the needs and workflows of doctors, how-ever, can help shift more of their time from data entry tasks towards interacting with patients.

In a typical encounter a provider tries to generate a history of the present illness (HPI) and uses that in-formation to determine a preliminary hypothesis regarding the nature of the illness. Once this hypothesis is generated, the provider uses it to guide their subsequent information collection and testing, eventually arriving at a diagnosis. While all of this process requires substantial medical expertise, the information collection generally proceeds from broad to narrow as the nuances of the patient’s specific illness are elucidated. GYANT aims to reduce physician burden by automating the standard aspects of HPI questioning and using machine learning to generate the initial preliminary hypothesis regarding likely illnesses. In this modified workflow, a provider’s time is not tied up with prompting a patient with predictable self-report questions and is instead freed to focus more on the patient-specific aspects of symptom collection.

A key aspect of this approach is a reliable tool that generates an initial set of condition recommendations for the provider to further explore, as well as digestibly summarizing the automated HPI collection. This re-quires that the underlying machine learning model both be highly accurate and transparent. A high-accuracy model indicates that it is able to leverage information within the HPI to make intelligent medical decisions and provide meaningful assistance to providers. It is equally important for the model to be transparent so that a provider can understand what information it used in making its decision, allowing the provider to trust the model’s recommendations and use them in his/her clinical reasoning. Lastly, both high accuracy and transparency are needed to help providers differentiate between conditions that may have similar presentations but very different urgencies. The current study aims to quantitatively assess the accuracy of GYANT’s decision support system, as well as validate the content of its medical decision making.

This study used the GYANT platform to collect standard HPI information from real-life patients and their corresponding diagnoses as determined by licensed medical providers during text-based telemedicine encounters. Using this collected data, the GYANT model’s diagnostic accuracy was assessed to validate GYANT’s machine learning model to utilize the information to provide useful condition recommendations to medical providers.

Methods

Data Collection

The data for this study was collected through patient encounters on the GYANT platform and obtained in the following manner: Patients entering the platform are prompted to provide a free-text initial complaint. The complaint is then matched to one or more series of questions called ”flows”. These flows are clinical protocols consisting of about 20 questions, which were created and reviewed by medical experts and based on established clinical triage protocols, e.g. [4, 5, 6, 7, 8]. Patients are most commonly prompted to respond to questions by selecting a ”Yes”, ”No”, or ”Don’t Know” button, though there can be additional response options, such as locations or qualifiers of pain. These questions are structured to correspond to established symptoms, such as ”headache” or ”cough” and are represented by GYANT’s internal taxonomy, which maps to SNOMED CT codes [9]. As questions are answered, GYANT dynamically narrows in on the appropriate flow or jumps to new flows that are more relevant for the patient’s symptom presentation. Users located in the state of California, however, have the option to transfer their case to a California-licensed medical provider working on behalf of GYANT to receive a diagnosis. After the patient completes the questions, the provider receives a list of 3 conditions2 that are most consistent with the reported symptoms, as well as a general case summary (Figure 2). At this point, the provider is able to chat with the patient if necessary to collect any more information and ultimately provide a diagnosis, effectively labeling the case for the purposes of this study.

To increase the number of cases included in the study, GYANT providers also labelled historical data in which the patient did not interact with a provider at the time of the encounter. In this case, users gave a free-text initial complaint, answered questions from the relevant flow, and received a list of possible conditions. The collected data was displayed at a later time to providers in a format similar to how they would see a live case (though providers were aware that these were not live cases). The provider was allowed to label the case with a condition in circumstances where they felt confident, or allowed to skip the case if they were either unsure of the likely diagnosis or felt they would need to chat with the patient to collect supplemental information. To handle the scenario where multiple providers labelled the same case, only cases that had a majority agreement between the provider-determined labels were used for model training and testing with the ground truth diagnosis set as the majority label.

This data collection process provided highly structured data consisting of the collected medical evidence along with auxiliary information (i.e. age, sex, month of encounter) resulting in a dataset of 2,486 features with which to build a machine learning model. The medical diagnoses determined by GYANT’s medical providers3 served as a curated, well-labelled dataset that reflects the noise and intricacies inherent in real-world patient data. Using this highly structured data allowed for a more interpretable machine learning algorithm to be used while still maintaining high performance, which is especially necessary in a medical setting where decisions need to be auditable.

Figure 1: Chief complaint collection (left) and evidence collection (right) in the GYANT app.

Model Building

The data used for model training, validation, and testing consisted of the data collected (as described above) through the GYANT platform from January 2018 – March 2019. Additional filtering is done to improve the quality of the dataset: only cases with patients at least 18 years of age were included, cases with a piece of medical evidence that has been deprecated due to the refining of medical content were removed, and cases where the user endorsed three or fewer symptoms as being present were removed, due to insufficient infor-mation with which to make an accurate prediction. Lastly, cases where the provider gave a patient multiple diagnoses were excluded, as predicting multiple diagnoses from a single set of symptoms is outside the sco-pe of the current approach. This results in a dataset of 5,379 encounters and their resulting diagnoses (2,934 from live encounters and 2,445 from the labelling of historical encounters).

Using this data, a random forest classifier was built to predict relevant medical conditions from a symptom presentation [10]. This classifier is trained for a multi-class classification task where each condition is a target class. The training set consisted of 80% (4,303) of the cases with model validation done using the out-of-bag score of the model. The remaining 20% (1,076) of cases were used only for the final assessment of the model. The training-test set split was done in a stratified manner to ensure both that the model sees all conditions during its training and that the model’s performance was tested on all conditions.

Figure 2: Example patient chart generated through information captured by the GYANT chatbot as presented to medical providers.

Method of Evaluation

GYANT’s goal is to assist physicians by reducing clerical work and providing clinical decision support. Thus the model was optimized for high accuracy over other metrics such as precision and was assessed by its “top-3 accuracy”, i.e. the number of cases where the correct diagnosis was contained in the model’s three most likely conditions, divided by the total number of cases. Additionally, since many upper respiratory tract infections (URTI) have identical symptom presentations and treatment plans, the model was not penalized for predicting a URTI when the true diagnosis was also a URTI. For example, a prediction of acute naso-pharyngitis was considered correct if the correct diagnosis was viral pharyngitis. These conditions were not grouped together at the model training stage so that the model can give providers as much information as possible. However, given their clinical interchangeability it would be unduly harsh to penalize the model for predicting one over the other.

Table 1: Top-3 accuracy of the model and test set size for five more frequent and five less common conditions. The data shown for upper respiratory tract infections are after grouping, as described above.

Results

Model Performance

The model scored a top-3 accuracy of 92% on the test set. While the distribution of conditions in the test set matched that seen in the GYANT clinic, the above metric is susceptible to bias by the model performing well on only the most common conditions while failing to understand the symptom presentations of rarer conditions. To evaluate this, the average of the top-3 accuracies for each condition was assessed, i.e. the macro top-3 accuracy, in order to give each condition equal weight in the metric. The average top-3 accuracy was 87%, showing that the model learns both frequent and rare conditions equally well. The top-3 accuracies for 5 commonly and 5 rarely seen conditions are shown below in Table 1.

Model Evaluation

In a clinical setting, it is not only necessary to have a high performing model, but also a model that makes its predictions using medically relevant information and not artifacts of the data (e.g. noise, medically irrelevant correlations, label leakage). One way the “decision making” of the model is assessed is through the use of the ELI5 python package, which shows the amount a piece of evidence contributed to a predicted condition for a particular case [11]. Three representative prediction explanations, selected from real patient encounters, are shown below for cases of a migraine headache, amenorrhea, and gastroenteritis.

The prediction explanation tables can be read as follows: The top row of each table shows the predicted condition and its probability4 , while the rows show the top-10 contributing features for the corresponding prediction. The ”Feature” column describes the piece of evidence. The ”Value” of a feature is ”1” if the patient reported the symptom, ”-1” if the patient denies the symptom, and ”0” if the symptom is unassessed or the patient responded ”Don’t know” to the question. For age, month, and sex, ”Value” refers to the age of the patient in years, the month of the year in which the encounter occurred, and the gender of the patient (0 for female, 1 for male). The ”Contribution” column displays how much the probability increased or decreased due to the value of that feature (for example, in Figure 3, the presence of phono-photophobia increased the probability of Migraine Headache by 0.083). The deeper green the row, the larger the value of the feature increased the likelihood of the condition, while the deeper red the row, the larger the value of the feature decreased the likelihood of the condition. To clarify nomenclature, the term evidence refers to symptom and auxillary information collected for a particular case, and the term feature refers to a single type of evidence (such as a specific symptom) across cases.

Case 1 (Migraine Headache)

A female, age 24, entered the chat complaining of a headache. After evidence collection through the GYANT chatbot, the model correctly predicted Migraine Headache as the most probable condition. Figure 3 shows the “reasoning” the model used in coming to this conclusion and that the most significant evidence was the presence of phono-photophobia, recurrent headaches, and a previous diagnosis of migraine headaches. The table also shows the model eliminating possibilities of a more severe medical issue, as the absence of a recent head trauma also increases the likelihood it is a migraine headache. Lastly, one can also see the model differentiating a migraine headache from a tension-type headache, as the presence of “severe eye pain” and “diagnosis of migraine” both decrease the likelihood of a tension-type headache. After a provider endorsed the diagnosis of a migraine headache, the patient received a medical treatment plan as well as warning signs that may indicate a more urgent condition. The entire exchange from the patient entering the app to receiving the model prediction lasted 3 minutes and 36 seconds.

Figure 3: Prediction explanation for a case of a migraine headache. A female patient, age 24, complained of a headache. Note this example only has two predictions as all other conditions were below the probability threshold to be displayed.

Case 2 (Amenorrhea)

A female, age 35, entered the GYANT app with a chief complaint of not having menstruated for multiple months. Given that amenorrhea is often a sign of another health problem, e.g. polycystic ovary syndrome, rather than a disease itself, it is necessary to have a complete medical intake process to rule out more serious conditions, especially as a physical exam is not possible in this setting. This case demonstrates the GYANT chatbot’s ability to generate a comprehensive HPI to help providers discriminate between non-urgent and urgent cases. The left column of Figure 4 shows that the model considered various pieces of evidence collected (e.g. previous history of tubal surgery, infertility, pelvic inflammatory disease) that could indicate a more severe underlying issue. As no such evidence was reported present, the model found Amenorrhea to be the most relevant condition. The provider agreed with the model and diagnosed the patient with Amenorrhea. The patient was then given a treatment plan, warning signs, and informed that they should see their primary care physician or gynecologist for a more complete examination.

Figure 4: Prediction explanation for a case of amenorrhea. A female patient, age 35, complained of not having menstruated for multiple months.

Case 3 (Gastroenteritis)

A male, age 21, entered the GYANT app complaining of stomach pain and having vomited multiple times. Here, the model predicts Gastroenteritis as the most relevant condition, based primarily on the patient having diarrhea and vomiting triggered by eating certain foods. However, there are also a few signs that this could be a more serious issues like appendicitis due to the patient endorsing an abdominal pain that worsened over time, anorexia, and fever. In this case, the differential reminds the provider that while gastroenteritis is the most likely condition (probability=0.473), it could also be a more urgent condition that requires immediate attention, though with lower likelihood (probability=0.061).

Figure 5: Prediction explanation for a case of gastroenteritis. A male patient, age 21, complained of stomach pain and having vomited multiple times.

Conclusion

GYANT’s diagnosis model was assessed and found to have a top-3 accuracy of 92%, with a condition-averaged top-3 accuracy of 87%. This assessment was performed using 5,379 patient encounters collected through a real-world telemedicine clinic. Additionally, a detailed analysis of case studies showed that the GYANT platform captures the relevant information needed to produce comprehensive HPIs and can utilize that information to discriminate between non-urgent and urgent cases to provide a medically relevant dif-ferential diagnosis. GYANT’s model offers transparent clinical decision support that addresses the most common conditions encountered by ambulatory care providers.

This full process of HPI generation to condition recommendation can be as short as 3 minutes and is entirely automated, which allows GYANT to provide benefit to patients, providers, and healthcare systems. Patients could easily complete their HPI while in the waiting room of a care clinic, making better use of patients’ wait times and allowing the provider to spend more time directly interacting with patients during encounters. For providers, a high-performance diagnostic model can serve as a useful clinical decision support system to help them treat patients more effectively and efficiently. For healthcare systems, it can provide a significant improvement for patient triage as opposed to traditional online system checkers that up to 59% of American adults use as their first source of medical information [12], despite their limited accuracy [13]. GYANT presents an opportunity to help all facets of healthcare service by making more accurate information, more accessible to patients, providers, and healthcare systems.

Footnotes

1. The term ”preliminary differential diagnosis” refers to conditions and/or diseases relevant to a set of symptoms and does not
constitute a medical diagnosis. The only medical diagnoses in this study are provided by licensed medical providers.
2. In cases where the model predicts a condition with a very high probability, it is possible that the second- and/or third-most
likely prediction has a probability less than that of random guessing. In this case, those predictions were not returned to the user to
prevent displaying very unlikely conditions.
3. While the provides are employees of GYANT, they were not informed of this study and remained completely independent of any part of the study, aside from the diagnosis of patients.
4. Since this is a multi-class classification task, the probability does not represent the absolute relevance of the condition, as it would be if a one-vs-rest classifier was built for each condition (i.e. the sum of all probabilities ranges from 0 – number of conditions). The probability here refers to the relative relevance of the condition compared to all other conditions in the model (i.e. the sum of all probabilities is 1).

References

[1] C. Sinsky, L. Colligan, L. Li, M. Prgomet, S. Reynolds, L. Goeders, J. Westbrook, M. Tutty, and
G. Blike. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Annals of Internal Medicine, 165(11):753, sep 2016.
[2] Medscape National Physician Burnout, Depression & Suicide Report 2019. Accessed on Fri, April 19, 2019.
[3] C. Bleustein, D.B. Rothschild, A. Valen, E. Valatis, L. Schweitzer, and R. Jones. Wait times, patient satisfaction scores, and the perception of care. Am J Manag Care, 20:393–400, May 2014.
[4] D.A. Thompson. Adult Telephone Protocols. Office Version, 3rd Edition. American Academy of Pediatrics, 2013.
[5] D.L. Kasper, A.S. Fauci, S.L. Hauser, D.L. Longo, J.L. Jameson, and J. Loscalzo. Harrison’s Princi-ples of Internal Medicine. McGraw Hill Education, 19th edition, 2015.
[6] J.E. Dains, L.C. Baumann, and P. Scheibel. Advanced Health Assessment & Clinical Diagnosis in Primary Care. Mosby, 2015.
[7] A.H. Goroll and A.G. Mulley. Primary Care Medicine: Office Evaluation and Management of the Adult Patient, 6th Edition. LWW, 2009.
[8] American Psychiatric Association, DSM-5 Task Force. Diagnostic and statistical manual of mental disorders: DSM-5. American Psychiatric Publishing, Inc, 5th edition, 2013.
[9] SNOMED CT. https://www.nlm.nih.gov/healthit/snomedct/index.html, 2019. Accessed on Wed, April 17, 2019.
[10] L. Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
[11] ELI5. https://github.com/TeamHG-Memex/eli5.
[12] S. Fox. Health Information is a Popular Pursuit Online. https://www.pewinternet.org/2011/02/01/health-information-is-a-popular-pursuit-online/, 2011. Accessed on Tue, April 9, 2019.
[13] H.L. Semigran, J.A. Linder, C. Cidengil, and A. Mehrotra. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ, page h3480, Jul 2015.