Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions

1Department of Computer Science
2Department of Physiology 3Department of Neurology
University of California, Los Angeles


The integration of Artificial Intelligence (AI), especially Large Language Models (LLMs), into the clinical diagnosis process offers significant potential to improve the efficiency and accessibility of medical care. While LLMs have shown some promise in the medical domain, their application in clinical diagnosis remains underexplored, especially in real-world clinical practice, where highly sophisticated, patient-specific decisions need to be made. Current evaluations of LLMs in this field are often narrow in scope, focusing on specific diseases or specialties and employing simplified diagnostic tasks.

To bridge this gap, we introduce CliBench, a novel benchmark developed from the MIMIC IV dataset, offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnosis from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions. Supported by structured output ontologies, CliBench enables a precise and multi-granular evaluation, offering an in-depth understanding of LLM's capability on diverse clinical tasks of desired granularity. We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making. Our preliminary results shed light on the potential and limitations of current LLMs in clinical settings, providing valuable insights for future advancements in LLM-powered healthcare.

CliBench Dataset

Task 1: discharge diagnoses Diagnosis is defined as the identification of a disease, condition, or injury based on a patient's health evidence. The task aims to provide a set of diagnoses according to the patient profile, medical record at admission, lab test results within the admission, radiology results within the admission and history diagnoses. Each diagnosis is represented in the International Classification of Diseases, tenth Revision, Clinical Modification (ICD-10-CM) code or equivalent concepts, which is a coding system used by healthcare providers to classify all diagnoses for claims processing. The history diagnoses are necessary for completeness because diagnoses made in previous admissions or other service departments might be inherited.

Task 2: procedures. Procedures are specific courses of action, to be implemented to intervene in the patient's health status. The task aims to identify the first batch of procedure decisions after the patient is admitted. The input contains patient profile and medical record at admission. The expected output is a set of ICD-10-Procedure Coding System or equivalent concepts. Within a certain admission, procedure decisions, lab test orders and prescriptions can be made at any time, where the later decisions are made while the clinician is aware of outcomes and results of previous procedures or lab tests. It is hard to obtain ground-truth non-initial decisions since the actions can be taken in different temporal orders while only the outcomes of the factual action order are available, which motivates us to predict only the first batch of decisions in terms of time.

Task 3: lab test orders. With the same input as procedure decisions, the task aims to produce a set of initial lab items after the patient is admitted to facilitate downstream diagnosis and treatment. Each lab item is a unique Logical Observation Identifiers Names and Codes (LOINC) code.

Task 4: prescriptions. Given the same input as the procedure decisions, the prescription task yields a set of initial medications to be prescribed for the patient after being admitted. Each medication is coded in the Anatomical Therapeutic Chemical (ATC) classification system.

Experimental Results Leaderboard

Data Distribution



      title={CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions},
      author={Mingyu Derek Ma and Chenchen Ye and Yu Yan and Xiaoxuan Wang and Peipei Ping and Timothy Chang and Wei Wang},