Extracting Social Determinants of Health from Clinical Notes using LLMs
Social determinants of health (SDOH) are non-medical factors that significantly impact patient health outcomes. Knowing patients’ SDOH greatly improves risk predictions, but this type of information is typically only found in unstructured, narrative clinical notes. In this study, we employed large language models (LLMs) to extract SDOH information from given patient clinical notes. Specifically, we used the Llama-3-8b, Llama-3-70b, and Mixtral 8x7b models, applying them to the MIMIC dataset for both inference and fine-tuning.
We focused on two main inference tasks: note-level and span-level. The note-level task involved identifying which SDOH categories were present in a given note, while the span-level task required determining spans or phrases for SDOH events, triggers, and argument values in a given note. To enhance performance of the AI models, we performed prompt engineering and explored various approaches, including 0-shot and few-shot learning prompts, as well as different prompting formats and example types. For the span-level inference task, we also performed LoRA fine-tuning using Llama-factory.
Evaluation was conducted using the BRAT scoring Python package, which considers the event types, trigger types, argument values, and index locations of identified spans to generate the evaluation metrics for the span-level task. Before evaluation, we needed to perform additional data cleaning to remove output lines that didn’t match the required BRAT format.
For the note-level inference task, we were able to achieve an F1 score of 0.93 with the Llama-3-70b model. However, for the span-level inference task, we discovered that evaluation scores were initially low due to LLMs outputting incorrect indices, even though the spans they identified were correct. To address this, we performed an additional step of index correction by searching through the input file for each identified trigger span. Then we replaced the index the LLMs outputted with the index of the closest trigger span in the input file. Finally, we were able to achieve an F1 score of 0.76 for the span-level task using the index-corrected, fine-tuned Mixtral 8x7b model.
Comments