How to Teach a Large Language Model to Tell Us About Childhood Development
Utilizing 133.2 million notes about 1.5 million patients in the Children’s Hospital of Philadelphia’s vast electronic health record, this project endeavors to train a large language model (originally based on Meta’s Llama 3 8B Instruct model) to answer questions about childhood development, particularly regarding the extraction of developmental milestone ages, the summarization of developmental trajectories, and the citation of corresponding notes. This effort is especially relevant in the context of prevalent developmental concerns in American youth, the current workflow of pediatricians carrying out developmental surveillance, and the enormous potential of large language models in comprehending free text, extracting details, and summarizing in a narrative style. We produce a 6-page electronic health record annotation guide for childhood development, reviewed by pediatricians and health informatics specialists; 84 annotated patients, each with 5 extracted milestones, a 50-600 word developmental summary, and accompanying note citations; 1,008 dataset rows for training; and Python scripts for annotation assistance, data pipelines, fine tuning training, and inference. Generated post-fine tuning examples exhibit improvements in style and relevance of included information, but continued issues with model hallucinations.
Comments