NLP/LLM Interest Group
This session features two talks:
"A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data Curation" by Hyunjae Kim, PhD - Postdoctoral Associate in Biomedical Informatics and Data Science
Abstract: Medical multimodal learning is often hindered by the scarcity of high-quality image-text data. While scientific literature is a vast resource, extracting clinically relevant, deconstructed, and aligned image-caption pairs at scale remains challenging. We propose MedPMC, an automated five-stage pipeline to curate high-fidelity medical datasets from large-scale biomedical repositories. The framework features task-specialized components for precise image filtering and sophisticated separation of multi-panel figures and captions to ensure accurate image-text alignment. Leveraging this pipeline, we curated 12 million medical image-text pairs. A CLIP-style model trained on this dataset surpassed state-of-the-art performance across 20+ benchmarks in six clinical specialties, including radiology and pathology. Furthermore, integrating our model into a multimodal LLM outperformed baselines on medical QA tasks by 3.6%. Crucially, MedPMC-trained models enhance performance on internal clinical data, underscoring their utility in real-world settings. This scalable framework establishes a new paradigm for transforming biomedical literature into continuously updatable, clinically grounded training resources.
"S-index – A Refined Data Sharing Index to Promote and Reward Biomedical Data Reuse" by Kalpana Raja, PhD, MRSB, CSci - Instructor of Biomedical Informatics and Data Science
Abstract: Data sharing has become increasingly recognized as essential for accelerating scientific discovery, enhancing transparency, and maximizing the return on research investments. Efforts such as the FAIR principles (Findable, Accessible, Interoperable, and Reusable) and NIH policies on Data Management and Sharing have underscored the importance of making datasets widely available to the scientific community. Despite these advances, current practices lack quantitative metrics that accurately reflect researchers' contributions to data sharing, particularly the downstream reuse of their datasets by the broader scientific community. Existing citation metrics, such as the H-index, predominantly measure scholarly impact through publications, neglecting the critical role of dataset creation and reuse. Consequently, there is a pressing need for a novel index to quantify and incentivize dataset reuse, fostering a robust culture of open and impactful scientific data sharing. To address this, we propose the Data Sharing index (S-index), a refined metric specifically designed to quantify a researcher’s contribution to reusable data. We have built an end-to-end workflow for S-index computation and a web-based interface for visualization and demonstrated feasibility using a real-world repository (OpenNeuro).