AI in Medicine Seminar Series
"LLMs for Science: From Literature-Grounded Evaluation to Insights on Scientific Reasoning"
Abstract:
Large Language Models are increasingly integrated into scientific workflows and promise to transform discovery, but progress depends on reliable ways to evaluate and interpret their capabilities. In this talk, I will present two complementary efforts on evaluating and understanding LLMs for science. First, SciArena, an open and collaborative platform for comparing LLMs on literature-grounded scientific tasks. SciArena provides a robust basis for assessing retrieval-augmented agents, and its extension, SciArena-Eval, enables benchmarking of automated evaluation methods. Second, I will discuss our study of scientific reasoning, where we investigate the roles of knowledge and reasoning using a new probing framework, KRUX. Our analysis highlights distinct bottlenecks: retrieving task-relevant knowledge from model parameters and the need for explicit reasoning to surface domain insights. Together, these works reveal key limitations of current models, such as bottlenecks in retrieving task-relevant knowledge and the importance of explicit reasoning for surfacing domain insights. We conclude with lessons learned from combining large-scale open evaluation with fine-grained probing, and outline opportunities for building the next generation of trustworthy scientific LLMs.
Arman Cohan, PhD is an Assistant Professor of Computer Science at Yale University. His research interests are in large language models, natural language processing and machine learning, with a focus on understanding, evaluating, and interpreting large language models.He is also interested in developing techniques to enhance LLM capabilities for complex and long-tail phenomena, especially with the goal of making robust and adaptable LLM systems that can address challenging real-world scenarios.