Skip to Main Content

Kelson Zawack

Associate Research Scientist of Biomedical Informatics and Data Science

Contact Information

Kelson Zawack


My focus is on the engineering aspects of data science. Just as a civil engineer brings together physics, geology, and materials science to build a bridge connecting two sides of a river my interest is how do we bring together, statistics, software development, and computing systems to address real world data analysis challenges. My work has four main focuses, health disparities epidemiology, applied machine learning, engineering reproducible software, and designing data storage architectures amenable to machine learning scale analyses.

Previous work has identified disparities in health outcomes based on a vast array of socioeconomic and demographic factors. The Veterans health administration is a unique setting to investigate these disparities because of the large number of patients it covers, the comprehensiveness of the care it provides, and how accessible the care is. My work has used generalized linear mixed models to explore the effects of national origin, whether a Veteran was born domestically, in a U.S. Territory or abroad, on hypertension and the effects of race and ethnicity on prescription of continuous glucose monitors.

While generalized linear models are a powerful tool for exploring relationships between risk factors and outcomes, their rigid structure limits their predictive power. The more complex decision surfaces explored by machine learning methods offer the potential to close this gap. Recently I have worked on imputing gulf war illness status from surveys done as part of the Million Veteran Project (MVP). During the gulf war Veterans were exposed to a vast array of toxins that are believed to be related to an increase in a set of physical and cognitive symptoms. To better characterize gulf war illness a survey was constructed and administered to a subset of MVP enrollees. To improve the sample size for downstream analyses we are working to use other MVP data to impute gulf war illness status.

Data analysis projects present a unique software engineering challenge because of their highly nonlinear and exploratory nature. Often a project will begin with an idea which will then be tested and the result used to generate the next hypothesis. Since the next hypothesis, however, may be radically different from the original hypothesis, the programming abstractions that made sense originally will no longer be workable. Through my work on applied projects, I am interested in how to create flexible abstractions that withstand this highly dynamic development environment.

Like the demands data analysis places on software design, it also places unique demands on data storage and organization. Data architecture design often occurs separately from data analysis and has separate concerns. Data storage designs typically focus a minimizing space usage and access times. This results in designs where data must be re-coded in the process of analysis. This re-coding is feasible for targeted analyses that involve only a few well-known variables but becomes prohibitive as analyses scale to machine learning scope for untargeted analyses. A core question I have focused on is how we store the data so that is still efficient in terms of space and access time but is more directly computable.