# Yuval Kluger, PhD

## Research & Publications

## Biography

## News

## Locations

### Research Summary

The research activities in our computational biology and bioinformatics laboratory span diverse areas of data science such as spectral analysis, machine learning, AI, deep learning, signal processing and statistics of high-dimensional data.

We analyze data from high throughput experiments such as single cell RNA sequencing (scRNA-seq), spatial proteomics and transcriptomics, Exome-seq, ChIP-seq, cytometry, chromosome conformation capture sequencing as well as other multiplexed modalities.

Our group develops computational methods that fall into several categories: a) pre-processing tasks such as denoising, removal of batch effects and imputation of missing values, b) scalable algorithms of dimensional reduction techniques for compression and visualization of very large genomics datasets , c) differential analysis tasks for detecting differences across samples with different phenotype/state/condition with the aim of discovering biomarkers, d) bi-clustering and co-organization of large tabulated datasets, e) intra and inter regulation and communication between different cell types, f) signal analysis tools for analyzing data from spatial transcriptomics and proteomics modalities, and g) tree based models in the context of cancer and phylogeny.

We combine our methodological research with practical solutions to analytical tasks emerging in our collaborative projects with basic, translational and clinical researchers. Our collaborations include characterization of the immune system at the single cell level, molecular profiling of melanoma, kidney, breast and lung cancers, interrogating the cellular landscape of brain tissue from donors with HIV and substance use disorders at the single cell level, and studying hair follicle development and regeneration.

### Extensive Research Description

We have been working in the broad fields of bioinformatics, and data science. Our main contributions to date all relate to development of spectral, machine learning and statistical methods for analysis of various types of data in genomics, proteomics, and biomarker discovery.

**Spectral and graph-based methods for unsupervised & supervised learning:** In the past two decades, a common approach to the analysis of data is to first represent it as a graph, and then apply spectral methods to analyze it. In some applications, the data is originally given as a graph (as in the connectivity of Facebook users, or a similarity graph between different proteins). Fundamental theoretical as well as practical questions are how should such data be analyzed, what are the properties of various spectral methods suggested in the literature, and how can multi-scale representations be developed and utilized to such data. We develop state of the art unsupervised spectral methodologies ideal for numerous applications: The first set of methods (Refs. 1-2) allows identification of complex patterns in large data tables by simultaneous organization of rows and columns . Our second set of spectral methods is concerned with ranking and combining multiple predictors without labeled data. This approach provides fundamental results in unsupervised ensemble learning and crowdsourcing (Refs. 3-6). The approach offers a principled way to rank or combine computational genomics pipelines. It is useful for numerous computational genomics tasks; it can remove confusion among end-users, as a substantial fraction of biological results inferred by different pipelines are often in disagreement. Our third set of spectral approaches is concerned with efficient methods for dimensionality reduction of Big Data (BD) matrices (Refs. 7-9). More recently we utilized spectral approaches to address challenges concerning the presence of heteroskedastic noise (Ref. 10), estimation of the rank of count matrices (Ref. 11), detecting significant differences between two high dimensional densities f1 and f0 satisfy the inequality f1>f0 or f1<f0 in the combined sample at different locations in the feature space(Ref. 12), inferring the tree structure of large scale phylogenetic datasets (Ref. 13-14), and phenotypic classification of samples measured in multiplexed spatial omics modalities (Ref. 15).

1. Kluger Y, Basri R, Chang JT, and Gerstein MB. Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions, Genome Research 2003; 13: 703-716. PMCID: PMC317287

2. Mishne, G., Talmon, R., Cohen, I., Coifman, R.R., Kluger, Y., Data-Driven Tree Transforms and Metrics, IEEE Transactions on Signal and Information Processing over Networks. 2017 Aug 23;4(3):451-66.

3. Parisi, F., Strino, F., Nadler, B., and Kluger, Y., Ranking and combining multiple predictors without labeled Data, PNAS (2014) 111(4): 1253-1258; PMID: 24474744; PMCID: PMC3910607

4. Jaffe, A., Nadler, B., Kluger, Y., Estimating the Accuracies of Multiple Classifiers Without Labeled Data, In Artificial Intelligence and Statistics, pp. 407-415. 2015

5. Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., Kluger, Y., Unsupervised Ensemble Learning with Dependent Classifiers, In Artificial Intelligence and Statistics, pp. 351-360. 2016.

6. Shaham, U., Cheng, X., Dror, O., Jaffe, A., Nadler, B., Chang, J., and Kluger, Y. A Deep Learning Approach to Unsupervised Ensemble Learning. Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48

7. Linderman,G.C, Rachh, M., Hoskins, J.G., Steinerberger, S., and Kluger, Y., Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data, Nature Methods 2019 Mar;16(3):243-5; PMID: 30742040

8. Li, H., Linderman, G.C., Szlam, A., Stanton, K.P., Kluger, Y., Tygert, M., Algorithm 971: An implementation of a randomized algorithm for principal component analysis, ACM Transactions on Mathematical Software (TOMS) 43.3 (2017): 28. PMCID: PMC5625842.

9. Shaham, U., Stanton, K.P., Li., F., Basri, R.., Nadler, B., Kluger, Y., Spectralnet: Spectral Clustering Using Deep Neural Networks. ICLR 2018

10. B. Landa, R. R. Coifman, and Y. Kluger, "Doubly Stochastic Normalization of the Gaussian Kernel Is Robust to Heteroskedastic Noise," SIAM Journal on Mathematics of Data Science, pp. 388-413, 2021/01/01 2021.

11. B. Landa, T. T. Zhang, and Y. Kluger, "Biwhitening Reveals the Rank of a Count Matrix," arXiv preprint arXiv:2103.13840, 2021.

12. B. Landa, R. Qu, J. Chang, and Y. Kluger, "Local Two-Sample Testing over Graphs and Point-Clouds by Random-Walk Distributions," arXiv preprint arXiv:2011.03418, 2020

13. A. Jaffe, N. Amsel, Y. Aizenbud, B. Nadler, J. T. Chang, and Y. Kluger, "Spectral Neighbor Joining for Reconstruction of Latent Tree Models," SIAM Journal on Mathematics of Data Science, vol. 3, pp. 113-141, 2021

14. Y. Aizenbud, A. Jaffe, M. Wang, A. Hu, N. Amsel, B. Nadler, J. T. Chang, and Y. Kluger, "Spectral Top-Down Recovery of Latent Tree Models," arXiv preprint arXiv:2102.13276, 2021.

15. Y.-W. E. Lin, T. Shnitzer, R. Talmon, F. Villarroel-Espindola, S. Desai, K. Schalper, and Y. Kluger, "Graph of graphs analysis for multiplexed data with application to imaging mass cytometry," PLOS Computational Biology, vol. 17, p. e1008741, 2021

**Cell specific regulatory networks:** In a series of papers (Refs. (16)-(19) our lab addressed the question of identifying cell specific regulatory networks and assessed differential transcriptional activity of known pathways. We were among the first to develop methods to generate condition specific regulatory networks and the first group to use supervised learning techniques to monitor key transcriptional circuitry alterations. Our work on pathway analysis preceded the popular Gene Set Enrichment Analysis tool and highlighted the limitations of interpreting pathway-based statistical analysis. We introduced a novel approach of looking at differences between cell types by analyzing the activity status of regulator-gene pairs, as well as more complex topological relationships between genes, rather than the expression level of individual genes. We were able to identify key transcriptional circuitry alterations by finding pairs of regulating-regulated genes whose coordinated expression activities undergo the most substantial modification from one class of patients to another.

16. Kluger Y, Tuck, DP, Chang, JT, Nakayama, Y, Poddar, R, Kohya, N, Lian, Z, Abdelhakim Ben Nasr H, Halaban, R, Krause, DS, Zhang X, Newburger PE, Weissman SM. Lineage Specificity of Gene Expression Patterns. PNAS 2004; 101:6508-6513. PMID:15096607; PMCID: PMC404075

17. Kim, H., Hu, W., Kluger, Y. Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae, BMC Bioinformatics 2006, 7:165. PMID:16551355; PMCID: PMC1488875

18. Tuck, D., Kluger, H., Kluger, Y., Characterizing disease states from topological properties of transcriptional regulatory networks, BMC Bioinformatics 2006, 7:236. PMID: 16670008; PMCID: PMC1482723

19. Zhou Y., Ferguson J., Chang J.T., Kluger Y., Inter and intra combinatorial regulation by transcription factors and MicroRNAs, BMC Genomics 2007;8(1):396. PMID: 17971223; PMCID: PMC2206040

**Algorithms for analyzing genomics and epigenomics sequencing data: **In recent years, we were involved in sequencing projects and publications in the fields of cancer genomics, epigenetics, transcriptional regulation and nuclear organization. Our work on picking peak detectors for ChIP-seq data analysis (Refs. (20,21)) provide top performing algorithms, more specific and sensitive than approaches used in the ENCODE project. We also developed an approach for organizing repositories of epigenetic marks using harmonic analysis. This organization reveals variety of binding patterns (Ref. (22)).

Models of cancer evolution assume that among all random mutations there are necessary aberrations that trigger tumor onset, metastatic processes and relapse. Recent efforts to provide a complete genealogical perspective of cancer evolution using experimental techniques have been limited to a small number of fluorescent markers or a small number of single cells. Computational methods can help overcome these limitations. In contrast to the typical phylogeny problems, where species are observed and measured separately, and to the problem of identifying the common cancer genealogy from a panel of samples, my lab addressed the problem of deconvolving a single aggregate signal from a single tumor sample into its subclonal components. Our algorithmic tool is among the very first algorithms addressing this difficult problem (Ref. (23)). It can be used not only in the context of cancer data but also in immunology or mixed cell populations with phylogenetic relationships.

Experiments involving chromosome conformation capture techniques provide support for simultaneous promoter activation, as enhancers often form contacts between each other and the target gene in the same cell. We introduced a bioinformatics novelty in a 4C-seq studies which allows us to detect not only pairwise interactions between different genomic loci but also multi-loci interactions from the same cell (Ref. (24))

20. M. Micsinai, F. Parisi, F. Strino, P. Asp, B.D. Dynlacht, and Y. Kluger, Picking Peak Detectors for Analyzing ChIP-seq experiments, NAR 2012, 1-16, PMID: 22307239; PMCID: PMC3351193

21. Stanton, K.P., Jin, J., Weissman, S.M. and Kluger, Y. Ritornello: High fidelity control-free chip seq peak calling, NAR (2017): gkx799. PMID: 28981893.

22. Stanton, K., Parisi, F., Strino, F., Rabin, N., Asp, P. and Kluger,Y., Arpeggio: Harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures, NAR 2013; 41(16):e161; doi: 10.1093/nar/gkt627. PMID: 23873955. PMCID: PMC3763565

23. Strino, F., Parisi, F., Micsinai, M., and Kluger,Y., TrAp: a Tree Approach for Fingerprinting Subclonal Tumor Composition, NAR 2013;41(17):e165 doi: 10.1093/nar/gkt641. PMID: 23892400. PMCID: PMC3783191

24. Jiang, T., Raviram,R., Snetkova, V., Rocha, P.P., Proudhon, C., Badri, S., Bonneau, R., Skok, J.A., and Kluger, Y., Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions, Nucleic Acids Research 2016; doi: 10.1093/nar/gkw568 PMID: 27439714

**Algorithms for analyzing omics and single cell sequencing data: **Development of methods for analyzing high dimensional data is an important area of biomedical research. We developed methods for preprocessing data in high throughput data, which includes methodologies for data calibration. In recent years, we developed methods and were involved in projects for analyzing multidimensional proteomic data from tumors. In these projects feature extraction of relevant variables is challenging due to sample size and noise level considerations, and have been addressed in a series of papers. Examples include:

25. Shaham, U., Stanton, K.P., Zhao, J., Li., H., Raddassi, K., Montgomery, R., Kluger, Y., Removal of Batch Effects using Distribution-Matching Residual Networks, Bioinformatics (2017): btx196. PMID: 28419223.

26. Li, H., Shaham, U., Yao, Y., Montgomery, R. and Kluger, Y., Gating Mass Cytometry Data by Deep Learning. Bioinformatics (2017): btx448. PMID: 29036374

27. Yamada, Y., Lindenbaum, O., Negahban, S. and Kluger, Y., 2020. “Feature Selection using Stochastic Gates”, Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, PMLR 119, 2020

28. Katzman J, Shaham U, Bates J, Cloninger A, Jiang T, Kluger Y. Deep survival: A deep cox proportional hazards network. BMC Medical Research Methodology 18 (1), 24. PMID:29482517 PMCID:PMC5828433

29. J. Zhao, A. Jaffe, H. Li, O. Lindenbaum, E. Sefik, R. Jackson, X. Cheng, R. Flavell, and Y. Kluger, "Detection of differentially abundant cell subpopulations in scRNA-seq data," PNAS June 1, 2021 118 (22) e2100293118

30. G. C. Linderman, J. Zhao, and Y. Kluger, "Zero-preserving imputation of scRNA-seq data using low-rank approximation," bioRxiv, p. 397588, 2018.

31. J. Yang, O. Lindenbaum, and Y. Kluger, "Locally Sparse Networks for Interpretable Predictions" arXiv:2106.06468, 2021.

### Coauthors

### Research Interests

Artificial Intelligence; Classification; Hemic and Immune Systems; Immune System Diseases; Neoplasms; Neural Networks, Computer; Computational Biology; Data Compression; Machine Learning; Deep Learning; Data Science; Data Visualization

### Selected Publications

- Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data.Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature Methods 2019, 16: 243-245. PMID: 30742040, PMCID: PMC6402590, DOI: 10.1038/s41592-018-0308-4.
- DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology 2018, 18: 24. PMID: 29482517, PMCID: PMC5828433, DOI: 10.1186/s12874-018-0482-1.
- Ritornello: high fidelity control-free chromatin immunoprecipitation peak calling.Stanton KP, Jin J, Lederman RR, Weissman SM, Kluger Y. Ritornello: high fidelity control-free chromatin immunoprecipitation peak calling. Nucleic Acids Research 2017, 45: e173. PMID: 28981893, PMCID: PMC5716106, DOI: 10.1093/nar/gkx799.
- Gating mass cytometry data by deep learning.Li H, Shaham U, Stanton KP, Yao Y, Montgomery RR, Kluger Y. Gating mass cytometry data by deep learning. Bioinformatics (Oxford, England) 2017, 33: 3423-3430. PMID: 29036374, PMCID: PMC5860171, DOI: 10.1093/bioinformatics/btx448.
- Algorithm 971: An Implementation of a Randomized Algorithm for Principal Component Analysis.Li H, Linderman GC, Szlam A, Stanton KP, Kluger Y, Tygert M. Algorithm 971: An Implementation of a Randomized Algorithm for Principal Component Analysis. ACM Transactions On Mathematical Software. Association For Computing Machinery 2017, 43 PMID: 28983138, PMCID: PMC5625842, DOI: 10.1145/3004053.
- Removal of batch effects using distribution-matching residual networks.Shaham U, Stanton KP, Zhao J, Li H, Raddassi K, Montgomery R, Kluger Y. Removal of batch effects using distribution-matching residual networks. Bioinformatics (Oxford, England) 2017, 33: 2539-2546. PMID: 28419223, PMCID: PMC5870543, DOI: 10.1093/bioinformatics/btx196.
- Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions.Jiang T, Raviram R, Snetkova V, Rocha PP, Proudhon C, Badri S, Bonneau R, Skok JA, Kluger Y. Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions. Nucleic Acids Research 2016, 44: 8714-8725. PMID: 27439714, PMCID: PMC5062970, DOI: 10.1093/nar/gkw568.
- Ranking and combining multiple predictors without labeled data.Parisi F, Strino F, Nadler B, Kluger Y. Ranking and combining multiple predictors without labeled data. Proceedings Of The National Academy Of Sciences Of The United States Of America 2014, 111: 1253-8. PMID: 24474744, PMCID: PMC3910607, DOI: 10.1073/pnas.1219097111.
- TrAp: a tree approach for fingerprinting subclonal tumor composition.Strino F, Parisi F, Micsinai M, Kluger Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Research 2013, 41: e165. PMID: 23892400, PMCID: PMC3783191, DOI: 10.1093/nar/gkt641.
- Characterizing disease states from topological properties of transcriptional regulatory networks.Tuck DP, Kluger HM, Kluger Y. Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics 2006, 7: 236. PMID: 16670008, PMCID: PMC1482723, DOI: 10.1186/1471-2105-7-236.
- Lineage specificity of gene expression patterns.Kluger Y, Tuck DP, Chang JT, Nakayama Y, Poddar R, Kohya N, Lian Z, Ben Nasr A, Halaban HR, Krause DS, Zhang X, Newburger PE, Weissman SM. Lineage specificity of gene expression patterns. Proceedings Of The National Academy Of Sciences Of The United States Of America 2004, 101: 6508-13. PMID: 15096607, PMCID: PMC404075, DOI: 10.1073/pnas.0401136101.
- Spectral biclustering of microarray data: coclustering genes and conditions.Kluger Y, Basri R, Chang JT, Gerstein M. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Research 2003, 13: 703-16. PMID: 12671006, PMCID: PMC430175, DOI: 10.1101/gr.648603.
- Graph of graphs analysis for multiplexed data with application to imaging mass cytometry.Lin YE, Shnitzer T, Talmon R, Villarroel-Espindola F, Desai S, Schalper K, Kluger Y. Graph of graphs analysis for multiplexed data with application to imaging mass cytometry. PLoS Computational Biology 2021, 17: e1008741. PMID: 33780435, PMCID: PMC8032202, DOI: 10.1371/journal.pcbi.1008741.
- Detection of differentially abundant cell subpopulations in scRNA-seq data.Zhao J, Jaffe A, Li H, Lindenbaum O, Sefik E, Jackson R, Cheng X, Flavell RA, Kluger Y. Detection of differentially abundant cell subpopulations in scRNA-seq data. Proceedings Of The National Academy Of Sciences Of The United States Of America 2021, 118 PMID: 34001664, PMCID: PMC8179149, DOI: 10.1073/pnas.2100293118.
- Spectral neighbor joining for reconstruction of latent tree Models.Jaffe A, Amsel N, Aizenbud Y, Nadler B, Chang JT, Kluger Y. Spectral neighbor joining for reconstruction of latent tree Models. SIAM Journal On Mathematics Of Data Science 2021, 3: 113-141. PMID: 34124606, PMCID: PMC8194222, DOI: 10.1137/20m1365715.
- Doubly Stochastic Normalization of the Gaussian Kernel Is Robust to Heteroskedastic Noise.Landa B, Coifman RR, Kluger Y. Doubly Stochastic Normalization of the Gaussian Kernel Is Robust to Heteroskedastic Noise. SIAM Journal On Mathematics Of Data Science 2021, 3: 388-413. PMID: 34124607, PMCID: PMC8194191, DOI: 10.1137/20M1342124.
- The Spectral Underpinning of word2vec.Jaffe A, Kluger Y, Lindenbaum O, Patsenker J, Peterfreund E, Steinerberger S. The Spectral Underpinning of word2vec. Frontiers In Applied Mathematics And Statistics 2020, 6 PMID: 34504892, PMCID: PMC8425479, DOI: 10.3389/fams.2020.593406.
- Zero-preserving imputation of single-cell RNA-seq data.Linderman GC, Zhao J, Roulis M, Bielecki P, Flavell RA, Nadler B, Kluger Y. Zero-preserving imputation of single-cell RNA-seq data. Nature Communications 2022, 13: 192. PMID: 35017482, PMCID: PMC8752663, DOI: 10.1038/s41467-021-27729-z.
- Deep unsupervised feature selection by discarding nuisance and correlated features.Shaham U, Lindenbaum O, Svirsky J, Kluger Y. Deep unsupervised feature selection by discarding nuisance and correlated features. Neural Networks : The Official Journal Of The International Neural Network Society 2022, 152: 34-43. PMID: 35500458, DOI: 10.1016/j.neunet.2022.04.002.
- Integrated transcriptome and trajectory analysis of cutaneous T-cell lymphoma identifies putative precancer populations.Ren J, Qu R, Rahman NT, Lewis JM, King ALO, Liao X, Mirza FN, Carlson KR, Huang Y, Gigante S, Evans B, Rajendran BK, Xu S, Wang G, Foss FM, Damsky W, Kluger Y, Krishnaswamy S, Girardi M. Integrated transcriptome and trajectory analysis of cutaneous T-cell lymphoma identifies putative precancer populations. Blood Advances 2022 PMID: 35947128, DOI: 10.1182/bloodadvances.2022008168.

### Clinical Trials

Conditions | Study Title |
---|---|

Diseases of the Nervous System; HIV/AIDS; Infectious Diseases; COVID-19 Inpatient; COVID-19 Outpatient | HIV Associated Reservoirs and Comorbidities (The HARC Plus Study) |

HIV/AIDS | Evaluating the role of opioid medication assisted therapies in HIV-1 Persistence |