Research & Publications
The research activities in our computational biology and bioinformatics laboratory span diverse areas of data science such as spectral analysis, machine learning, AI, deep learning, signal processing and statistics of high-dimensional data.
We analyze data from high throughput experiments such as single cell RNA sequencing (scRNA-seq), spatial proteomics and transcriptomics, Exome-seq, ChIP-seq, cytometry, chromosome conformation capture sequencing as well as other multiplexed modalities.
Our group develops computational methods that fall into several categories: a) pre-processing tasks such as denoising, removal of batch effects and imputation of missing values, b) scalable algorithms of dimensional reduction techniques for compression and visualization of very large genomics datasets , c) differential analysis tasks for detecting differences across samples with different phenotype/state/condition with the aim of discovering biomarkers, d) bi-clustering and co-organization of large tabulated datasets, e) intra and inter regulation and communication between different cell types, f) signal analysis tools for analyzing data from spatial transcriptomics and proteomics modalities, and g) tree based models in the context of cancer and phylogeny.
We combine our methodological research with practical solutions to analytical tasks emerging in our collaborative projects with basic, translational and clinical researchers. Our collaborations include characterization of the immune system at the single cell level, molecular profiling of melanoma, kidney, breast and lung cancers, interrogating the cellular landscape of brain tissue from donors with HIV and substance use disorders at the single cell level, and studying hair follicle development and regeneration.
Extensive Research Description
We have been working in the broad fields of bioinformatics, and data science. Our main contributions to date all relate to development of spectral, machine learning and statistical methods for analysis of various types of data in genomics, proteomics, and biomarker discovery.
Spectral and graph-based methods for unsupervised & supervised learning: In the past two decades, a common approach to the analysis of data is to first represent it as a graph, and then apply spectral methods to analyze it. In some applications, the data is originally given as a graph (as in the connectivity of Facebook users, or a similarity graph between different proteins). Fundamental theoretical as well as practical questions are how should such data be analyzed, what are the properties of various spectral methods suggested in the literature, and how can multi-scale representations be developed and utilized to such data. We develop state of the art unsupervised spectral methodologies ideal for numerous applications: The first set of methods (Refs. 1-2) allows identification of complex patterns in large data tables by simultaneous organization of rows and columns . Our second set of spectral methods is concerned with ranking and combining multiple predictors without labeled data. This approach provides fundamental results in unsupervised ensemble learning and crowdsourcing (Refs. 3-6). The approach offers a principled way to rank or combine computational genomics pipelines. It is useful for numerous computational genomics tasks; it can remove confusion among end-users, as a substantial fraction of biological results inferred by different pipelines are often in disagreement. Our third set of spectral approaches is concerned with efficient methods for dimensionality reduction of Big Data (BD) matrices (Refs. 7-9). More recently we utilized spectral approaches to address challenges concerning the presence of heteroskedastic noise (Ref. 10), estimation of the rank of count matrices (Ref. 11), detecting significant differences between two high dimensional densities f1 and f0 satisfy the inequality f1>f0 or f1<f0 in the combined sample at different locations in the feature space(Ref. 12), inferring the tree structure of large scale phylogenetic datasets (Ref. 13-14), and phenotypic classification of samples measured in multiplexed spatial omics modalities (Ref. 15).
1. Kluger Y, Basri R, Chang JT, and Gerstein MB. Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions, Genome Research 2003; 13: 703-716. PMCID: PMC317287
2. Mishne, G., Talmon, R., Cohen, I., Coifman, R.R., Kluger, Y., Data-Driven Tree Transforms and Metrics, IEEE Transactions on Signal and Information Processing over Networks. 2017 Aug 23;4(3):451-66.
3. Parisi, F., Strino, F., Nadler, B., and Kluger, Y., Ranking and combining multiple predictors without labeled Data, PNAS (2014) 111(4): 1253-1258; PMID: 24474744; PMCID: PMC3910607
4. Jaffe, A., Nadler, B., Kluger, Y., Estimating the Accuracies of Multiple Classifiers Without Labeled Data, In Artificial Intelligence and Statistics, pp. 407-415. 2015
5. Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., Kluger, Y., Unsupervised Ensemble Learning with Dependent Classifiers, In Artificial Intelligence and Statistics, pp. 351-360. 2016.
6. Shaham, U., Cheng, X., Dror, O., Jaffe, A., Nadler, B., Chang, J., and Kluger, Y. A Deep Learning Approach to Unsupervised Ensemble Learning. Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48
7. Linderman,G.C, Rachh, M., Hoskins, J.G., Steinerberger, S., and Kluger, Y., Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data, Nature Methods 2019 Mar;16(3):243-5; PMID: 30742040
8. Li, H., Linderman, G.C., Szlam, A., Stanton, K.P., Kluger, Y., Tygert, M., Algorithm 971: An implementation of a randomized algorithm for principal component analysis, ACM Transactions on Mathematical Software (TOMS) 43.3 (2017): 28. PMCID: PMC5625842.
9. Shaham, U., Stanton, K.P., Li., F., Basri, R.., Nadler, B., Kluger, Y., Spectralnet: Spectral Clustering Using Deep Neural Networks. ICLR 2018
10. B. Landa, R. R. Coifman, and Y. Kluger, "Doubly Stochastic Normalization of the Gaussian Kernel Is Robust to Heteroskedastic Noise," SIAM Journal on Mathematics of Data Science, pp. 388-413, 2021/01/01 2021.
11. B. Landa, T. T. Zhang, and Y. Kluger, "Biwhitening Reveals the Rank of a Count Matrix," arXiv preprint arXiv:2103.13840, 2021.
12. B. Landa, R. Qu, J. Chang, and Y. Kluger, "Local Two-Sample Testing over Graphs and Point-Clouds by Random-Walk Distributions," arXiv preprint arXiv:2011.03418, 2020
13. A. Jaffe, N. Amsel, Y. Aizenbud, B. Nadler, J. T. Chang, and Y. Kluger, "Spectral Neighbor Joining for Reconstruction of Latent Tree Models," SIAM Journal on Mathematics of Data Science, vol. 3, pp. 113-141, 2021
14. Y. Aizenbud, A. Jaffe, M. Wang, A. Hu, N. Amsel, B. Nadler, J. T. Chang, and Y. Kluger, "Spectral Top-Down Recovery of Latent Tree Models," arXiv preprint arXiv:2102.13276, 2021.
15. Y.-W. E. Lin, T. Shnitzer, R. Talmon, F. Villarroel-Espindola, S. Desai, K. Schalper, and Y. Kluger, "Graph of graphs analysis for multiplexed data with application to imaging mass cytometry," PLOS Computational Biology, vol. 17, p. e1008741, 2021
Cell specific regulatory networks: In a series of papers (Refs. (16)-(19) our lab addressed the question of identifying cell specific regulatory networks and assessed differential transcriptional activity of known pathways. We were among the first to develop methods to generate condition specific regulatory networks and the first group to use supervised learning techniques to monitor key transcriptional circuitry alterations. Our work on pathway analysis preceded the popular Gene Set Enrichment Analysis tool and highlighted the limitations of interpreting pathway-based statistical analysis. We introduced a novel approach of looking at differences between cell types by analyzing the activity status of regulator-gene pairs, as well as more complex topological relationships between genes, rather than the expression level of individual genes. We were able to identify key transcriptional circuitry alterations by finding pairs of regulating-regulated genes whose coordinated expression activities undergo the most substantial modification from one class of patients to another.
16. Kluger Y, Tuck, DP, Chang, JT, Nakayama, Y, Poddar, R, Kohya, N, Lian, Z, Abdelhakim Ben Nasr H, Halaban, R, Krause, DS, Zhang X, Newburger PE, Weissman SM. Lineage Specificity of Gene Expression Patterns. PNAS 2004; 101:6508-6513. PMID:15096607; PMCID: PMC404075
17. Kim, H., Hu, W., Kluger, Y. Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae, BMC Bioinformatics 2006, 7:165. PMID:16551355; PMCID: PMC1488875
18. Tuck, D., Kluger, H., Kluger, Y., Characterizing disease states from topological properties of transcriptional regulatory networks, BMC Bioinformatics 2006, 7:236. PMID: 16670008; PMCID: PMC1482723
19. Zhou Y., Ferguson J., Chang J.T., Kluger Y., Inter and intra combinatorial regulation by transcription factors and MicroRNAs, BMC Genomics 2007;8(1):396. PMID: 17971223; PMCID: PMC2206040
Algorithms for analyzing genomics and epigenomics sequencing data: In recent years, we were involved in sequencing projects and publications in the fields of cancer genomics, epigenetics, transcriptional regulation and nuclear organization. Our work on picking peak detectors for ChIP-seq data analysis (Refs. (20,21)) provide top performing algorithms, more specific and sensitive than approaches used in the ENCODE project. We also developed an approach for organizing repositories of epigenetic marks using harmonic analysis. This organization reveals variety of binding patterns (Ref. (22)).
Models of cancer evolution assume that among all random mutations there are necessary aberrations that trigger tumor onset, metastatic processes and relapse. Recent efforts to provide a complete genealogical perspective of cancer evolution using experimental techniques have been limited to a small number of fluorescent markers or a small number of single cells. Computational methods can help overcome these limitations. In contrast to the typical phylogeny problems, where species are observed and measured separately, and to the problem of identifying the common cancer genealogy from a panel of samples, my lab addressed the problem of deconvolving a single aggregate signal from a single tumor sample into its subclonal components. Our algorithmic tool is among the very first algorithms addressing this difficult problem (Ref. (23)). It can be used not only in the context of cancer data but also in immunology or mixed cell populations with phylogenetic relationships.
Experiments involving chromosome conformation capture techniques provide support for simultaneous promoter activation, as enhancers often form contacts between each other and the target gene in the same cell. We introduced a bioinformatics novelty in a 4C-seq studies which allows us to detect not only pairwise interactions between different genomic loci but also multi-loci interactions from the same cell (Ref. (24))
20. M. Micsinai, F. Parisi, F. Strino, P. Asp, B.D. Dynlacht, and Y. Kluger, Picking Peak Detectors for Analyzing ChIP-seq experiments, NAR 2012, 1-16, PMID: 22307239; PMCID: PMC3351193
21. Stanton, K.P., Jin, J., Weissman, S.M. and Kluger, Y. Ritornello: High fidelity control-free chip seq peak calling, NAR (2017): gkx799. PMID: 28981893.
22. Stanton, K., Parisi, F., Strino, F., Rabin, N., Asp, P. and Kluger,Y., Arpeggio: Harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures, NAR 2013; 41(16):e161; doi: 10.1093/nar/gkt627. PMID: 23873955. PMCID: PMC3763565
23. Strino, F., Parisi, F., Micsinai, M., and Kluger,Y., TrAp: a Tree Approach for Fingerprinting Subclonal Tumor Composition, NAR 2013;41(17):e165 doi: 10.1093/nar/gkt641. PMID: 23892400. PMCID: PMC3783191
24. Jiang, T., Raviram,R., Snetkova, V., Rocha, P.P., Proudhon, C., Badri, S., Bonneau, R., Skok, J.A., and Kluger, Y., Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions, Nucleic Acids Research 2016; doi: 10.1093/nar/gkw568 PMID: 27439714
Algorithms for analyzing omics and single cell sequencing data: Development of methods for analyzing high dimensional data is an important area of biomedical research. We developed methods for preprocessing data in high throughput data, which includes methodologies for data calibration. In recent years, we developed methods and were involved in projects for analyzing multidimensional proteomic data from tumors. In these projects feature extraction of relevant variables is challenging due to sample size and noise level considerations, and have been addressed in a series of papers. Examples include:
25. Shaham, U., Stanton, K.P., Zhao, J., Li., H., Raddassi, K., Montgomery, R., Kluger, Y., Removal of Batch Effects using Distribution-Matching Residual Networks, Bioinformatics (2017): btx196. PMID: 28419223.
26. Li, H., Shaham, U., Yao, Y., Montgomery, R. and Kluger, Y., Gating Mass Cytometry Data by Deep Learning. Bioinformatics (2017): btx448. PMID: 29036374
27. Yamada, Y., Lindenbaum, O., Negahban, S. and Kluger, Y., 2020. “Feature Selection using Stochastic Gates”, Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, PMLR 119, 2020
28. Katzman J, Shaham U, Bates J, Cloninger A, Jiang T, Kluger Y. Deep survival: A deep cox proportional hazards network. BMC Medical Research Methodology 18 (1), 24. PMID:29482517 PMCID:PMC5828433
29. J. Zhao, A. Jaffe, H. Li, O. Lindenbaum, E. Sefik, R. Jackson, X. Cheng, R. Flavell, and Y. Kluger, "Detection of differentially abundant cell subpopulations discriminates biological states in scRNA-seq data," bioRxiv, p. 711929, 2020
30. G. C. Linderman, J. Zhao, and Y. Kluger, "Zero-preserving imputation of scRNA-seq data using low-rank approximation," bioRxiv, p. 397588, 2018.
Artificial Intelligence; Classification; Hemic and Immune Systems; Immune System Diseases; Neoplasms; Neural Networks, Computer; Computational Biology; Data Compression; Machine Learning; Deep Learning; Data Science; Data Visualization