Dagstuhl-Seminar 24122
Low-Dimensional Embeddings of High-Dimensional Data: Algorithms and Applications
( 17. Mar – 22. Mar, 2024 )
Permalink
Organisatoren
- Fred Hamprecht (Universität Heidelberg, DE)
- Dmitry Kobak (Universität Tübingen, DE)
- Smita Krishnaswamy (Yale University - New Haven, US)
- Gal Mishne (University of California, San Diego - La Jolla, US)
Kontakt
- Andreas Dolzmann (für wissenschaftliche Fragen)
- Simone Schilke (für administrative Fragen)
Gemeinsame Dokumente
- Dagstuhl Materials Page (Use personal credentials as created in DOOR to log in)
Programm
2D embeddings in science
In recent years, high-dimensional “big” data have become commonplace in multiple academic fields. To give some examples, single-cell transcriptomics routinely produces datasets with sample sizes in hundreds of thousands and dimensionality in tens of thousands [1]; single-cell mass spectrometry deals with millions of samples [2]; genomic datasets quantifying singlenucleotide polymorphisms can deal with many millions of features [3]; behavioural physiology produces high-dimensional datasets with tens of thousands of samples [4]. In neuroscience, calcium imaging allows to record time-series activity of thousands of neurons. Many scientific fields that traditionally did not have to deal with high-dimensional data analysis now face similar challenges; for example, a digital library can yield a dataset with tens of millions of samples and hundreds, if not millions, of features [5].
Such datasets require adequate computational methods for data analysis, including unsupervised data exploration. In fact, exploratory statistical analysis has become an essential tool in many scientific disciplines, allowing researchers to compactly visualise, represent and make sense of their data. It became commonplace to explore low-dimensional embeddings of the data, generated by methods like t-SNE [6] or UMAP [7]. Such visualisation has proven to be a valuable tool for exploring the data, performing quality control, and generating scientific hypotheses (Figure 1).
Similar algorithms are also applied in artificial intelligence research to visualise massive datasets used to train state-of-the-art artificial intelligence models, such as image-based and text-based generative models. This allows researchers to discover biases and gaps in the data, to highlight model limitations, and ultimately to develop better models (Figure 2). A concise overview of the model’s training data can also be helpful for societal oversight and public communication.
Neighbour embedding methods like t-SNE and UMAP create a low-dimensional map of the data based on the k-nearest-neighbour graph. As a result, they are often unable to reproduce large-scale global structure of the data [12], creating potentially misleading visualizations [13]. Acquisition of increasingly high-dimensional data across scientific fields has sparked widespread interest in employing dimensionality reduction and visualisation methods. However, there is a gap between method developers who propose and implement these algorithms, and domain experts who aim to use them. The purpose of this seminar was to bring together machine learning researchers, theoreticians, and practitioners, to address current gaps in theoretical guarantees and evaluation measures for state-of-the-art approaches, highlight practical challenges in applying these techniques in different domains, brainstorm the solutions, and set up new collaborations to tackle open problems in this vibrant field.
Seminar Topics
The overarching purpose of this Dagstuhl Seminar was to brainstorm open problems and challenges in the field of low-dimensional embeddings, as seen by (i) practitioners; (ii) theoreticians and mathematicians; and (iii) machine learning researchers — leading to new collaborations to tackle these problems. The seminar focused on the following open questions, grouped into four areas.
Low-dimensional embeddings in actual practice
Single-cell biology, working with large quantities of high-dimensional data and interested in exploratory research, became a field heavily relying on low-dimensional embeddings. But embeddings of texts [5], of genomes [14], of graph nodes [15], of chemical structures [16], etc., are also rapidly gaining popularity. Seminar participants discussed and brainstormed which fields in the coming years are likely to generate data amenable for embedding methods, and compared challenges raised by each of these application fields.
Neighbour embeddings have a number of well-known limitations [12]: for example, they can strongly distort the global structure of the data and are unable to represent highdimensional (i) which limitations can be addressed by the new generation of algorithms; (ii) how to diagnose misleading aspects of any given embedding; and (iii) what evaluation metrics are necessary and sufficient for comparing different visualisation techniques.
Moreover, two-dimensional embeddings have been recently criticised as being dangerously misleading [13]. At the same time, they are widely used across many disciplines and can be helpful in actual scientific practice, if used with care [12]. In several talks and multiple discussions, seminar participants talked about specific examples of how and where the embeddings are useful, and which best practices can help to avoid them being misleading.
Common themes across state-of-the-art algorithms and relevant trade-offs
One common theme in multiple talks and discussions was trade-offs between various embedding algorithms.
First, methods like t-SNE or UMAP are typically used to produce 2D or 3D embeddings, while spectral methods like Laplacian eigenmaps [17] produce low-dimensional embeddings that are often used with more embedding dimensions. This is less suitable for visualisation but may be better suited for downstream data analysis. Several seminar participants reported successfully applying UMAP to intermediate dimensionality too, with particular benefits for downstream density-based clustering (using HDBSCAN algorithm).
Second, all neighbour embedding algorithms operate on the kNN graph of the data but use different loss functions and different attractive/repulsive forces to arrive at the final layout. This yields various trade-offs in the quality of global/local structure preservation [18].
Third, neighbour embedding algorithms are typically run on a kNN graph constructed using pairwise Euclidean distances, but in principle any other metric can be used as well. Specifically, metric design can be useful for incorporating domain knowledge and statistical priors on the data [19, 20]. We discussed what kinds of data can profit from using non- Euclidean distance metrics, or from kNN graph post-processing, such as diffusion-based smoothing.
Fourth, more generally, neighbour embeddings are known to be related to the selfsupervised learning approach known as contrastive learning [21]. However, despite substantial progress in each of these two fields, they stayed largely disconnected from each other. Seminar participants argued that both contrastive learning and neighbour embedding research can benefit from each other’s state-of-the-art approaches, and in particular can be combined to develop new algorithms for visualising textual and/or graph-based data.
Fifth, while neighbour embeddings only aim to preserve nearest neighbours, methods based on MDS aim to preserve all pairwise distances including the large ones. In Isomap [22] and PHATE [23], pairwise distances are obtained as graph distances on the kNN graph. Isomap uses short path distance, while PHATE uses diffusion-based distance called potential distance. LDLE [24] uses bottom-up manifold learning to align low-distortion local embeddings to a global embedding. We discussed to what extent these approaches can capture both the local and the global structure of the data, and what the advantages and the disadvantages of aiming to preserve global aspects of the data are.
Interactive embeddings
Another extensively discussed topic was interactive visualizations of 2D embeddings (in particular see abstracts by Benjamin M. Schmidt and B. P. F. Lelieveldt). While most often low-dimensional embeddings are depicted as static images, they can be powerful tools for interactive data explorers. NomicAI has been developing software for in-browser interactive explorers, while the group of B. P. F. Lelieveldt has been working on stand-alone software for interactive explorerd of RNA-sequencing data.
References
- Nature Communications, 10(1): 1–14, 2019.
- Belkina, Anna C and Ciccolella, Christopher O and Anno, Rina and others Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets, Nature Communications, 10(1): 5415, 2019.
- Diaz-Papkovich, Alex and Anderson-Trocmé, Luke and Ben-Eghan, Chief and Gravel, Simon UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts, PLoS genetics, 15(11): e1008432, 2019.
- Kollmorgen, Sepp and Hahnloser, Richard HR and Mante, Valerio Nearest neighbours reveal fast and slow components of motor learning, Nature, 577(7791): 526–530, 2020.
- Schmidt, Benjamin Stable random projection: Lightweight, general-purpose dimensionality reduction for digitized libraries, Journal of Cultural Analytics, 3(1), 2018.
- van der Maaten, Laurens and Hinton, Geoffrey Visualizing data using t-SNE, Journal of Machine Learning Research, 9(11), 2008.
- McInnes, Leland and Healy, John and Melville, James UMAP: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426, 2018.
- Yao, Zizhen and Van Velthoven, Cindy TJ and Nguyen, Thuc Nghi and others A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation, Cell, 184(12): 3222–3241, 2021.
- Kanton, Sabina and Boyle, Michael James and He, Zhisong and others Organoid single-cell genomic atlas uncovers human-specific features of brain development, Nature, 2019.
- Triana, Sergio and Vonficht, Dominik and Jopp-Saile, Lea and others Single-cell proteogenomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states, Nature Immunology, 22(12): 1577–1589, 2021.
- Anand, Yuvanesh and Nussbaum, Zach and Treat, Adam and other GPT4All: An Ecosystem of Open Source Compressed Language Models, arXiv preprint arXiv:2311.04931, 2023.
- Wattenberg, Martin and Viégas, Fernanda and Johnson, Ian How to Use t-SNE Effectively, Distill, 2016, 10.23915/distill.00002.
- Chari, Tara and Pachter, Lior The specious art of single-cell genomics, PLoS Computational Biology, 19(8): e1011288, 2023.
- Diaz-Papkovich, Alex and Anderson-Trocmé, Luke and Gravel, Simon A review of UMAP in population genetics, Journal of Human Genetics, 66(1): 85–91, 2021.
- Hu, Yifan Efficient, high-quality force-directed graph drawing, Mathematica journal, 10(1): 37–71, 2005.
- Probst, Daniel and Reymond, Jean-Louis Visualization of very large high-dimensional data sets as minimum spanning trees, Journal of Cheminformatics, 12(1): 12, 2020.
- Belkin, Mikhail and Niyogi, Partha Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation, 15(6): 1373–1396, 2003.
- Böhm, Jan Niklas and Berens, Philipp and Kobak, Dmitry Attraction-Repulsion Spectrum in Neighbor Embeddings, Journal of Machine Learning Research, 23(95), 2022.
- Talmon, Ronen and Coifman, Ronald R Empirical intrinsic geometry for nonlinear modeling and time series filtering, Proceedings of the National Academy of Sciences, 110(31): 12535– 12540, 2013.
- Mishne, Gal and Talmon, Ronen and Meir, Ron and others Hierarchical coupled-geometry analysis for neuronal structure and activity pattern discovery, IEEE Journal of Selected Topics in Signal Processing, 10(7): 1238–1253, 2016.
- Damrich, Sebastian and Böhm, Niklas and Hamprecht, Fred A and Kobak, Dmitry From t-SNE to UMAP with contrastive learning, In The Eleventh International Conference on Learning Representations, 2023.
- Tenenbaum, Joshua B and Silva, Vin de and Langford, John C A global geometric framework for nonlinear dimensionality reduction, Science, 290(5500): 2319–2323, 2000.
- Moon, Kevin R and Van Dijk, David and Wang, Zheng and others Visualizing structure and transitions in high-dimensional biological data, Nature Biotechnology, 37(12): 1482–1492, 2019.
- Kohli, Dhruv and Cloninger, Alexander and Mishne, Gal LDLE: Low distortion local eigenmaps, Journal of machine learning research, 22(282): 1–64, 2021.
Low-dimensional embeddings are widely used for unsupervised data exploration across many scientific fields, from single-cell biology to artificial intelligence. These fields routinely deal with high-dimensional characterization of millions of objects, and the data often contain rich structure with hierarchically organised clusters, progressions, and manifolds. Researchers increasingly use 2D embeddings (t-SNE, UMAP, autoencoders, etc.) to get an intuitive understanding of their data and to generate scientific hypotheses or follow-up analysis plans. With so many scientific insights hinging on these visualisations, it becomes urgent to examine the current state of these techniques mathematically and algorithmically.
This Dagstuhl Seminar intends to bring together machine learning researchers working on algorithm development, mathematicians interested in provable guarantees, and practitioners applying embedding methods in biology, chemistry, humanities, social science, etc. Our aim is to bring together the world's leading experts to (i) survey the state of the art; (ii) identify critical shortcomings of existing methods; (iii) brainstorm ideas for the next generation of methods; and (iv) forge collaborations to help make these a reality.
This seminar should lay the groundwork for future methods that rise to the challenge of visualising high-dimensional data sets while emphasising their idiosyncrasies and scaling to tens, hundreds, and potentially thousands of millions of data points.
Seminar topics:
- Manifold assumption and manifold learning.
- Spectral methods, diffusion, Laplacian methods, etc.
- Relationships and trade-offs between different embedding algorithms.
- Limitations and shortcomings of low-dimensional embeddings. Danger of over-interpretation.
- Local, global, and hierarchical structure preservation.
- Non-Euclidean embeddings, such as hyperbolic or spherical.
- Low- (~2) vs. mid-range- (~256) dimensional embeddings: unique challenges.
- Low-dimensional embeddings in actual practice: embeddings of cells, molecules, texts, graph nodes, images, etc. Data modalities and their challenges.
- Scaling up for larger datasets, runtime considerations.
- Self-supervised embeddings via contrastive learning.
- Theoretical guarantees and mathematical properties of unsupervised and self-supervised embeddings.
- Topological data analysis in the embedding space; topological constraints on embeddings.
- Michael Bleher (Universität Heidelberg, DE)
- Kerstin Bunte (University of Groningen, NL) [dblp]
- Corinna Coupette (MPI für Informatik - Saarbrücken, DE)
- Sebastian Damrich (Universität Tübingen, DE)
- Cyril de Bodt (University of Louvain, BE)
- Alex Diaz-Papkovich (Brown University - Providence, US)
- Laleh Haghverdi (Max-Delbrück-Centrum - Berlin, DE)
- Fred Hamprecht (Universität Heidelberg, DE) [dblp]
- Ágnes Horvát (Northwestern University - Evanston, US)
- Dmitry Kobak (Universität Tübingen, DE)
- Dhruv Kohli (University of California - San Diego, US)
- Smita Krishnaswamy (Yale University - New Haven, US)
- John Aldo Lee (UC Louvain-la-Neuve, BE) [dblp]
- B.P.F. Lelieveldt (Leiden University Medical Center, NL)
- Leland McInnes (Tutte Institute for Mathematics&Computing - Ottawa, CA)
- Gal Mishne (University of California, San Diego - La Jolla, US)
- Ian Nabney (University of Bristol, GB) [dblp]
- Maximilian Noichl (Utrecht University, NL)
- Pavlin Policar (University of Ljubljana, SI)
- Bastian Rieck (Helmholtz Zentrum München, DE) [dblp]
- Enrique Fita Sanmartin (Universität Heidelberg, DE)
- Benjamin M. Schmidt (Nomic AI - New York, US)
- Ingo Scholtes (Universität Würzburg, DE) [dblp]
- Guy Wolf (University of Montreal, CA & MILA - Montreal, CA)
- Miguel Á. Carreira-Perpiñán (University of California - Merced, US) [dblp]
Klassifikation
- Data Structures and Algorithms
- Machine Learning
Schlagworte
- dimensionality reduction
- visualization
- high-dimensional