With an ever-increasing variety of fluorochromes available, and a parallel increase in flow cytometer detection capabilities, high-parameter flow cytometry has become an incredibly powerful technology capable of generating large amounts of data from lesser and lesser amounts of sample. Automatic tools have been created for analysis of these big data sets, and one of the most successful and widely used tools in flow cytometry today is called t-SNE, or t-Stochastic Neighbor Embedding. In this blog post, I’ll demonstrate some of the basics of using t-SNE to visualize and characterize your entire cell populations within a two-dimensional space. For a video tutorial on how to make tSNE in FlowJo, check out this blog post.
What is t-SNE? How does it compare to PCA?
t-SNE is an algorithm used for arranging high-dimensional data points in a two-dimensional space so that events which are highly related by many variables are most likely to neighbor each other.
t-SNE differs from the more historically used Principal Component Analysis (PCA) because PCA maximizes separation of data points in space which are very different from each other, and identifies components which are most likely to provide the greatest separation between data points.
This is extremely useful for identifying variables which contribute the most to distinguishing data points. However, the resolution of data which are closely related can be compromised, and similar data points can collapse on each other.
Conversely, t-SNE assembles closely-related data points to neighbor each other in space, increasing the ability to resolve differences between data points that are quite similar. An important effect of this, however, is that physical distances between clusters on a t-SNE map do not indicate how closely or distantly related that data is.
There are important uses for both PCA and t-SNE, but t-SNE has emerged as the preferred algorithm for characterizing flow cytometry data in recent years. The visualizations produced by t-SNE using flow cytometry data are exceptional, and t-SNE excels at maximizing the resolution between populations which don’t stain brightly or discretely.
How can I use a t-SNE map to analyze flow cytometry data?
Note: For the remainder of this post, I’ll demonstrate the generation of various t-SNE plots with flow cytometry data that is publicly available from ImmPort1, and these specific data are available under study accession SDY702 ‘Human T Cell Profile’. The data were part of a study originally published by Drs. Joseph Thome and Donna Farber at Columbia University2. Thank you to Drs. Thome and Farber and the NIH, NIAID, and DAIT for making this data available for public use.
Generating a t-SNE visualization of your flow cytometry data can help you see all of your data points and how they cluster, or relate to each other, in one two-dimensional plot. An example of a t-SNE visualization looks like this:
This is a pseudocolor smooth density plot of a t-SNE map generated in FlowJo. In red are cell clusters of high density, and blue shows areas of low density. You can detect numerous discrete clusters (I can count at least 7), which correspond with unique cell populations, using a t-SNE map.
FlowJo has a nice feature that allows you to use a heatmap of the different parameters in your dataset in order to characterize which cells are located where on your map. As a side note, you can also apply heatmaps to your t-SNE plots in R, Python, and CytoBank.
Here’s what the map looks like when I apply a heatmap of the different markers used in this sample:
From these heatmaps, you can get an idea of what types of cells are found in each cluster based on the antigen which they are highly expressing. Locate the small cluster of CD103+ cells and appreciate how that group of cells also highly expresses CD8.
A second way to characterize the populations present in your t-SNE sample is to manually gate the populations you’re interested in, and then overlay them on the heatmap to see where they’re located and the frequency that they occur in relation to other cells in the sample, like this:
In this example, the B cells, CD4+ T cells, and CD8+ T cells are clearly discriminated from each other. In this next visualization, I’ve done the same thing but gated down a bit further:
Here I can feel pretty confident that I’ve been able to identify most of the major clusters by manually gating the different populations shown on the right. I can also see quantitatively that B cells are in greatest frequency in this sample (39.5%), and CD4+CD103+ have the smallest frequency (0.13%).
My favorite way to analyze t-SNE maps and identify the different clusters that they represent is to put the map on a density plot, and then draw gates around the clusters that I can visually identify, like this:
Since I don’t know what the cell types are in each cluster, I label them 1 – 9, as shown.
The next step is to overlay the gates on top of the t-SNE map, and then characterize each population based on the expression of each antigen as shown on their corresponding histogram, like this:
Histograms are extremely instructive in determining the characteristics of each population, and looking at only one population at a time (as opposed to all of them together as shown here) makes it easy to characterize each population.
An alternative way to identify each population is by pulling the MFI data from each population and creating a heatmap, as I’ve done below using the ‘gplots’ package in R 3,4. This method takes a bit more time than overlaying histograms, but it is a powerful and quantitative way to clearly label and identify what each population is, as well as how they relate to each other with a dendrogram clustering component. You can exclude dendrogram clustering if you just want to identify the populations 1-9 in the correct order rather than scrambled around to show relationships, which is the way I’ve done it below:
Using this heatmap, I can quickly identify what antigens are present and how brightly they are expressed on each population. Notably, this heatmap shows that population 1 corresponds with a type of cell that is high for CD103 and is expressed only on very bright CD8+ T cells, whereas the rest of the CD8+ T cell population has dimmer expression. This was also observed when we characterized the populations by using heatmaps of each selected antigen on the actual t-SNE map earlier in this post.
For a video tutorial to learn to replicate these plots in FlowJo, check out this blog post.
A Word of Caution
An important caveat to using t-SNE for flow cytometry analysis is that the maps are based on mean fluorescent intensity (MFI). Therefore, if you’re looking at longitudinal data over time, any shifts in the MFI will bias your results. It is thus critically important to manually confirm what the algorithm has produced and discovered by using conventional methods. The idea is not to have to confirm this for each population every time (and thus eliminating the utility of it), but rather to confirm data which stands out to you as being interesting and reportable.
I hope these visualizations have helped you to understand t-SNE and how it can be used to help you develop unbiased, high-parameter flow cytometry analyses. FlowJo, R, Python, and Cytobank are all excellent tools for creating these visualizations and two of them (R and Python) are free. Learning to use these tools will become increasingly more important as flow cytometry data becomes more and more complex.
- Bhattacharya S, Dunn P, Thomas CG, Smith B, Schaefer H, Chen J, Hu Z, Zalocusky KA, Shankar RD, Shen-Orr SS, Thomson E, Wiser J, Butte AJ. ImmPort, toward repurposing of open access immunological assay data for translational and clinical research. Sci Data. 2018 Feb 27;5:180015. doi: 10.1038/sdata.2018.15.
- Thome JJ, Yudanin N, Ohmura Y, Kubota M, Grinshpun B, Sathaliyawala T, Kato T, Lerner H, Shen Y, Farber DL. Spatial map of human T cell compartmentalization and maintenance over decades of life. Cell. 2014 Nov 6;159(4):814-28. doi: 10.1016/j.cell.2014.10.026. PubMed PMID: 25417158; PubMed Central PMCID: PMC4243051.
- Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber Andy Liaw, Thomas, Lumley, Martin Maechler, Arni Magnusson, Steffen, Moeller, Marc Schwartz and Bill Venables (2019). gplots: Various R Programming Tools for Plotting Data. R package version 184.108.40.206. https://CRAN.R-project.org/package=gplots
- R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.