This is Part 2 of Case Study 1, where I’m outlining the process of semi-automated flow cytometry analysis from beginning (pre-processing) until the end (clustering and interpretations). Part 1 introduced the data set that I’m using, how to identify problematic samples that should not be included in your analyses, and how to decide whether your experiment is appropriately designed to proceed with analysis. In Part 2, I’ll be explaining the process of data normalization, or aligning peaks of data to make gating quicker and easier, as well as make the data suitable for clustering analyses.
Part 2: Data Normalization – a Prerequisite for Quick & Easy (and Automated!) Gating and Cluster Analyses
In general, data normalization refers to the process of transforming variables to fit a limited range of values and create consistency between samples. In some cases, including in the case of flow cytometry, it will change the underlying values associated with your dataset. Thus, it should be used only when appropriate (we’ll talk about this) and not universally.
One of the biggest advantages of normalizing data, particularly flow cytometry data, is that achieving consistency of values among your datasets greatly increases the efficiency of analysis, especially when analyzing large numbers of samples. This is because you can apply template gates that require little (if any) changing between samples, because the data from different sets will be closely aligned with each other.
A second major advantage of normalizing data is that it is an absolute requirement for downstream automated clustering analyses, which is of particular importance when analyzing high-parameter data sets. Clustering analyses largely rely on MFI, or median fluorescence intensity, and this can differ drastically depending on the day that the sample is run. The computer running its automated clustering analyses doesn’t recognize these differences, and assumes that they are all real, biological, and reproducible differences between samples. Therefore, utilizing MFI data without pre-normalization to make interpretations can be incredibly misleading.
There are several important caveats for data normalization that should be considered prior to its utilization, however. Data should be fairly normal to begin with, as a result of optimized instrument settings used with each run, as well as consistent sample processing and preparation.
Any data that’s wildly outside the normal range (based on visual inspection) should be discarded. Any example of data which is outside the normal range is data that exhibits extremely poor separation, or does not stain for markers that you know should be present in the sample.
Furthermore, there are important biological implications to consider prior to normalization. For example, several markers on cells do not exhibit a simple bimodal, clear, well-separated, positive and negative population. Instead, small shifts in MFI of the total cell population represent real biological changes and you would not want to normalize such data to each other, because then these real changes will no longer be observable.
Reference controls help to solve these issues
The best way to normalize data is to utilize a reference control with every single flow cytometry run. Reference controls are samples from which you have a very large pool or supply that (ideally) positively express all of the markers you have in your panel. These controls have no relation to the objective of the study – they are utilized solely for the purpose of ensuring consistency among your flow cytometry data from run to run.
In the best case, you have enough of the reference control available to utilize it with each run, and this sample should (theoretically) look identical between runs, since it comes from the same pool.
When you use reference controls, you can be confident that your antibodies are working, that the instrument is working, and that you executed your protocol correctly. This is because even if all of your study samples are negative for every color, as long as your reference controls look normal, then you know that the unexpected results were real and biological, and not a result of an issue with the assay.
Reference controls can also tell you whether a large shift in MFI among some or all of your colors is true and biological, or whether that large shift occurred among all of your samples and is not a reflection of a biological process. Thus, even if you don’t intend to utilize batch gating, automated gating, or cluster analyses, reference controls should still be used between runs as an indicator of whether observations you see are biological or an artefact of the assay.
What does data look like after normalization?
The best way for me to show you what I mean is visually. Below are two figures: one showing CD4 expression among CD3+ cells, and a second showing CD8 expression among CD3+ cells.
You can see that the bimodal peaks don’t line up with each other. After normalizing, this is what the data looks like:
The normalization algorithm isn’t perfect because you can see the bottom few peaks of the CD4 populations still aren’t lined up. Those samples, considering how different the peaks are from the rest of the samples, would not be suitable for clustering analyses that rely on MFI.
Next: Use an MDS plot to quickly assess global differences
Once your samples are normalized, the next step prior to deep clustering would be to use a simple MDS plot to determine how different (or alike) the samples are to each other. In our example, we’re using a set of samples which have been stimulated with SEB, antigen, or media alone. They also represent two different populations: those with Chronic Chagas’ disease, and health controls. I’d expect the samples to diverge and cluster based on both disease and stimulation status.
In the MDS plot below, I’m showing only a subset of the patients for simplicity.
What we can tell from this plot is that although the samples differ from each other based on stimulation status within one patient, there is no obvious pattern standing out between patients.
You can see, however, that Patient 1 with the unique CD4 pattern that was unable to be fully normalized does stand apart from the rest.
Again, it would probably be best to exclude them from further automated analyses because they highlight large differences between animals, but we know from looking at the data that this was not a biological difference but most likely a technical issue.
At this point, you could proceed further with more complicated downstream analyses or confirm your results with the standard box plots.
We’ll do both in Part 3: Data Validation with Standard Analyses
Thanks for reading!