Case Study 1 Part 1: Introduction, Data Cleaning and Pre-processing


Case Study 1 uses publicly available data from flowrepository.org1 (data set FR-FCM-ZY36) that identifies CD4 and CD8 T cells responding to peptides derived from Trypanasoma cruzi, the causative agents of Chagas disease. I used the FCS files kindly provided by the authors to go through a detailed analysis of their data, from data cleaning to sophisticated cluster-based data extraction.

Part 1: Data Cleaning and Pre-Processing: A Critical (and often-missed!) Step

Achieving reproducibility in your studies is a key component of being a good scientist, and too many are publishing results without validating the data. This has led to a scientific problem where over 70% of published research cannot be reproduced! This is an obvious waste of time and resources, and less obviously, can damage an author’s scientific integrity, too.

Flow cytometry is a hugely important tool used in immunology. Unfortunately, it is also one of the easiest places to introduce bias in your experiment, because there are so many steps involved in sample processing and acquisition, as well as subjectivity in the gating when there should not be. In fact, many people consider flow cytometry an art – but those people are doing it wrong. Flow cytometry is a science, period. It should be treated as such by the responsible researcher. Imagine considering an ELISA, or a qPCR, an art?

To reduce bias and increase the chances that your flow cytometry data can be reproduced in someone else’s hands, your flow cytometry experiment must be designed with care, and executed with consistency. However, even samples run with absolute rigor can be subject to strange, and often, inconvenient, results. With the understanding that many samples are precious and we’d like to keep as much as possible, let’s talk about how to optimize data quality, keep as many samples as possible, and still maintain the tenets of rigor and reproducibility.

Where Do Outliers and Variability Come From?

Understanding where outliers and variability are introduced in your data sets is key to improvement and not repeating the same mistakes. Fixing the issues will save you from having to throw away precious samples and data during the downstream analyses.

That said, the flow cytometry process from start to finish includes a multitude of steps, each of which can introduce a great deal of variability. Below are all of the different steps and the areas where consistency may be compromised:

When staining the sample, variability can be introduced during sample collection, sample processing, sample storage and thawing, antibody lot and concentration, antibody length of storage, deviations in staining protocols, and storage conditions of fully-stained samples before acquisition. The person doing all of these tasks may also do them differently from another member of the lab.

When acquiring flow cytometry on the instruments, variability can be introduced depending on the day, whether QC is run, how voltages are set, the speed at which samples are run, and whether there were any technical issues with the instrument such as microbubbles or microclogs that were not easily observed.

Interestingly, the biggest source of variability actually comes downstream of acquisition during the analysis! Setting gates can be challenging when the appropriate controls are not used and the populations are not clearly separated, as is a reality with flow cytometry. However, analyses don’t affect data quality, as quality is only determined by technical prowess.

“My samples are already acquired. How can I identify bad samples?”

There are some simple preliminary steps that you should take to detect problematic samples prior to any analysis. Without this, you’re in danger of reporting bad data as true results.

When talking about flow cytometry, unclean data is data that contains outliers that are kept in the experiment in error, or data with wild variability between duplicates or samples that should be similar.

One of the quickest and most accurate ways to identify whether there are data quality problems in your flow cytometry samples is by analyzing the consistency of your data across batches. This can be done quickly using programs in R where you can simply import your FCS files and request that your data be visualized on certain graphs. Alternatively, you can use FlowJo overlaid histograms to compare samples.

When checking for consistency, what you’re looking for is whether major cell types from all of your samples fall in similar fluorescent intensity locations. Although you can expect small changes in terms of these locations of your populations from batch-to-batch, these batch variabilities should not exceed average differences greater than 1/2 log MFI.

A great way to check both the consistency of your data as well as locations of cell populations is by generating line-density plots of your samples from different batches overlaid on top of each other in an automated way. This is especially useful when you’ve got markers for major cell populations including CD3, CD4, CD8, CD20, HLA-DR, CD14, and so on. These populations tend to produce plots that are clearly bimodal and should express only two peaks. They should also “align” with each other from batch-to-batch in terms of fluorescent intensity.

In the graphs below, each line represents data from a unique FCS file, acquired at different times and under different stimulation conditions. Each plot represents the density distribution of all cells expressing different fluorescent intensities of each indicated marker. Below is an example of a situation where peaks of cells with the same markers definitely do not align with each other between batches, resulting in plots that have three peaks, when they should only have two:

flow cytometry cd4 and cd8 line graphs data consistency
Line plots of all FCS files analyzed for CD4 and CD8 expression

These plots should both be distinctly bimodal and on top of each other, and they’re not. Both plots have (at least) three distinct peaks.

However, in this study they stimulated cells with peptides as well as Staphylococcal Enterotoxin B (SEB) as a positive control. Therefore, it’s possible that one of the extra peaks corresponds with the influence of stimulation. We check that next, by coloring the lines by stimulation group:

flow cytometry cd4 and cd8 line graphs data consistency by stimulation group
Line plots of all FCS files colored by stimulation group

We can see from the graphs that the three peaks are not explained by stimulation, as all peaks contain a mix of the three stimulation groups. Ideally, we would then try to explain the discrepancies by looking at different time points, to see if there was one particular time that the data looked a bit different. However, we don’t have access to that, so we use the next best thing, patient IDs. Assuming that each set of samples (three per patient, corresponding to the stimulations) for each patient were run at the same time, we can identify the lines by patient and see if there are any that stand out.

flow cytometry cd4 and cd8 line graphs data by patient sample
Line plots of all FCS files colored by patient

We can now judge that samples from the same patient are congregated together, which is particularly evident in the CD4 plot. In the CD8 plot, they are mostly congregated, with at least two samples from the same patient falling together in the same location on the graph.

This is a strong indication that the variability came from acquiring the data at different time points, and not biological variability that may have been introduced by the stimulation conditions.

In this case, you would have several options moving forward:

  1. Analyze dot plots and discard samples which do not have a clear bimodal distribution or good resolution.
  2. Following discarding samples, since you have unstimulated samples present, you can still analyze the effects of stimulation because you have the appropriate controls. Data inconsistency should not be a problem for this.
  3. You can still gate on CD4 and CD8 T lymphocyte populations, if the populations are distinctly separated.
  4. You can still gate on unclearly separated populations, but only if you had the appropriate FMO (fluorescence-minus-one) controls.

As you can see, even when there are data consistency problems, you can still use them as long as you have all of the necessary controls in place. However, without these controls, you would have serious problems moving forward. If, for example, you did not have an unstimulated sample, or you did not have FMOs, you would absolutely not be able to define your positive gates. In this case, you would need to trash all samples that were not highly consistent, although defining your gates would still be a challenge without the proper controls (and we don’t advocate for this!).

Following Option 1, we now come back to the dot plots to see if there are any samples that we should discard based on the inability to discern clear populations:

flow cytometry dot plots patients 1-9
flow cytometry dot plots patients 10-18

We can see visually that Patients 6, 7, 15, 16, 17, and 18, (and maybe also 12) do not have clearly defined CD4 lymphocyte subsets, and would need to disregard these populations when analyzing stimulation effects of CD4 lymphocytes unless we had a negative or (ideally) FMO control for CD4. However, the CD8 lymphocytes are clearly distinguished and well-separated, thus, we could proceed with our analysis for CD8 lymphocytes with these patients.

Interestingly, if we reconstruct the original line graphs without these patient samples, the result looks like this:

Line plots of selected FCS files by patient

As you can see, even after removing the data where we couldn’t see two clearly defined peaks, the plots are still not bimodal do not line up with each other. That’s because even the dot plots that showed well-separated populations still exhibit different fluorescent intensities. Although we can now analyze the data manually, it precludes the ability to do so in batches, and creates problems for more sophisticated analyses that rely on the consistency of fluorescence intensity.

Inconsistency Still Creates “Bigger” Problems

A major issue that data inconsistency creates is the lack of being able to analyze in large batches. For data such as these, the gates would have to be drawn differently for every single sample, because their expression of the different parameters is all quite different. This workload becomes infeasible as the number of samples, or parameters you are analyzing, grows.

A second big issue that the inconsistency creates is that you cannot run algorithms and next-gen analyses based on fluorescence intensity, because the MFI differs so drastically. This limits you to simplistic manual analysis and only being able to draw gates on 2D plots.

One way to work around this problem is by normalizing the data, so that the peaks align with each other, and then applying batch gating. Instead of fully-gating each sample, you can quickly go through and confirm that the gates are all in the correct place, because normalizing the populations aligns the peaks to one reference sample.

I’ll show you how to do this in Case Study 1: Part 2.

Good luck and thanks for reading!


  1. Spidlen J, Breuer K, Rosenberg C, Kotecha N and Brinkman RR. FlowRepository – A Resource of Annotated Flow Cytometry Datasets Associated with Peer-reviewed Publications. Cytometry A. 2012 Sep; 81(9):727-31.

Leave a Comment