There are a number of areas in which improvements may be made in the management of flow cytometry data, including ways for users and flow cytometry facility staff to help maximize the usefulness and longevity of data.
Most flow cytometry experiments are quite highly structured in ways that are not conveyed by simply listing the cells and reagents contained in each sample: for example, some samples are controls; other samples may constitute a time series or a reagent titration series; or a number of samples may come from different tissues of the same mouse.
A special designation should be given to samples containing fluorescence compensation standards. In systems using analog fluorescence compensation, single-dye-labeled samples are needed to set the fluorescence compensation adjustments or to check standard compensation settings. Data on these standard samples should be recorded to check the settings used in the rest of the experiment. When computed off-line compensation is used, the necessary coefficients are derived from an analysis of the single-dye samples.
The standard format for flow cytometry data files, the Flow Cytometry Standard (FCS; unit 10.2), is single sample-oriented and is not structured to describe the relationships among different samples in an experiment or to relate one experiment to another. At present these must be recorded separately. Some efforts have been made to systematize the organization and use of experiment level information within particular software systems. For example, software under development at Stanford University (Treister et al., 1996) organizes all information about the samples in an experiment into one "workshop" document. The user can define and name groups of related samples within a workspace and apply complex analysis specifications to all the samples in a group in a single step.
In a different approach to encoding experiment level information, a group at Purdue has defined a Tube Identification Parameter (TIP) and used it to discriminate different cell sam ples within a single composite FCS file (Robinson et al., 1991, 1992; Durack et al., 1991). The TIP is included as a parameter in each cell record in the listmode file and is incremented between cell samples. Gatings or other analyses common to the whole set of samples can be done in a single operation on all of the data in the file. By gating on the TIP, individual samples or groups of samples can be selected from the set. This method retains the structural information within the FCS file format, but the inclusion of different samples in one FCS file will often invalidate standard FCS keywords (such as the ones naming the reagents in each fluorescence channel) that are not common to all the samples in the set. These keywords represent the key distinguishing features of the samples for normal FCS compliant software, so although the system has proven useful, it also illustrates the need to incorporate higher-level information into the Flow Cytometry Standard in a systematic way.
unit io.2 describes the latest update of the standard format for flow cytometry data files, FCS 3.0. The next major advance in flow cy-tometry data standards should be to develop experiment- and project-level data specifications, but as of early 1997 there seem to be no plans to move in this direction within the FCS 3.0 project.
It is is very important to be consistent in annotating samples to ensure that data are accessible and interpretable. Consistency is required both in the recording of relevant features and in the specification of a particular cell, reagent, or condition. Use of synonyms and different abbreviations from one experiment to the next makes it much harder to connect related data.
Ideally all cell and reagent specifications should be drawn from a database. This ensures consistency in naming and provides a central place to keep information, such as the titers of different reagent lots, or data, such as the number of antibody molecules needed to produce one unit of fluorescence. With access to the latter information, a data analysis program can draw plots with axes based on antibody molecules per cell.
Instrument Conditions and Time-of-Collection Annotations
In general, all of the computer-readable instrument settings used during data collection should be recorded so that they can be retrieved in the future (Parks and Bigos, 1990). It is redundant to record this fully in every FCS file, but at present there is no generally accepted way to store this information more efficiently and still ensure that a cell data sample remains linked to the instrument condition record.
Instrument settings and test particle measurements are important data for verifying the cytometer measurements of cells. In addition, text annotations of observations during the measurement and, where needed, notes clarifying the data (e.g., explaining different sets of data collected to document a sort) should be recorded.
The time of collection is an important part of the data. This makes it possible to check the order in which samples were run and provides a simple link to time-stamped log files. The data rate or duration of the data collection can also be useful in evaluating unexpected data results.
For logarithmic amplified immunofluores-cence measurements, 9-bit data (steps of ~2% per channel) are adequate. Higher resolution may be useful for cell cycle analysis.
Measurements used to gate the recorded data events should be included in the data collection. Such gating is usually used to exclude dead cells or debris, but it is a good idea to set "loose" gates and refine them in the data analysis process. Thus, the values used for gating should be recorded.
The number of data events that should be collected depends on the frequencies of populations to be evaluated and on the required accuracy of those evaluations (Parks and Bigos, 1997). In general, the lower the frequency of the population of interest, the more data events are needed.
Most software allows data storage in one- or two-dimensional plots as well as in list form. When computer storage was expensive and data were limited to two or three measurements per cell, histograms provided an adequate alternative; however, with ample computer storage currently available, there is little reason to record primary data in anything but listmode (see unit 10.3 for more extensive discussion of this point). It is still valuable to keep the lists as compact as possible, both to make the best use of storage space and to minimize transmission time during data transfer.
The use of bit packing rather than full bytes is advantageous, particularly for 9- to 12- bit data, as it takes only 56% (9-bit) to 75% (12-bit) as much space. Bit-packed data conforms to the FCS. Unfortunately it appears that most commercial flow cytometry software does not support it. For large data files with lower bit resolution and nonuniform distributions of data values, Huffman encoding can be more efficient than packed bits (Bigos and Moore, 1996). Tested on an eight-dimensional 50,000-cell data set, Huffman encoding gave 18 % more compression than simple packed-bit storage at 8-bit resolution but only 6% at 12 bits.
Data Storage, Maintenance, and Recovery
At present it seems that most flow cytometry facilities rely on the underlying structure of their computer file system to manage their data. With this arrangement consistent organization of data files into experiments and projects is not ensured. To facilitate consistent data management, each instance of data collection needs to have a unique identifier assigned so that any outputs (e.g., plots or tables) derived from that data can be tagged with the identifier and traced back to the original data source.
The FCS (unit 10.2; Data File Standards Committee of the Society for Analytical Cy-tometry, 1990) has been defined to provide consistent data forms and facilitate data interchange. Files written to this standard should be readable by any program that supports FCS (as almost all commercial analysis programs claim to do). At this time, however, full intercompati-bility has not been realized. That will require production of standard coding/decoding packages that implement the full FCS and can be incorporated into all flow cytometry software, an effort now being undertaken by the FCS Committee.
Current data collections often include eight to ten primary fluorescence and light-scatter measurements. As the ability to delineate low-frequency cell populations increases, the average number of cells that must be measured also increases. This increase in the number of measurements and the number of cells measured has led to the production of very large data files.
These days, however, data storage is the simplest aspect of data management. Large, inexpensive disks make it easy to keep a large quantity of data online, and high-volume tape storage is inexpensive and compact. Recovery of data from tape, however, is slow, and the life expectancy of data tapes is shorter than the minimal 10 years that is appropriate for flow
Data Management cytometry data storage. Optical compact disks are now coming into routine use and are much better than tape in these two respects. Their only limitation is that they generally are not rewritable (see unit 10.3), so the original data files cannot be updated with analysis specifications and results. It is possible to keep all experiment and sample information, including the analysis specifications and results, online while storing the numerical data themselves elsewhere. In this case the numerical data are separated from the online data, which must include information for recovery of the numerical data.
Routine data analysis includes data gating, specification of plots to display signal distributions, and computation of statistical results. Records of these analyses should be retained with the rest of the information about the sample. Graphs and tables derived from data (e.g., fluorescence signal medians for export to a spreadsheet program) should include sufficient information to trace back to the original flow cytometer data records.
Computed data transformations can be an important part of the data analysis record. Analog fluorescence compensation is discussed in Parks (1997). Off-line fluorescence compensation, resulting in the computation of new data dimensions, can be useful when stored fluorescence data are uncompensated or undercompensated. In multiple-laser systems, a dye may be excited by more than one laser, leading to unwanted signal contributions that cannot be corrected by ordinary analog fluorescence compensation. In this case, off-line compensation may be required prior to analysis of cell populations and extraction of numerical results. Analysis of data from singly stained cells of each type provides the subtraction coefficients used to correct mixed dye results. When off-line fluorescence compensation has been done, the matrix that specifies the transformation should be recorded as part of the analysis. This documents the transformation and allows comparisons between the adjustments required in different experiments.
The FCS 3.0 includes space for adding analysis results to the data file (unit 10.2; http.V/mdeus.mmunoLwashington. edu/ISAC. html). Because there is no fixed format for analyzing data, results are recorded as unformatted text. Therefore, no standard method of retrieving specific analysis results is available.
One general method for recording cell popu lations is to define new 1 -bit list parameters that identify which cells fall into particular gated populations. One-bit lists may be an effective way to represent and store the results of an analysis, particularly when the analysis identifying a population is complex, such as that derived from a cluster analysis.
At present flow cytometric databases have several useful roles. They provide a flexible way for the user to organize information; thus, database interactions may well become the predominant way in which flow cytometry users interface with the data. For example, rather than starting a data analysis program and naming a file for it to operate on, the database could be asked for experiments from the previous month under a specific name. An analysis program could then be launched on one of these experiments.
If experiment and sample information have been entered fully and consistently, a database will be able to provide useful responses to queries such as "Who has been doing staining for CD192 recently?" or "Which experiments used SJL mice?"
Flow cytometry service facilities and larger laboratories should organize their current data collections with the expectation that they will soon be porting it into a database. Should this not happen, they will certainly benefit from better-organized data in any case.
It is good practice for a service facility to retain a copy of all data it generates. However, when data are well documented, controlling their access becomes a critical issue; unauthorized competitors might be able to steal the results. Therefore, a center with central data management has a responsibility to prevent unauthorized use of data. Although most flow cytometry data should be publicly available after publication, at certain times users should be allowed to restrict access to their data.
Another major aspect of data security is the prevention of accidental or intentional data loss. Users should be allowed to "delete" data that they believe to be incorrect or obsolete; however, the system should retain these data but add a "deleted" annotation so that it does not appear in routine data listings.
An example of a comprehensive data man agement system for flow cytometry is the DESK system developed at Stanford University. DESK is a comprehensive system for designing flow cytometry experiments and for collecting, annotating, managing, and analyzing flow cytometry data (Moore, 1984, 1987; Moore and Bigos, 1990). It has been in use since the mid-1980s at Stanford and in a small number of other laboratories around the world. The data management aspects of DESK have proven to be very valuable in organizing large volumes of data and in keeping it usable and accessible over a period of years.
Based on this experience, a number of conclusions and recommendations can be made:
1. The experiment design editor in DESK provides for one-time entry of cell and reagent information for an experiment and makes it easy to specify cell and reagent combinations for all samples in the experiment. This is useful for ensuring consistency in cell and reagent names among all samples in the experiment. This consistency can be maintained over a series of related experiments by copying and re-editing the original design document. Ideally, the reagent labels should be obtained from a database that would contain full information on each reagent. The experiment design editor makes it possible to code related samples into parallel positions in a grid array, but future software should provide explicit ways to group samples and specify other relationships within an experiment.
2. For each cytometer the DESK system maintains a log of the instrument configurations and adjustments and of the results of the instrument standardization run that precedes each data collection run. The log is valuable for retrospective investigation when users obtain questionable data, and it can be used to monitor trends and abrupt changes in the operating conditions of the instrument.
3. Each data collection record is automatically assigned a unique identifier, which assures that different data sets can never be confused with each other even if all the user-supplied labels are the same.
4. Automatic data identification means that the user never has to deal with actual file names in the computer system. When a user specifies analyses on a sample whose numerical data are not online, a request for the appropriate data storage tape is automatically dispatched. After a request tape is loaded, the required data are transferred to a disk, the requested analysis is performed, and the results are returned to the user's DESKtop file automatically.
5. For each cell sample the user's DESKtop retains a record of all analysis gating, specifications for each plot produced, and the results of numerical analyses. This makes it possible to reconstruct previous analyses exactly. Printouts of plots and numerical results include user, date, experiment, sample, cell, and reagent labels as well as gating information leading to the data subset plotted. Thus, data outputs can always be traced back to the data from which they were produced.
6. The DESK data archive lists all experiments in the system by user name, date, and title. The lists can be browsed and used to gain access to the data in archived experiments. This archive consists of lists and is not a general database, and it only includes experiment level information. Data systems should include a real database that contains both the experiment level and sample level information.
Was this article helpful?
For centuries, ever since the legendary Ponce de Leon went searching for the elusive Fountain of Youth, people have been looking for ways to slow down the aging process. Medical science has made great strides in keeping people alive longer by preventing and curing disease, and helping people to live healthier lives. Average life expectancy keeps increasing, and most of us can look forward to the chance to live much longer lives than our ancestors.