Data Processing

After image acquisition, raw data is transferred from its local site to a High Performance Computing (HPC) cluster through an infrastructure developed by Yale’s HPC Resource Center. There, the data is transformed into sequence by means of Illumina’s Pipeline software package, which consists of three executables: Firecrest, Bustard, and Gerald.

Firecrest performs image analysis. For each raw TIF file, clusters are identified; intensities are extracted; X,Y positions are determined; and noise levels are calculated. Thereafter, images are filtered, and, in this way, clusters are sharpened, background noise is mitigated, and scale is adjusted.

Bustard performs base calling. Using the output from Firecrest, Bustard produces sequences of bases from each cluster and assigns each base a confidence level. It does so by deconvolving signal from each cluster and applying correction for cross-talk (the overlap between the frequency emissions of the fluorophores); and phasing and pre-phasing (the phenomenon whereby a molecule in a given cluster is ahead or behind its current incorporation cycle, respectively).

Gerald performs sequence analysis. Gerald filters low quality base calls based on user-defined criteria; aligns sequences to an elected reference genome; visualizes the results; and generates statistics, diagnostic quality control plots, and summary tables. ELAND (Efficient Large-Scale Alignment of Nucleotide Databases), a Gerald alignment program, can efficiently align for up to two errors up to 32mers from a reference genome; PhageAlign, another Gerald alignment program, can exhaustively align all possible alignments up to an arbitrary edit distance (PhageAlign is accordingly slow, and therefore only recommended for reference genomes less than, or equal to, 2-megabases).

The voluminous amount of data will be accessible for a designated time via a streamlined web interface designed by Yale’s High Performance Computing Resource Center.