Skip to Main Content

INFORMATION FOR

Data Interpretation

Sequence analysis is carried out on Applied Biosystems 3730 capillary instruments. The sequencing reactions utilize fluorescently-labelled dideoxynucleotides (Big Dye Terminators) and Taq FS DNA polymerase in a thermal cycling protocol. The electrophoretic data are returned as a computer-readable sequence file via ftp. For good quality DNA templates the average read length is 550-750 bases with >99% accuracy.

Interpretation of the Fluorescent Electrophoretic Data

On average, using good template and primer, Taq/dye-terminator cycle sequencing will provide 500-600 bases of sequence with a 98-99% accuracy (exceptional template-primer combinations will yield 650-750 bases with 98% accuracy). After 600-650 bases, the resolution between peaks decreases and the software has difficulty accurately determining the exact number of bases in runs of the same base. Consequently, the error rate usually increases dramatically and may be as much as 10% at 550-650 bases. Furthermore, because the software utilizes a uniform spacing to call bases, it is slightly biased towards inserting extra bases. Thus, one should be somewhat conservative in data interpretation, particularly when designing primers for primer walking: in general, for the best chance of synthesizing a primer with the correct sequence and to provide sufficient overlap between the two sequencing steps, one should design a primer (see the Primer Guidelines page) in the region between bases 450 and 550.

Unlike Amplitaq polymerase, Taq FS polymerase demonstrates greatly reduced discrimination between incorporation of the four fluorescent-dideoxynucleotide terminators leading to relatively uniform peak sizes. Despite the increased uniformity of terminator incorporation, Taq FS data do exhibit some recognizable patterns which are useful for sequence interpretation:

  • G's following A's are weak and may be very weak, leading to a dropout peak.

Click here for troubleshooting help

Editing the Sequence File

Sequence data is provided in a computer-readable format. Usually the sequence is in GCG format, ready for use by the University of Wisconsin Genetics Computer Group (UWGCG) programs which are available on the VAX computer in the Yale Biomedical Computing Unit. Before analyzing the raw sequence data or aligning it with previously determined sequence, use a sequence editing program with the electropherogram as a guide to truncate the sequence by removing:

  • unreliable data at the beginning of the sequence (usually the first 10-20 bases) which is due to the analysis software starting base calling before a uniform stream of fluorescent peaks is present in the electrophoretic data.
  • any relevant vector sequences.
  • unreliable data at the 3' end of the sequence (beginning in the region of 550-650 bases for ds plasmid DNA and large PCR fragments) which is due to the decreasing resolution of large DNA fragments (broadening and overlap of fluorescent peaks).
  • for PCR products, data past the physical end of the PCR fragment.

If you need assistance in interpreting your sequence data
please call us at 737-2566 or email, dnaseq@yale.edu.