Tuesday, January 8, 2013

Blog Series: WoG, Cesky Krumlov; Day 2: So you want a NGS Sequencer eh?

Dr. Konrad Paszkiewicz
University of Exeter
Director of Wellcome Trust Biomedical Bioinformatics Hub

Topic: DNA Sequencing Technology: Past, present, future

Good morning bioinformatic campers! Well, afternoon for me, but morning for many of you back in the U.S.

So much of what was covered this morning is going to be redundant with the DNA sequencing preparation blog I wrote previously. So between this entry and that other one you will hopefully get a complete view of the 'state of the union' where sequencing is concerned and prospects for the future. The nice thing is that Konrad tossed in a lot of pro/con lists for different platforms, so those of you considering NGS in the future, this is a bare bones, get you started guide as to what's out there and whether 'it's worth it' for your own research to invest in a platform and which one to invest in. Again, I still highly recommend Dr. Elaine Mardis' talk that I linked in my DNA sequencing technology prep blog.

First and foremost, if you are familiar with what molecular biology is and what sequencing is and don't know who Fred Sanger is...then you've probably been living in a hole...

Dr. Fred Sanger is the double nobel laureate that 'invented' sanger sequencing (hence the name...tada!).

I'm not going to get into the painstaking history of sequencing technology but rather give a brief listing and you can click to learn more as you see fit:

  • Maxam-Gilbert Sequencing: Based on chain breakage of DNA, nasty chemicals, radiolabels and sequencing gels.
  • Cycle Sequencing (Sanger): Chain termination, use thermal cycler, heat stable polymerase, flourescent dyes. (Applied Biosciences developed the first software to 'call' peaks)
  • 1972: First gene sequenced from RNA
  • 1976: First bacteriophage genome
  • 1995: First whole genome shotgun (sanger sequencing) of H. influenzae
  • 2004: Birth of 454 pyrosequencing (side note: I entered grad school in 2003.)
  • Human genome project used shotgun sanger sequencing, the project was longer than the Apollo project and as stated in a previous talk the draft came out 2000 and you can see the pubs from 2001 in Dr. Chris Pontings Talk I posted yesterday on mammalian genomes.
A little more info on the competing teams:
  1. Nature pub group (see Mammalian Genomes post), publically funded, shotgun BAC approach.
  2. Science group: A Venter venture (private but data publically available). Wanted to do the genome faster for less than the publically funded group. Different method: Shear DNA, sequence the small fragments and rely on bioinformatics for assembly and scaffolding into a genome. Raise ethical concerns about ownership of genomes and the idea of patenting genes.
Second Generation Sequencing:

Common features of all sequencers nowadays are they use adaptors to fix DNA, some form of PCR amplification/library creation, flourescent probes, all can do paired-end reading, most can sequence a human genome in a day, all require post processing of data for quality control, on average shorter read lengths than sanger, capable of high volume.
  • Illumina HiSeq 2000
  1. Pros: Large volume (300 Gb/run), short runs (<1 day; get 70-80 Gb), straightforward sample prep (bridge-PCR--see Elaine Mardis video), open source software.
  2. Cons: To achieve low cost you have to run LOTS of samples and short read lengths (36-150 bp)
  3. With the upgrade to 2500 you can produce 1 billion reads in 2-9 days using a flowcell, depending on how you use it. (ave length 18-150 bp).
  4. The 2500 is meant for rapid sequencing of limited samples
  5. The 2000 is meant for research and high throughput
  6. It's the difference between obtaining a human genome from 1 sample in 27 hrs (rapid) or 5 samples in 12 days (High througput).
  7. For us bacteriologists: You could either do 48 genomes amidst all 'lanes' available in cell (rapid) or doing 48 genomes/lane (high throughput but longer).
  • 454 Roche System (We have this at WRAIR)
  1. Pros: long reads (200-1000 bp), multiple samples at once (multiplexing), short run (<1 day)
  2. Cons: expensive (~10K/run versus $2000/run Illumina), low 'volume' output (100 Mb-1Gb), complex sample prep that involved emPCR (uses beads) library creation/amplification protocol. Why? Because illumina patented their bridge-PCR so Roche had to do something else.
  • SOLiD
  1. Pros: No dinucleotides, relies on oligonucloetide primers, reads two bases at a time and therefore effectively sequences every base twice suggesting higher accuracy.
  2. Cons: Uses emPCR protocol like 454 Roche, only 1 color emits so have to convert colors to sequence, short read lengths.

A note about quality control: Your phred score (quality score) is only as good as the type of organism the sequencing center programs in as the control. So for instance if your control is a GC rich bacterium and you are sequencing an AT rich virus...how applicable do you think your quality scores are going to be when they are extrapolated??? Give your sequencing center as much information as possible so they can afix a proper control into your system.

  • MiSeq
  1. 7.5 Gb/run, $800/run, $100K/instrument plus $50K/2 year servicing contract, no additional wet lab equipment required except something to shear your DNA. Can do 20-30 bacterial genomes/run. Also the libraries created are compatible for HiSeq so if you require more data you can send the libraries off for more sequencing without further processing.
  • 454 Junior System
  1. 100K reads, 700 bp ave. length, 70 Mb/run, mostly for clinical use right now, $1000/run, $100K/instrument.
  • Ion Torrent (We have one of these too)
  1. Doesn't use optics, uses pH instead, 2 hr/run (5+ hrs w/ library prep and run), output depends on chip; highest output chip (318) can do >1 Gb/3 hrs. Also relies on emPCR protocol. $700/run, $50K/instrument + $75K/library prep system. Meant for shorter reads. Unfortunately, libraries are not compatible with Ion Proton.
  • Ion Proton
  1. Meant for longer reads (ie. genome sequencing or assemblies of Mb sized genomes). No optics either, average length 200 bp, 2 hr/run or 8+ hrs with library prep system. 60-80 million reads with P1 chip. $1500/run, $150K/instrument + $75K/library prep system. Not compatible with Ion Torrent system and also has 454 chemistry (emPCR protocol)
Want to learn more about sequencers?

Problems associated with NGS:
  1. Sequencing is only going to be as good as your sample prep, so if there's contamination or degredation, that's what you'll get out of your sequencer.
  2. When your organism has a high bias toward GC or AT then it becomes more difficult to sequence.
  3. 454 and Ion Torrent have problems with homopolyers, Illumina has this problem too but to a lesser extent because of their specific PCR protocol that incorporates 'blocking' via a terminator at the end of each cycle.
  4. Need to be reminded what a homopolymer is? A long stretch of a single nucleotide in a DNA sequence (ie. AAAAAA). The longer the stretch the less 'confident' the machine becomes when reading the nucleotides, signal is 'maxed' out and you end up for varying numbers of that nucleotide in the output that will need to be resolved.
Third Generation Sequencers
  • Single molecule sequencing: PacBio has a machine available. Basically the machine is designed to collect absolutely ALL the light given off by the photon that occurs when a base is read. This system requires library prep (so some bias may still be inevitable as with current systems). The nifty thing about this system has to do with it's potential applications to epigenetics. Because they slow the reaction with the polymerase and methylated bases take longer to disassociate than non-methylated bases--they can measure the 'time' and determine the DNA that is methylated while sequencing. Also you can circularize DNA and sequence the same molecule over and over. Theoretically you can get fragments up to 10kb, the process is 40 min (minus prep). Currently it has about a 15% error rate though, is uber-expensive ($750K) and you only get 10-100 Mb/run.
  • Nanopore Sequencing: This method developed by Oxford Nanopore, uses--you guessed is 'a very small pore' and electrical current to detect DNA. Different bases will elicit a different signal when reacted with the electrical current. No library prep is the goal of this technology as well as the possibilities for parallelization. However, DNA moves really quickly and they haven't found a way yet to either slow the DNA down enough or make the pore thin enough to force one base through at a time. Currently, they are at 4-5 bases at a time. There is also a lot of electrical noise generated so teasing out your signal is challenging. They came up with two methods.
  1. Strand Sequencing: A pore that slows DNA down via it's design to 4-5 bases at a time.
  2. Exonuclease sequencing: An exonuclease chops the DNA and slows it down so it 'falls' into the pore hopefully 1 to a few bases at a time. This may post an indel problem similar to what's afflicting current methods today BUT has prospects for sequencing proteins, polymers, small molecules and possible replace mass spectrometry (though that's a ways off).
  • GRIDIon (possibly around $30K) is said to be able to sequence a human genome in 2 hrs for $1000.
  • MinIon (the smaller/USB version) is designed to be a 'throw away' sequencer that works for about 6 hrs. Costs $900 approx (for a 2000 pore chip) and assuming it delivers 10kb reads could produce 20 Mb. Error rate is stated to be about 4%
Now the data for much of this has not been officially released...so take all this with a grain or boulder of salt depending on your confidence level in these companies.

Dr. Paszkiewicz ended by suggesting a great paper to read that puts all the mass sequencing in perhaps a more realistic light.

It's a study and a reality check on whether personalize sequencing really will be the 'cure-all' for us in the future and it's important to recognize that we can sequence and sequence and sequence, but without biological and environmental context...it means nothing.

For information pertaining to Exeter University or the Wellcome Trust Bioinformatics Hub or if you just want to chat up some research: k.h.paszkiewicz@exeter.ac.uk