Tuesday, January 8, 2013

Blog Series: WoG, Cesky Krumlov; Day 2: Data Quality Control--no really it's more fun than it sounds...

In truth...to me, data sequence quality control is necessary and ok and it was fun when I was learning it but the further in you get the more you want to automate the hell out of it! WELL, lucky for you, you probably aren't at that stage yet so we are going to start fresh thanks to Naiara's talk, slides and exercises ala this evenings lab!

It is time, my fellow command-line/terminal apprentice ninjas...let's DO SCIENCE

Now I won't be going through absolutely everything in this blog entry but I will cover most and offer tips to help the exercises go smoothly for you.

Naiara Rodriguez Ezpeleta
AZTI Tecnalia

Topic: Quality Assessment and Control of NGS data

Go through her slides and learn about fasta and fastq and phred scores if you don't know already. She has a nice exercise that brings back nightmares of math which can be frustrating if you don't break it down so let me offer some tips to figuring out her Quiz 1 and Quiz 2 slides which are based on the slide entitled, Different Scoring Systems.

Slide: Different Scoring Systmes

  • Along the bottom of the figure you see a span of characters and letters. They represent a quality score, it's easier to represent a quality score as a single character as opposed to 2 characters which is what a number score would be. So for instance if you see the '!' character that means the score is 33...Wait I'm not done.
  • Depending on the platform you will have to subtract from the number in order to obtain the actual quality score. For instance on the figure find Illumina 1.8+ and Sanger, both state +33 so in order to know what the actual score is (0 being bad, 40+ being good) you have to subtract 33. So if a base quality score = '!' it actually has a quality score of 0--so complete crap. For the other platforms you can see on the slide you have to subtract 64. So if you have a base with a score of 'J', then by counting it looks like 'J' has a score of 74 (notice the I before it has a score of 73, those numbers below the characters are meant to help with counting). BUT you have to subtract 64 IF you're sequences come from Illumina 1.3, 1.5 or Solexa meaning you actually have a score of 8--once again, complete crap.
  • You see the long strings of letters at the top, SSSSSSSSS etc, LLLLLLLL etc...those are explained int the figure too, see the letter next to the name of the sequencing platform? Ie. S - Sanger. Well the 'span' of those letters is the 'span' of characters used to signify the qualities for the that sequencing platform. For example the characters starting at the beginning '!' through the colon character ':' are characters used to describe sequence quality ONLY in Sanger (S) and Illumina (L). Notice that there are a span of characters 'A' to 'I' where they all overlap meaning if you see qualities that are capital letters between 'A' and 'I' there's no way to tell which sequencing platform it came from.
  • And finally you see the characters at the end...105-126 (lower case 'i' to the dash symbol '-'); you'll notice no Letter designations above those characters...well for the platforms in this figure, they do not use quality scores in this range.
  • I think that's all you should keep in mind. Look at her slide and try and answer her quiz slides. You can put answers in the comments section if you wish and I'll let you know if you are right.
  • You shouldn't need to look at the ASCII wiki for fastq format but if you'd like...FASTQ format and ASCII (use the Hex column under ASCII printable characters if you want, but you shouldn't need to for these exercises).
Alrighty, head through the rest of her slides, get a feel for them then head to the website exercise, see below for tips and things to know.

FastQC Program

  • Work in the terminal/command line as much as possible. The FastQC program is a GUI program but otherwise try and stay away from your mouse.
  • Arrange your windows so that you have the tutorial exercise window on one side of your monitor view and the terminal on the other, this will make it easier instead of having to bounce back and forth. Better yet, if you have a second monitor, drag one to the other monitor!
  • She has a section entitled 'Getting Started'
  1. Install ALL the programs she suggests and MAKE NOTE of where they are on your computer, you will need to know. They are all free and linked on her webpage.
  2. Unfortunately you will not be able to use the same dataset as we do as it is only virtual machine. If you have NGS data in the form of fastq formatted files, then after reading through the exercise and getting an idea of what we see with this data, have at it and open your own data using the program.
  3. You should be able to download the perl script, if not then walk through the exercise and find a script later that takes a fasta file and qual file and turns them into a fastq file.
  4. Yes, indeed have a gander at the documentation she suggests for the programs to familiarize yourself.
Exercise 1:

We opened an Illumina sequence fastq file using the program FastQC to see what it would look like. Using the screenshots below you should be able to answer her questions 3, 4 and 5.

The Stats of the File we uploaded.
Example of good quality data...all bars and median line are ~36-39
Illustration of AGCT content, this organism is higher in AT content.

Overrepresented sequences in the sample...what does this mean do you think?

Exercise 2

So since you don't have the files you cannot create the Fastq file needed, so the screenshots below will bypass that. BUT if you have your own fasta and qual files try her directions to create a fastq file using the perl script (assuming you were able to download that too).

Fastq file result from her directions on the tutorial loaded onto FastQC
Hmmm...what's going on here???
Exercise 3

You cannot do #1 unless you have your own data but let me tell you the data looks better and you no longer have the adapter, seen in the previous screenshot. Remember what 'adapter' sequence looks like.

Now lets look at mRNA

mRNA file
Quality...where would you 'cut it off' ??? 
What's going on here??? What does the mess in the beginning look like? What does the green peak at the end signify? Remember cellular biology '101'?
If you have the programs and you had barcoded data, you can try splitting it using the fastx_barcode_splitter.pl program.

Now lets look at microRNA (DO NOT confuse with mRNA which is above! Keep track of what files you are looking at specifically, that's why I upload the screen shot of the Basic Statistics because it shows you what file is being looked at).

microRNA fastq file
Hmmm, what can you say about microRNA quality and length?
What do we have hear, what does this tell us?
(Above slide): Hints....

  1. You know what adapter sequence looks like.
  2. You know what barcoded sequence looks like. 
  3. You know what sequence looks like when there are tons of sequences that are very diverse; the A, G, C, and T lines are fairly flat.

Aha...so we have lots over overrepresented sequences! What should we do now?
So, what does having over-represented sequences tell you about microRNA, does it make sense? What's going on during gene expression and generation of mRNA?


No I'm not going to tell you the answer...if you want to know then post what you think the answers are to the above in the comments and I'll comment back.

And workshop attendees reading this, don't cheat and put answers in! Ha!

See, I found this fun! And if you were able to download the programs, use the terminal and work with your own data all the better!

Now in terms of sequence quality you shouldn't ONLY rely on FastQC, depending on your platform/type of sequencing, sample quality/prep, and overall sequencing results you will have to tweak your protocol. FastQC doesn't answer everything, it simply gives you and overall snapshot of the quality of your data and points out some 'concerns' if there are any.

  • What if you have homopolymers or a lot of indels (insertion/deletions)?
  • What if you sequenced the wrong organism by accident DOH!
  • How do you deal with depth in light of coverage?
  • How do you deal with variants? Are you even interested in variants or just consensus? Because that'll affect your overall QC protocol as well...
Lots to think about indeed.

Many thanks to Naiara who gave a very nice exercise and introduced us to what good versus bad data looks like and other possible data types and how they would look using this OPEN SOURCE HUZZAH Program.

Good Night All....