Thursday, January 10, 2013

Blog Series: WoG, Cesky Krumlov; Day 3: Unix, Part 2--Ninja-ery

Julian Catchen
University of Oregon
Unix Ninja

Topic: Unix Part 2

So unfortunately as with the quality control blog, I am or will be unable to give you files that we practiced on but Julian's slides are quite good. We learned about pipes and added on to our current knowledge of command line. We finished up the end of the first slide set... which included all the following commands:

  • man [command you are confused about]: manual for commands that gives you all the options. Some man pages are more helpful than others but if you stare at it long enough it'll start to make sense. Often times there are examples of usage so pay attention to those.
  • ls, gunzip, more, cat, head, tail, grep, wc...all that we learned yesterday--so don't forget it and today we ended up having to man some of those commands to learn the options.
  • sort [filename], just as it sounds--depending on what kind of sort you want you have to specify an option (see man page for sort; ie. numeric, alphabetical)
  • uniq [filename], as with sort, lots of options to tease things out of your file.
  • cut [filename], we learned this before too--BUT I learned anew that it is only for column files today.
  • tr "       " "," [filename]: translate command that changes all tabs to commas in the given filename.
  • tab = Ctrl+V+tab or \t
  • |, this is a 'pipe' read Julian's slides part 1
Example: What do you think I just did to this file?

cat batch_1.genotypes_1.loc | tr "    " "," | grep "^96053"

  1. I grabbed the file = cat
  2. I piped it to the command 'tr' specifying I wanted all tabs changed to commas
  3. Then piped it to another command 'grep' (remember what grep does?)
  4. With grep I specified I wanted to look for all entries where the beginning of the line (hat symbol) started with 96053

We didn't get through all these slides but they are great so have a gander...

What we did get through:


  • Regular expressions: Text often follows human conventions, so when patterns are the same we can write an expression to pull data out of that we want. Ie. zip codes are always 5 numbers, phone numbers are always XXX-XXX-XXXX, dates can always be formated like June 13, 1978
  • He has a table in his slides that lists what all the symbols in a regular expression mean
for instance a dot '.' stands for any character (ie. '.....' can stand for 5 numbers/zip code)
[0-9]  any number, just one (so 0 or 1 or 2...)
[0-9]+ any number plus however many digits of numbers (this cover two digit, three digit etc numbers)
[a-z] lower case alphabet
[A-Z] upper case alphabet
[a-zA-z] upper/lower case alphabet
When you write an expression in a search command ala grep you have to use the -E option and use quotes around what 'pattern' you are searching for.

There are a lot of expressions on the slides so take a look...

With all that in mind...what am I doing here in the file record.tsv (you don't need to have the file to know what this command is doing)?

grep -E '[a-zA-Z]' record.tsv

Well, I am searching for all the letters in the file.

Next up was the sed command which is a search and replace command with the syntax 's/pattern/replace'. So that's what the sed command is doing.

Example Sentence: "I like ballet"
Command s/ballet/kungfu
Result "I like kungfu"

So now let's combine what we know about grep and what we know about sed

What's this doing with the file record.tsv:

cat record.tsv | sed -E 's/[a-z]+ [a-z]+/foo/'

  • Well...we are 'grabbing' the file record.tsv and sending it (via pipe) to the command sed...
  • What's sed going to do?
  • Well sed is a 'find and replace' type of command...ok, well it looks like we want to be 'general' in what we want to look for...so lets use a regular expression (be sure to use quotes)--hence the -E in the command.
  • Ok...how general, well, lets find all the letters in all the words...well, words can be separated by spaces and can be any length...so [a-z]+ [a-z]+, there that's good...
  • What do we want to replace them with? how about the word 'foo'--cause that's fun.
  • In summary...we have grabbed the file (cat), we have sent it to sed (via a pipe) so that we can change all the words in the file to the word foo using a regular expression (-E 's/[a-z]+ [a-z]+/foo/')
Congrats you have now 'foo-ed' your file...sweet.

Julian has tons of other examples in the slide show...so go find some more 'foo-ing' in the file.

Can't get enough regular expression fun? Head over to Tyghe's blog where he has posted a tutorial/module for regular expressions and python. He has aimed this toward bioinformaticists...so go nutz!

Alrighty, so story time. How about a fairytale, everyone loves fairytales:

    There once was a file...we'll call her tinkerbell.txt. Tinkerbell contains a lot of information related to never-neverland but I don't want to know everything in the file. There is information in there that starts with an @ sign that I could care less about...in fact all I want is a certain lines of text. So I've executed a huge long terrible command on the tinkerbell text file... Can you figure out what I've done to her?

cat tinkerbell.txt | grep -A 1 "@" | grep -v -- "--" | grep -v "@" | cut -b 1-5 | sort -d | uniq -c | sort -n

  • I grabbed tinkerbell and did a search on her but I was only interested in 1 part of her which was conveniently located right above the @ symbol which I'm sure she was trying to hide from me. Now she gets confused easily so I'm going to make sure I give her the simplest, dumbest, commands all nicely separated by a big vertical line and I looked at the manual of all the commands I could give her with their options to be double sure she doesn't screw up! No offense tinkerbell.txt. So if you don't understand a dash option (ie. -A) go into your terminal and man the command attached to it (ie. man grep).
cat tinkerbell.txt | grep -A 1 "@" |
  • When I yanked what I wanted she protested by giving me double dashes --, kind of like narrow eyes; well since I don't like people giving me stink eye (mean looks), I told her to get rid of those double dashes. 
grep -v -- "--" |
  • Now her @ sign was still there as I had used it to find and yank what I wanted but I hadn't asked her to 'get rid' of her @ sign. So lets do that now. 
grep -v "@" |
  • So now I have only the pieces of information about her that I want...ha ha ha. Wow, that information is long and I really only want the first 5 bits of it, so lets surgically remove it (don't worry I anesthetized her). 
cut -b 1-5 |
  • Great now I have a jumble of nonsense a lot of which is repeated...I'm interested in what's repeated but don't want to count it all one by one so I'm going to make her do it. Now lets say she's getting fussy and will only compare and count if all the 'similar' data is close together so I'll have her sort it out, alphabetically first...
sort -d |
  • then make her count all the unique entries 
uniq -c |
  • then put them all in order for me again and send me the result...and that's all I'll make her do...for now
sort -n > evilstepmothercommands.txt
  •  Cause I'm mean like that...perfect. Evil stepmothers have nothing on me...
And I lived happily ever after...

I encourage you to look up Julian's slides. We didn't get to the use and comparison of the editors emacs and vim; but it's on the slides so have at!

Cheers!

No comments:

Post a Comment