Mapping with bwa
================

One of the most common operations when dealing with NGS data is mapping sequences
to other sequences, and most commonly, mapping reads to a reference. Because
short read alignment is so common, there is a deluge of programs available for doing
it, and it is well-understood problem. Most of the efficient programs work by first
building an *index* of the sequence that will be mapped to, which allows for
extremely fast lookups, and then running a separate module which uses that index
to align short sequences against the reference. Alignment is used for visualization,
coverage estimation, SNP calling, expression analysis, and a slew of other problems.
For a high-level overview, try `this NCBI review <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2836519/>`__.

Getting the Dependencies
````````````````````````

bwa is one of the many available fast read mappers that uses the Burrows-Wheeler
transform to speed up finding exact and near-exact matches in the database.
A review can be found
`here <http://bioinformatics.oxfordjournals.org/content/25/14/1754.short>`__.

The `original verison of this lesson <http://2013-caltech-workshop.readthedocs.org/en/latest/bwa_mapping.html>`__
had instructions to download, uncompress, make, and install bwa.

We can get bwa on ubuntu with ::

    sudo apt-get install bwa

This may not get us the most recent version of bwa, but the last version that was
bundled for ubuntu.   Now we have bwa, and we didn't have to brave a grumpy sysadmin
to install it for us.  On the cloud, you are your own sysadmin.

Getting the Data
````````````````

First, we will create a new directory to contain our bwa alignments (separate
from our blast alignments in the previous lesson)::

    cd /mnt
    mkdir ecoli
    cd ecoli

We'll be using the data we downloaded during the reads and quality control session;
if you missed that, you'll eventually want to run through it in :doc:`reads_and_qc`,
but for now, you can just grab the quality-controlled data with::

    curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.pe.qc.fq.gz
    curl -O https://s3.amazonaws.com/public.ged.msu.edu/SRR390202.se.qc.fq.gz

This download takes about ten minutes to put on our instances.

You'll also need a reference genome, which can be acquired with::

     curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_O104_H4_2011C_3493_uid176127/NC_018658.fna

which downloads the reference genome from NCBI.

Mapping the Reads
`````````````````

To speed up the demonstration, we will just map a subset of the reads rather than
the entire file, which is somewhat large (though small compared to many datasets).
The head command outputs the first n lines of a file, by default 4::

    gunzip -c ../SRR390202.pe.qc.fq.gz | head -400000 > ecoli_pe.fq
    gunzip -c ../SRR390202.se.qc.fq.gz | head -400000 > ecoli_se.fq

We've got our reads and a reference, so we're ready to get started. First, we
build an index of the reference genome using bwa::

    mv ../NC_018658.fna .
    bwa index -a bwtsw NC_018658.fna

The -a flag tells bwa which indexing algorithm to use. The program will automatically
output some files with set extensions, which the main alignment program knows the
format of. Thus, we run the alignment like so::

    bwa mem -p NC_018658.fna ecoli_pe.fq > aln.x.ecoli_NC_018658.sam

which aligns the left and right reads against the reference, and outputs them
to the given SAM file. SAM is a common format for alignments which is understood
by many programs, along with BAM. It's often useful to have both, so we'll use
a utility called samtools to produce a sorted BAM file as well.  First,
install samtools::

    cd /mnt
    curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.18/samtools-0.1.18.tar.bz2
    tar xvfj samtools-0.1.18.tar.bz2
    cd samtools-0.1.18
    make
    cp samtools /usr/local/bin
    cd misc/
    cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/
    cd ..
    cd bcftools
    cp *.pl bcftools /usr/local/bin/

Then, run samtools to do the conversion::

    cd /mnt/ecoli
    samtools view -uS aln.x.ecoli_NC_018658.sam > aln.x.ecoli_NC_018658.bam
    samtools sort aln.x.ecoli_NC_018658.bam aln.x.ecoli_NC_018658.bam.sorted
    samtools index aln.x.ecoli_NC_018658.bam.sorted.bam

For additional resources on these tools, check out:

* the bwa `manual <http://bio-bwa.sourceforge.net/bwa.shtml>`__
* info on `samtools <http://samtools.sourceforge.net/>`__
* the SAM `format spec <http://samtools.sourceforge.net/SAM1.pdf>`__

Visualizing your Data with Tablet
`````````````````````````````````

Copy your mapping files to your local machine :

:
    scp -i soils.pem ubuntu@yourinstance:/mnt/ecoli/aln.x.ecoli_NC_018658.bam.sorted.bam*  .
    scp -i soils.pem ubuntu@yourinstance/mnt/ecoli/NC_018658.fna .

Although you can do many things with your alignments, one useful thing is to simply
view them through a graphical interface. To demonstrate this, we'll use a program
called Tablet, which can be downloaded `here <http://bioinf.scri.ac.uk/tablet/download.shtml>`__.

Tablet claims to run on Windows, Linux, and OSX, though I have only tested it out
on Linux. Because this is a GUI-driven program, we'll be running it on our
local machines instead of our EC2 instances. So, go ahead and grab the appropriate
version for your system, and install it.

Once you have it installed, open it up. You'll want to load your mapping file
and reference genome:

.. image:: images/tablet_open.png
    :width: 75%

You'll then get some loading bars, and potentially an error about the indexing file
which can be ignored. You need to select a contig on the left to view; ecoli has
a very good reference, and is only one contig:

.. image:: images/tablet_view.png
    :width: 75%

You can move left and right along the contig, as well as zoom. Tablet can view
other information like gene structure, but we won't get into that.