table of contents
- expected learning outcome
- getting started
- exercise 0: index the reference using BWA
- exercise 1: align the reads to the reference using BWA (optional)
- exercise 2: inspect the reads
- exercise 3: assemble the reads into contigs using ABySS
- exercise 4: align the contigs to the reference using web BLAT
- exercise 5: align the contigs to the reference using BWA-SW
- exercise 6: browse the contig to reference alignments using samtools tview
- exercise 7: browse the contig to reference alignments using IGV
- exercise 8: view the contig to reference alignments SAM file
- exercise 9: call variants of the reads-to-reference alignments using bcftools (optional)
- exercise 10: call variants of the contigs-to-reference alignments using bcftoolss
- exercise 11: call the effects of the SNVs
- exercise 12: compare the assembly variants to the read-alignment variants (optional)
- supplementary exercise 13: align the reads to the contigs using BWA
expected learning outcome
The learning objective of this activity is to understand the basics of the sequence assembly process and some related analysis steps, as well as gain proficiency in some specific software for these analyses. After completion of this activity, you will have learned how to perform basic assembly steps using ABySS, use BWA and BWA-SW to align reads and contigs to a reference genome, use IGV to visualize these alignments, and use bcftools and snpEff to call variants and determine their effect.
The working example is data for a ~200 kbp bacterial artificial chromosome (BAC). The data are paired-end reads from one lane of an Illumina sequencer.
getting started
System requirements
Total run time of the tools alone is ~70 minutes on a 2-core 2 GHz system. 4 GB of RAM and 5 GB of disk space is required. The following software is required:
ABySS 1.3.0: assemble short reads de novo
http://www.bcgsc.ca/platform/bioinfo/software/abyss
BWA 0.5.9: align short reads
http://bio-bwa.sourceforge.net
IGV 2.0.11: visualize a genome
http://www.broadinstitute.org/igv
Java 6: execute Java programs
http://www.java.com
samtools 0.1.16: manipulate SAM/BAM files
http://samtools.sourceforge.net
snpEff 1.9.5: call the effect of SNPs
http://snpeff.sourceforge.net
tabix 0.2.5: index tab-delimited files
http://sourceforge.net/projects/samtools/files/tabix
Set up the environment
Each time you open a new terminal, you will need to run the following two commands.
Change to the activity working directory and source the environment script:
- cd ~/wcg_oct2011/activities/abyss
source environment
Check that the learning activity scripts are in the PATH:
- which run-abyss run-bwa run-bwasw run-snpeff run-bcftools run-bcftools-assembly
The FASTQ files are located in the activities folder:
- ls 30CJCAAXX_4_[12].fq.gz
exercise 0: index the reference using BWA
Index the reference file:
- cd $top
bwa index $ref
8 min, 1 GB RAM, 260 MB disk space
exercise 1: align the reads to the reference using BWA (optional)
Run BWA
- cd $top
- mkdir bwa
- cd bwa
- run-bwa 2>&1 |tee bwa.log
40 min, 250 MB RAM, 3 GB disk space
While this job is running in the background, open a new terminal and continue with the next section. Remember you have to set the environment, in the new terminal window type:
- cd ~/wcg_oct2011/activities/abyss
- source environment
exercise 2: inspect the reads
There are two FASTQ files for this one lane of paired-end Illumina data: one file for the forward reads and one file for the reverse reads.
Look at the first few reads.
- cd $top
- gunzip -c 30CJCAAXX_4_1.fq.gz |head
- How long are the reads?
[show answer]Answer:
gunzip -c 30CJCAAXX_4_1.fq.gz |awk 'NR==2{print length;exit}'
50 - How many lines are there in both files? (hint: use wc -l)
[show answer]Answer:
gunzip -c 30CJCAAXX_4_1.fq.gz 30CJCAAXX_4_2.fq.gz |wc -l
40,869,448 lines - How many lines per read?
[show answer]Answer:
4 lines per read - How many reads are there in both files?
[show answer]Answer:
40869448 / 4 = 10,217,362 reads - How many bases are sequenced?
[show answer]Answer:
10217362*50 = 510,868,100 bp - Assuming the BAC is 200 kbp, what is the depth of coverage?
[show answer]Answer:
510868100/200000 = 2554 fold coverage
exercise 3: assemble the reads into contigs using ABySS
Run the assembly.
- cd $top
On Mac OS X uncompress the read files and use the run-abyss-fq script:
- gunzip -c 30CJCAAXX_4_1.fq.gz >30CJCAAXX_4_1.fq
gunzip -c 30CJCAAXX_4_2.fq.gz >30CJCAAXX_4_2.fq
k=48 run-abyss-fq 2>&1 |tee abyss.log
On Linux use the run-abyss script:
- k=48 run-abyss 2>&1 |tee abyss.log
5 min, 200 MB RAM, 2 MB disk space
While the assembly is running, view the script in a text editor.
- <text editor command> $top/bin/run-abyss
Look at the option -n,--dry-run of abyss-pe. Its output is the
commands that ABySS will run for the assembly.
- k=32 run-abyss -n
The assembly runs in three stages: assemble contigs without paired-end information, align the paired-end reads to the initial assembly, and merge contigs joined by paired-end information. You can instruct ABySS to stop after any of these stages. Use the -n option to see the commands for each stage.
- k=32 run-abyss se-contigs -n
- k=32 run-abyss pe-sam -n
- k=32 run-abyss pe-contigs -n
Once the assembly has completed, view the contigs in a text editor.
- <text editor command> k48/HS0674-contigs.fa
Disabling line wrap makes it easier to browse the file.
- How many contigs are longer than 100 bp?
[show answer]Answer:
5 - What is the length of the longest contig (hint: use awk)?
[show answer]Answer:
awk 'length>x{x=length} END{print x}' k48/HS0674-contigs.fa
121927 bp - What is the N50 of the assembly?
[show answer]Answer:
abyss-fac k48/HS0674-contigs.fa
n n:200 n:N50 min N80 N50 N20 max sum
10 5 1 8044 54747 121861 121861 121861 212397
122 kbp - View the assembly log in a text editor.
- <text editor command> abyss.log
What portion of the reads align to the assembly? (hint: search for "Aligned")
[show answer]Answer:
7173439 of 10217362 reads (70.2083%) - What is the median fragment size and standard deviation of this library? (hint: search for "median")
[show answer]Answer:
median = 204 bp, sd = 18 bp
exercise 4: align the contigs to the reference using web BLAT
Open BLAT in a web browser.
http://genome.ucsc.edu
View the assembled contigs in a text editor.
- <text editor command> k48/HS0674-contigs.fa
Disabling line wrap makes it easier to select the full sequence.
Select the two contigs whose lengths are ~8 and ~16.5 kbp and copy-and-paste their sequence into BLAT.
- What is the exact length of these two contigs?
[show answer]Answer:
8,044 bp and 16,561 bp - Click "browser" for the best alignment and then zoom out 10x. To which chromosome and band do these contigs align?
[show answer]Answer:
chr3q27.3 - What are the nearest two genes?
[show answer]Answer:
SST and RTP2
Set the "Common SNPs" track (in Variation and Repeats) to "pack". A SNV is displayed with a red line. Zoom in on a SNV. Is it in dbSNP?
Zoom in on the gap between the two contigs.
Set the "RepeatMasker" track (in Variation and Repeats) to "full".
- What feature overlaps the gap that likely caused the assembly gap?
[show answer]Answer:
Simple repeat (TA)n - Zoom in to see the sequence of the feature.
Zoom out to see the alignment of both contigs.
Unaligned query sequence is shown with a thin purple line at the end of the alignment. The thin purple line can be difficult to see.
Which contig has unaligned sequence at one end?
[show answer]Answer:
the 16.5 kbp contig - Select that contig, and copy the unaligned sequence to the clipboard. Aligned sequence is shown in blue upper-case characters, and unaligned sequence is shown in black lower-case characters.
Open BLAST in a web browser. http://blast.ncbi.nlm.nih.gov
Select "nucleotide blast".
Select the database "Nucleotide collection (nr/nt)"
Paste the sequence into the query box. Click BLAST.
To what sequence is the best BLAST hit?
[show answer]Answer:
Cloning vector pTARBAC6, complete sequence - What is the cause of this chimeric contig?
[show answer]Answer:
This contig contains the cloning vector as well as the human insert.
exercise 5: align the contigs to the reference using BWA-SW
Warning: do not start BWA-SW until BWA has completed unless your machine has at least 2 GB of RAM.
Run BWA-SW.
- cd $top
- mkdir k48/bwasw
- cd k48/bwasw
- ln -s ../HS0674-contigs.fa .
- run-bwasw
1 min, 800 MB RAM, 1 MB disk space
While the alignment is running, view the script in a text editor.
- <text editor command> $top/bin/run-bwasw
exercise 6: Browse the contig to reference alignments using samtools tview
Run samtools tview.
- cd $top/k48/bwasw
- samtools tview HS0674-contigs.bam $ref
Go to the region chr3:186,648,940
- g chr3:186,648,940
- What variants do you see?
[show answer]Answer:
a T/G SNV and a 24-bp insertion - Go to the region chr3:186,676,730
- g chr3:186,676,730
Notice the Ns in the contig sequence, indicating a scaffold gap.
How many Ns are in the contig?
[show answer]Answer:
19 Ns - How many Ns should there be for the size of the gap to agree with the reference?
[show answer]Answer:
15 Ns
exercise 7: browse the contig to reference alignments using IGV
Start IGV:
- igv
Select View -> Preferences... -> Alignments and change "Visibility range threshold (kb)" to 1000. Close and restart IGV to make this configuration take effect.
Select the reference "Human hg19".
Then select File->Load from File...
~/wcg_oct2011/activities/abyss/k48/bwasw/HS0674-contigs.bam
Go to the region chr3:186,600,000-187,600,000 by entering it into the box labeled "Go".
IGV may take up to a minute to load.
- What genes overlap the contigs?
[show answer]Answer:
ST6GAL1, SST, RTP2 and BCL6 - Bonus: Find a coding SNV. What is its dbSNP rs ID?
[show answer]Answer:
rs1973791 (chr3:187,416,634)
rs11707167(chr3:187,416,719)
rs1056932 (chr3:187,447,032)
Add the dbSNP track.
File->Load from Server... then expand "hg19", "Variation and Repeats"
and select "dbSNP 1.3.1".
Zoom in on a SNV. Is it in dbSNP? Is it coding?
exercise 8: view the contig to reference alignments SAM file
View the SAM file in a text editor.
- cd $top/k48/bwasw
- <text editor command> HS0674-contigs.sam
Disable line wrap.
[show answer]
Answer:
The contig that has alignments starting at chr3:186,644,066 and chr3:187,439,792.
[show answer]
Answer: chr3:186,644,066 (-) and chr3:187,439,792 (+)
[show answer]
Answer: a ~800 kbp inversion
[show answer]
Answer: ST6GAL1 and BCL6
exercise 9: call variants of the reads-to-reference alignments using bcftools (optional)
Check that BWA has completed aligning the reads to the reference.
Run bcftools in this terminal.
- cd $top/bwa
- run-bcftools 2>&1 |tee bcftools.log
20 min, 250 MB RAM, 120 MB disk space
While bcftools is running, view the script in a text editor.
- <text editor command> $top/bin/run-bcftools
exercise 10: call variants of the contigs-to-reference alignments using bcftools
Run bcftools.
- cd $top/k48/bwasw
- run-bcftools-assembly 2>&1 |tee bcftools.log
Browse the variants using IGV.
File->Load from File... k48/bwasw/HS0674-contigs.var.vcf.gz
Right-click on the VCF track and select "Color by Allele".
exercise 11: call the effects of the SNVs
Run snpEff.
- cd $top/k48/bwasw
On Mac OS X uncompress the read files and run the run-snpeff script:
- gunzip -c HS0674-contigs.var.vcf.gz >HS0674-contigs.var.vcf
- run-snpeff HS0674-contigs.var.vcf >HS0674-contigs.var.snpeff
On Linux download the snpEff database and run the run-snpeff script:
- wget sf.net/projects/snpeff/files/databases/v1_9_5/snpEff_v1_9_5_hg37.zip
unzip snpEff_v1_9_5_hg37.zip
mv data ~/wcg_oct2011/software/snpEff_v1_9_5/
- run-snpeff HS0674-contigs.var.vcf.gz >HS0674-contigs.var.snpeff
1 min, 1.0 GB RAM
View the output of snpEff in a text editor.
- <text editor command> HS0674-contigs.var.snpeff
Count the number of SNVs in each category of effect.
- sort -uk1,2 HS0674-contigs.var.snpeff |cut -f16 |cut -d: -f1 |sort |uniq -c |sort -rn
Find all the coding SNVs.
- grep CODING HS0674-contigs.var.snpeff |cut -f1-4,16-18 |uniq
- What is its dbSNP rs ID? (hint: use IGV or the UCSC genome browser)
[show answer]Answer:
rs11707167 - Open dbSNP in a web browser. http://www.ncbi.nlm.nih.gov/projects/SNP/
What is the minor allele frequency (MAF) of this SNP?
[show answer]Answer:
C=0.490
[show answer]
Answer: chr3:187,416,719
exercise 12: compare the assembly variants to the read-alignment variants (optional)
Browse both variants using IGV.
Start IGV.
File->Load from File... k48/bwasw/HS0674-contigs.bam
File->Load from File... k48/bwasw/HS0674-contigs.var.vcf.gz
File->Load from File... bwa/HS0674.var.vcf.gz
Go to the region chr3:186,648,960
- Is the SNV called by both methods?
[show answer]Answer:
Yes. - Is the insertion called by both methods?
[show answer]Answer:
No, the insertion is called only by the assembly. - Hover the mouse cursor over the insertion to see the inserted sequence. What is the inserted sequence?
[show answer]Answer:
TCTGGGTTCCTTCAAATCCTGCCT - Compare the inserted sequence to the reference sequence at this location. What sequence do the inserted sequence and the reference sequence have in common?
[show answer]Answer:
The insertion duplicates this 8-bp sequence: TCTGGGTT
Load the alignments of the reads to the reference.
File->Load from File... bwa/30CJCAAXX_4.bam
The aligned reads do not show the insertion, but the alignments that span the insertion have mismatches at the end of the alignment. Do those mismatched bases agree with the inserted sequence?
[show answer]
Answer: Yes, the mismatched bases show CT at the 5' end andCC at the 3' end, which agree with the inserted sequence.
supplementary exercise 13: align the reads to the contigs using BWA
Run BWA.
-
cd $top
mkdir k48/bwa
cd k48/bwa
bwa index ../HS0674-contigs.fa
ref=../HS0674-contigs.fa run-bwa 2>&1 |tee bwa.log
12 min, 200 MB RAM, 3 GB disk space
Index the assembly FASTA file.
- samtools faidx ../HS0674-contigs.fa
This command may fail with the following error:
- [fai_build_core] line length exceeds 65535 in sequence '94'.
- How long is the longest line? (hint: use awk)
[show answer]Answer:
awk 'length>x{x=length} END{print x}' ../HS0674-contigs.fa121,927 bp
Break the long lines into short lines.
- fold ../HS0674-contigs.fa >HS0674-contigs.fa
Index the FASTA file.
- samtools faidx HS0674-contigs.fa
Browse the BAM file using samtools tview.
- samtools tview 30CJCAAXX_4.bam HS0674-contigs.fa
Browse the BAM file using IGV.
File->Import Genome... k48/HS0674-contigs.fa
File->Load from File... k48/bwa/30CJCAAXX_4.bam
Find a scaffold gap on the largest contig.
Find the two contigs that have consistent mate pairs joining them.