table of contents
- expected learning outcome
- getting started
- exercise 1: alignment of RNA-seq reads
- exercise 2: transcript reconstruction using Scripture
- exercise 3: analysis of the reconstruction with Cufflinks
- exercise 4: working with existing annotations
- exercise 5: ChIP-Seq analysis using Scripture
expected learning outcome
The learning objective of this activity is to understand the basics of RNA-Seq data and to gain proficiency in the use of Scripture and other programs.
getting started
We will analyze a partial dataset from a pathogen stimulated Dendritic Cells (DC) timecourse. Specifically, we will look at reads aligning to chr1:16170000-66500000 from three libraries, an RNA-Seq library from unstimulated cells, an RNA-Seq library from RNA extracted 4 hours post stimulation, and a H3K4me3 Chip-Seq library from unstimulated DCs.
We will work against a ”genome” consisting only of the first 66.55 megabases of mouse chromosome 1 (NCBI build 37).
We will need to uncompress the rnaseq.session.data.tgz file in your ~/wcg_oct2011/activities/Scripture/ folder:
- tar xvfz rnaseq.session.data.tgz
Your data folder (~/wcg_oct2011/activities/Scripture/) should now contain 5 read files (fastq) and the partial chromosome 1 (chr1.fa) which will be the "genome".
On Ubuntu Linux, run (copy/paste) the following commands:
mkdir ~/wcg_oct2011/software/scripture
ln -s ~/wcg_oct2011/local/lib/scripture_alpha2.jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar
On Mac OS X, run the following command:
- ln -s igvtools_1.5.15 ~/wcg_oct2011/software/IGVTools
exercise 1: alignment of RNA-seq reads
Use TopHat to align RNA-Seq reads. Your data folder should contain 5 read files (fastq) and the partial chromosome 1 (chr1.fa).
1.1 Creating the BW-transformed genome (~ 5 min. )
TopHat depends on Bowtie to map reads and read fragments, and the Bowtie aligner relies on a very efficient compression (Burrows-Wheeler transform) of the genome to perform its searches. It is necessary to create the BW transform of our genome before mapping reads.
Use the bowtie-build program in the Bowtie distribution. Change your directory to the data folder and invoke the BW transform:
- bowtie-build chr1.fa chr1
Our alignment database will be called chr1. If all is successful your data directory should now contain 6 new files:
- chr1.1.ebwt
- chr1.2.ebwt
- chr1.3.ebwt
- chr1.4.ebwt
- chr1.rev.1.ebwt
- chr1.rev.2.ebwt
1.2 Alignment using TopHat
Tophat calls all alignment files ”accepted hits.bam”. It is good practice to create a directory for each alignment run. In your working directory, create two directories:
- mkdir tophat.t0 tophat.t4
To align the reads in the data directory, invoke TopHat:
- tophat -o tophat.t0/ -r 250 chr1 DC.LPS.t0.merged.sorted.chr1.p1.fq DC.LPS.t0.merged.sorted.chr1.p2.fq
And similarly for the 4h reads:
- tophat -o tophat.t4/ -r 250 chr1 DC.LPS.t4.merged.sorted.chr1.p1.fq DC.LPS.t4.merged.sorted.chr1.p2.fq
Each of these commands take about 15 minutes. It is best to launch them in two different shells simultaneously.
1.3 Preparing and viewing alignments (~ 5 min)
Most applications require alignments to be sorted. Scripture and the IGV browser further require the alignments to be indexed. There is currently an incompatibility between samtools sort and Picard (which is used by Scripture); therefore you need to sort the BAM file using Picard and not samtools:
-
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/SortSam.jar I=tophat.t0/accepted_hits.bam O=t0.sorted.bam SO=coordinate
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/BuildBamIndex.jar I=t0.sorted.bam O=t0.sorted.bam.bai
Similarly, sort and index the 4hr alignments:
-
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/SortSam.jar I=tophat.t4/accepted_hits.bam O=t4.sorted.bam SO=coordinate
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/BuildBamIndex.jar I=t4.sorted.bam O=t4.sorted.bam.bai
1.4 Viewing the alignments
Once sorted and indexed, the alignments can be viewed in the Integrative Genomics Viewer (IGV).
Viewing alignments can be slow and is only useful to understand the specific structure of each gene. To be able to just look at read densities IGV provides a counting function to estimate the number of reads aligning to any window in the genome.
To produce these files for our alignments, use igvtools:
- igvtools count -w 5 t0.sorted.bam t0.sorted.bam.tdf ~/wcg_oct2011/software/IGVTools/genomes/mm9.genome
igvtools count -w 5 t4.sorted.bam t4.sorted.bam.tdf ~/wcg_oct2011/software/IGVTools/genomes/mm9.genome
Start the genome browser, IGV from the command line:
- igv
Select the mm9 genome, and use the File->Load from File... dialog to open both alignments. To zoom in quickly you may type stat1 in the address textbox.
Other interesting genes in the region: Pgap1 with a much longer UTR than that annotated and Obfc2a with long UTR minor isoform.
exercise 2: transcript reconstruction using Scripture
Scripture is designed to reconstruct all transcripts expressed at significant levels in the sample sequenced.
2.1 Looking at the paired-end information (optional)
The two libraries sequenced were prepared using the dUTP method which ensures the cDNA from all transcripts is always on reverse orientation to the original RNA. However, since the alignment contains both pairs, this is not obvious when looking at the alignments.
Scripture creates a new alignment that contains the original inserts in the orientation of one of the pairs. To generate this alignment invoke:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task pairedfile -pair1 t0.sorted.bam -usePair2Orientation -out t0.sorted.pairs.bam
And similarly for the 4 hour library:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task pairedfile -pair1 t4.sorted.bam -usePair2Orientation -out t4.sorted.pairs.bam
These alignments need to be sorted and indexed like the original alignments using Picard.
-
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/SortSam.jar I=t0.sorted.pairs.bam O=t0.sorted.pairs.sorted.bam SO=coordinate
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/BuildBamIndex.jar I=t0.sorted.pairs.sorted.bam O=t0.sorted.pairs.sorted.bam.bai
And similarly for the 4 hour timepoint:
-
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/SortSam.jar I=t4.sorted.pairs.bam O=t4.sorted.pairs.sorted.bam SO=coordinate
java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/BuildBamIndex.jar I=t4.sorted.pairs.sorted.bam O=t4.sorted.pairs.sorted.bam.bai
The Bzw1 gene is a good example of a transcript having a pair of spliced exons without spliced alignments, but where paired-end data clearly indicates their connection.
2.2 Run Scripture
Scripture works with tasks; each process corresponds to tasks specified by the -task parameter:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task <your task> <task parameters>
If you do not specify the task, scripture executes transcript reconstruction, its default transcript reconstruction task.
For a list of tasks, invoke scripture without parameters:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar
Run scripture:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -alignment t0.sorted.bam -chr chr1 -chrSequence chr1.fa -out scripture.t0.bed
And similarly for the 4 hour sample:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -alignment t4.sorted.bam -chr chr1 -chrSequence chr1.fa -out scripture.t4.bed
You can directly view the resulting scripture.t0.bed file in the IGV browser. However, in order to use scripture output as input to Cufflinks, we need to convert the output into a GTF file readable by Cufflinks:
- java -Xmx1000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task toGFF -cufflinks -in scripture.t0.bed -source SCRIPTURE -out scripture.t0.gtf -prefix t0
And similarly with the 4 hour sample:
- java -Xmx1000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task toGFF -cufflinks -in scripture.t4.bed -source SCRIPTURE -out scripture.t4.gtf -prefix t4
Stat1 is an interesting gene to explore.
2.3 Running Cufflinks (optional)
To run Cufflinks for transcript reconstruction on the same libraries, invoke the command that takes into account that our library is strand specific. First, create an output directory, and then run Cufflinks:
-
mkdir cufflinks.t0
cufflinks --library-type fr-firststrand -o cufflinks.t0 -m 290 t0.sorted.bam
The resulting transcripts cufflinks.t0/transcripts.gtf can be viewed and compared with those from Scripture in the IGV browser. An interesting difference is the Cnnm4 region, where a duplication makes for an interesting (if likely artificial) transcript.
To look at this region, load cufflinks.t0/transcripts.gtf and scripture.t0.gtf into the IGV browser and navigate to the Cnnm4 gene.
Scripture will try to report most of the isoforms that are supported by the data; many will be spurious. One way to filter out spurious transcripts is by finding those that account for a small fraction of the expressed mRNAs. Cufflinks read assignment strategy can be used to find isoforms that are either artifact or expressed at uninteresting levels.
exercise 3: analysis of the reconstruction with Cufflinks
3.1 Creating a single reconstruction set using cuffmerge
The cuffmerge utility is design to create a single transcript dataset from multiple reconstruction. In this step we use this utility to merge our timepoint reconstructions into a single set:
The input to cuffmerge is a file listing all the different reconstructions to merge, to create it use your favorite text editor (e.g. vi) and type in the two scripture reconstruction .gtf files we previously obtained. The file should contain the following two lines. Save it with the name cuffmerge.input.txt
-
scripture.t0.gtf
scripture.t4.gtf
Then run cuffmerge:
- cuffmerge cuffmerge.input.txt
Cuffmerge will output its merged reconstruction to a merged_asm/merged.gtf file, which we will use throughout this section.
Note: Load the merged_asm/merged.gtf to your IGV session to visualize result pertaining this transcript set. Make sure you also have loaded the alignment data (bam and tdf files) in your session so you can visualize the results we obtain in the next steps.
3.2 Expression analysis with Cuffdiff
The single transcript file we created in the previous section can be readily used by cuffdiff for differential expression analysis at both isoform and gene levels.
Run cuffdiff
- cuffdiff -o cuffdiff.merged --library-type fr-firststrand merged_asm/merged.gtf -L t0,t4 t0.sorted.bam t4.sorted.bam
The --library-type fr-firststrand tells cuffdiff that the library construction method used for this library is strand specific and that the first read comes from the 3' end of the RNA fragment as is the case with the dUTP library construction method used to generate the libraries used in this exercise.
The -L t0,t4 option tells cuffdiff to use the labels t0 for the first sample, and t4 for the second sample.
The -o cuffdiff.merged option forces output into the cuffdiff.merged subdirectory of your current directory.
Cuffdiff has several output files, described elsewhere in more detail. Here we will concentrate on cuffdiff.merged/gene_exp.diff and cuffdiff.merged/isoform_exp.diff files.
These files are tab-separated tables consisting of 13 columns and can be viewed in Excel or OpenOffice.
To quickly analyze these files we will use standard Unix commands:
To view the gene differential expression file use
- less -S cuffdiff.merged/gene_exp.diff
and
- less -S cuffdiff.merged/isoform_exp.diff
on two different terminal windows. While there is a single test_id row and gene_id for each locus in the gene_exp.diff file, the isoform_exp.diff file may contain several test_id rows for each gene_id and, in some cases several gene_ids for a single locus.
Next lets look for the most significant differentially expressed genes according to cuffdiff. For this we can use standard unix tools to sort cuffdiff output according to the significance of the test statistic for differential expression:
- grep -v test_id cuffdiff.merged/gene_exp.diff | sort -nk12 | less -S
The locus chr1:40448452-40522254 is at the top of the list. Look at this locus in the IGV browser by pasting the locus into IGV's location window. The locus corresponds to the Il1rl1 gene, an Il33 putative receptor, which has been shown to a regulator of the inflammatory response and associated with several auto-immune diseases. Interestingly this gene is down-regulated in DCs during the inflammatory response.
We can similarly examine the isoform expression for this gene:
- grep 'chr1:40448452-40522254' cuffdiff.merged/isoform_exp.diff | less -S
Take the time to explore the different isoform usage for the 5 putative isoforms and the intronic segment.
Examine this locus in detail as it is an example of many interesting features.
exercise 4: working with existing annotations
Scripture implements a simple scoring scheme to compute the expression of a given set of transcripts. It does so by finding the set of constituent exons for any given gene and computing RPKM and other expression values.
First download the RefSeq annotations for the region of interest. Go to the UCSC genome browser, select the mouse mm9 genome, and download the RefSeq track of the Gene and Gene prediction track group in BED format (be sure to select full gene when prompted). In the position textbox, input chr1:16170000-66500000 and save the file as refseq.bed.
Then use Scripture to score genes:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task score -in refseq.bed -alignment t0.sorted.pairs.sorted.bam -out refseq.t0.scored -useConstituentExons
java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task score -in refseq.bed -alignment t4.sorted.pairs.sorted.bam -out refseq.t4.scored -useConstituentExons
The output consists of the orginal BED file with an extra set of 8 columns. These columns are:
- Family wise error rate (FWER) at the observed score
- Read enrichment (average reads per base over transcript / genome average)
- Total reads aligned transcript
- Average reads per base over transcript
- RPKM
- Genome average reads per base
- Transcript length
- Poisson nominal p-value for the observed count given the genome-wide average
You can use a custom script like this example to build an expression table with either the raw counts, p-values or FPKM values, which can be analyzed using, for example, DESeq.
exercise 5: ChIP-Seq analysis using Scripture
In this exercise we use Scripture to find regions of enrichment (peaks) in a H3K4me3 ChIP-Seq library of unstimulated DCs.
We first align reads to our toy chromosome 1, but first we need to unpack our reads:
- gunzip DC.K4me3.t0.fq.gz
To map the reads we use bowtie:
- bowtie --best -S chr1 DC.K4me3.t0.fq DC.K4me3.t0.sam
Examine the output alignment using a text editor to review the SAM format.
- less -S DC.K4me3.t0.sam
We then sort and index the alignment. Since the alignment is in SAM format, we can either convert it to BAM using the samtools view utility or use Picard (or Unix sort) to directly sort the alignment:
- java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/SortSam.jar I=DC.K4me3.t0.sam O=DC.K4me3.t0.sorted.bam SO=coordinate
and then index:
- java -Xmx2000m -jar ~/wcg_oct2011/software/picard-tools-1.46/BuildBamIndex.jar I=DC.K4me3.t0.sorted.bam O=DC.K4me3.t0.sorted.bam.bai
and produce a count file to view in IGV:
- igvtools count -w 5 DC.K4me3.t0.sorted.bam DC.K4me3.t0.sorted.bam.tdf ~/wcg_oct2011/software/IGVTools/genomes/mm9.genome
Find regions of enrichment using Scripture:
- java -Xmx2000m -jar ~/wcg_oct2011/software/scripture/scripture_alpha2.jar -task chip -alignment DC.K4me3.t0.sorted.bam -chr chr1 -out DC.K4me3.t0.segments.bed -windows 500,100 -trim