Table of contents
- Expected learning outcome
- Exercise 1: Identifying homologs and non-homologs; effects of scoring matrices and algorithms
- Exercise 2: Comparison of protein:protein and translated DNA:protein to DNA:DNA searches - more sensitive DNA searches
- Exercise 3: Confirming statistical estimates with shuffles
- Exercise 4: Significant similarities within sequences - domain duplication
- Conclusion
Expected learning outcome
This activity will guide you through a number of similarity searching exercises using FASTA and BLAST. These exercises use programs on the FASTA WWW Search page and the Molecular Evolution BLAST WWW Search page [pgm].
An alternative (and more compact) version of these exercises is at: fasta.bioch.virginia.edu/mol_evol/.
In Exercise 1, you will search a small database for homologs using FASTA, Smith-Waterman (SSEARCH), or BLAST. The goal is to demonstrate that: (a) In general, all three programs produce about the same results using their default parameters. (b) Similarity statistics are accurate - unrelated sequences have E()-values near 1. (c) Homologous sequences do not always have significant scores, but additional searches (with different queries) can produce significant similarities. In Exercise 2, we show that protein:protein (BLASTP) or translated DNA:protein (BLASTX) searches are far far more sensitive than DNA:DNA searches. Exercise 3 shows how to use shuffling to independently confirm statistical signficance. Excercise 4 shows how to find duplicated domains, and that scoring matrices matter more for getting accurate domain boundaries.
Links below with the [pgm] highlight should take you to the appropriate program, with much of the information filled in (program, query sequence, database). Links with [seq] link to the NCBI page describing the protein or DNA sequence.
Exercise 1: Identifying homologs and non-homologs; effects of scoring matrices and algorithms
Use the FASTA search page [pgm] to compare Drosophila glutathione transferase GSTT1_DROME [seq] (gi|121694) to the PIR1 Annotated protein sequence database. Be sure to press , not .
- Inspect the output.
- How long is the query sequence?
- How many sequences are in the PIR1 database?
- What scoring matrix was used?
- What were the gap penalties?
- What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
- Looking at an alignment, where are the boundaries of the alignment (the best local region)?
Answer: (i) 209 aa; (ii) 13,351; (iii) BLOSUM50; (iv) gap open= -10; extend -2; one residue gap: -12; two residue: -14; (v) library sequence length; (band) optimized (opt) score (raw score), bit score, E()-value, percent identity, fraction identical, alignment length;
- What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm that your candidate non-homolog was truly unrelated?
- Note that this drosophila glutathione transferase shares significant similarity with both sequences from bacteria (SSPA_ECO57, stringent starvation protein) and mammals. How might you test whether the stringent starvation protein is homologous to glutathione transferases? (Hint - compare your candidate non-homolog with SwissProt for a more comprehensive test.)
- Compare the expectation (E()) value for the distant relationship between GSTT1_DROME and GSTP1_HUMAN (class-pi). How would you demonstrate that GSTT1_DROME is homologous to GSTP1_HUMAN?
[show answer]Answer: Try searching PIR1 with an intermediate (but significant) mammalian sequence.
- Examine how the expectation value changes with different scoring matrices (BLOSUM62, BlastP62, PAM250) and different gap penalties. (The default scoring matrix for the FASTA programs is BLOSUM50, with gap penalties of -10 to open a gap and -2 for each residue in the gap – e.g. -12 for a one residue gap).
- What happens to the E()-value for the 100% identical sequence with the different matrices and different gap penalties?
- What happens to the E()-value for distant homologs, like GSTA1_RAT with the different matrices and different gap penalties?
- What happens to the E()-value for the highest scoring unrelated sequence with the different matrices?
Best scores BL50 -10/-2 BL50 -12/-2 BL62 -11/-1 PAM250 -10/-2 GSTT1_DROME 1399 1.1e-98 1399 1.5e-109 1108 3.4e-117 1087 1.2e-75 GSTF3_MAIZE 173 3.5e-06 163 1.8e-07 119 3e-07 152 1.1e-05 GSTF1_MAIZE 151 7.1e-05 140 1.4e-05 96 0.0001 118 0.0035 SSPA_ECO57 140 9.7e-05 134 4.2e-05 99 4.8e-05 140 7.8e-05 GSTA1_RAT 139 0.00012 120 0.00064 78 0.011 118 0.037 GSTA4_RAT 97 0.16 84 0.6 55 3.9 86 0.8 GSTM2_RAT 93 0.32 85 0.48 61 0.83 75 5.9 GSTP1_HUMAN 82 2.0 74 3.7 -- -- -- -- Highest unrelated 79 3.5 76 2.6 58 3 93 1.7 - Try the search with ssearch [pgm] (Smith-Waterman). Again, look at the E()-values for distant homologs and the highest scoring unrelated sequence.
Do the same search (121694) using the Course BLAST [pgm] WWW page (BLASTP search instead of FASTA).
- Inspect the output.
- How long is the query sequence?
- How many sequences are in the PIR1 database?
- What scoring matrix was used?
- What were the gap penalties?
- What are the numbers after the description of the library sequence? Which one is best for inferring homology?
- Looking at an alignment, where are the boundaries of the alignment (the best local region)?
- What is the highest scoring non-homolog?
- How do the blastp E()-values compare with the FASTA (blosum62) E()-values for the distantly related mammalian and plant sequences?
| Best scores | BL50 -10/-2 | BL62 -11/-1 | BLASTP | SSEARCH |
|---|---|---|---|---|
| GSTT1_DROME | 1.1e-98 | 3.4e-117 | e-122 | 1.2e-99 |
| GSTF3_MAIZE | 3.5e-07 | 3e-07 | 2e-07 | 1.4e-07 |
| GSTF1_MAIZE | 1.5e-05 | 0.0001 | 0.0003 | 3.7e-05 |
| SSPA_ECO57 | 9.7e-05 | 4.8e-05 | 3e-05 | 0.00025 |
| GSTA1_RAT | 0.00012 | 0.011 | 0.010 | 0.00031 |
| GSTA4_RAT | 0.16 | 3.9 | 0.060 | 0.097 |
| GSTM2_RAT | 0.32 | 0.83 | 1.8 | 0.92 |
| GSTP1_HUMAN | 2 | 2.1 | -- | 6 |
| Highest unrelated | 3.5 | 3 | 4.1 | 3.2 |
Exercise 2: Comparison of protein:protein and translated DNA:protein to DNA:DNA searches - more sensitive DNA
In this exercise, we will try to find GSTT1_DROME homologs in the Arabidopsis genome, using (1) protein:protein (BLASTP), (2) DNA:protein (BLASTX), (3) protein:DNA (TBLASTN), and (4) DNA:DNA (BLASTN) searches.
The BLASTP, BLASTX, etc. links below are pre-set to search Arabidopsis sequences.
- BLASTP. Compare the GSTT1_DROME [seq] (gi|121694) protein sequence to Arabidopsis proteins using BLASTP [pgm].
What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1, and ATGSTU4?
- BLASTX. Try the same search using the GSTT1_DROME cDNA DMGST [seq] (gi|8033) against Arabidopsis proteins using BLASTX [pgm].
What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1, and ATGSTU4?
- TBLASTN. Use GSTT1_DROME [seq] (gi|121694) against translated Arabidopsis DNA using TBLASTN [pgm].
What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1, and ATGSTU4?
- Finally, try the DNA:DNA comparison. Use BLASTN [pgm] to compare dmgst (gi|8033) to the Reference mRNA sequences (refseq_rna) sequences in Arabidopsis.
Are there detectable Arabidopsis homologs?
| BLASTP | BLASTX | TBLASTN | BLASTN | |||||
| ATGSTT1 | 84.0 | 1e-16 | 80.5 | 2e-15 | 83.6 | 2e-16 | -- | -- |
| GSTF10 | 68.6 | 4e-12 | 67.8 | 1e-11 | 69.7 | 3e-12 | -- | -- |
| ATGSTZ1 | 52.4 | 3e-07 | 52.4 | 5e-07 | 52.0 | 6e-07 | -- | -- |
| ATGSTU4 | 40.8 | 0.001 | -- | -- | -- | -- | -- | -- |
Exercise 3: Confirming statistical estimates with shuffles
Use the PRSS shuffle [pgm] program to evaluate the statistical significance of a match.
- Compare GSTT1_DROME [seq] (gi|121694) to GSTA4_RAT [seq] (gi|121714) using PRSS [pgm].
What is the E()-value? What equivalent database size is used to calculate the E()-value? Why isn't the database size 1? What should it be?
[show answer]Answer: E() < 0.001; database size is 10,000; sequence was found from a database search, database size was not 1;
- Use PRSS [pgm] to compare OPSD_HUMAN [seq] (gi|129207) to US27_HCMVA [seq] (gi|137159) Human Herpes GPCR homolog. Compare with uniform or window shuffling. How does the E()-value change? Why?
[show answer]Answer: Window shuffling preserves local composition, making "random" 7TM proteins
- Use the FASTA search page [pgm] to compare rat synapsin-1 SYN1_RAT [seq] (gi|6686305) to the PIR1 Annotated protein sequence database.
- What is the E()-value against the human IL3 receptor IL3RB_HUMAN [seq]?
- Use PRSS [pgm] to compare SYN1_RAT with IL3RB_HUMAN. Now what is the E()-value? What happens with the window shuffle?
- Look at the two sequences using pseg [pgm]. (Select "show low complexity as domains".)
Exercise 4: Significant similarities within sequences - domain duplication
Exploring domains with local alignments – Calmodulin
- Use lalign [pgm] to examine local similarities between calmodulin CALM_HUMAN and itself.
- Use plalign [pgm] to plot the same alignment. How many repeats are present in this sequence?
- If a protein is made up of five copies of a domain, how many off-diagonal (non-identical) alignments should be seen?
- What happens to the domain alignment plot when you use a shallower scoring matrix? (Try BP62 and MD20.)
Answer: (b) 4; (c) n-1=4; (d) BP62 looks similar, but only close relationships are seen with MD20.
Exploring domains with local alignments – Cortactin (SRC8_HUMAN)
- Use lalign [pgm] to examine local similarities between SRC8_HUMAN and itself. Note the average percent identity for the some of the most significant alignments.
- Use plalign [pgm] to plot the same alignment. How many repeats are shown in this alignment? How long do they appear to be?
- Based on the percent identities you saw in step 1, what would the appropriate scoring matrix be to accurately identify the cortactin domains?
- What happens to the domain alignment plot when you use a shallower scoring matrix? (Try BP62 and MD40.) How do the BP62, PAM120, and MD40 alignments differ? Why are they different? Which matrix appears to best identify all the repeats found in the PFAM diagram?
- You can look at the PFAM annotation of this protein at PFAM: SRC8_HUMAN [pfam] (Cortactin). How many repeats are shown in this diagram? How long are they?
Answer:(a) 35 - 40% identical (b) MD40 for 60%, MD70 (not available) for 40% (c) MD40 domains are better defined, BP62 and BL50 alignments drag on. MD40 (d) 7 domains of about 40 residues
For more complex domain alignments, try mwkw, or mouse RNA polymerase (rpb1_mouse residues 1500- ) against itself. Try the rpb1_mouse alignment using the MD20 scoring matrix as well as BLOSUM50.
Conclusion
Where to get the FASTA package: faculty.virginia.edu/wrpearson/fasta/
The "normal" FASTA WWW site