Phycas activity

Table of contents

Expected learning outcome

The goal of this tutorial is to help you become familiar with running Phycas and interpreting its output. In particular, we are going to demonstrate the use of Phycas to calculate the marginal likelihoods for a partitioned versus unpartitioned analysis.

Do not rush things. If you do not finish the in lab, you can always download Phycas from www.phycas.org when you get home and finish it there. The web site and the manual (available at the web site) have instructions for installing Phycas if you are using your own machine.

Getting started

Initial Downloads for the Phycas lab

All participants will need to download phycasWoodsHole2010Demo.zip (a zip archive of the data files and scripts).

Ubuntu users will want to download the Phycas shell script

Mac users with intel machines will need to download the newest release inside Phycas-1.2.0-Py2.5-WoodsHole2010.dmg (depending on your browser, it may automatically mount as an image called "phycasimg", or you might need to double click the image to mount it). Copy Phycas application to your Desktop

General Information

Phycas is a software package for Bayesian phylogenetic analysis written by Paul Lewis, Mark Holder and David Swofford. Phycas is a hybrid program: computationally-intensive tasks are implemented in C++, and the high-level tasks are done in Python.

In fact, when you use Phycas you are actually issuing commands in the Python programming language. This has advantages because Python is a powerful, robust, and versatile interpreter. However, it means that Phycas cannot always be forgiving when it comes to the structure of commands and syntax in general. We have attempted to add as much error checking of input as possible, but be aware that most commands are case-sensitive and you will have to pay attention to the quoting used in commands in this tutorial.

Running Phycas on Ubuntu Linux

The Phycas python libraries have been installed on the Ubuntu Disk images supplied by the course. The easiest way to start an interactive Phycas session is to:

  • Download the Phycas shell script from this link: Phycas
  • Use the command chmod +x Phycas to make it exectuable, and
  • Move it somewhere on your executable search PATH (inside /usr/local/bin is usually a good choice).
    sudo cp Phycas /usr/local/bin

After that point you should be able to simply use the command Phycas to launch IPython and import phycas.

You can run Phycas from any terminal by invoking Python (type ipython from your shell prompt, or use python if ipython is not available) and then importing the phycas package:

ipython
from phycas import *

This requires that Phycas is installed in a standard system-wide location or that you have "told" the shell where to find it. If Python shows an error when you execute this command consult the Phycas manual or ask for help configuring your environment correctly.

The Phycas.app bundle for Mac

For (intel-based) Macs, you can run Phycas.app, a double-clickable program built around the open-source iTerm terminal application (our only changes to iTerm involve telling the Phycas.app application to: launch a terminal window, feed the terminal the python command, and tell python "from phycas import *" so that python knows about Phycas).

When you get a new window in the Phycas.app application you are automatically in Phycas, so you will not need to type in the "from phycas import *" command when you are interacting with Phycas. Note: You still have to use this command at the beginning of standalone Phycas scripts!

We did not change much of iTerm at all to make the Phycas GUI. If you are prompted to install an update of the GUI do NOT install the update -- it will actually download iTerm (not the iTerm4Phycas GUI). Also notice that you can alter the transparency of the GUI window using the View menu >> "Use Transparency" menu item (iTerm's default is an annoyingly transparent window).

IPython (all platforms)

The Ubuntu images distributed for the course come with ipython, and the Phycas.app is configured to run Phycas through IPython rather than Python itself. IPython provides a more user-friendly environment for interactive Python than the "raw" Python interpreter. The main benefits of IPython are:

  • Tab-completion of variable names (start to type the name and then hit the Tab-key to fill in the rest of the word or see a list of possible matches)
  • Native support for a few UNIX commands (most notably ls and cd)
  • Support for any UNIX commands if you start the line with ! (an exclamation point). For example !cat myscript.py | grep mcmc would behave just like the UNIX command cat myscript.py | grep mcmc

On Mac, you may want to tweak the configuration of phycas to tell it whether or not you want to use ipython. [show configuration instructions]

I highly recommend installing IPython, if your version of Phycas does not come with IPython.

A note on scripts used in this activity

The tutorial directory has subirectories for exercise 1, exercise 2, and exercise 3-4. Each subdirectory has the datafiles and scripts described in the tutorial. To use these scripts, you should change your working directory to the same directory as the data files (or you could edit the filenames in the scripts to reflect the relative path to the data files).

Exercise 1: Simple usage

I recommend that you perform this exercise by interactively entering the commands (rather than simply executing the script provided). Locate the directory of example files labeled phycasWoodsHole2010Demo and move it to a convenient location for you, such as the Desktop.

Launch Phycas:

  • Mac users: double-click the Phycas icon (later when you have scripts written, you can launch Phycas by dragging a Python script onto the Phycas application icon).
  • Linux users: invoke the Phycas shell script that you installed above.

After the libraries have loaded you will see a prompt. This is the Python interpreter waiting for you to give it a command. The working directory will be your home directory.

1.1  Changing the working directory

The first step is to set the working directory of Phycas to the directory in which you want to work. If you are using the IPython interface, and you would like to move to a directory called Desktop/phycasWoodsHole2010Demo/ex1basic, issue the command:

cd Desktop/phycasWoodsHole2010Demo/ex1basic
ls

to change working directory, then show the path to the current directory, and finally list the directory contents.

If you are using the raw Python interpreter (instead of the IPython interface), you have to issue the commands in Python:

from phycas import *
os.chdir("Desktop/phycasWoodsHole2010Demo/ex1basic")
os.listdir(".")

1.2  Getting help

A good first step is to get an idea of what actions you can take. Call the help function to see a general message about Phycas.

help()

To see what functions and variables are currently available, you can use the Python dir function

dir()

A very long list of variables is probably not too helpful to you. Fortunately, you can pass one of the listed names as an argument to the help function to see some information about the variable:

help(readFile)
help(mcmc)

1.3  Setting up a basic MCMC analysis

The settings that affect the behavior of an MCMC run are all stored as attributes of a Phycas object with the name mcmc. In Python, objects are collections of data and actions. You address the attributes of an object by putting a . between the name of the object and the name of the attribute. And you change the value of an attribute by using the = character to assign the attribute a new value:

mcmc.current()
mcmc.data_source = "green.nex"
mcmc.current()

If you look through the tables of input settings before and after the assignment to the data_source attribute, you should see that the value goes from None to the message:"Characters from the file green.nex"

Take time to look over the options available for MCMC to get a feeling for what Phycas can do. The current and help functions for Phycas commands will each show two tables. The first table shows settings that control the behavior of the analysis. The second table lists options that control where the output of the command is written.

As with other Bayesian phylogenetic software, the MCMC sampler will produce a collection of files with the trees and parameter values visited during the MCMC simulation. Now let us configure the output settings so that the files generated will have names that will remind us of the analysis done:

mcmc.out.log = "basic.log"
mcmc.out.log.mode = REPLACE
mcmc.out.trees.prefix = "green"
mcmc.out.params.prefix = "green"

Now we will set up the length of the MCMC simulation. It is important to realize that a "cycle" in Phycas is not equivalent to a "generation" in MrBayes (Ronquist and Huelsenbeck 2003). In MrBayes each generation represents an iteration of:

  1. propose a new state,
  2. evaluate the posterior probability density,
  3. accept or reject the proposal

But in Phycas a cycle represents a sweep of proposals over all of the parameters in model – including several proposed changes to the topology. In terms of the number of updates, a cycle in Phycas is more or less equivalent (in terms of the number of updates) to 100 generations in MrBayes (if you use the default settings). Using the mcmc options with the word weight in their name you can alter the number of times a particular type of update is performed in each cycle):

mcmc.ncycles = 2000
mcmc.sample_every = 10

This run will almost certainly not be long enough. In the interest of time, we will be running short MCMC simulations in the lab, please do not interpret the number of cycles used in this tutorial as recommendations for the appropriate length of the MCMC runs. As we have discussed several times in the course MCMC is a beautiful, but fairly dangerous algorithm and it is important that you make an effort to verify that your run is long enough before you publish results.

Before we launch the analysis, let us check the settings according to the MCMC command to see if there were any errors in the settings:

mcmc.current()

If everything looks good we can launch the run by calling the mcmc command as if it were a Python function:

mcmc()

As in MrBayes, we can summarize the trees produced by the mcmc analysis using a sumt command. Unlike MrBayes, the sumt command is not automatically configured to read the files that were just created. Once again, we can use the current function to inspect the state of a command. Let us set the sumt command to read the trees from the green.t file that was just created by the MCMC analysis:

sumt.current()
sumt.trees = "green.t"
sumt.current()

Then we can run the sumt command with the syntax:

sumt()

The sumt.burnin setting in Phycas is specified in terms of the number of trees to skip in the file – not in terms of the generation number. You will see the majority-rule consensus trees and the trees that have the highest posterior probability in the sumt_trees.pdf file.

From within IPython on Mac, the easiest way to open the pdf files is to exclamation point "shell-escape" and use the open command (from ipython this would be
!open sumt_trees.pdf
on Mac, and
!gnome-open sumt_trees.pdf
on Ubuntu).

In the case of this simple data set and this short MCMC, it is entirely possible that all of the splits will have estimated posterior probabilities of 1.0. If that is the case, the sumt command will not produce a pdf file showing plots of the posterior probability over different divisions of the MCMC simulation. We will see examples of these files later in the lab.

Leave this Phycas session with the command:

exit()

The commands for replaying this example are shown in the section basic.py and in the file basic.py in the scripts folder that accompanies the lab.

Exercise 2: Steppingstone sampling

In this exercise, we will use the RAG1 and RAG2 dataset that you used in the GARLI lab to demonstrate how we can use steppingstone sampling to estimate Bayes Factors. We calculate the marginal likelihood under a model in which we partition by codon position (leading to 3 subsets) and the marginal likelihood under a model in which we partition by both codon position and by gene (leading to 6 subsets). We are going to accomplish this by modifying and executing scripts that are in the phycasWoodsHole2010Demo/ex2steppingstone folder.

Because there are a large number of commands used in this exercise, I have written it such that you can modify the scripts as you work through the exercise (rather than writing them from scratch). I'll use this font color for actions that you need to take, so that it is clear when the activity is explaining a line in a script rather than instructing you to do something.

The steppingstone sampling command in Phycas is ss.
It is actually just a wrapper around the mcmc command, so several of the relevant settings are in the mcmc command.

2.1: Understanding the workflow of the steppingstone sampling analysis.

The "workflow" of the analysis that we are going to conduct is:

  1. Read the data into Phycas
  2. Set up a model that partitions by codon position
  3. Conduct steppingstone MCMC sampling to estimate the marginal likelihood under the "by codon position" model
  4. Read the data into Phycas
  5. Set up a model that partitions by codon position and by gene
  6. Conduct steppingstone MCMC sampling to estimate the marginal likelihood under the "by codon position and by gene" model
  7. Use the Bayes Factor to choose our preferred model
  8. Read the data into Phycas
  9. Set up the preferred model based on this Bayes Factor calculation
  10. Conduct an MCMC over trees

Notice that there is a quite a bit of repetition in tasks in this workflow. Because the front-end to Phycas is the python language, we can easily put the reused steps of this analyses into separate script files. This lets us avoid repeating typing the same instructions.

We'll accomplish this workflow using three separate scripts. Change your working directory to the ex2steppingstone subdirectory (using the cd command)

Open the files steppingstone1.py, steppingstone2.py, and steppingstone3.py in a text editor. Note that steppingstone1.py and steppingstone2.py are very simple scripts that piece together the "real action". Each script uses the python function execfile to read a file and perform all of the instructions in the file. steppingstone1.py does the steps 1-3 in the list above; steppingstone2.py does steps 4-6. You'll calculate the Bayes Factor (step 7) based on the output. Finally you'll need to modify steppingstone3.py to use the preferred model and then execute it (to complete steps 8-10 of the workflow).

2.2: Approximating the marginal likelihood for the by-codon-position partition

Open steppingstoneReadData.py in a text editor. The script begins with the importing of Phycas functions into python and the specification of the datafile:

from phycas import * 
mcmc.data_source = 'murphy29.rag1rag2.nex'

After that we see the definition of the character subsets (Phycas will soon support the CharSet and CharPartition commands in NEXUS. Unfortunately it currently only supports specifying the charsets in python using the subset).

Note that the three arguments to this command are the number of the first position (starting at 1 for the first character), the last number included in the subset, the third argument is the "stride". We use a stride of 3 to indicate that we want to include every third position.

We could use the steppingstone method to choose among the class of nucleotide model, but in the interest of time we are just going to use the GTR + Gamma model for all parts of this exercise. The relevant section of the script is:

model.type = "gtr"
model.num_rates = 4
model.gamma_shape = 0.5
model.gamma_shape_prior = Exponential(1.0)
model.update_freqs_separately = False

(The last line tells Phycas to treat the base frequencies as a joint parameter vector during in MCMC. For the time being, this is required when you use steppingstone sampling but this constraint will be relaxed very soon).

2.2.1: Exploring a distribution that could be used as a prior.

In almost all cases, we see transitions having a higher rate than transversions. However the default prior for relative rates in Phycas is a uniform distribution over all combinations of the 6 rates.
The rates are parameterized such that all 6 rates (rate[A ↔ C], rate[A ↔ G], rate[A ↔ T], rate[C ↔ G], rate[C ↔ T], and rate[G ↔ T]) sum to one. Phycas uses a Dirichlet distribution to specify the prior. To get a sense for how much information there is in the prior, you can manipulate an object that represents a Dirichlet distribution.
Launch Phycas and enter the following commands:

params = (100, 200, 100, 100, 200, 100)
x =  Dirichlet(params)
x.getMean()
x.getVar()
x.sample()
x.sample()
x.sample()
params = (1, 2, 1, 1, 2, 1)
y =  Dirichlet(params)
y.getMean()
y.getVar()
y.sample()
y.sample()
y.sample()

A note about the syntax: This code creates two different Dirichlet distribution objects. For each of the distributions, the functions getMean and getVar return the mean and variance, respectively. If you were to ask for help about these attributes of the distribution with help(x.getMean) or help(y.getMean), then you would see that it is a type of "function" or "method" (technically it is an "instancemethod", which is the term that is reported in the help function). That means that in order for you to take the action associated with the attribute you have to add parentheses after it (this is referred to as calling the function).

A note about the statistics:
The mean values for the rates are the same for these distributions. Both state that the second and fifth entries are expected to be twice as large as the other parmeters. Because of the ordering of the relative rates (rate[A ↔ C], rate[A ↔ G], rate[A ↔ T], rate[C ↔ G], rate[C ↔ T], and rate[G ↔ T]) in the GTR model, these distributions both lead to the expectation that transitions are twice common as transversions. But the two distributions differ in terms of how strongly they favor this set of parameter values. The larger the sum of parameters in the Dirichlet distribution, the smaller the variance. Smaller variances are associated with more informative priors. The y.sample() function draws a random value from the distribution y. Note that under, the more informative priors, the drawn samples are clustered very close to their expected values. Sampling from Phycas's distribution objects can help you get a sense of the distributions that are used as priors in a Bayesian analysis.

2.2.2: Setting a prior.

Execute the help(model) command in Phycas. You will notice that the prior for the relative rates in the GTR matrix is specified by the model.relrate_prior attribute.

Reopen steppingstoneReadData.py in a text editor. In the model setting section of the file, specify the prior distribution that you would like to use for the relative rates. The prior depends on your previous knowledge of molecular evolution (it should not be estimated from the data at hand). Based on your explorations of the Dirichlet distributions above, choose your prior by adding a line to the end of the file. For example, you might add:

model.relrate_prior = Dirichlet((1, 2, 1, 1, 2, 1))
model.update_relrates_separately = False

The second of these commands (i.e. model.update_relrates_separately = False) indicates that you want to use a joint (not "separate") prior for the relative rates. If we do not include this line then, Phycas will use its model.relrate_param_prior instead of our model.relrate_prior. (Make sure that keep the model.update_freqs_separately = False line in the model section, too).

2.2.3: Running the steppingstone sampler.

Open setUpByCodonPosPartition.py and steppingstoneDoMCMC.py in a text editor and read the comments. Feel free to ask about any options that you do not understand (or play around with the Phycas' help system to learn more about the options).

Run the analysis from within Phycas using the command execfile('steppingstone1.py')

Lets take a look at the output. Phycas writes a summary of the active model configuration and information about the different MCMC updaters that are active when it starts an MCMC analysis. What happened during the MCMC?

  1. The first part of steppingstone is running MCMC with the posterior distribution as the target distribution for mcmc.ncycles + ss.xcycles iterations. This starts at the line in the output that says "Setting chain boldness to 0 based on beta = 1." The boldness refers to how drastic the MCMC moves are (as we decrease β it is often helpful to make the MCMC proposals more drastic).
  2. The reference distribution is then fit to your samples from the posterior distribution. This happens in the section labelled "Working prior details". You will then see distributions for each of the parameters in the model. These distributions are summaries of the marginal distribution for each of the parameters.
  3. Once the reference distribution has been fit, Phycas can explore the power posteriors that blend the posterior (raised to the power β) and the reference distribution (raised to the power 1-β)
  4. MCMC is then conducted for decreasing β increments down to 0 (the number of increments is controlled by the ss.nbetavals command option).
  5. The sump command then reads the stepping stone .p files and uses importance sampling to estimate the marginal likelihood of the model. It will report information about the effective number of samples for different MCMC samples and the marginal likelihood as "Generalized Stepping Stone method"

You can look at the .p files in Tracer. The file contains samples from the different power posterior distributions that were sampled. You will probably be able to notice the effect of the beta changing by simply looking at the trace of the parameters sampled over the course of the run. In the context of steppingstone sampling, you cannot use Tracer's estimates of ESS, because Tracer will be calculating the autocorrelation across the entire sweep of β values.

Exit Phycas and launch it again to reset it. (you may have to use the cd command again to change Phycas's working directory.)

Conduct the steppingstone analysis for the partitioning of codon position and gene by running steppingstone2.py

2.2.4: Calculating the Bayes Factor.

The logarithm of the Bayes Factor supporting one partitioning scheme over another is simply the difference in the log marginal likelihoods. For the purpose of this exercise, you can simply select the model with the higher marginal likelihood, but if you want to report the Bayes Factor in a paper you can simply perform the subtraction of the numbers reported in the Generalized Stepping Stone method section of the output.

2.2.5: Conducting MCMC over trees

In the section above we were calculating the probability of the data given a model and a tree (we were using MCMC to marginalize over all of the parameter values). Now that we have chosen a model, we'll need to conduct MCMC over the space of all trees. The file steppingstone3.py starts with the commands that are needed to read the data file. You will need to uncomment the line that sets up the appropriate model, and remove the lines that generate the "ERROR" message, and then run the MCMC simulation.

Remember that the number of cycles chosen for this lab is much too small to provide reliable estimates of the marginal likelihood – the ncycles was chosen so that the analysis would complete in a reasonable amount of time.

I will explain the plots found in the pdf that Phycas's sumt command produces in the second half of the lab. Some are representations of the trees of interest (the majority-rule consensus tre and the trees with the highest posterior probabilities). The splits pdfs show the posteriors for different bipartitions of taxa through the run. These can be helpful in diagnosing poor MCMC mixing (though if you do very long Phycas runs you may have to disable the splits output by setting sumt.out.splits to None to avoid generating massive pdf files).

As within any MCMC analysis, it is a good idea to thoroughly check your results for signs of failure of the runs to converge. Check out the .p files in Tracer. If this were a real analysis, you should run multiple simulations to look for signs that the simulation is not exploring the surface well. By examining the traces of the parameters can you tell which parameters are mixing well? Is there any correlation between the mixing and which parameters are updated with slice sampling?.

Exercise 3: Polytomy priors

One flavor of Bayesian analysis that is implemented in Phycas, but not available in many other Bayesian phylogenetic programs is support for prior distributions that place non-zero probability on trees that are not fully-resolved (also referred to as trees with polytomies or trees with multifurcations -- you can also check out p4). The details of the algorithm (and justifications for the approach) can be found in Lewis et al. (2005). We will be using the same data set that was used in that paper which is a set of algal sequences published by Shoup and Lewis (2003). This time we will pay more attention to the model and MCMC settings than we did in the basic example above. Start Phycas from a clean slate:

  1. open a new window in the Phycas application
  2. change the working directory to a convenient location for working with the ShoupLewis.nex file.

Read the file into memory:

file_contents = readFile("ShoupLewis.nex")
mcmc.data_source = file_contents.characters

If you print the file_contents.characters attribute you should see that the matrix has 17 taxa and 3,341 characters.

3.1  Setting up the model of evolution
Now we will configure the model of character evolution. Use the model.help() invocation to see the current settings of the model and the explanations of the attributes. Note that the model.type is a string that can be "gtr", "hky" or "jc" and this controls whether some of the other settings are used (for instance the settings associated with kappa are only used if the model.type is "hky").

Because we are going to perform an analysis that uses a pseudorandom number generator, it is a good idea to explicitly set the seed so that we can repeat the analysis. You can specify any positive number as this "seed":

setMasterSeed(13957)

Let us configure Phycas to use the HKY model with Γ-distributed rate heterogeneity:

model.type = "hky"
model.num_rates = 4

The model.num_rates is the number of variable rates that are used to approximate the Gamma distribution. If you were to set model.num_rates = 1 then you would be selecting a single-rate model that does not use Γ-distributed rate heterogeneity at all.

We can set the value of the shape parameter of the Gamma distribution, but this is just the value that the sampler will start at. It is more important (in the context of setting up an MCMC run) to set the prior distribution for the shape parameter of the gamma distribution. We can do this by choosing from among the continuous probability distributions over non-negative numbers. An Exponential distribution will be fine for the purpose of the lab (The Gamma and Inverse-Gamma distributions are also possible choices):

model.gamma_shape_prior = Exponential(2.0)

This has just set the prior to be an exponential distribution with mean of 0.5.

NOTE: Phycas parameterizes the exponential distribution with a rate parameter that is the reciprocal of the mean of the distribution! If you are unsure, you can always "ask" a distribution what its mean is:

model.gamma_shape_prior.getMean()

Kappa (Κ), the transition/transversion rate ratio, is usually much greater than 1, so let us use a fairly vague prior with a mean of 4:

model.kappa_prior = Exponential(0.25)

The only other parameters in the HKY substitution model are the base frequencies1:

model.state_freq_param_prior = Gamma(1.0, 1.0)

3.1.1  The priors on the tree
The tree and branch lengths are also important parts of our model and they need to have priors assigned to them. Currently Phycas only supports a uniform prior over all tree topologies (within a class of trees with the same number of internal nodes), so we do not have to set a prior on the shapes of different resolved trees. However, we do need to say something about the branch lengths.

We could use an Exponential (or Gamma distribution) for the prior on all of the branch lengths, but we will also have to choose a mean value for the prior distribution. Choosing a reasonable prior is important for every parameter in a Bayesian analysis, but the branch length prior merits extra care because:

  1. estimating branch lengths accurately is often crucial to getting the topology correct,
  2. the prior will be applied to a large number of parameters (because there are so many branches in the tree).

Suchard et al. (2001) suggested the use of a hierarchical model in which the branch lengths are distributed as exponential variables with a mean of μ. We refer to μ as a "hyperparameter", and we have to place a prior distribution (a "hyperprior") on it. This has the advantage of producing a vague prior distribution on the branch length in which the mean of the prior distribution for the branch lengths is allowed to tune-itself to the data at hand. In Phycas, if you have the model.edgelen_hyperprior attribute set to a distribution (rather than None), you will be using this hierarchical model suggested by Suchard et al. (2001). For example:

model.edgelen_hyperprior = InverseGamma(2.10000, 0.90909)

uses an Inverse Gamma distribution with a shape parameter of 2.1 and a scale parameter of 0.90909 for the hyperparameter, μ. This distribution has mean 1 and variance 10.

3.1.2  MCMC over fully-resolved trees

Let us perform an initial run in the "standard" mode in which polytomies are not considered. We tell the mcmc command that we do not want to sample trees with polytomies:

mcmc.allow_polytomies = False
mcmc.starting_tree_source = randomtree()

Now we can configure the MCMC to run a short simulation:

mcmc.ncycles = 2000
mcmc.sample_every = 10

and save files with names that will remind us that they are from the no-polytomies run:

mcmc.out.trees.prefix = "no_p_trees"
mcmc.out.params.prefix = "no_p_params"
mcmc.out.log.prefix = "no_p_output"

Finally, we call mcmc() which starts the MCMC analysis:

mcmc()

After it has completed we can summarize the results. We can control the behavior of the summarization with:

sumt.outgroup_taxon = "Oedogonium cardiacum"
sumt.trees = "no_p_trees.t"
sumt.burnin = 101

We can specify the names of the files using the sumt.out attributes:

sumt.out.log.prefix = "no_p_sumt_output"
sumt.out.trees.prefix = "no_p_sumt_trees"
sumt.out.splits.prefix = "no_p_sumt_splits"
sumt()

Take a look at the summaries (they should be in the files no_p_sumt_splits.pdf and no_p_sumt_trees.pdf) and the strength of support for the clades.

Note that a couple of the internal branches have very short lengths but also display very high posterior probabilities.

Which groupings in the tree do you think will display posterior probabilities which are very sensitive to the prior over tree topologies?
[show answer]

3.2  Allowing polytomies
Let us redo the same analysis but allow for polytomies.

Even if you are writing your own scripts for this tutorial, you probably want to just copy the script from the NoPolytomy.py analysis and save the file as Polytomy.py2. Then we can just change the lines that differ between the two styles of analysis.

Here are the changes that we need to make to enable sampling over unresolved trees:

mcmc.allow_polytomies = True
mcmc.polytomy_prior = True
mcmc.topo_prior_C = 2.72

You must make these changes to the script before the line that says mcmc(). The first line tells the MCMC sampler to consider polytomies and the next two lines set up the prior distribution over unresolved trees.

Phycas supports priors on the trees based such that different "resolution classes" of trees have a particular prior probability, but the easiest prior to understand (and one that seems to work well in practice) is the "polytomy prior" discussed by Lewis et al. (2005). With this prior, you specify the ratio of the prior probability of a tree over the probability of another tree that is identical to it except that it has one fewer internal branch. This is the ratio associated with collapsing a branch in the tree (the ratio that favors the acceptance of a delete-edge form of the "bush move" in Phycas). Using a value for this prior ratio of e (the base of the natural logarithm) was suggested by Lewis et al. (2005). This value has no particular significance or justification in this context (there is no theory saying that you should use e), but it is aesthetically pleasing in that a tree with one more branch has to be more than one log-likelihood unit better in order to be favored over the tree with a polytomy. The mcmc.topo_prior_C option mentioned above is the prior ratio.

Make sure to rename your output files so that we do not overwrite the files that we just produced in the run that did not consider polytomies:

mcmc.out.trees.prefix = "polytomy_trees"
sumt.trees = "polytomy_trees.t"

mcmc.out.params.prefix = "polytomy_params"
mcmc.out.log.prefix = "polytomy_output"
sumt.out.log.prefix = "polytomy_sumt_output"
sumt.out.trees.prefix = "polytomy_sumt_trees"
sumt.out.splits.prefix = "polytomy_sumt_splits"

Now we have changed the command script to run the file. Save it as Polytomy.py in the same directory as your other files for this activity.

3.3  Executing a script
How do we tell Phycas to execute a script? There are actually three ways to do this:

3.3.1  Drag-and-drop
The easiest way to do this on a Mac with the Phycas.app bundle is to simply drag the Polytomy.py file onto the Phycas icon (in the dock if Phycas is running or on the to the application icon if it is not running). This will set the working directory for the Phycas run to be the directory that contains the script that was dragged onto the icon (so you may have to use absolute paths to data files in the script if the data files are not in the same directory).

3.3.2  Executing a script if you are not using the double-clickable application

If you are not running the double-clickable Phycas, then you can launch a new Terminal (or CMD.EXE application on Windows) session. If Phycas is not installed in a standard location you will have to execute the export PHYCAS_ROOT and source run_phycas_env.sh commands discussed at the beginning of the lab. From a properly initialized shell, you can then execute the following commands:

cd 
python -i Polytomy.py

3.3.3  Executing a script from within the same Phycas session
If you already have a Phycas session underway and do not want to start a new session, you can execute the command:

execfile("Polytomy.py")

to execute a Python script (in this case Polytomy.py).

3.4  Comparing the results
If all goes well, you should have the files created by the run: polytomy_params.p, polytomy_sumt_splits.pdf, polytomy_sumt_trees.pdf, polytomy_trees.t, and polytomy_sumt_trees.tre. Look at polytomy_sumt_splits.pdf for evidence of poor mixing during the MCMC; and compare the majority-rule consensus trees and the maximum a posteriori trees shown in polytomy_trees.pdf to those found in no_p_sumt_trees.pdf. Tracer and AWTY are also highly recommended for diagnosing lack of convergence in MCMC samples.

Is the maximum a posteriori tree topology a fully resolved tree?
[show answer]

Is the majority-rule consensus tree the same as the consensus from the "no polytomies" run?
[show answer]

Exercise 4: Using different priors for internal and terminal edge lengths

Yang and Rannala (2005) (and Yang 2007) suggested using prior probabilities on the internal edges of the tree that strongly favor short lengths as a means of reducing the strength of support for some groupings in a Bayesian phylogenetic analysis. In fact, they suggest that the prior be used to prefer shorter internal branches when you have a large number of characters in your data set. These suggestions are alternatives to the polytomy prior approach, and you can try it out from within Phycas ( see also Ziheng Yang's software site). It will be easiest to copy the NoPolytomy.py script that you created earlier and modify it.

In particular we need to tell Phycas that we do not want to use the hierarchical model for branch lengths. Then we need to set different prior distributions for the internal and terminal branch lengths.

By calling model.help() and mcmc.help() you should be able to find the commands that you will have to type to have an internal branch length prior with a mean of 0.0001 and a terminal edge prior with a mean of 0.1 (you can use Exponential distributions for both, but pass in different rate parameters for the Exponential).

The important settings to verify have been changed are:

model.edgelen_hyperprior = None
model.internal_edgelen_prior = Exponential(10000.0)
model.external_edgelen_prior = Exponential(10.0)

mcmc.allow_polytomies = False
mcmc.out.trees.prefix = "int_ext_trees"
mcmc.out.params.prefix = "int_ext_params"
mcmc.out.log.prefix = "int_ext_output"
sumt.trees = "int_ext_trees.t"
sumt.out.log.prefix = "int_ext_sumt_output"
sumt.out.trees.prefix = "int_ext_sumt_trees"
sumt.out.splits.prefix = "int_ext_sumt_splits"

Run the analysis, and see how the clade posterior probabilities compare to the previous analyses.

Do the results seem comparable to the other analyses?
[show answer]

Note that Yang and Rannala (2005) do not give specific recommendations about the exact values of the means of the priors to use. They discuss the way the prior mean should change as a function of the number of characters sampled, but do not advocate specific prior means. Thus, it is possible that the prior mean of 0.0001 is simply too low (and that higher means would result in significantly improved performance). For some data sets, the priors suggested by Yang and Rannala will result in a majority-rule consensus tree that is less resolved than the Bayesian analysis with polytomies forbidden and a vague prior over branch lengths.

Back to top

Conclusion and thanks

That is it for this activity! Thanks for trying out Phycas and providing feedback!

Please feel free to try out other aspects of Phycas, and let us know if there were parts of the activity or the Phycas user-interface that you found confusing.

If you are interested in programming in Python feel free to ask us about how you can use Phycas as a library of phylogenetic functions.

Funding sources include the CIPRES grant (NSF-DBI-0306047) to Dave Swofford and a NESCent sabbatical fellowship to Paul Lewis. Phycas is built using 5 open source libraries: boost, ncl (the NEXUS Class Library), iTerm, IPython, and (of course) Python.

Footnotes

  • 1 Internally, Phycas uses base frequency parameters that are unnormalized (0, ∞) variables. The base frequencies are obtained by dividing each of the parameters by the sum of the four. We place priors on the unnormalized versions, which is why we use a GammaDistribution in this example. Note that putting Gamma(a, 1) priors on the unnormalized versions of the parameters is equivalent to placing a Dirichlet(a, a, a, a) on the frequencies themselves. So using 4 Gamma(1.0, 1.0) priors for the base frequency parameters is equivalent to a Dirichlet(1, 1, 1, 1) prior on the base frequencies (and is also a uniform prior on the frequencies). [back]
  • 2 You can also use the history command in IPython to see the previous commands. This can be useful if you are using Phycas interactively and would like to turn your commands into a script. [back]