| Documentation |
 |
 |
main
PHYLIP
Phylogeny Inference Package
Version 3.66
July, 2006
by Joseph Felsenstein
Department of Genome Sciences and Department of Biology
University of Washington
Box 357730
Seattle, WA 98195-7730
USA
|
E-mail address: joe (at) gs.washington.edu
Contents of This Document
Contents of This Document
A Brief Description of the Programs
Copyright Notice for PHYLIP
The Documentation Files and How to Read Them
What The Programs Do
Running the Programs
A word about input files
Running the programs on a Windows machine
Running the programs on a Macintosh with Mac OS X
Running the programs on a Macintosh with Mac OS 8 or 9
Running the programs on a Unix or Linux system
Running the programs in MSDOS
Running the programs in background or under control of a command file
Preparing Input Files
Input and output files
Data file format
The Menu
The Output File
The Tree File
The Options and How To Invoke Them
Common options in the menu
The U (User tree) option
The G (Global) option
The J (Jumble) option
The O (Outgroup) option
The T (Threshold) option
The M (Multiple data sets) option
The W (Weights) option
The option to write out the trees into a tree file
The (0) terminal type option
The Algorithm for Constructing Trees
Local rearrangements
Global rearrangements
Multiple jumbles
Saving multiple tied trees
Strategy for finding the best tree
A Warning on Interpreting Results
Relative Speed of Different Programs and Machines
Relative speed of the different programs
Speed with different numbers of species
Relative speed of different machines
General Comments on Adapting the Package to Different Computer Systems
Compiling the programs
Unix and Linux
On Windows systems
Compiling with Cygnus Gnu C++
Compiling with Microsoft Visual C++
Compiling with Borland C++
Compiling with Metrowerks Codewarrior for Windows
Macintosh
Compiling with GCC on Mac OS X with our Makefile
Compiling with GCC on Mac OS X with X Windows
Compiling with Metrowerks Codewarrior
Creating the Metrowerks project file.
VMS VAX systems
Parallel computers
Other computer systems
Frequently Asked Questions
How to make it do various things
Background information needed:
Questions about distribution and citation:
Questions about documentation
Additional Frequently Asked Questions, or: "Why didn't it occur to you to ...
(Fortunately) obsolete questions
New Features in This Version
Coming Attractions, Future Plans
Endorsements
From the pages of Cladistics
... in the pages of other journals:
... and in the comments made by users when they register:
References for the Documentation Files
Credits
Other Phylogeny Programs Available Elsewhere
PAUP*
MacClade
MrBayes
MEGA
PAML
TREE-PUZZLE
DAMBE
Hennig86
RnA
NONA
TNT
How You Can Help Me
In Case of Trouble
A Brief Description of the Programs
PHYLIP, the Phylogeny Inference Package, is a package of programs for
inferring phylogenies (evolutionary trees). It has been distributed since
1980, and has over 20,000 registered users, making it the most widely
distributed package of phylogeny programs. It is available free, from
its web site:
PHYLIP is available as source code in C, and also as executables for
some common computer systems. It can infer phylogenies by parsimony,
compatibility, distance matrix methods, and likelihood. It can also
compute consensus trees, compute distances between trees, draw trees,
resample data sets by bootstrapping or jackknifing, edit trees, and
compute distance matrices. It can handle data that are nucleotide
sequences, protein sequences, gene frequencies, restriction sites,
restriction fragments, distances, discrete characters, and continuous
characters.
|
Copyright Notice for PHYLIP
The following copyright notice is intended to cover all source code, all
documentation, and all executable programs of the PHYLIP package.
© Copyright 1980-2006. University of Washington. All
rights reserved. Permission is granted to reproduce, perform, and modify
these programs and documentation files. Permission is granted to distribute
or provide access to these
programs provided that this copyright notice is not removed, the programs are
not integrated with or called by any product or service that generates
revenue, and that your distribution of these documentation files and programs
are free. Any modified
versions of these materials that are distributed or accessible shall indicate
that they are based on these program. Institutions of higher education are
granted permission to distribute this material to their students and staff
for a fee to recover distribution costs. Permission requests for any other
distribution of this program should be directed to license (at) u.washington.edu .
|
The Documentation Files and How to Read Them
PHYLIP comes with an extensive set of documentation files. These
include the main documentation file (this one), which you should read
fairly completely. In addition there are files for groups of programs,
including ones for the molecular sequence
programs, the distance matrix
programs, the
gene frequency and continuous characters
programs, the discrete characters programs,
and the tree drawing programs. Finally,
each program has its own documentation file. References for the
documentation files are all gathered together in this main documentation
file. A good strategy is to:
- Read this main documentation file.
- Tentatively decide which programs are of interest to you.
- Read the documentation files for the groups of programs that
contain those.
- Read the documentation files for those individual programs.
There is an excellent guide to using PHYLIP 3.6 also available. It was written by
Jarno Tuimala of the Center for Scientific Computing in Espoo, Finland and is
available as a PDF here.
What The Programs Do
Here is a short description of each of the programs. For more detailed
discussion you should definitely read the documentation file for the
individual program and the documentation file for the group of programs
it is in. In this list the name of each program is a link which will
take you to the documentation file for that program. Note that there is no
program in the PHYLIP package called PHYLIP.
- Protpars
- Estimates phylogenies from protein sequences (input using the
standard one-letter code for amino acids) using the parsimony method, in
a variant which counts only those nucleotide changes that change the amino
acid, on the assumption that silent changes are more easily accomplished.
- Dnapars
- Estimates phylogenies by the parsimony method using nucleic acid
sequences. Allows use the full IUB ambiguity codes, and estimates
ancestral nucleotide states. Gaps treated as a fifth nucleotide state.
It can also fo transversion parsimony. Can cope with multifurcations,
reconstruct ancestral states, use 0/1 character weights, and infer
branch lengths.
- Dnamove
- Interactive construction of phylogenies from nucleic acid
sequences, with their evaluation by parsimony and compatibility and the
display of reconstructed ancestral bases. This can be used to find
parsimony or compatibility estimates by hand.
- Dnapenny
- Finds all most parsimonious phylogenies for nucleic acid
sequences by branch-and-bound search. This may not be practical (depending
on the data) for more than 10-11 species or so.
- Dnacomp
- Estimates phylogenies from nucleic acid sequence data using
the compatibility criterion, which searches for the largest number of sites
which could have all states (nucleotides) uniquely evolved on the same
tree. Compatibility is particularly appropriate when sites vary greatly in
their rates of evolution, but we do not know in advance which are the less
reliable ones.
- Dnainvar
- For nucleic acid sequence data on four species, computes
Lake's and Cavender's phylogenetic invariants, which test alternative tree
topologies. The program also tabulates the frequencies of occurrence of the
different nucleotide patterns. Lake's invariants are the method which he
calls "evolutionary parsimony".
- Dnaml
- Estimates phylogenies from nucleotide sequences by maximum
likelihood. The model employed allows for unequal expected frequencies of
the four nucleotides, for unequal rates of transitions and transversions,
and for different (prespecified) rates of change in different categories of
sites, and also use of a Hidden Markov model of rates, with
the program inferring which sites have which rates. This also
allows gamma-distribution and gamma-plus-invariant sites distributions of
rates across sites.
- Dnamlk
- Same as Dnaml but assumes a molecular clock. The use of the
two programs together permits a likelihood ratio test of the
molecular clock hypothesis to be made.
- Proml
- Estimates phylogenies from protein amino acid sequences by maximum
likelihood. The PAM, JTT, or PMB models can be employed, and also use
of a Hidden Markov model of rates, with the program inferring which sites
have which rates. This also allows gamma-distribution and
gamma-plus-invariant sites distributions of rates across sites.
It also allows different rates of change at known sites.
- Promlk
- Same as Proml but assumes a molecular clock. The use of the
two programs together permits a likelihood ratio test of the
molecular clock hypothesis to be made.
- Dnadist
- Computes four different distances between species from nucleic acid
sequences. The distances can then be used in the distance matrix programs.
The distances are the Jukes-Cantor formula, one based on Kimura's 2-
parameter method, the F84 model used in Dnaml, and the LogDet distance.
The distances can also be corrected for gamma-distributed and
gamma-plus-invariant-sites-distributed rates of change in different sites.
Rates of evolution can vary among sites in a prespecified way, and also
according to a Hidden Markov model. The program can also make a table of
percentage similarity among sequences.
- Protdist
- Computes a distance measure for protein sequences, using
maximum likelihood estimates based on the Dayhoff PAM matrix,
the JTT matrix model, the PBM model, Kimura's 1983 approximation to these,
or a model based on the genetic code plus a constraint on changing to a
different category of amino acid. The distances can also be corrected for
gamma-distributed and gamma-plus-invariant-sites-distributed rates of change
in different sites. Rates of evolution can vary among sites in a
prespecified way, and also according to a Hidden Markov model. The program
can also make a table of percentage similarity among sequences. The
distances can be used in the distance matrix programs.
- Restdist
- Distances calculated from restriction sites data or
restriction fragments data. The restriction sites option is the one to
use to also make distances for RAPDs or AFLPs.
- Restml
- Estimation of phylogenies by maximum likelihood using
restriction sites data (not restriction fragments but presence/absence of
individual sites). It employs the Jukes-Cantor symmetrical model of
nucleotide change, which does not allow for differences of rate between
transitions and transversions. This program is very slow.
- Seqboot
- Reads in a data set, and produces multiple data sets from
it by bootstrap resampling. Since most programs in the current version of
the package allow processing of multiple data sets, this can be used
together with the consensus tree program Consense to do bootstrap (or
delete-half-jackknife) analyses with most of the methods in this package.
This program also allows the Archie/Faith technique of permutation of
species within characters. It can also rewrite a data set to convert
it from between the PHYLIP Interleaved and Sequential forms, and into
a preliminary version of a new XML sequence alignment format
which is under development and which is described in the
Seqboot documentation web page.
- Fitch
- Estimates phylogenies from distance matrix data under the
"additive tree model" according to which the distances are expected to
equal the sums of branch lengths between the species. Uses the
Fitch-Margoliash criterion and some related least squares criteria, or
the Minimum Evolution distance matrix method. Does
not assume an evolutionary clock. This program will be useful with
distances computed from molecular sequences, restriction sites or fragments
distances, with DNA hybridization measurements, and with genetic distances
computed from gene frequencies.
- Kitsch
- Estimates phylogenies from distance matrix data under the
"ultrametric" model which is the same as the additive tree model except
that an evolutionary clock is assumed. The Fitch-Margoliash criterion and
other least squares criteria, or the Minimum Evolution criterion are
possible. This program will be useful with
distances computed from molecular sequences, restriction sites or
fragments distances, with distances from DNA hybridization measurements,
and with genetic distances computed from gene frequencies.
- Neighbor
- An implementation by Mary Kuhner and John Yamato of Saitou and
Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage
clustering) method. Neighbor Joining is a distance matrix method producing
an unrooted tree without the assumption of a clock. UPGMA does assume a
clock. The branch lengths are not optimized by the least squares criterion
but the methods are very fast and thus can handle much larger data sets.
- Contml
- Estimates phylogenies from gene frequency data by maximum
likelihood under a model in which all divergence is due to genetic drift in
the absence of new mutations. Does not assume a molecular clock. An
alternative method of analyzing this data is to compute Nei's genetic
distance and use one of the distance matrix programs.
This program can also do maximum likelihood analysis of continuous
characters that evolve by a Brownian Motion model, but it assumes that
the characters evolve at equal rates and in an uncorrelated fashion, so
that it does not take into account the usual correlations of characters.
- Gendist
- Computes one of three different genetic distance formulas
from gene frequency data. The formulas are Nei's genetic distance, the
Cavalli-Sforza chord measure, and the genetic distance of Reynolds et. al.
The former is appropriate for data in which new mutations occur in an
infinite isoalleles neutral mutation model, the latter two for a model
without mutation and with pure genetic drift. The distances are written to
a file in a format appropriate for input to the distance matrix programs.
- Contrast
- Reads a tree from a tree file, and a data set with continuous
characters data, and produces the independent contrasts for those
characters, for use in any multivariate statistics package. Will also
produce covariances, regressions and correlations between characters for
those contrasts. Can also correct for within-species sampling variation
when individual phenotypes are available within a population.
- Pars
- Multistate discrete-characters parsimony method. Up to 8 states
(as well as "?") are allowed. Cannot do Camin-Sokal or Dollo Parsimony.
Can cope with multifurcations, reconstruct ancestral states, use character
weights, and infer branch lengths.
- Mix
- Estimates phylogenies by some parsimony methods for discrete
character data with two states (0 and 1). Allows use of the
Wagner parsimony method, the Camin-Sokal parsimony method, or arbitrary
mixtures of these. Also reconstructs ancestral states and allows weighting
of characters (does not infer branch lengths).
- Move
- Interactive construction of phylogenies from discrete character
data with two states (0 and 1). Evaluates parsimony and compatibility
criteria for those phylogenies and displays reconstructed states throughout
the tree. This can be used to find parsimony or compatibility estimates by
hand.
- Penny
- Finds all most parsimonious phylogenies for discrete-character
data with two states, for the Wagner, Camin-Sokal, and mixed parsimony
criteria using the branch-and-bound method of exact search. May be
impractical (depending on the data) for more than 10-11 species.
- Dollop
- Estimates phylogenies by the Dollo or polymorphism parsimony
criteria for discrete character data with two states (0 and 1). Also
reconstructs ancestral states and allows weighting of characters. Dollo
parsimony is particularly appropriate for restriction sites data; with
ancestor states specified as unknown it may be appropriate for restriction
fragments data.
- Dolmove
- Interactive construction of phylogenies from discrete
character data with two states (0 and 1) using the Dollo or polymorphism
parsimony criteria. Evaluates parsimony and compatibility criteria for
those phylogenies and displays reconstructed states throughout the tree.
This can be used to find parsimony or compatibility estimates by hand.
- Dolpenny
- Finds all most parsimonious phylogenies for
discrete-character data with two states, for the Dollo or polymorphism
parsimony criteria using the branch-and-bound method of exact search. May
be impractical (depending on the data) for more than 10-11 species.
- Clique
- Finds the largest clique of mutually compatible characters, and
the phylogeny which they recommend, for discrete character data with two
states. The largest clique (or all cliques within a given size range of
the largest one) are found by a very fast branch and bound search method.
The method does not allow for missing data. For such cases the T
(Threshold) option of Pars or Mix may be a useful alternative.
Compatibility methods are particular useful when some characters are of
poor quality and the rest of good quality, but when it is not known in
advance which ones are which.
- Factor
- Takes discrete multistate data with character state trees and
produces the corresponding data set with two states (0 and 1). Written by
Christopher Meacham. This program was formerly used to accomodate
multistate characters in Mix, but this is less necessary now that PARS is
available.
- Drawgram
- Plots rooted phylogenies, cladograms, circular trees and phenograms in a
wide variety of user-controllable formats. The program is interactive and
allows previewing of the tree on PC, Macintosh, or X Windows screens,
or on Tektronix or Digital graphics terminals. Final output can be
to a file formatted for one of the drawing programs, for a ray-tracing or
VRML browser, or one at can be sent to a laser printer (such as Postscript
or PCL-compatible printers), on graphics screens or terminals, on pen
plotters or on dot matrix printers capable of graphics.
- Drawtree
- Similar to Drawgram but plots unrooted phylogenies.
- Treedist
- Computes the Branch Score distance between trees, which allows for
differences in tree topology and which also makes use of branch lengths. Also
computes the Robinson-Foulds symmetric difference distance between trees, which
allows for differences in tree topology but does not use branch lengths.
- Consense
- Computes consensus trees by the majority-rule consensus tree
method, which also allows one to easily find the strict consensus tree.
Is not able to compute the Adams consensus tree. Trees are input in a tree
file in standard nested-parenthesis notation, which is produced by many of
the tree estimation programs in the package. This program can be used as
the final step in doing bootstrap analyses for many of the methods in the
package.
- Retree
- Reads in a tree (with branch lengths if necessary) and allows
you to reroot the tree, to flip branches, to change species names and
branch lengths, and then write the result out. Can be used to convert
between rooted and unrooted trees, and to write the tree into a
preliminary version of a new XML tree file format which is under
development and which is described in the
Retree documentation web page.
Running the Programs
This section assumes that you have obtained PHYLIP as compiled executables
(for Windows, Mac OS X, Mac OS 9 or Linux), or have obtained the source code
and compiled it yourself (for Linux, Unix, or OpenVMS). For machines for
which compiled executables are available, there will usually be no need for
you to have a compiler or compile the programs yourself. This section
describes how to run the programs. Later in this document we will
discuss how to download and install PHYLIP (in case you are
reading this without yet having done that). Normally you will only read
your copy of this document after downloading and installing PHYLIP.
A word about input files.
For all of these types of machines, it is
important to have the input files for the programs (typically data files)
prepared in advance. They can be prepared in any editor, but it is important
that they be saved in Text Only ("flat ASCII") format, not in the format that
word processors such as Microsoft Word want to write. It is up to you to read
the PHYLIP documentation files which describe the files formats that are
needed. There is a partial description in the next section of this document.
The input files can also be obtained by running a program that
produces output files in PHYLIP format (some of these programs do, and so do
programs by others such as sequence alignment programs such as ClustalW and
sequence format conversion programs such as Readseq). There is not any
input file editor available in any program in PHYLIP (you should not
simply start running one of the programs and then expect to click a mouse
somewhere to start creating a data file).
When they start running, the programs look first for input files with
particular names (such as infile, treefile, intree, or fontfile).
Exactly which file names they look for varies a bit from program to program,
and you should read the documentation file for the particular program to
find out. If you have files with those names the programs will use them
and not ask you for the file name. If they do not find files of those
names, the programs will say that they cannot find a file of that name, and
ask you to type in the file name.
For example, if DnaML looks
for the file infile and does not find one of that name,
it prints the message:
dnaml: can't find input file "infile"
Please enter a new file name>
|
This does not mean that an error
has occurred. All you need to do is to type in the name of the file.
The program looks for the input files in the same folder that the
program is in (a folder is the same thing as a "directory"). In Windows,
Mac OS X, Linux, or Unix, if you are asked for the
file name you can type in the path to the file, as part of the name (thus,
if the file is in the folder containing the current folder, you can type in
a file name such as ../myfile.dna). If you do not know what a
"folder" is, or what "above" means, then you are a member of the new
generation who just clicks the mouse and assumes that a list of file names
will magically appear. (Typically members of this generation have no idea
where the files are on their system, and accumulate enormous amounts of
unnecessary clutter in their file systems.) In this case you should ask
someone to explain folders to you.
Running the programs on a Windows machine.
Double-click on the icon for
the program. A window should open with a menu in it. Further dialog with the
program occurs
by typing on the keyboard in response to what you see in the window. The
programs can be interrupted either by typing Control-C (which means to
press down on the Ctrl key while typing the letter C), or by using
the mouse to open the File menu in the upper-left corner of the program's
window area and then select Quit. Other than this, most PHYLIP programs
make no use of the mouse. The tree-drawing programs Drawtree and Drawgram
do allow use of the mouse to select some options.
The programs open a window for their menus. This window may be too small for
your tastes. They can be resized by tugging on the lower-right corner of the
window. In addition, the font may be too small. On most versions of Windows,
you can click on the small icon
symbol at the upper-left corner of the window, and choose the Properties menu choice there. One of its options is to change the font and size of
the print. I prefer large font sizes such as 16x12. Some versions of Windows
wuch as Windows XP also ask you if you want to apply this choice to all
such windows, not just to the current one.
The programs can also be run in a Command window under Windows, in much the
same way as they were under the MSDOS operating system, which is what the
Command window emulates. Type the name of the program
in lower-case letters (such as dnaml). To interrupt the program while
it is running, type Control-C (which means to press down on the Ctrl key
while typing the letter C).
Running the programs on a Macintosh with Mac OS X
We have provided a Mac OS X version of the executables. The programs can be
run by clicking on their icons. It is also possible to run them using a
Terminal window by typing the correct name, but this is harder.
The executables open a Terminal window in which
the menus appear and in which responses can be typed.
The programs can be interrupted by typing a control-C (press down the "control"
key in the lower-left corner of the keyboard and then type "c"), or by
closing the Terminal window by using the red button in the upper-left
corner of the window.
On Mac OS X systems the Mac OS 8 or 9 executables can also be run
using the Classic environment
included with it which allows executables for earlier versions of
Mac OS to run.
If your Mac OS X system has the X windows windowing system installed,
you could also make excutables that will work with that. See below, under
"Compiling with GCC on Mac OS X with X Windows" for instructions on how to do
this.
One problem we have often encountered using Mac OS X is that it is possible for
data files to have the wrong kind of characters at the ends of their lines.
They may have carriage-return (ASCII/ISO 13 or control-M) characters at the
ends of their lines when they should instead have the Unix newline character
(ASCII/ISO 10 or control-J) there. This can happen with files transferred
from other operating systems or files produced in some word processors.
It results in segmentation-fault or memory errors. If you encounter these,
check this possibility carefully.
Running the programs on a Macintosh with Mac OS 8 or 9
Double-click on the icon for
the program. A window should open. Further dialog with the program occurs
by typing on the keyboard in response to what you see in the window. The
programs can be interrupted by using
the mouse to open the File menu in the upper-left corner of the program's
window area and then select Quit. Alternatively, you can use the
Command-Q key combination.
When you use Quit, the program will ask you whether you want to save
a file whose name is the program name (often followed by .out -- for
example, if you are using Dnaml it will ask you if you want to save file
Dnaml.out. This file is simply a record of everything that
displayed on the program window, and you usually will not want to save it.
Pressing the Enter key or selecting the Do Not Save button with
the mouse will keep this from being saved.
If you encounter memory limitations on a Mac OS 8 or 9 Macintosh,
and determine that
this is not due to a problem with the format of the input file, as it
often will be, you may be able to solve it by raising the limits of the
stack and heap sizes of the program. To do this click on the program
and then select Get Info from the Finder File menu.
This will open a window which can be made to show the memory limits
of the program. These can be changed by selecting them and typing in
larger numbers. This may relieve nagging memory problems. If it does
not, consult your local documentation and suspect problems with your
input file format.
Running the programs on a Unix or Linux system.
Type the name of the program
in lower-case letters (such as dnaml). To interrupt the program while
it is running, type Control-C (which means to press down on the Ctrl key
while typing the letter C).
On some systems you may need to type ./ before the program name,
so that in the above case it would be ./dnaml. This is mostly
needed if the users PATH does not include their current directory, something
which is often done as a security precaution.
Running the programs in background or under control of a command file
In running the programs, you may sometimes want to put them in background
so you can proceed with other work. On systems with a windowing environment
they can be put in their own window, and commands like the Unix and Linux
nice command used to make
them have lower priority so that they do not interfere with interactive
applications in other windows. This part of the discussion will
assume either a Windows system or a Unix or Linux system. I will
note when the commands work on one of these systems but not the other.
Mac OS X is actually Unix (surprise! surprise!) and you can
run PHYLIP programs in background on any Mac OS X system by simply following
the instructions for Unix, using a terminal window to do so if necessary.
Running jobs in background on Mac OS 8 or 9 systems is an arcane art into whose
mysteries I have not been initiated (or perhaps no one has been initiated).
If there is no windowing
environment, on a Unix or Linux system you will want to use an
ampersand (&) after the command file name when invoking it to put the
job in the background. You will have to put all the responses to the
interactive menu of the program into a file and tell the background job
to take its input from that file.
On Windows systems there is no & or nice command
but input and output redirection and command files work fine in a Commmand
window. A command file can either be invoked by clicking on its icon or
by typing its name from a Command window. The a file of commands must have
a name ending in .bat, such as foofile.bat. You can
run the batch file from a Command window by typing its name (such as
foofile) without the .bat.
For example: suppose you want to run Dnapars in a background, taking its
input data from a file called sequences.dat, putting its interactive
output to file called screenout, and using a file called input as
the place to store the interactive input. The file input need only
contain two lines:
which is what you would have typed to run the program interactively, in
response to the program's request for an input file name if it did not
find a file named infile, in in response the the menu.
To run the program in background, in Unix or Linux you would simply give the command:
dnapars < input > screenout &
These run the program with input responses coming from input and
interactive output being put into file screenout. The usual output
file and tree file will also be created by this run (keep that in mind
as if you run any other PHYLIP program from the same directory while
this one is running in background you may overwrite the output file from
one program with that from the other!).
If you wanted to give the program lower priority, so that it would
not interfere with other work, and you have Berkeley Unix type job control
facilities in your Unix or Linux (and you usually do), you can use the
nice command:
nice +10 dnapars < input > screenout &
which lowers the priority of the run. To also time the run and put the
timing at the end of screenout, you can do this:
nice +10 ( time dnapars < input ) >& screenout &
which I will not attempt to explain.
On Unix or Linux systems
you may also want to explore putting the interactive output into the
null file /dev/null so as to not be bothered with it (but then you
cannot look at it to see why something went wrong). If you have problems
with creating output files that are too large, you may want to
explore carefully the turning off of options in the programs you run.
If you are doing several runs in one, as for example when you do a
bootstrap analysis using Seqboot, Dnapars (say), and Consense, you
can use an editor to create a "command file" with these commands:
seqboot < input1 > screenout
mv outfile infile
dnapars < input2 >> screenout
mv outtree intree
consense < input3 >> screenout
|
This is the Unix or Linux version -- in the Windows version, the renaming
of files and the appending of output to the file screenout is
handled differently.
On Unix or Linux the command file might be named something like
foofile, and on Windows systems might be named foofile.bat.
On Unix or Linux the command file must be given
execute permission by using the command chmod +x foofile followed
by the command rehash. The job that foofile describes
can be run in background on Unix or Linux by giving the command
foofile &
On Windows systems it can be run by
clicking on the icon of the command file. Its icon will have a little gear
symbol.
Note that you must also have the interactive input
commands for Seqboot (including the random number seed), Dnapars, and
Consense in the separate files input1, input2, and input3.
Note also that when PHYLIP programs attempt to open a new output file (such as
outfile, outtree, or plotfile, if they see
a file of that name already in existence they will ask you if you want to
overwrite it, and offer alternatives including writing to another file,
appending information to that file, or quitting the program without writing to
the file. This means that in writing batch files it is important to know
whether there will be a prompt of this sort. You must know in advance
whether the file will exist. You may want to put in your batch file a
command that tests for the existence of a pre-existing output file and
if so, removes it. You might even want to put in a command that creates a
file of that name, so that you can be sure it is there! Either way,
you will then know whether to put into your file of keyboard responses the
proper response to the inquiry about overwriting that output file.
Preparing Input Files
The input files for PHYLIP programs must be prepared separately - there is
no data editor within PHYLIP. You can use a word processor (or text
editor) to prepare them yourself, or you can use a program that produces
a PHYLIP-format output. Sequence alignment programs such as ClustalW
commonly have an option to produce PHYLIP files as output, and some
other phylogeny programs, such as MacClade and TreeView, are capable of
producing a PHYLIP-format file.
The format of the input files is discussed below, and you should also
read the other PHYLIP documentation relevant to the particular type of
data that you are using, and the particular programs you want to run, as
there will be more details there.
It is very important that the input files be in "Text Only" or "flat
ASCII" format. This means that they contain only printable ASCII/ISO
characters, and not any unprintable characters. Many word processors such
as Microsoft Word save their files in a format that contains unprintable
characters, unless you tell them not to. For Microsoft Word you can
select Save As from its File menu, and choose Text Only
as the file format. This can also be done in WordPad utility in Windows .
Other word processors will have equivalent
options. Text editors such as the vi and emacs editors on
Unix and Linux, Windows Notepad, the SimpleText editor in Mac OS, or the pico
editor that comes with the pine
mailer program, produce their files in Text Only format and should not
cause any trouble.
Input and output files
For most of the PHYLIP programs, information comes from a series of
input files, and ends up in a series of output files:
-------------------
| |
infile ---------> | |
| |
intree ---------> | | -----------> outfile
| |
weights --------> | program | -----------> outtree
| |
categories -----> | | -----------> plotfile
| |
fontfile -------> | |
| |
-------------------
|
The programs interact with the user by presenting a menu. Aside from the
user's choices from the menu, they read
all other input from files. These files have default names. The program
will try to find a file of that name - if it does not, it will ask the
user to supply the name of that file.
Input data such as DNA sequences
comes from a file whose default name is infile. If the user
supplies a tree, this is in a file whose default name is intree.
Values of weights for the characters are in weights, and the
tree plotting program need some digitized fonts which are supplied in
fontfile (all these are default names).
For example, if DnaML looks
for the file infile and does not find one of that name,
it prints the message:
dnaml: can't find input file "infile"
Please enter a new file name>
|
This simply means that it wants you to type in the name of the
input file.
Data file format
I have tried to adhere to a rather stereotyped input and output
format. For the parsimony, compatibility and maximum likelihood programs,
excluding the distance matrix methods, the simplest version of the input
data file looks something like this:
6 13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC
|
The first line of the input file contains the number of species and the
number of characters (in this case sites). These are in free format, separated
by blanks. The information for each species follows, starting with a
ten-character species name (which can include blanks and some punctuation
marks), and continuing with the characters for that species. The name should
be on the same line as the first character of the data for that species.
(I will use the term "species" for the tips of the trees, recognizing
that in some cases these will actually be populations or individual gene
sequences).
The name should be ten characters in length, filled out to the full
ten characters by blanks if shorter. Any printable ASCII/ISO character is
allowed in the name, except for parentheses ("(" and ")"), square
brackets ("[" and "]"), colon (":"), semicolon (";") and comma (",").
If you forget to extend the names to ten characters in length by blanks,
the program will get out of synchronization with the contents of the data
file, and an error message will result.
Note that Tab characters count as only one character in the species names.
Their inclusion can cause trouble. The name will appear to you to be filled
out to a full ten characters, but it may be shorter than that to the program.
If you make the data file in a word processor such as Word, you would be
well-advised to make sure the names contain no Tab characters. You can check
for their presence by advancing through the names with your cursor keys, and
looking for sudden jumps of two or more characters. It is best to fill the
names out with blanks, not Tabs.
In the
discrete-character programs, DNA sequence programs and protein sequence
programs the characters are each a
single letter or digit, sometimes separated by blanks. In
the continuous-characters programs they are real numbers with decimal points,
separated by blanks:
Latimeria 2.03 3.457 100.2 0.0 -3.7
The conventions about continuing the data beyond one line per species are
different between the molecular sequence programs and the others. The
molecular sequence programs can take the data in "aligned" or "interleaved"
format, in which we first have some lines giving the first part of each of the
sequences, then some
lines giving the next part of each, and so on. Thus the sequences might
look like this:
6 39
Archaeopt CGATGCTTAC CGCCGATGCT
HesperorniCGTTACTCGT TGTCGTTACT
BaluchitheTAATGTTAAT TGTTAATGTT
B. virginiTAATGTTCGT TGTTAATGTT
BrontosaurCAAAACCCAT CATCAAAACC
B.subtilisGGCAGCCAAT CACGGCAGCC
TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC
|
Note that in these sequences we have a blank every
ten sites to make them easier to read: any such blanks are allowed. The blank
line which separates the two groups of lines (the ones
containing sites 1-20 and ones containing sites 21-39) may or may not
be present, but if it is, it should be a line of zero length and not contain
any extra blank
characters (this is because of a limitation of the current versions
of the programs). It is important that the number of sites in each
group be the same for all species (i.e., it will not be possible to run
the programs successfully if the first species line contains 20 bases, but
the first line for the second species contains 21 bases).
Alternatively, an option can be selected in the menu to take the data in
"sequential" format, with all of the data for the first species,
then all of the characters for the next species, and so on. This is also
the way that the discrete characters programs and the gene frequencies
and quantitative characters programs want to read the data. They do not
allow the interleaved format.
In the sequential format, the character data can run on to a new line at any
time (except in the middle of a species name or, in the case of continuous
character and distance matrix programs where you cannot go to a new line in
the middle of a real number). Thus it is legal to have:
Archaeopt 001100
1101
or even:
Archaeopt
0011001101
though note that the full ten characters of the species name must
then be present: in the above case there must be a blank after the "t". In all
cases it is possible to put internal blanks between any of the character
values, so that
Archaeopt 0011001101 0111011100
is allowed.
Note that you can convert molecular sequence data between the interleaved
and the sequential data formats by using the Rewrite option of the J
menu item in Seqboot.
If you make an error in the format of the input file, the programs can
sometimes detect that
they have been fed an illegal character or illegal numerical value and issue
an error message such as BAD CHARACTER STATE:, often printing out the
bad value, and sometimes the number of the species and character in which it
occurred. The program will then stop shortly after. One of the things which
can lead to a bad value is the omission of something earlier in the file, or
the insertion of something superfluous, which cause the reading of the file to
get out of synchronization. The program then starts reading things it
didn't expect, and concludes that they are in error. So if you see this error
message, you may also want
to look for the earlier problem that may have led to the program becoming
confused about what it is reading.
Some options are described below, but you should also read the documentation
for the groups of the programs and for the individual programs.
The Menu
The menu is straightforward. It typically looks like this (this one is for
Dnapars):
DNA parsimony algorithm, version 3.6
Setting for this run:
U Search for best tree? Yes
S Search option? More thorough search
V Number of trees to save? 10000
J Randomize input order of sequences? No. Use input order
O Outgroup root? No, use as outgroup species 1
T Use Threshold parsimony? No, use ordinary parsimony
N Use Transversion parsimony? No, count all steps
W Sites weighted? No
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Print out steps in each site No
5 Print sequences at all nodes of tree No
6 Write out trees onto tree file? Yes
Y to accept these or type the letter for one to change
|
If you want to accept the default settings (they are shown in the above case)
you can simply type Y followed by pressing on the Enter key.
If you want to change any of the options, you should type the letter
shown to the left of its entry in the menu. For example, to set a threshold
type T. Lower-case letters will also work. For many of the options
the program will ask for supplementary information, such as the value of
the threshold.
Note the Terminal type entry, which you will find on all menus. It
allows you to specify which type of terminal your screen is. The options
are an IBM PC screen, an ANSI standard terminal, or none.
Choosing zero (0) toggles
among these three options in cyclical order, changing each time the 0
option is chosen. If one of them is right for your terminal the screen will be
cleared before the menu is displayed. If none works, the none option
should probably be chosen. The programs should start with a terminal option
appropriate for your computer, but if they do not, you can change the
terminal type manually. This is particularly important in program Retree
where a tree is displayed on the screen - if the terminal type is set to the
wrong value, the tree can look very strange.
The other numbered options control which information the program will
display on your screen or on the output files. The option to Print
indications of progress of run will show information such as the names of
the species as they are successively added to the tree, and the
progress of rearrangements. You will usually want to see these as
reassurance that the program is running and to help you estimate how long
it will take. But if you are running the program "in background" as can be
done on multitasking and multiuser systems, and do not have the
program running in its own window, you may want to turn this option off so
that it does not disturb your use of the computer while the program is
running. Note also menu option 3, "Print out tree". This can be useful
when you are running many data sets, and will be using the resulting trees
from the output tree file. It may be helpful to turn off the printing out
of the trees in that case, particularly if those files would be too big.
The Output File
Most of the programs write their output onto a file called (usually) outfile, and a representation of the trees found onto a file called
outtree.
The exact contents of the output file vary from program to program and also
depend on which menu options you have selected. For many programs, if you
select all possible output information, the output will consist of
(1) the name of the program and its
version number, (2) some of the input information printed out, and (3) a series of
phylogenies, some with associated information indicating how much change
there was in each character or on each part of the tree. A typical rooted tree
looks like this:
+-------------------Gibbon
+----------------------------2
! ! +------------------Orang
! +------4
! ! +---------Gorilla
+-----3 +--6
! ! ! +---------Chimp
! ! +----5
--1 ! +-----Human
! !
! +-----------------------------------------------Mouse
!
+------------------------------------------------Bovine
|
The interpretation of the tree is fairly straightforward: it "grows"
from left to right. The numbers at the forks are arbitrary and are used (if
present) merely to identify the forks. For many of the programs the tree
produced is unrooted. Rooted and unrooted trees are printed in nearly the
same form, but the unrooted ones are accompanied by the
warning message:
remember: this is an unrooted tree!
to indicate that this is an unrooted tree and to warn against
taking the position of its root too seriously. Mathematicians still call
an unrooted tree a tree, though some systematists unfortunately use the term
"network" for an unrooted tree. This conflicts with standard mathematical
usage, which reserves the name "network" for a completely different kind of
graph). The root of this tree could be anywhere, say on the line leading
immediately to Mouse. As an exercise,
see if you can tell whether the following tree is or is not a different
one from the above:
+-----------------------------------------------Mouse
!
+---------4 +------------------Orang
! ! +------3
! ! ! ! +---------Chimp
---6 +----------------------------1 ! +----2
! ! +--5 +-----Human
! ! !
! ! +---------Gorilla
! !
! +-------------------Gibbon
!
+-------------------------------------------Bovine
remember: this is an unrooted tree!
|
(it is not different). It is important also to realize that the
lengths of the segments of the printed tree may not be significant: some
may actually represent branches of zero length, in the sense that there is no
evidence that
those branches are nonzero in length. Some of the diagrams of trees attempt
to print branches approximately proportional to estimated
branch lengths, while in others the lengths are purely conventional and
are presented just to make the topology visible. You will have to look closely
at the documentation that accompanies each program to see what it presents
and what is known about the lengths of the branches on the tree. The above
tree attempts to represent branch lengths approximately in the diagram. But
even in those cases, some of the smaller branches are likely to be
artificially lengthened to make the tree topology clearer. Here is what
a tree from Dnapars looks like, when no attempt is made to make the
lengths of branches in the diagram proportional to estimated branch
lengths:
+--Human
+--5
+--4 +--Chimp
! !
+--3 +-----Gorilla
! !
+--2 +--------Orang
! !
+--1 +-----------Gibbon
! !
--6 +--------------Mouse
!
+-----------------Bovine
remember: this is an unrooted tree!
|
When a tree has branch lengths, it will be accompanied by a table showing
for each branch the numbers (or names) of the nodes at each end of the
branch, and the length of that branch. For the first tree shown above,
the corresponding table is:
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
1 Bovine 0.90216 ( 0.50346, 1.30086) **
1 Mouse 0.79240 ( 0.42191, 1.16297) **
1 2 0.48553 ( 0.16602, 0.80496) **
2 3 0.12113 ( zero, 0.24676) *
3 4 0.04895 ( zero, 0.12668)
4 5 0.07459 ( 0.00735, 0.14180) **
5 Human 0.10563 ( 0.04234, 0.16889) **
5 Chimp 0.17158 ( 0.09765, 0.24553) **
4 Gorilla 0.15266 ( 0.07468, 0.23069) **
3 Orang 0.30368 ( 0.18735, 0.41999) **
2 Gibbon 0.33636 ( 0.19264, 0.48009) **
* = significantly positive, P < 0.05
** = significantly positive, P < 0.01
|
Ignoring the asterisks and the approximate confidence limits, which will be
described in the documentation file for Dnaml, we can see that the table
gives a more precise idea of what the lengths of all the branches are.
Similar tables exist in distance matrix and likelihood programs, as well
as in the parsimony programs Dnapars and Pars.
Some of the parsimony programs in the package can print out a table
of the number of steps that different characters (or sites) require on
the tree. This table may not be obvious at first. A typical example looks like
this:
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 2 2 2 2 1 1 2 2 1
10! 1 2 3 1 1 1 1 1 1 2
20! 1 2 2 1 2 2 1 1 1 2
30! 1 2 1 1 1 2 1 3 1 1
40! 1
|
The numbers across the top and down the side indicate which site
is being referred to. Thus site 23 is column "3" of row "20"
and has 1 step in this case.
There are many other kinds of information that can appear in the
output file, They vary from program to program, and we leave their
description to the documentation files for the specific programs.
The Tree File
In output from most programs,
a representation of the tree is also written into the tree file
outtree. The tree is specified by nested pairs
of parentheses, enclosing
names and separated by commas. We will describe how this works
below. If there are any blanks in the names,
these must be replaced by the underscore character "_". Trailing blanks
in the name may be omitted. The pattern of the parentheses indicates
the pattern of the tree by having each pair of parentheses enclose all
the members of a monophyletic group. The tree file could look like this:
((Mouse,Bovine),(Gibbon,(Orang,(Gorilla,(Chimp,Human)))));
In this tree the first fork separates the lineage leading to
Mouse and Bovine from the lineage leading to the rest. Within the
latter group there is a fork separating Gibbon from the rest, and so on.
The entire tree is enclosed in an outermost pair of parentheses. The tree ends
with a semicolon. In some programs such as Dnaml, Fitch, and Contml,
the tree will be unrooted. An unrooted tree should have its
bottommost fork have a
three-way split, with three groups separated by two commas:
(A,(B,(C,D)),(E,F));
Here the three groups at the bottom node are A, (B,C,D), and
(E,F). The single three-way split corresponds to one of the interior
nodes of the unrooted tree (it can be any interior node of the tree). The
remaining forks are encountered as you move out from that first node.
In newer programs, some are able to tolerate these other forks being
multifurcations (multi-way splits).
You should check the documentation files
for the particular programs you are using to see in which of these forms
you can expect the user tree to be in. Note that many of the programs
that actually estimate an unrooted tree (such as Dnapars) produce trees in the
treefile in rooted form! This is done for reasons of arbitrary internal bookkeeping. The placement of the root is arbitrary. We are working toward
having all programs be able to read all trees, whether rooted or unrooted,
multifurcating or bifurcating, and having them do the right thing with
them. But this is a long-term goal and it is not yet achieved.
For programs that infer branch lengths, these are given in the trees in the
tree file as real numbers following a colon, and placed immediately
after the group descended from that branch. Here is a typical tree
with branch lengths:
((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,
bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,
seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);
Note that the tree may continue to a new line at any time except in the
middle of a name or the middle of a branch length, although in trees
written to the tree file this will only be done after a comma.
These representations of trees are a subset of the standard adopted
on 24 June 1986 at the annual meetings of the Society for the Study of
Evolution by an informal committee (its final session in Newick's
lobster restaurant - hence its name, the Newick standard)
consisting of Wayne Maddison (author of MacClade), David Swofford (PAUP),
F. James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and the original
PHYLIP tree drawing programs), James Archie,
William H.E. Day, and me. This standard is a generalization of
PHYLIP's format, itself based on a well-known representation of trees in
terms of parenthesis patterns which is due to the famous mathematician
Arthur Cayley, and which has been around for over a century. The
standard is now employed by most phylogeny computer programs but unfortunately
has yet to be decribed in a formal published description. Other
descriptions by me and by Gary Olsen can be accessed using the Web at:
The Options and How To Invoke Them
Most of the programs allow various options that alter the amount of
information the program is provided or what is done with the
information. Options are selected in the menu.
Common options in the menu
A number of the options from the menu, the U (User tree), G (Global),
J (Jumble), O (Outgroup), W (Weights),
T (Threshold), M (multiple data sets), and the tree output options, are used
so widely that it is best to discuss them in this document.
The U (User tree) option. This option toggles between the default
setting, which allows the program to search for the best tree, and the
User tree setting, which reads a tree or trees ("user trees") from the input
tree file and evaluates them. The input tree file's
default name is intree. In many cases the programs will also
tolerate having the trees
be preceded by a line giving the number of trees:
((Alligator,Bear),((Cow,(Dog,Elephant)),Ferret));
((Alligator,Bear),(((Cow,Dog),Elephant),Ferret));
((Alligator,Bear),((Cow,Dog),(Elephant,Ferret)));
|
An initial line with the number of trees was formerly
required, but this now can be omitted. Some programs require rooted
trees, some unrooted
trees, and some can handle multifurcating trees. You should read
the documentation for the particular program to find out which it
requires. Program Retree can be used to convert trees among
these forms (on saving a tree from Retree, you are asked whether
you want it to be rooted or unrooted).
In using the user tree option, check the pattern of parentheses
carefully. The programs do not always detect
whether the tree makes sense, and if it does not there will probably be
a crash (hopefully, but not inevitably, with an error message indicating
the nature of the problem). Trees written out by programs are
typically in the proper form.
The G (Global) option. In the programs which construct trees (except for
Neighbor, the "...penny" programs and Clique, and of course
the "...move" programs where you construct the trees yourself),
after all species have been added to the tree a rearrangements phase
ensues. In most of these programs the rearrangements are automatically
global, which in this case means that subtrees will be removed from the tree
and put back on in all possible ways so as to have a better chance of
finding a better tree. Since this can be time consuming (it roughly
triples the time taken for a run) it is left as an option in some of the
programs, specifically Contml, Fitch, and Dnaml. In these programs
the G menu option toggles between the default of local rearrangement and
global rearrangement. The rearrangements are explained more below.
The J (Jumble) option. In most of the tree construction programs
(except for the "...penny" programs and Clique), the exact
details of the search of different trees depend on the order of input of
species. In these programs J option enables you to tell the program to use
a random number
generator to choose the input order of species. This option is toggled on
and off by
selecting option J in the menu. The program will then prompt you for
a "seed" for the random number generator. The seed should be an integer
between 1 and 232-3 (which is 4,294,967,293), and should be
of form 4n+1,
which means that it must give a remainder of 1 when divided by 4. This can be
judged by looking at the last two digits of the number (for example, in the
upper limit given above, the last two digits are 93, which is of form 4n+1. Each different seed
leads to a different sequence of addition of species. By simply changing the
random number seed and re-running the programs one can look for other, and
better trees. If the seed entered is not odd, the program will not proceed,
but will prompt for another seed.
The Jumble option also causes the program to ask you how many times you
want to restart the process. If you answer 10, the program will
try ten different orders of species in constructing the trees, and the
results printed out will reflect this entire search process (that is,
the best trees found among all 10 runs will be printed out, not the
best trees from each individual run).
Some people have asked what are good values of the random number seed.
The random number seed is used to start a process of choosing "random"
(actually pseudorandom) numbers, which behave as if they were
unpredictably randomly chosen between 0 and 232-1 (which is
4,294,967,295). You could put in the number 133 and find that the
next random number was 221,381,825. As they are effectively
unpredictable, there is no such thing as a choice that is better than
any other, provided that the numbers are of the form 4n+1. However
if you re-use a random number seed, the sequence of random numbers
that result will be the same as before, resulting in exactly the same
series of choices, which may not be what you want.
The O (Outgroup) option. This specifies which species is to
have the root of the tree be on the line leading to it. For example, if the
outgroup is a species "Mouse" then the root of the tree will be placed in the
middle of the branch which is connected to this species, with Mouse branching
off on one side of the root and the lineage leading to the rest of the tree
on the other. This option is toggled on and off by choosing O in the
menu (the alphabetic character O, not the digit 0). When it
is on, the program will then prompt for the
number of the outgroup (the species being taken in the numerical order that
they occur in the input file). Responding by typing 6 and then an
Enter character indicates that the sixth species in the data
(the 6th in the first set of data if there are multiple data sets)
is taken as the outgroup. Outgroup-rooting will not be attempted if the
data have already established a root for the tree from some other
consideration, and may not be if it is a user-defined tree,
despite your invoking the option. Thus programs such as Dollop that
produce only rooted trees do not allow the Outgroup option. It is also
not available in Kitsch, Dnamlk, or Clique. When it is used, the tree as
printed out is still listed as being an
unrooted tree, though the outgroup is connected to the bottommost node
so that it is easy to visually convert the tree into rooted form.
The T (Threshold) option. This sets a threshold forn the
parsimony programs such that if the
number of steps counted in a character is higher than the threshold, it
will be taken to be the threshold value rather than the actual number of
steps. The default is a threshold so high that it will never be
surpassed (in which case the steps whill simply be counted). The T
menu option toggles on and off asking the user to
supply a threshold. The use of thresholds to obtain methods intermediate
between parsimony and compatibility methods is described in my 1981b paper.
When the T option is in force, the program
will prompt for the numerical threshold value. This will be a positive
real number greater than 1. In programs Mix, Move, Penny, Protpars,
Dnapars, Dnamove, and Dnapenny, do not use threshold values less
than or equal to 1.0, as they have no meaning and lead to a tree which
depends only on considerations such as the input order of species and not at
all on the character state data! In programs Dollop, Dolmove, and Dolpenny
the threshold should never be 0.0 or less, for the same
reason. The T option is an
important and underutilized one: it is, for example, the only way in this
package (except for program Dnacomp) to do a compatibility analysis when there
are missing data. It is a method of de-weighting characters that evolve
rapidly. I wish more people were aware of its properties.
The M (Multiple data sets) option. In menu programs there is an
M menu
option which allows one to toggle on the multiple data sets option. The
program will ask you how many data sets it should expect. The data sets
have the same format as the first data set. Here is a (very small) input file
with two five-species data sets:
5 6
Alpha CCACCA
Beta CCAAAA
Gamma CAACCA
Delta AACAAC
Epsilon AACCCA
5 6
Alpha CACACA
Beta CCAACC
Gamma CAACAC
Delta GCCTGG
Epsilon TGCAAT
|
The main use of this option will be to allow all of the methods in these
programs to be bootstrapped. Using the program Seqboot one can take any
DNA, protein, restriction sites, gene frequency or binary character data set and
make multiple data sets by bootstrapping. Trees can be produced for all of
these using the M option. They will be written on the tree output file if
that option is left in force. Then the program Consense can be used with
that tree file as its input file. The result is a majority rule consensus
tree which can be used to make confidence intervals. The present version
of the package allows, with the use of Seqboot and Consense and the M option,
bootstrapping of many of the methods in the package.
Programs Dnaml, Dnapars and Pars can also take multiple weights
instead of multiple data sets. They can then do bootstrapping by
reading in one data set, together with a file of weights that show how
the characters (or sites) are reweighted in each bootstrap sample. Thus a
site that is omitted in a bootstrap sample has effectively been given
weight 0, while a site that has been duplicated has effectively been
given weight 2. Seqboot has a menu selection to produce the file of
weights information automatically, instead of producing a file of
multiple data sets. It can be renamed and used as the input weights file.
The W (Weights) option. This signals the program that, in
addition to the data set, you want to read in a series of weights that
tell how many times each character is to be counted. If the weight
for a character is zero (0) then that character is in effect to
be omitted when the tree is evaluated. If it is (1) the
character is to be counted once. Some programs allow weights greater than
1 as well. These have the effect that the character is counted as
if it were present that many times, so that a weight of 4 means that the
character is counted 4 times.
The values 0-9 give weights 0 through 9, and the
values A-Z give weights 10 through 35. By use of the weights we can
give overwhelming weight to some characters, and drop others from the
analysis. In the molecular sequence programs only two values of the
weights, 0 or 1 are allowed.
The weights are used to analyze subsets of the characters, and also can be
used for resampling of the data as in bootstrap and jackknife resampling.
For those programs that allow weights to be greater than 1, they can also
be used to emphasize information from some characters more strongly than
others. Of course, you must have some rationale for doing this.
The weights are provided as a sequence of digits. Thus they might be
10011111100010100011110001100
The weights are to be provided in an input file
whose default name is weights. The weights in it are
a simple string of digits. Blanks in the weightfile are skipped over and
ignored, and the weights can continue to a new line. In programs such as
Seqboot
that can also output a file of weights, the input weights have a default
file name of inweights, and the output file name has a default
file name of outweights.
Weights can be used to analyze different subsets of characters (by weighting
the rest as zero). Alternatively, in the discrete characters programs
they can be used to force a certain
group to appear on the phylogeny (in effect confining consideration to only
phylogenies containing that group). This is done by adding an imaginary
character that has 1's for the members of the group, and 0's
for all the
other species. That imaginary character is then given the highest weight
possible: the result will be that any phylogeny that does not contain that
group will be penalized by such a heavy amount that it will not (except in
the most unusual circumstances) be considered. Of course, the new character
brings extra steps to the tree, but the number of these can be calculated
in advance and subtracted out of the total when reporting the results. This
use of weights is an important one, and one sadly ignored
by many users who could profit from it. In the case of molecular sequences
we cannot use weights this way, so that to force a given group to appear we
have to add a large extra segment of sites to the molecule, with (say) A's
for that group and C's for every other species.
The option to write out the trees into a tree file. This specifies that you
want the program to write
out the tree not only on its usual output, but also onto a file in
nested-parenthesis notation (as described above). This option is sufficiently
useful that it is turned on by default in all programs that allow it. You
can optionally turn it off if you wish, by typing the appropriate number
from the menu (it varies from program to program). This option is useful for
creating tree files that can be directly read into the programs, including
the consensus tree and tree distance programs, and the tree plotting programs.
The output tree file has a default name of outtree.
The (0) terminal type option . (This is the digit 0, not
the alphabetic character O). The program will default to
one particular assumption about your terminal (ANSI in the case of Linux,
Unix, or Mac OS X, none in the case of Mac OS 9, and IBM PC in the case of
Windows). You can
alternatively select it to be either an IBM PC, or nothing.
This affects the ability of the programs to clear the screen when they
display their menus, and the graphics characters used to display trees
in the programs Dnamove, Move, Dolmove, and Retree. In the case of Windows,
the screen will clear properly with either the IBM PC or the ANSI settings,
but the graphics characters needed by Move, Dnamove, Dolmove, or Retree
will display correctly only with the IBM PC setting.
The Algorithm for Constructing Trees
All of the programs except Factor, Dnadist, Gendist, Dnainvar, Seqboot,
Contrast, Retree, and the plotting and
consensus tree programs act to construct an estimate of a phylogeny. Move,
Dolmove, and Dnamove let you construct it yourself by hand. All of
the rest but Neighbor, the "...penny" programs and Clique make use of
a common approach involving additions and rearrangements. They are
trying to minimize or maximize some quantity over the space of all
possible evolutionary trees. Each program contains a part that, given
the topology of the tree, evaluates the quantity that is being minimized
or maximized. The straightforward approach would be to evaluate all
possible tree topologies one after another and pick the one which,
according to the criterion being used, is best. This would not be
possible for more than a small number of species, since the number of
possible tree topologies is enormous. A review of the literature on the
counting of evolutionary trees will be found one of my papers
(Felsenstein, 1978a) and in my book (Felsenstein, 2004, chapter 3).
Since we cannot search all topologies, these programs are not
guaranteed to always find the best tree, although they seem to do quite
well in practice. The strategy they employ is as follows: the species
are taken in the order in which they appear in the input file. The
first two (in some programs the first three) are taken and a tree
constructed containing only those. There is only one possible topology for
this tree. Then the next species is taken, and we consider where it
might be added to the tree. If the initial tree is (say) a rooted tree
with two species and we want the resulting three-species tree to be a
bifurcating tree, there are only three places where we could add the
third species. Each of these is tried, and each time the resulting tree is
evaluated according to the criterion. The best one is chosen to be the
basis for further operations. Now we consider adding the fourth
species, again at each of the five possible places that would result in
a bifurcating tree. Again, the best of these is accepted. This is
usually known as the Sequential Addition strategy.
Local rearrangements
The process continues in this manner, with one important exception. After
each species is added, and before the next
is added, a number of rearrangements of the tree are tried, in an effort
to improve it. The algorithms move through the tree, making all
possible local rearrangements of the tree. A local rearrangement involves an
internal segment of the tree in the following manner. Each internal
segment of the tree is of this form (where T1, T2, and T3 are subtrees
- parts of the tree that can contain further forks and tips):
T1 T2 T3
\ / /
\ / /
\ / /
\/ /
* /
* /
* /
* /
*
!
!
the segment we are discussing being indicated by the asterisks. A local
rearrangement consists of switching the subtrees T1 and T3 or T2 and T3,
so as to obtain one of the following:
T3 T2 T1 T1 T3 T2
\ / / \ / /
\ / / \ / /
\ / / \ / /
\ / / \ / /
\ / \ /
\ / \ /
\ / \ /
\ / \ /
! !
! !
! !
Each time a local rearrangement is successful in finding a better tree,
the new arrangement is accepted. The phase of local rearrangements does
not end until the program can traverse the entire tree, attempting local
rearrangements, without finding any that improve the tree.
This strategy of adding species and making local rearrangements will look
at about (n-1)x(2n-3) different topologies, though if
rearrangements are frequently successful the number may be larger. I
have been describing the strategy when rooted trees are being
considered. For unrooted trees there is a precisely similar strategy,
though the first tree constructed may be a three-species tree and the
rearrangements may not start until after the addition of the fifth
species.
These local rearrangements have come to be called Nearest Neighbor
Interchanges (NNIs) in the phylogeny literature.
Though we are not guaranteed to have found the best tree topology,
we are guaranteed that no nearby topology (i. e. none accessible by a
single local rearrangement) is better. In this sense we have reached a
local optimum of our criterion. Note that the whole process is
dependent on the order in which the species are present in the input
file. We can try to find a different and better solution by reordering
the species in the input file and running the program again (or, more
easily, by using the J option). If none of
these attempts finds a better solution, then we have some indication
that we may have found the best topology, though we can never be certain
of this.
Note also that a new topology is never accepted unless it is better
than the previous one, so that the rearrangement process can never fall
into an endless loop. This is also the way ties in our criterion are
resolved, namely by sticking with the tree found first. However, the tree
construction programs other than Clique, Contml, Fitch,
and Dnaml do keep a record of all trees found that are tied with the best one
found. This gives you some immediate idea of which parts of the tree can be
altered without affecting the quality of the result.
Global rearrangements
A feature of most of the programs, such as Protpars, Dnapars,
Dnacomp, Dnaml, Dnamlk, Restml, Kitsch, Fitch, Contml, Mix, and Dollop,
is "global" optimization of the tree. In four of these (Contml,
Fitch, Dnaml and Dnamlk) this is an option, G. In the others it
automatically applies. When
it is present there is an additional stage to the search for the best tree.
Each possible subtree is removed from the tree from the tree and added back in
all possible places. This process continues until all subtrees can be removed
and added again without any improvement in the tree. The purpose of this
extra rearrangement is to make it less likely that one or more a species gets
"stuck" in a suboptimal region of the space of all possible trees. The use of
global optimization results in approximately a tripling (3 x ) of the run-time,
which is why I have left it as an option in some of the slower programs.
What PHYLIP calls "global" rearrangements are more properly called
SPR (subtree pruning and regrafting) by Swofford et. al. (1996) as distinct
from the NNI (nearest neighbor interchange) rearrangements that PHYLIP
also uses, and the TBR (tree bisection and reconnection) rearrangements
that it does not use. My book (Felsenstein, 2004, chapter 4) contains
a review of work on these and other rearrangements and search methods.
The programs doing global optimization print out a dot "." after each group is
removed and re-added to the tree, to give the user some sign that the
rearrangements are proceeding. A new line of dots is started whenever a new
round of global rearrangements is started following an improvement in the
tree. On the line before the dots are printed there is printed a bar of
the form "!---------------!" to show how many dots
to expect. The dots will
not be printed out at a uniform rate, but the later dots, which represent
removal of larger groups from the tree and trying them consequently in fewer
places, will print out more quickly. With some compilers each row of dots may
not be printed out until it is complete.
It should be noted that Penny, Dolpenny, Dnapenny and Clique use a more
sophisticated strategy of "depth-first search" with a "branch and bound"
search method that guarantees that all
of the best trees will be found. In the case
of Penny, Dolpenny and Dnapenny there can be a considerable sacrifice of
computer time if the number of species is greater than about ten: it is a
matter for you to consider whether it is worth it for you to guarantee finding
all the most parsimonious trees, and that depends on how much free computer
time you have! Clique finds all largest cliques, and does so without undue
burning of computer time. Although all of these problems that have been
investigated fall into the
category of "NP-hard" problems that in effect do not have a rapid solution,
the cases that cause this trouble for the largest-cliques algorithm in
Clique apparently are not biologically realistic and do not occur in actual
data.
Multiple jumbles
As just mentioned, for most of these programs the search depends on the order
in which the species are entered into the tree. Using the J (Jumble)
option you can supply a random number seed which will allow the program to put
the species in in a random order. Jumbling can be
done multiple times. For example, if you tell the program to do it
10 times, it will go through the tree-building process 10 times, each with a
different random order of adding species. It will keep a record of the trees
tied for best over the whole process. In other words, it does not just
record the best trees from each of the 10 runs, but records the best ones
overall. Of course this is slow, taking 10 times longer than a single run.
But it does give us a much greater chance of finding all of the most
parsimonious trees. In the terminology of Maddison (1991) it
can find different "islands" of trees. The present algorithms do not
guarantee us to find all trees in a given "island" from a single run, so
multiple runs also help explore those "islands" that are found.
Saving multiple tied trees
For the parsimony and compatibility programs, one can have a perfect tie
between two or more trees. In these programs these trees are all
saved. For the newer parsimony programs such as Dnapars and Pars,
global rearrangement is carried out on all of these tied trees. This can
be turned off in the menu.
For trees with criteria which are real numbers, such as the distance
matrix programs Fitch and Kitsch, and the likelihood programs Dnaml,
Dnamlk, Contml, and Restml, it is difficult to get an exact tie between
trees. Consequently these programs save only the single best tree
(even though the others may be only a tiny bit worse).
Strategy for finding the best tree
In practice, it is advisable to use the Jumble option to evaluate many
different orderings of the input species. It is advisable to use the
Jumble option and specify that it be done many times (as many as
different orderings
of the input species). (This is usually not necessary when bootstrapping,
though the programs will then default to doing it once to avoid artifacts
caused by the order in which species are added to the tree.)
People who want a magic "black box" program whose results they do
not have to question (or think about) often are upset that these
programs give results that are dependent on the order in which the species
are entered in the data. To me this property is an advantage, for it
permits you to try different searches for better trees, simply by
varying the input order of species. If you do not use the multiple Jumble
option, but do multiple individual runs instead, you
can easily decide which to pay most attention to - the one or ones that
are best according to the criterion employed (for example, with parsimony,
the one out of the runs that results in the tree with the fewest changes).
In practice, in a single run, it usually seems best to put species that are
likely to be sources of confusion in the topology last, as by the time they are
added the arrangement of the earlier species will have stabilized into a
good configuration, and then the last few species will by fitted into
that topology. There will be less chance this way of a poor initial
topology that would affect all subsequent parts of the search. However,
a variety of arrangements of the input order of species should be tried,
as can be done if the J option is used,
and no species should be kept in a fixed place in the order of input.
Note that the results of the "...penny" programs and Clique
are not sensitive to the input order of species, and Neighbor is only
slightly sensistive to it, so that multiple Jumbling is not possible
with those programs. Note also that with global search, which
is standard in many programs and in others is an
option, each group (including
each individual species) will be removed and re-added in all possible
positions, so that a species causing confusion will have more chance of moving
to a new location than it would without global rearrangement.
A Warning on Interpreting Results
Probably the most important thing to keep in mind while running any of the
parsimony or compatibility programs is not
to overinterpret the result. Many users treat the set of most parsimonious
trees as if it were a confidence interval. If a group appears in all of the
most parsimonious trees then they treat it as well established. Unfortunately
the confidence interval on phylogenies appears to be much
larger than the set of all most parsimonious trees (Felsenstein, 1985b).
Likewise, variation of result among different methods will not be a good
indicator of the size of the confidence interval. Consider a simple data set
in which, out of 100 binary characters, 51 recommend the unrooted tree
((A,B),(C,D)) and 49 the tree ((A,D),(B,C)). Many different
methods will all give the same result on
such a data set: they will estimate the tree as ((A,B),(C,D)).
Nevertheless it is
clear that the 51:49 margin by which this tree is favored is not statistically
significantly different from 50:50. So consistency among different methods
is a poor guide to statistical significance.
Relative Speed of Different
Programs and Machines
Relative speed of the different programs
C compilers differ in efficiency of the code they generate,
and some deal with some features of the language better than with
others. Thus a program which is unusually fast on one computer may be
unusually slow on another. Nevertheless, as a rough guide to relative
execution speeds, I have tested the programs on three data sets, each of
which has 10 species and 40 characters. The first is an imaginary one
in which all characters are compatible - ("The Willi Hennig Memorial
Data Set" as J. S. Farris once called ones like it). The second is the binary
recoded form of the fossil horses data set of Camin and Sokal (1965).
The third data set has data that is completely random: 10 species and 20
characters that have a 50% chance that each character state is 0 or
1 (or A or G). The data sets thus range from a completely
compatible one in which there is no homoplasy (paralellism or convergence),
through the horses data set, which requires 29 steps where the possible
minimum number would be 20, to the random data set, which requires 49 steps.
We can thus see how this increasing messiness of the data affects running
times. The three data sets have all had 20 sites of A's added to the
end of each sequence, so as to prevent likelihood or distance matrix programs
from having infinite branch lengths (the test data sets used for timing
previous versions of PHYLIP were the same except that they lacked these
20 extra sites).
Here are the nucleotide sequence versions of the three data sets:
10 40
A CACACACAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
B CACACAACAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
C CACAACAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
D CAACAAAACAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
E CAACAAAAACAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
F ACAAAAAAAACACACAAAACAAAAAAAAAAAAAAAAAAAA
G ACAAAAAAAACACAACAAACAAAAAAAAAAAAAAAAAAAA
H ACAAAAAAAACAACAAAAACAAAAAAAAAAAAAAAAAAAA
I ACAAAAAAAAACAAAACAACAAAAAAAAAAAAAAAAAAAA
J ACAAAAAAAAACAAAAACACAAAAAAAAAAAAAAAAAAAA
|
10 40
MesohippusAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
HypohippusAAACCCCCCCAAAAAAAAACAAAAAAAAAAAAAAAAAAAA
ArchaeohipCAAAAAAAAAAAAAAAACACAAAAAAAAAAAAAAAAAAAA
ParahippusCAAACAACAACAAAAAAAACAAAAAAAAAAAAAAAAAAAA
MerychippuCCAACCACCACCCCACACCCAAAAAAAAAAAAAAAAAAAA
M. secunduCCAACCACCACCCACACCCCAAAAAAAAAAAAAAAAAAAA
Nannipus CCAACCACAACCCCACACCCAAAAAAAAAAAAAAAAAAAA
NeohippariCCAACCCCCCCCCCACACCCAAAAAAAAAAAAAAAAAAAA
Calippus CCAACCACAACCCACACCCCAAAAAAAAAAAAAAAAAAAA
PliohippusCCCACCCCCCCCCACACCCCAAAAAAAAAAAAAAAAAAAA
|
10 40
A CACACAACCAAACAAACCACAAAAAAAAAAAAAAAAAAAA
B AAACCACACACACAAACCCAAAAAAAAAAAAAAAAAAAAA
C ACAAAACCAAACCACCCACAAAAAAAAAAAAAAAAAAAAA
D AAAAACACAACACACCAAACAAAAAAAAAAAAAAAAAAAA
E AAACAACCACACACAACCAAAAAAAAAAAAAAAAAAAAAA
F CCCAAACACCCCCAAAAAACAAAAAAAAAAAAAAAAAAAA
G ACACCCCCACACCCACCAACAAAAAAAAAAAAAAAAAAAA
H AAAACAACAACCACCCCACCAAAAAAAAAAAAAAAAAAAA
I ACACAACAACACAAACAACCAAAAAAAAAAAAAAAAAAAA
J CCAAAAACACCCAACCCAACAAAAAAAAAAAAAAAAAAAA
|
Here are the timings of many of the version 3.6 programs on these three data
sets as run after being compiled by Gnu C (version 3.2) and run on an
AMD Athlon XP 2200+ computer under Linux.
| |
Hennigian Data |
Horses Data |
Random Data |
| Protpars |
0.00500 |
0.00670 |
0.01289 |
| Dnapars |
0.01050 |
0.00940 |
0.00980 |
| Dnapenny |
0.01400 |
0.00860 |
1.71100 |
| Dnacomp |
0.00240 |
0.00250 |
0.00590 |
| Dnaml |
0.17749 |
0.23970 |
0.21350 |
| Dnamlk |
0.21740 |
0.19450 |
0.24400 |
| Proml |
1.3527 |
3.2085 |
2.0055 |
| Promlk |
3.3567 |
8.6078 |
4.4886 |
| Dnainvar |
0.00020 |
0.00020 |
0.00020 |
| Dnadist |
0.00140 |
0.00080 |
0.00150 |
| Protdist |
0.09220 |
0.09210 |
0.09310 |
| Restml |
0.14560 |
0.28810 |
0.21540 |
| Restdist |
0.00110 |
0.00090 |
0.00080 |
| Fitch |
0.00760 |
0.01280 |
0.00880 |
| Kitsch |
0.00180 |
0.00260 |
0.00280 |
| Neighbor |
0.00020 |
0.00050 |
0.00050 |
| Contml |
0.01310 |
0.01500 |
0.01780 |
| Gendist |
0.00070 |
0.00070 |
0.00070 |
| Pars |
0.00780 |
0.00610 |
0.02930 |
| Mix |
0.00360 |
0.00410 |
0.00610 |
| Penny |
0.00190 |
0.00470 |
0.8060 |
| Dollop |
0.00480 |
0.00450 |
0.00820 |
| Dolpenny |
0.00200 |
0.01060 |
1.1270 |
| Clique |
0.00100 |
0.00070 |
0.00130 |
In all cases the programs were run under the default options with optimized
compiler switches (-03 -fomit-frame-pointer), except as
specified here.
The data sets used for the discrete characters programs have 0's and 1's
instead of A's and C's. For Contml the A's and C's
were made into 0.0's and 1.0's and considered as 40 2-allele loci.
For the distance programs 10 x 10 distance matrices were
computed from the three data sets.
For the restriction sites programs A and C were changed into
+ and -. It does not
make much sense to benchmark Move, Dolmove, or Dnamove, although when there
are many characters and many species the response time after each
alteration of the tree should be proportional to the product of the number of
species and the number of characters.
For Dnaml, Dnamlk, and Dnadist the frequencies of the four bases were
set to be equal rather than determined empirically as is the default. For
Restml the number of enzymes was set to 1.
In most cases, the benchmark was made more accurate by analyzing 100 data
sets using the M (Multiple data sets) option and dividing the resulting
time by 100. Times were determined as user times using the Linux time
command. Several patterns will be apparent from this. The algorithms (Mix,
Dollop, Contml, Fitch, Kitsch, Protpars, Dnapars, Dnacomp, and
Dnaml, Dnamlk, Restml) that use the above-described addition strategy have
run times that do not depend strongly on the messiness of the data. The only
exception to this is that if a data set such as the Random data requires
extra rounds of global rearrangements it takes longer. The
programs differ greatly in run time: the protein likelihood programs
Proml and Promlk were very slow, and the other likelihood programs
Restml, Dnaml and
Contml are slower than the rest of the programs. The protein sequence parsimony
program, which has to do a considerable amount of bookkeeping to keep track of
which amino acids can mutate to each other, is also relatively slow.
Another class of algorithms includes Penny, Dolpenny, Dnapenny and Clique.
These are branch-and-bound methods: in principle they should have execution
times that rise exponentially with the number of species and/or
characters, and they might be much more sensitive to messy data. This is
apparent with Penny, Dolpenny, and Dnapenny, which go from being reasonably
fast with clean data to very slow with messy data. Dolpenny is particularly
slow on messy data - this is because this algorithm cannot make use of some of
the lower-bound calculations that are possible with Dnapenny and Penny. Clique
is very fast on all
data sets. Although in theory it should bog down if the number of cliques in
the data is very large, that does not happen with random data, which in
fact has few cliques and those small ones. Apparently the "worst-case"
data sets that cause exponential run time are much rarer for Clique than for
the other branch-and-bound methods.
Neighbor is quite fast compared to Fitch and Kitsch, and should make it
possible to run much larger cases, although the results are expected to be
a bit rougher than with those programs.
Speed with different numbers of species
How will the speed depend on the number of species and the number
of characters? For the sequential-addition algorithms, the speed should
be proportional to somewhere between the cube of the number of species and
the square of the number of species, and to the number
of characters. Thus a case that has, instead of 10 species and 20
characters, 20 species and 50 characters would take (in the cubic case)
2 x 2 x 2 x 2.5 = 20
times as long. This implies that cases with more than 20 species will
be slow, and cases with more than 40 species very slow. This places a
premium on working on small subproblems rather than just dumping a whole
large data set into the programs.
An exception to these rules will be some of the DNA programs that use an
aliasing device to save execution time. In these programs execution time
will not necessarily increase proportional to the number of sites,
as sites that show the same pattern of nucleotides will be detected
as identical and the calculations for them will be done only once, which does
not lead to more execution time. This is particularly
likely to happen with few species and many sites, or with data sets that have
small amounts of evolutionary divergence.
For programs Fitch and Kitsch, the distance matrix is square, so
that when we double the number of species we also double the number of
"characters", so that running times will go up as the fourth power of
the number of species rather than the third power. Thus a 20-species
case with Fitch is expected to run sixteen times more slowly than a 10-species
case.
For programs like Penny and Clique the run times will rise faster
than the cube of the number of species (in fact, they can rise faster
than any power since these algorithms are not guaranteed to work in
polynomial time). In practice, Penny will frequently bog down above 11
species, while Clique easily deals with larger numbers.
For Neighbor the speed should vary only as the cube of the number of
species, so a case twice as large will take only eight times as long. This
will make it an attractive alternative to Fitch and Kitsch for large data
sets.
Suggestion: If you are unsure of how long a program will take, try it first on
a few species, then work your way up until you get a feel for the speed
and for what size programs you can afford to run.
Execution time is not the most important criterion for a program,
particularly as computer time gets much cheaper than your time or a
programmer's time. With workstations on which background jobs can be run
all night, execution speed is not overwhelmingly relevant. Some of us have been
conditioned by an earlier era of computing to consider execution speed
paramount. But ease of use, ease of adaptation to your computer system,
and ease of modification are much more important in practice, and in
these respects I think these programs are adequate. Only if you are
engaged in 1960's style mainframe computing, or if you have very large
amounts of data is minimization of execution
time paramount. If you spent six months getting your data, it may not be
overwhelmingly important whether your run takes 10 seconds or 10 hours.
Nevertheless it would have been nice to have made the programs
faster. The present speeds are a compromise between speed and
effectiveness: by making them slower and trying more rearrangements in the
trees, or by enumerating all possible trees, I could have made the programs
more likely to find the best tree. By trying fewer rearrangements I
could have speeded them up, but at the cost of finding worse trees. I
could also have speeded them up by writing critical sections in assembly
language, but this would have sacrificed ease of distribution to new
computer systems. There are also some options included in these programs that
make it
harder to adopt some of the economies of bookkeeping that make other programs
faster. However to some extent I have simply made the decision not to spend
time trying to speed up program bookkeeping when there were new likelihood and
statistical methods to be developed.
Relative speed of different machines
It is interesting to compare different machines using Dnapars as the
standard task. One can rate a machine on the Dnapars benchmark by summing the
times for all three of the data sets. Here are relative total timings over
all three data sets (done with various versions of Dnapars) for some machines,
taking an AMD Athlon 1.2 GHz computer running Linux with gcc as the
standard. Benchmarks from versions 3.4 and 3.5 of the program are
also included (respectively the Pascal and C versions whose timings are in
parentheses). They are compared only with each other and are scaled to the
rest of the timings using the joint runs on the 386SX and the Pentium MMX 266.
This use of separate standards is necessary not
because of different languages but because different versions of the package
are being compared. Thus, the "Time" is the ratio of the Total to that for
the Pentium, adjusted by the scalings of machines using 3.4 and 3.5 when
appropriate. The Relative Speed is the reciprocal of the Time.
| Machine |
Operating System |
Compiler |
Total |
Time |
Relative Speed |
| Toshiba T1100+ |
MSDOS |
Turbo Pascal 3.01A |
(269) |
10542 |
0.00009486 |
| Apple Mac Plus |
Mac OS |
Lightspeed Pascal 2 |
(175.84) |
6891 |
0.00014511 |
| Toshiba T1100+ |
MSDOS |
Turbo Pascal 5.0 |
(162) |
6349 |
0.00015750 |
| Macintosh Classic |
Mac OS |
Think Pascal 3 |
(160) |
6271 |
0.00015947 |
| Macintosh Classic |
Mac OS |
Think C |
(43.0) |
4771 |
0.0002096 |
| IBM PS2/60 |
MSDOS |
Turbo Pascal 5.0 |
(58.76) |
2303 |
0.0004343 |
| 80286 (12 Mhz) |
MSDOS |
Turbo Pascal 5.0 |
(47.09) |
1845.4 |
0.0005419 |
| Apple Mac IIcx |
Mac OS |
Think Pascal 3 |
(42) |
1645.5 |
0.0006077 |
| Apple Mac SE/30 |
Mac OS |
Think Pascal 3 |
(42) |
1645.6 |
0.0006077 |
| Apple Mac IIcx |
Mac OS |
Lightspeed Pascal 2 |
(39.84) |
1561.6 |
0.0006404 |
| Apple Mac IIcx |
Mac OS |
Lightspeed Pascal 2# |
(39.69) |
1555.0 |
0.00006431 |
| Zenith Z386 (16MHz) |
MSDOS |
Turbo Pascal 5.0 |
(38.27) |
1539.0 |
0.0006498 |
| Macintosh SE/30 |
Mac OS |
Think C |
(13.6) |
1508.4 |
0.0006630 |
| 386SX (16 MHz) |
MSDOS |
Turbo Pascal 6.0 |
(34) |
1333.6 |
0.0007498 |
| 386SX (16 MHz) |
MSDOS |
Microsoft Quick C |
(12.01) |
1333.6 |
0.0007499 |
| Sequent-S81 |
DYNIX |
Silicon Valley Pascal |
(13.0) |
509.0 |
0.0019646 |
| VAX 11/785 |
Unix |
Berkeley Pascal |
(11.9) |
466.3 |
0.002144 |
| 80486-33 |
MSDOS |
Turbo Pascal 6.0 |
(11.46) |
449.0 |
0.02227 |
| Sun 3/60 |
SunOS |
Sun C |
(3.93) |
435.7 |
0.002295 |
| NeXT Cube (68030)< | |