To start MrBayes on the you just have to type
mb
At the MrBayes> prompt type 'help' to see a list of the commands in MrBayes. Most commands allow you to set values for a set of options. If you type 'help <command>', where <command> is any of the listed comamnds, you will see a description of each option and the choices for that option. The help facility also provides a way to see the current settings of the parameters. For instance, typing 'help lset' not only lists each of the options and possible choices, at the end of the list it tells you the current state for each option.
The input file is a standard Nexus file of aligned nucleotide or protein sequences. A Nexus file is a simple text (ASCII) file that begins with NEXUS on the first line, followed by the DATA block. The DATA block begins with 'begin data;' followed by the dimensions statement, the format statement, and matrix. The dimensions statement must contain 'ntax', the number of taxa, and 'nchar', the number of characters in each aligned sequence. The format statement must begin with 'datatype=<DNA or RNA or Protein or Standard>. The format statement may also contain 'gap=-' (or whatever symbol is used for a gap in your alignment), 'missing=?' (or whatever symbol is used for missing data in your file), 'interleave=yes' when the data matrix is interleaved sequences, and =. (or whatever symbol is used for matching characters in the alignment). The format statement is followed by the word 'matrix' on a separate line, followed by the aligned sequences. Each sequence begins with the sequence name sepaarted from the sequence itself by at least one space. The data block is completed by an 'end;' statement. Note that the begin data, dimensions, format and end statements all end in a semicolon. That semicolon is essential and must not be left out.
You can create a Nexus (non-interleaved) format multiple sequence file from any other format of multiple sequence file by using the EMBOSS package and typing the command:
seqret input_file output_file -osformat nexusnon
where 'input_file' is the name of the file containing your existing multiple sequence format and 'output_file' is the name of the Nexus file you will create.
The example below is a Nexus file in block (non-interleaved) format. Non-interleaved is the default, but you can put "interleaved=no' in the format statement if you want to be sure.
#NEXUS begin data; dimensions ntax=9 nchar=33; format interleave=no datatype=RNA gap=-; matrix FR ggcaacggu---gug---gcagacccacgccuc MS2 gggaacgga---gug---ucagauccacgccuc GA gggaacggu---uug---ucagauccgcgacuc SP uca---aauaaagca---ugggauccuagagca NL95 ucg---aauaaagca---ugggauccuagggua M11 ccuuucaauaaagca---ugggauccuagggua MX1 ccuuucaauaaagca---ugggauccuagggua QB ccuuuuaauaaagca---ugggauccuagggca PP7 gacagc---cgguuc---cggaucccugacacg ; end;
To put the data into MrBayes, type:
execute <filename>
at the MrBayes> prompt, where '<filename>' is the name of teh input file. Note: The input file must be located in the same folder (directory) as the MrBayes application and the name of the input file should not have blank spaces.
Any command that can be entered at the command line at the MrBayes> prompt. At a minimum two commands, "lset" and "mcmc" are required. "lset" sets the parameters of the likelihood model, and "mcmc" both sets the parameters of the Markov Chain Mone Carlo (MCMC) analysis and initiates the analysis. Entering commands at the command line provides a way for the user to change parameters on the fly and is useful for quickly exploring the effects of such changes, but it is not the most convenient way to use MrBayes. It is much easier to enter commands via a MrBayes block in the input file.
Any command that can be entered via the command line can also be entered via the input file as a MrBayes statement. To do so add a MrBayes block after the "end;" statement that terminates the data block. A MrBayes block begins with "begin mrbayes;", and ends with "end;". As is the case for the data block, all statements end with a semicolon. Consider the example block below:
begin mrbayes; charset 1stpos = 1-720\3; charset 2ndpos = 2-720\3; charset 3rdpos = 3-720\3; partition bycodon = 3:1stpos,2ndpos,3rdpos; lset nst=6 rates=sitespec sitepartition=bycodon; usertree = (((FR,MS2),GA),((SP,NL95),((M11,MX1),QB)),PP7); mcmc ngen=1000 printfreq=100 samplefreq=10 nchains=4 savebrlens=yes; end;
which was taken from the example file replicase.nex. The first four lines define character sets with the command charset. The line "charset 1stpos = 1-771\3;" defines a character set named "1stpos" as consisting of every third site from 1 through 721. The next two lines similarly define the second and third positions. The format for the charset command is 'charset <name> = <character numbers>'. The next line defines a partition named "bycodon" consisting of three parts: 1stpos, 2ndpos, and 3rdpos. Note that the terms 1stpos, 2ndpos, and 3rdpos that were defined in the first three statements are used in the partition statement. A partition statement has the format 'partition name = <number of partitions>: <sites in 1st partition>, sites in 2nd partition>' etc;. Together these lines have defined codons and the
sites within codons. The next line sets the parameers of the likelihood model with the command "lset". "nst=6" sets the general time reversible model, "rates=sitespecific" sets the model for among sitge variation to being site specific, and "sitepartition=bycodon" sets the name of the site partition to use for site specific rate variation as "bycodon". Note that a partition had to already be defined for this option to be implemented. The next very long line defines the starting tree using the Newick format. If you define a tree it must be strictly bifurcating or MrBayes will return an error message. You don't need to define a tree. If no tree is defined MrBayes will use a random tree as its starting point. If you have a problem with a user tree just use the default random tree option. This may actually be preferable in many cases, particularly when monitoring convergence by comparing several indpendent runs. The final statement mcmc instructs MrBayes to run for 1000 generations (ngen=1000) print its results to the screen once every 100 generations (printfreq=100), to save the current tree to a file once every 10 generations (samplefreq=10), to run four simultaneous MCMC chains, and to save the branch lengths of the trees to the tree file. It also instructs MrBayes to start running the chains. With this block in place typing 'execute <filename>', e.g. 'execute replicase.nex', will produce the following output:
Running Markov chain Starting likelihoods 1 -- -6429.147 -6429.147 2 -- -6388.254 -6388.254 3 -- -6678.446 -6678.446 4 -- -6597.156 -6597.156 100 -- -19215.03 -- (-6045.46) [-5929.31] (-6088.02) (-6238.81) 2s 200 -- -18522.17 -- (-5878.95) [-5819.39] (-5833.47) (-5834.11) 4s (4~2) 300 -- -18292.23 -- [-5732.10] (-5725.53) (-5821.09) (-5810.95) 6s 400 -- -18139.18 -- (-5712.11) (-5703.27) (-5780.78) [-5693.39] 8s 500 -- -18092.13 -- (-5702.02) (-5684.19) (-5762.62) [-5680.81] 9s 600 -- -18036.97 -- (-5696.36) (-5680.07) (-5701.56) [-5671.27] 11s 700 -- -18018.36 -- (-5682.72) (-5679.98) (-5700.30) [-5662.94] 13s (4~1) 800 -- -17981.18 -- (-5679.95) (-5666.39) (-5684.34) [-5649.37] 15s 900 -- -17970.38 -- (-5684.04) (-5663.26) (-5669.62) [-5647.99] 17s 1000 -- -17959.28 -- (-5684.56) [-5657.12] (-5657.73) (-5648.36) 19s (4~2) Continue with chain? (yes/no):
MrBayes is asking whether you want to continue the run. If you type yes (the full word is required) MrBayes will ask for the number of additional generations you want. If you type anything other than yes the run terminates. Why might you want the run to continue? Primarily to ensure that you consider a sufficient number of trees after the likelihoods of the trees have converged on a stable value. In the output, above, the first column shows the generation number. The next column shows the sum of the natural log of the likelihoods for the trees in each of the four chains weighted according to the tempertures of the chains. The default temperature of the heated chains is determined by the temp parameter that is set by the mcmc command. The default setting is temp=0.2. (For an explanation of heated chains see section 2 of this manual) The next four columns show the likelihoods of the trees in each chain with the cold chain being indicated by square brackets. The final (rightmost) column shows the elapsed time in seconds. In three of the generations that were printed to the screen, there was a successful swap between chains, indicated by "(4 2)", "(4 1)", and "(4 2)" in generations 200, 700, and 1000, respectively.
Note that the ln likelihood sum (column 2) starts at -19215.03 and increases with each 100 generations to -17959.28. Eventually that number will converge upon a stable value. Once that stable value is reached the MrBayes is sampling trees according to their posterior probabilities (one hopes). The trees sampled after the log likelihoods reach a plateau can be used to make a consensus tree. Below we will discuss how to decide how many trees to include in that set, but at this point just consider how to answer the question "Continue with chain?".
Since the log likelhood sum has not reached a stable value type "yes". How many additional generations do you want to see? This is the $64,000 question for MCMC analysis. We suggest the following: (1) run multiple chains, starting from random trees if possible and (2) run each chain well past the point where apparent stationarity was reached. For example, if we run the replicase data set again, this time printing every 500th to the screen, we obtain:
500 -- -17993.91 -- [-5651.94] (-5704.28) (-5686.40) (-5658.09) 9s 1000 -- -17938.56 -- (-5653.61) (-5652.86) (-5671.55) [-5644.75] 18s 1500 -- -17911.89 -- (-5642.16) (-5657.28) (-5642.82) [-5643.71] 27s (4~3) 2000 -- -17900.22 -- [-5641.23] (-5642.17) (-5641.69) (-5643.72) 35s (4~2) 2500 -- -17905.10 -- (-5651.80) [-5642.50] (-5642.14) (-5640.15) 44s 3000 -- -17898.56 -- (-5650.13) (-5641.28) [-5635.48] (-5641.77) 53s 3500 -- -17899.50 -- (-5645.09) (-5645.03) [-5637.74] (-5640.57) 62s (1~4) 4000 -- -17888.31 -- [-5635.58] (-5637.56) (-5638.49) (-5643.47) 71s 4500 -- -17901.53 -- [-5638.01] (-5640.68) (-5647.19) (-5646.79) 80s (1~3) 5000 -- -17889.99 -- (-5638.23) (-5640.35) [-5636.40] (-5641.58) 89s (3~2) 5500 -- -17906.17 -- (-5642.21) (-5646.19) [-5648.80] (-5637.32) 98s (1~4) 6000 -- -17910.50 -- (-5642.70) (-5648.94) (-5656.92) [-5637.01] 106s (3~1) 6500 -- -17906.43 -- (-5643.42) [-5638.26] (-5647.00) (-5650.79) 116s 7000 -- -17895.80 -- (-5639.54) [-5634.80] (-5648.19) (-5641.47) 125s 7500 -- -17890.95 -- (-5642.34) (-5639.56) [-5632.56] (-5645.62) 134s 8000 -- -17897.81 -- (-5645.97) (-5641.58) (-5642.21) [-5637.28] 143s (2~3) 8500 -- -17903.34 -- (-5642.63) [-5639.20] (-5644.76) (-5647.57) 152s (1~2) 9000 -- -17892.54 -- (-5644.11) (-5638.13) (-5644.03) [-5634.35] 161s (1~3) 9500 -- -17897.95 -- [-5641.69] (-5637.69) (-5640.14) (-5646.34) 169s (2~1) 10000 -- -17893.30 -- (-5636.77) [-5641.33] (-5647.43) (-5635.27) 178s (4~1) 10500 -- -17891.36 -- (-5642.92) (-5640.35) (-5649.60) [-5629.41] 187s 11000 -- -17896.02 -- (-5648.41) [-5635.74] (-5640.64) (-5641.29) 196s 11500 -- -17907.58 -- (-5642.24) [-5634.30] (-5659.54) (-5646.23) 205s (3~4) 12000 -- -17879.13 -- (-5641.49) (-5635.24) (-5636.55) [-5630.35] 214s (2~3) 12500 -- -17902.92 -- (-5656.38) [-5636.68] (-5648.66) (-5633.30) 223s 13000 -- -17898.09 -- (-5645.50) (-5640.01) (-5645.86) [-5636.92] 232s (2~3) 13500 -- -17889.75 -- (-5647.46) [-5634.11] (-5642.97) (-5634.34) 241s 14000 -- -17889.64 -- (-5641.87) (-5640.95) (-5640.87) [-5633.40] 250s (1~3) 14500 -- -17900.36 -- (-5646.60) (-5643.01) [-5638.79] (-5642.06) 258s (3~4) 15000 -- -17909.86 -- (-5641.49) (-5651.88) (-5654.19) [-5637.49] 267s 15500 -- -17905.15 -- (-5654.59) (-5646.07) (-5643.89) [-5634.62] 276s 16000 -- -17894.56 -- (-5661.10) (-5634.41) (-5644.16) [-5629.49] 285s 16500 -- -17895.79 -- [-5635.66] (-5634.98) (-5647.15) (-5645.50) 294s (3~1) 17000 -- -17887.96 -- (-5635.99) [-5634.48] (-5647.32) (-5636.63) 303s (1~3) 17500 -- -17889.21 -- (-5637.32) (-5636.37) (-5646.93) [-5636.13] 311s 18000 -- -17898.34 -- (-5646.61) (-5639.04) (-5648.28) [-5635.52] 320s (1~3) 18500 -- -17891.88 -- (-5635.24) [-5638.67] (-5648.28) (-5637.44) 329s 19000 -- -17904.45 -- (-5637.69) [-5637.90] (-5647.41) (-5655.37) 338s (2~1) 19500 -- -17902.31 -- (-5646.84) (-5649.32) (-5641.51) [-5636.55] 346s 20000 -- -17891.69 -- [-5635.71] (-5648.45) (-5638.32) (-5637.95) 355s
Note that the chain appeared to reach apparent stationarity by about the 3000th generation. Thus, we discard the states of the chain that were sampled before generation 3000 as the "burn in" of the chain. Inferences of phylogeny could be based upon those trees sampled after the burn in.
At the start of the run MrBayes opened three files: mbout.t, mbout.p, and mbout.bp. These are the default names for the output files, but you can set your own name for the output files using the command 'filename = <filename>' in the mcmc statement The mbout.bp file is used by MrBayes in computing the summary of parameters after the run (see post-run analysis below). mbout.p is a file of the values of various parameters in tab delimited text format; this file can be read by a program such as Excel. mbout.t is the tree file, the file to which MrBayes writes the current tree at the frequency defined in the mcmc statement. That file is the source of the trees used to build a final consensus tree. The trees that are written before the ln likelhoods sum converges on a stable value are less likely, given he data, than are those written after convergence; so it makes sense to base the consensus tree on those written after convergence. The question is: How many trees are necessary to obtain a good consensus tree? Unfortunately there is no good answer to that question other than "it depends on the data". In general, more is better, but how many for how much beter? Are 10000 trees better than 1000? Yes. How much better? One very pragmatic approach is to set ngen at a value that will allow the run to be completed overnight (or in whatever time interval you can live with). For the adh.nex example 16 hours would yield about 480,000 generations or 4800 trees, not an unreasonable number. Since 15000 generations were requried for likelihood convergence it would make sense to discard the first 150 of the 4800 trees as the "burnin".
sumt.- When the run is complete issuing the sumt command will summarize the results concerning the topology and branch lengths. The format is 'sumt <filename> burnin = <number of trees to ignore> contype = <allcompat or halfcompat>'. For the adh.nex example the statement would be 'sumt filename=replicase.nex.t contype=allcompat burnin=300'. Remember, burn in is the number of trees to be ignored, not the number of generations. Contype is the consensus type. Halfcompat is the equivalent of 50% majority rule in PAUP*, while allcompat is the equivalent to the 50% majority rule with "Show frequencies of all observed bipartitions" ticked. If the consensus tree with branch lengths calculated by MrBayes is to be the same as that calculated by PAUP* contype must correspond to the setting in PAUP*.
sumt writes the bipartitions, the frequencies with which they were found, the probabilities of the bipar- titions, and the mean and variance of the branch length (if branch lengths were recorded) to a file called '<filename>.parts'. If branch lengths were recorded it also writes a consensus phylogram based on mean branch lengths. Finally, a consensus tree with clade probabilities is shown on the screen. The consensus tree in <filename>.con can be printed with PAUP*.
You can also construct a consensus tree by importing the tree file into PAUP* and create a consensus tree in PAUP*, including all but the trees discarded as the burn-in trees in the consensus. The numbers at the interior branches of a majority-rule consensus tree are the percent of the time that the clade occurs among the sampled trees; i.e. the (posterior) probability of that clade existing.
As an aside, one of the great advantages of Bayesian inference of phylogeny is its speed. After completing the sumt command and obtaining a consensus tree, you have gone a long ways towards not only finding a good estimate of phylogeny but also finding those parts of the tree that are well supported. Moreover, you have done this using full models of DNA substitution. This is roughly equivalent to doing a maximum likelihood search with bootstrapping.
sump.-The sump command summarizes the information in the file named '<filename>.bp' (default is mbout.bp). The output is to the screen, and provides the mean, variance, and 95% credible inter- val for the parameter. The format is 'sump filename=<filename>.bp burnin=<number of trees to be ignored>'.
If you already know how many generations are requried for convergence you can do the entire run in batch mode by modifying the mrbayes block as shown below. Modifications to the previous example are in boldface:
begin mrbayes; set autoclose=yes; charset 1stpos = 1-720\3; charset 2ndpos = 2-720\3; charset 3rdpos = 3-720\3; partition bycodon = 3:1stpos,2ndpos,3rdpos; lset nst=6 rates=sitespec sitepartition=bycodon; usertree = (((FR,MS2),GA),((SP,NL95),((M11,MX1),QB)),PP7); mcmc ngen=20000 printfreq=500 samplefreq=10 nchains=4 savebrlens=yes; end;
The only unfamiliar command is 'autoclose=yes'. Autoclose prevents MrBayes from asking if you want to continue the run for more generations after it has completed the assigned 20000 generations. That allows it to get on with sumt and sump.
Sometimes it is useful to use a MrBayes block to set most of the parameters, but to vary other parameters from within the program by the command line. You can then use the help command to check out the settings and see how varying them affects things. You can use an mcmcp statement in the mrbayes block instead of an mcmc statement to set the mcmc parameters without starting the run. If you do so, however, all of the post-run statements (supt and sump) must be deleted from the mrbayes block or you will get an error message.
In the semi-batch mode you execute the input file to set the parameters, but issue the mcmc command at the command line to start running the chain.
MrBayes can be interrupted at any time by holding down the Command Key while typing a period (Macintosh) or by typing Control-C (Unix).
The above MrBayes block is a good choice for dna data that is an alignment of coding regions. For non-coding DNA or for RNA everything before the lset statement should be eliminated and lset nst=6 rates=gamma is a good starting point. For protein sequences lset aamodel=jones is a good starting point.
You will quickly discover that MrBayes is a "memory hog". The reason for this is not simply lazy programming, but the result of decisions made early on to make the program as fast as possible, memory-bedamned. It might be useful to know about how much memory will be required before running the program. The formula is
(# chains) W 2 W (2 W # taxa) W (# site patterns) W (# characters states) W (# rate categories) W 8
This will give you the memory requirements in bytes. The number of site patterns is the number of unique patterns in the data. This number is reported before the MCMC analysis. The number of states is either 4 (DNA and RNA), 20 (amino acid), or 61 (codon models). The number of rate categories is the number of rate classes. Usually this is 1 (equal or site specific) and 4 (for the discrete gamma model). The default number of rate categories for gamma-distributed rate variation is ncat=4. This option can be changed using 'lset ncat=<number>'.
The time requried is a function of the processor speed of the computer, the number of taxa, the length of the sequences, the number of parameters that must be estimated in the model determined by lset, and the number of generations for the run. It is therefore almost impossible to predict how long any particular run will take. In general, for any given model, the required time varies linearly with sequence. For the coding sequence data in the adh.nex file, using the model as in lset above, running on a Macintosh G3 processor at 366 MHz, for 100,000 generations, 0.42 hours were required for 10 taxa, 1.05 hours for 20 taxa, 1.7 hours for 30 taxa, 2.4 hours for 40 taxa, and 3.6 hours were required for all 54 taxa. The time required for protein data or codon data is much greater. If the chain seems to be taking a long time, just remember how long it took you to collect the sequences in the first place!
MrBayes can estimate the sequence at one or more internal nodes, i.e. sequences of the taxa that are ancestral to the extant taxa from which the data were obtained. To estimate an ancestral sequence it is necessary to add constraints to the MrBayes block of an existing input file. A constraint is an instruction to always consider a set of taxa as belonging to a clade. The format is "constraint <name> = taxon1 taxon2 taxon3 . . . taxoni;". In the phylogeny below we are interested in estimating the ancestral sequences at the nodes labeled con1 con7.
The tree phylogeny was estimated using the MrBayes block below:
begin mrbayes; charset 1st_pos = 1-867\3; charset 2nd_pos = 2-867\3; charset 3rd_pos = 3-867\3; partition by_codon = 3:1st_pos,2nd_pos,3rd_pos; lset nst=6 rates=sitespec sitepartition=by_codon; set autoclose=yes; mcmc ngen=1000000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes startingtree=random filename=ALLDNA; sumt filename=ALLDNA.t burnin=500 contype=halfcompat; end;
To reconstruct the ancestral sequences the block was modified as shown below in boldface:
begin mrbayes;
charset 1st_pos = 1-867\3;
charset 2nd_pos = 2-867\3;
charset 3rd_pos = 3-867\3;
partition by_codon = 3:1st_pos,2nd_pos,3rd_pos;
constraint con1=TEM1 TEM6 TEM17 TEM54 TEM10 TEM28 TEM70 TEM76 TEM77 TEM79 TEM29 TEM43
TEM2 TEM3 TEM8 TEM24 TEM60 TEM21 TEM22 TEM16 TEM59 TEM12 TEM53 TEM78 TEM20 TEM72
TEM47 TEM68 TEM48;
constraint con2= TEM1 TEM6 TEM17 TEM54 TEM10 TEM28 TEM70 TEM76 TEM77 TEM79 TEM29 TEM43;
constraint con3=TEM2 TEM3 TEM8 TEM24 TEM60 TEM21 TEM22 TEM16 TEM59 TEM12 TEM53 TEM78;
constraint con4=TEM2 TEM3 TEM8 TEM24 TEM60 TEM21 TEM22 TEM16 TEM59;
constraint con5= shv1 shv2 shv5 shv7 shv18 shv9 shv8 shv24 shv26 shv27 shv28 shv2A
shv12 shv15 shv13 shv25;
constraint con6= shv1 shv2 shv5 shv7 shv18 shv9 shv8 shv24 shv26 shv27 shv28;
constraint con7= shv1 shv2 shv5 shv7 shv18 shv9 shv8 shv24;
lset nst=6 rates=sitespec sitepartition=by_codon enforcecon=yes inferanc=yes;
set autoclose=yes;
mcmc ngen=1000000 printfreq=1000 samplefreq=100 nchains=4 savebrlens=yes
startingtree=random filename=ALLDNA;
sumt filename=ALLDNA.t burnin=500 contype=halfcompat;
end;
Each constraint has a name (con1, con2, etc.) and each lists the the taxa that are constrained. Notice that all of the taxa in con7 are also in con6. Each constraint will result in the estimation of the ancestral sequence for that node. In addition to the constraints themselves two additional parameters must be set with lset: enforcecon=yes and inferanc=yes. inferanc tells Mr. Bayes to infer ancestral states, while enforcecon makes MrBayes keep members of a constraint together as a clade. In practice, a tree must first be constructed without constraints, and probabilities of clades must be estimated at this stage. Once that is done the tree must be reconstructed with the constraints of interest. It is not legitimate to estimate probabilities of clades when constraints have been imposed because the constraint has forced the probability at the constrainted node to be 100%. If the fileneame is alldna, then in addition to the alldna.t, alldna.p, alldna.bp, alldna.con and alldna.parts files MrBayes will write a file for each of the constraints named alldna.sum.1, alldna.sum.11, etc The files list the sites and the probabilities of, respectively, A, C, G or T at that site, and the most probable base at that site. For reasons known only to programmers the sites are numbered starting at zero instead of 1, so you will have to add 1 to each site position to make them correspond to standard numbering of a sequence. Summary of ancestral states at constraint "con1"
0 0 -- 0.999942 0.000024 0.000017 0.000017 A 1 1 -- 0.000018 0.000017 0.000019 0.999945 T 2 2 -- 0.000017 0.000017 0.999943 0.000023 G 3 0 -- 0.999942 0.000024 0.000017 0.000017 A 4 3 -- 0.000017 0.000017 0.999948 0.000018 G 5 4 -- 0.000024 0.000019 0.000031 0.999927 T 6 5 -- 0.999944 0.000022 0.000017 0.000017 A
The most probable sequence and the log of the likelihood of that sequence are given at the bottom of each file.
execute -- Executes a file quit -- Quits the program lset -- Sets the parameters of the likelihood prset -- Sets the priors for the parameters props -- Sets MCMC chain parameters mcmc -- Starts Markov chain Monte Carlo mcmcp -- Sets the parameters of a chain (without starting analysis) exclude -- Excludes sites from the analysis include -- Includes sites delete -- Deletes taxa from the analysis restore -- Restores taxa charstat -- Shows status of characters charset -- Assigns a group of sites to a set partition -- Assigns a character partition constraint -- Assigns a constraint on a tree calibration -- Assigns a calibration on a tree outgroup -- Assigns a single species to be the outgroup root -- Roots tree at outgroup species deroot -- Deroots tree showtree -- Shows the current user tree shownodes -- Shows the current user tree as linked nodes sumt -- Summarize the trees in a tree file set -- Sets run conditions of program usertree -- Assigns a user tree
This command executes a file called <file name>. The correct usage is: "execute <file name>" For example, "execute replicase.nex" would execute the file named "replicase.nex". This file must be in the same directory as the executable.
This command quits the program.
The correct usage is: "quit"
This command sets the parameters of the likelihood model. The likelihood function is the probability of observing the data conditional on the phylogenetic model. In order to calculate the likelihood, you must assume a model of character change. This command lets you tailor these assumptions. The correct usage is: "lset <parameter>=<option>" For example, "lset nst=6 rates=gamma" would set the model to a general model of DNA substition (the GTR) with gamma-distributed rate variation across sites.
Options:
nst -- Sets the number of substitution types: "1" is JC69 or
F81, "2" is K80 or HKY85, "6" is GTR, and "12" is
the general nonreversible model.
aamodel -- Sets the model of substitution for amino acid sequence
data. The options are "poisson" (a glorified Jukes-
Cantor), "equalin" (a glorified Felsenstein, 1981),
"jones", "dayhoff", and "gtr".
rates -- Sets the model for among-site rate variation:
* equal -- No rate variation across sites.
* gamma -- Gamma-distributed rate variation.
* adgamma -- Autocorrelated rates across sites. The
marginal rate at a site is gamma, but
adjacent sites have correlated rates.
* invariant -- A proportion of the sites are invariable.
* invgamma -- A proportion of the sites are invariable
while the rate for the remaining sites is
drawn from a gamma distribution.
* sitespec -- Site specific rates where sites are
assigned to partitions and the rate for
each partition estimated separately.
* ssgamma -- Site specific rates but the rates within
a class have rate variation as described
by a gamma distribution.
* ssadgamma -- Site specific rates but the rates within
a class have rate variation as described
by a autocorrelated gamma distribution.
inferrates -- When inferrates=yes, the rate factor of the sites is
estimated.
basefreq -- Sets the base frequencies: "equal" sets the frequency
to 1/4 for each nucleotide, "empirical" sets the freq-
uencies to the observed frequency of each nucleotide in
the alignment, and "estimate" allows the program to
estimate the frequencies. Note that when "nst=12", that
the base frequencies are completely determined by the Q
matrix.
clock -- Sets the model for branch length constraints: "strict"
restrains the branch lengths such that the molecular clock
is obeyed, "relaxed" relaxes the molecular clock
assumption by assuming that rates change on the tree
according to a compound Poisson process, and "uncon-
strained" allows each branch to have its own rate
parameter.
tratio -- Sets the transition/transverion rate ratio. This option
only works when "nst=2".
revmat -- Sets the rates for the GTR model. This option only works
when "nst=6".
nonrevmat -- Sets the rates for the general nonreversible model. This
option only works when "nst=12".
shape -- Sets the shape parameter of the gamma distribution. This
option only works when gamma-distributed rate variation
has been selected ("rates=gamma").
ncat -- Sets the number of rate categories for the gamma distri-
bution. The gamma distribution is a continuous one.
However, it is virtually impossible to calculate likeli-
hoods under the gamma distribution. Hence, an approx-
imation to the continuous gamma is used; the gamma distri-
bution is broken into ncat categories of equal weight
(1/ncat). The mean rate for each category represents the
rate for the entire category. This option allows you to
specify how many rates to use when approximating the
gamma. The approximation is better as ncat is increased.
In practice, "ncat=4" does a reasonable job of approx-
imating the continuous gamma.
coding -- This specifies how characters were sampled. If all
site patterns had the possiblity of being sampled, then
"all" should be specified (the default). Otherwise,
"variable" (only variable characters), "informative"
only parsimony informative), "noabsence" (patterns
with all characters absent), or "nopresence" (patterns
with all characters present) should be specified. These
options only work with morphological (all/variable/
informative) or restriction site (all/variable/noabsence/
nopresence) data.
enforcecodon -- Enforces the use of a 61 X 61 codon model of DNA
substitution. Use of this option allows you to estimate
the nonsynonymous/synonymous rate of substitution (omega).
Make certain that the data are protein coding DNA
sequences that are in frame (i.e., the length of the
sequences should be divisable by three.
code -- Enforces the use of a particular genetic code. The default
is the universal code. Other options include "vertmt"
for vertebrate mitochondrial DNA, "mycoplasma",
"yeast", "ciliates", and "metmt" (for metazoan
mitochondrial DNA, except vertebrates).
omegavar -- Allows the nonsynonymous/synonymous rate ratio (omega) to
vary across codons according to the model of Nielsen and
Yang (1998) when the "ny98" option is enforced. Other-
wise, omega is assumed to be equal across sites.
omegaratio -- Sets the nonsynonymous/synonymous rate ratio.
seqerror -- Sets the probability of a sequencing error at a site in
the aligned sequence.
inferpossel -- Sets the option for inferring positively selected sites.
You must specify the codon model of DNA substitution and
the ny98 model for omega variation.
sitepartition -- Sets the name of the site partition to use for site
specific rate variation. A partition must be defined for
this option to be implemented.
siterates -- Sets the rates for site specific rate variation. A site
partition must be defined.
enforcecon -- Enforces constraints, if defined, when "enforcecon=yes".
enforcecal -- Enforces calibrations.
inferanc -- Causes the states at nodes specified by constraints to be
inferred. Constraints must be defined for this option to
work.
ancfile -- Determines whether all of the ancestral state information
is printed to files (yes) or only a summary file for each
constrained node (no).
Parameter Options Current Setting
nst 1/2/6/12 subModel
aamodel poisson/equalin/jones/dayhoff/gtr poisson
rates equal/gamma/adgamma/invariant/
invgamma/sitespec/ssgamma/
ssadgamma ratesModel
inferrates yes/no yes
basefreq equal/empirical/estimate baseFreqModel
clock unconstrained/relaxed/strict clockModel
tratio <number>/estimate tRatioModel/tRatio
revmat (<number>, ..., <number>)/estimate revRatesModel
nonrevmat (<number>, ..., <number>)/estimate nonRevRatesModel
omega <number>/estimate omegaRatioModel/omegaRatio
omegavar equal/ny98 codonModelType
shape <number>/estimate shapeModel/shape
ncat <number> nRateCats
coding all/variable/informative/
noabsence/nopresence codingType
seqerror <number> seqErrorProb
sitepartition <name> partitionName
siterates (<number>, ..., <number>)/estimate siteRatesModel
enforcecon yes/no yes
inferanc yes/no yes
ancfile yes/no yes
enforcecodon yes/no yes
inferpossel yes/no yes
code universal/vertmt/mycoplasma/ universal
This command excludes characters from the analysis. The correct usage is: "exclude <number> <number> <number> - <number>" For example, "exclude 2 3 10-14 22" excludes sites 2, 3, 10, 11, 12, 13, 14, and 22 from the analysis. Similarly, "exclude all" excludes all of the characters.
This command includes characters that were previously deleted from the analysis. The correct usage is: "include <number> <number> <number> - <number>". For example, "include 2 3 10-14 22" includes sites 2, 3, 10, 11, 12, 13, 14, and 22 in the analysis. Similarly, "include all" includes all of the characters.
This command deletes taxa from the analysis. The correct usage is: "delete <name> <number>". A list of the taxon names and/or numbers can be used (the taxa are numbered from 1 to n in the order in the matrix). For example, "delete 1 2 Homo_sapiens" deletes taxa 1, 2, and the taxon labelled "Homo sapiens" from the analysis.
This command restores taxa to the analysis. The correct usage is: "restore <name> <number>". A list of the taxon names and/or numbers can be used (the taxa are numbered from 1 to n in the order in the matrix). For example, "restore 1 2 Homo_sapiens" restores taxa 1, 2, and the taxon labelled "Homo sapiens" to the analysis.
This command shows the status of all the characters.
The correct usage is: "charstat"
After typing "charstat", the character number, whether it is
excluded or included, and the partition identity are shown.
This command assigns a taxon to the outgroup. The correct usage is: "outgroup <number>/<taxon name>" For example, "outgroup 3" assigns the third taxon in the matrix to be the outgroup. Similarly, "outgroup Homo_sapiens". assigns the taxon labelled "Homo_sapiens" to be the outgroup (assuming that there is a taxon named "Homo_sapiens" in the matrix). Only a single taxon can be assigned to be the outgroup.
This command sets the priors for the phylogenetic model. Remember that in a Bayesian analysis, you must specify a prior probability distribution for the parameters of the likelihood model. In this case, you must specify the priors for the model of DNA substitution. qmatpr sets the prior for the rate matrix; the options are "fixed", "exp", and "uni", which fixes the rate matrix, places an exponential or uniform distribution on the rate matrix, respectively. Other options include "basefreqpr", "brlenpr", "shapepr", and "siteratepr" which specify the prior for the base frequencies, branch lengths, gamma shape parameter, or site specific rates, respectively. The correct usage is: "prset qmatpr=uni(0,100.0) basefreqpr=dir(1.0)", which would set the prior on the rates of the instantaneous rate matrix to a uniform (0,100) and the prior on the base frequencies to a dirichlet.
Options:
Parameter Options Current Setting qmatpr fixed/exp/uni qmatprModel basefreqpr fixed/dir basefreqprModel brlenpr uni/exp/bd brlenprModel shapepr fixed/exp/uni shapeprModel omegapr fixed/exp/uni omegaprModel siteratepr fixed/exp/uni siterateprModel statefreqpr fixed/dir stateFreqPrModel
This command starts the Markov chain Monte Carlo (MCMC) analysis. The posterior probability of phylogenetic trees (and other parameters of the substitution model) cannot be determined analytically. Instead, MCMC is used to approximate the posterior probabilities of trees by drawing (dependent) samples from the posterior distribution. This program can implement a variant of MCMC called "Metropolis-coupled Markov chain Monte Carlo", or MCMCMC for short. Basically, "nchains" are run, with nchains - 1 of them heated. The chains are labelled 1, 2, ..., nchains. The heat that is applied to the i-th chain is B = 1 / (1 + temp X i). B is the power to which the posterior probability is raised. When B = 0, all trees have equal probability and the chain freely visits trees. B = 1 is the "cold" chain (or the distribution of interest). MCMCMC can mix better than ordinary MCMC; after all of the chains have gone through one cycle, two chains are chosen at random and an attempt is made to swap the states (with the probability of a swap being determined by the Metropolis et al. equation). This allows the chain to potentially jump a valley in a single bound. The correct usage is "mcmc ngen=100000 nchains=4 temp=0.5", which performs a MCMCMC analysis with four chains with the temperature set to 0.5. The chains would be run for 100,000 cycles.
Options:
Parameter Options Current Setting seed <number> randomSeed ngen <number> nGen samplefreq <number> sampleFreq printfreq <number> printFreq nchains <number> numChains temp <number> chainTemp filename <file name> outFile burnin <number> mcmcBurn calcpseudobf yes/no yes cyclefreq <number> cycleFreq startingtree random/user startingTreeModel nperts <number> nPerturbations savebrlens yes/no yes
This command sets the parameters of the Markov chain Monte Carlo (MCMC) analysis without actually starting the chain. This command is identical in all respects to mcmc, except that the analysis will not start after this command is issued.
Options:
Parameter Options Current Setting seed <number> randomSeed ngen <number> nGen samplefreq <number> sampleFreq printfreq <number> printFreq nchains <number> numChains temp <number> chainTemp filename <file name> outFile burnin <number> mcmcBurn startingtree random/user startingTreeModel nperts <number> nPerturbations savebrlens yes/no yes
This command roots the tree. If the tree is already rooted, a warning is issued. The tree is rooted at the outgroup species. The correct usage is "root".
This command sets run conditions of the program. Currently, the only option is "autoclose". When autoclose=yes, the program will skip the querry usually issued at the end of a MCMC analysis.
Options:
Parameter Options Current Setting autoclose yes/no yes
This command defines a character set. The format for the charset command is "charset <name> = <character numbers>". For example, "charset first_pos = 1-720\3" defines a character set called "first_pos" that includes every third site from 1 to 720. The character set name cannot have any spaces in it. The slash (\) is a nifty way of telling the program to assign every third (or second, or fifth, or whatever) character to the character set. You can assign up to 32 character sets. This option is best used not from the command line but rather as a line in the mrbayes block of a file.
This command defines a tree constraint. The format for the constraint command is "constraint <name> = <list of taxa>". For example, "constraint apes = human chimp gorilla organg gibbon" defines a constraint consisting of five species. The list of taxa defines a constraint on a tree by specifying a taxon bipartition that must be present in all trees considered by the MCMC algorithm. In effect, by using a constraint, you are specifying a very strong prior on the topology of the tree (the posterior probabilities of all trees that do not contain the taxon bipartition is zero). For the program to actually see a constraint, you must specify enforcecon=yes in the lset command. This option is best used as as a line in the mrbayes block.
This command defines a calibration on a tree. The format for the calibration command is "calibration <name> = (<min age> <+error>) <list of taxa>". For example, "calibration apes = (4, 2) human chimp gorilla organg gibbon" defines a calibration consisting of five species. The list of taxa defines a constraint on a tree (see description of constraint command). The numbers in the parantheses specify two things: the first number specifies the youngest age for the clade containing the listed taxa; the second number specifies how much older than this minimum age the clade can be. The program assumes that the divergence time is uniformly distributed on the interval (x, x+y), where x is the first number and y is the second number specified in the parantheses. In order for the program to see calibrations, you must specify enforcecal=yes in the lset command. If you are interested in inferring speciation times on a tree (as I suspect you must be if you go through the trouble of specifying a calibration), then you should also make certain that you assume the molecular clock (or a relaxed version thereof). Although you can specify a calibration from the command line, it makes more sense to use this option as a line in the mrbayes block.
This command allows you to specify a user tree. The user tree can then be used as a starting tree for a MCMC analysis. The format for the command is "usertree = <tree in Newick format>". For example, "usertree = (A,B,(C,D))" specifies an unrooted tree of four species. Note that the program requires that trees are binary (i.e., strictly bifurcating). Hence, there can be only one three-way split, as shown in the example. If the tree is not binary, the program should return an error.
This command allows you to specify a character partition. The format for this command is "partition <name> = <number of partitions>:<sites in first>, <sites in second>,<sites in third>,<sites in fourth>, ...,<sites in last partition>". For example, "partition by_codon = 3:1st_pos,2nd_pos,3rd_pos" specifies a partition called "by_codon" which consists of three parts (first, second, and third codon positions). Here, I am assuming that the sites in each partition were defined using the charset command. You can specify a partition without using charset as follows: "partition by_codon 3:1 4 6 9 12,2 5 7 10 13,3 6 8 11 14". However, I recommend that you use the charsets to define a set of characters and then use these predefined sets when defining the partition. Also, it makes more sense to define a partition as a line in the mrbayes block than to issue the command from the command line (then again, you may be a masochist, and want to do extra work).
Unfortunately, cladists are beyond my ability to help. I suspect that you cannot help them either.
mcmc savebrlens=no is the default.
This command allows you to control aspects of the MCMC algorithm. The MCMC algorithm updates parameters in blocks. Each proposal mechanism changes the parameter in a specific manner. This command lets you change these proposal mechanisms so that the chain mixes better for your specific data set. You should only be using this command if you have a fairly good understanding of the MCMC algorithm.
Options:
Parameter Current Setting qmatwin qMatWinProp pialpha piAlphaProp ratewin rateWinProp tuning tuningProp clocktune clockTuneProp timetune timeTuneProp mtune mTuneProp extendp extendPProp wormtune wormTuneProp bdwin bdWinProp omegawin omegaWinProp rhowin rhoWinProp omegaalpha omegaAlphaProp invslide invSlideProp switchwin switchWinProp alphawin alphaWinProp reconlim reconLimProp tbrextendp tbrExtendPProp tbrtune tbrTuneProp blTune blTuneProp
This command deroots the tree. If the tree is already unrooted, a warning is issued. The correct usage is "deroot".
This command shows the current user tree. The correct usage is "showtree".
This command shows the current user tree as a linked list of nodes. The nodes are formated as "N w (l=x r=y a=z) <number>", where "w" is the node number, "x" is the node to the left, "y" is the node to the right, and "z" is the ancestral node to node "w", respectively. The last decimal number is the length of the branch leading to node "w". A "-1" indicates that the node does not exist (is null). The correct usage is "shownodes".
This command summarizes the trees in a tree file named "filename". All of the trees are read from the file and the proportion of the time any single taxon bipartition is found is counted. The proportion of the time that the bipartition is found is an approximation of the posterior probability of the bipartition. (Remember that a taxon bipartition is defined by removing a branch on the tree, dividing the tree into those taxa to the left and right of the removed branch. This set is called a taxon bipartition). The branch length of the bipartition is also recorded. The result is a list of the taxon bipartitions found, the frequency with which they were found, the posterior probability of the bipartition and, if branch lengths were recorded, the mean and variance of the length of the branch. The partition information is output to a file called "<filename>.parts". A consensus tree is also printed to a file called "<filename>.con". The consensus tree is either a 50 percent majority rule tree or a majority rule tree with all compatible partitions displayed. Finally, the number of trees in the tree file that are skipped is controlled by the "burnin". The default is 0, but you may want to discard those trees that were sampled while the chain was not at stationarity. You can display the majority rule consensus tree using a program such as PAUP*. You will have to manually put the probabilities of clades on the tree.
Options:
Parameter Options Current Setting filename <name> sumTreeFile burnin <number> burnIn contype halfcompat/allcompat halfcompat
This command summarizes the information in a file named "filename". The program prints three files during a MCMC analysis: one tree file and two parameter files. The parameter files have extensions .p and .bp. This command, sump, summarizes the information in the file <filename.bp>. The output is to the screen, and provides the mean, variance, and 95 percent credible interval for the parameter. You may want to discard a specified number of observations from the chain as the burn-in.
Options:
Parameter Options Current Setting filename <name> sumParmFile burnin <number> burnIn
MrBayes is provided free and as is. It is not guaranteed to be free of bugs. Moreover, we suggest that you run multiple chains in all of your analyses to confirm that the program is converging to the posterior probability of interest. You should know that Markov chain Monte Carlo is a fantastic technology that allows you to calculate posterior probabilities for really hard problems. However, the method is not foolproof, and you should take care to confirm your results.
If you experience any problems with the program, please feel free to email JPH at 'johnh@brahms.biology.rochester.edu' or FR at 'fredrik.ronquist@ebc.uu.se'. If you use the results of the program in a paper, please use the following citation:
Huelsenbeck, J. P, and F. R. Ronquist. In Press. MRBAYES: Bayesian inference of phylogeny. Biometrics.
Please send comments or suggestions about the program to mrbayes@ebc.uu.se. Your message will be forwarded to JPH and FR.
Peter Foster gave us the routines for exponentiating the Q matrix and Ziheng Yang provided the routines for mainipulating gamma distributions. Barry Hall, Johan Nylanger, and Jon Bollback provided feedback on the program and did some initial testing. NSF (DEB-0075406) and the Swedish Natural Science Research Council provided much needed financial support.
Adachi, J., and M. Hasegawa. 1992. Amino acid substitution of proteins coded for in mitochondrial DNA during mammalian evolution. Jpn. J. Genet. 67:187-197.
Adachi, J., and M. Hasegawa. 1996a. MOLPHY: Programs for molecular phylogenetics, I. PROTML. Maximum likelihood inference of protein phylogeny. Comput. Sci. Monogr. 27:1-77.
Adachi, J., and M. Hasegawa. 1996b.MOLPHY, version 2.3: Programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. 28:1-150.
Bayes, T. 1763. An essay towrds solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London 53:370-418. Reprinted, E. S. Pearson and M. G. Kendall (eds.). 1970. Pages 131-153 in Studies in the History of Statistics and Probability. Charles Griffin, London.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17:368-376.
Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle.
Geyer, C. J. 1991. Markov chain Monte Carlo maximum likelihood. Pages 156-163 in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface. (E. M. Keramidas, ed.). Fairfax Station: Interface Foundation.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution 11:725-736.
Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model deter- mination. Biometrika 82:711-732.
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22:160-174.
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16:111-120.
Larget, B., and D. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Molecular Biology and Evolution 16:750-759.
Lewis, P. O. 2001. Maximum likelihood phylogenetic inference: Modeling discrete morphological charac- ters. Systematic Biology.
Li, S. 1996. Phylogenetic tree construction using Markov chain Monte carlo. Ph. D. dissertation, Ohio State University, Columbus.
Mau, B. 1996. Bayesian phylogenetic inference via Markov chain Monte carlo methods. Ph. D. dissertation, University of Wisconsin, Madison.
Mau, B., and M. Newton. 1997. Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo. Journal of Computational and Graphical Statistics 6:122-131.
Mau, B., M. Newton, and B. Larget. 1999. Bayesian phylogenetic inference via Markov chain Monte carlo methods. Biometrics 55:1-12.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21:1087-1091.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynony- mous nucleotide substitution rates with application to the chloroplast genome. Molecular Biology and Evolution 11:715-724.
Rannala, B., and Z. Yang. 1996. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of Molecular Evolution 43:304-311.
Schoniger, M., and A. von Haeseler. 1994. A stochastic model and the evolution of autocorrelated DNA sequences. Molecular Phylogenetics and Evolution 3:240-247.
Swofford, D. L. 1998. PAUP*: Phylogenetic Analysis Using Parsimony and Other Methods. Sinauer Associates, Sunderland, MA.
Tavare, S. 1986. Some probabilistic and statistical problems on the analysis of DNA sequences. Pages 57-86 in Lectures in Mathematics in the Life Sciences, vol. 17.
Tierney, L. 1994. Markov chains for exploring posterior distributions (with discussion). Annuals of Statistics 22:1701-1762.
Yang, Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular Biology and Evolution 10:1396-1401.
Yang, Z. 1994a. Estimating the pattern of nucleotide substitution. Journal of Molecular Evolution 39:105- 111.
Yang, Z. 1994b. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. Journal of Molecular Evolution 39:306-314.
Yang, Z. 1997. TITLE. CABIOS 15:555.
Yang, Z., and B. Rannala. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte carlo method. Molecular Biology and Evolution 14:717-724.