Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods

doi:10.1007/BF00160154

J Mol Evol (1994) 39:306-314

Journal

of Molecular

Evolution

Maximum Likelihood Phylogenetic Estimation from DNA Sequences with

Variable Rates over Sites: Approximate Methods

Ziheng Yang

Department of Zoology, The Natural History Museum, London SW7 5BD, United Kingdom

Received: 22 December 1993 / Accepted: 9 March 1994

Abstract.

Two approximate methods are proposed

for maximum likelihood phylogenetic estimation, which

allow variable rates of substitution across nucleotide

sites. Three data sets with quite different characteristics

were analyzed to examine empirically the performance

of these methods. The first, called the "discrete gamma

model," uses several categories of rates to approximate

the gamma distribution, with equal probability for each

category. The mean of each category is used to repre-

sent all the rates falling in the category. The performance

of this method is found to be quite good, and four such

categories appear to be sufficient to produce both an op-

timum, or near-optimum fit by the model to the data, and

also an acceptable approximation to the continuous dis-

tribution. The second method, called "fixed-rates mod-

el," classifies sites into several classes according to

their rates predicted assuming the star tree. Sites in dif-

ferent classes are then assumed to be evolving at these

fixed rates when other tree topologies are evaluated.

Analyses of the data sets suggest that this method can

produce reasonable results, but it seems to share some

properties of a least-squares pairwise comparison; for

example, interior branch lengths in nonbest trees are of-

ten found to be zero. The computational requirements of

the two methods are comparable to that of Felsenstein's

(1981, J Mol Evol 17:368-376) model, which assumes

a single rate for all the sites.

Key words: Phylogeny -- Maximum likelihood --

Rate variation over sites -- The gamma distribution --

Approximate methods

Introduction

Variation of substitution rates across nucleotide sites has

long been recognized as a characteristic of DNA se-

quence evolution, especially for sequences coding for bi-

ological products (e.g., Fitch and Margoliash 1967;

Fitch and Markowitz 1970; Uzzell and Corbin 1971;

Holmquist et al. 1983; Fitch 1986; Kocher and Wilson

1991; Wakeley 1993). There have been many attempts

to account for such rate variation in phylogenetic analy-

sis. Two approaches are taken. The first assumes that

rates over sites are random variables drawn from a con-

tinuous distribution; for example, Nei and Gojobori

(1986), Jin and Nei (1990), Li et al. (1990), and Tamu-

ra and Nei (1993) used the gamma distribution for rates

over sites when they constructed estimators of the dis-

tance between two sequences. The second approach us-

es several categories of rates. The simplest model of this

sort assumes that a proportion of sites are invariable

while others are changing at a constant rate (e.g.,

Hasegawa et al. 1985; Palumbi 1989; Hasegawa and Ho-

rai 1991). In accounting for the extreme rate hetero-

geneity of the control region of the human mtDNA,

Hasegawa et al. (1993) adopted a three-rate-category

model, wherein some sites are assumed to be invariable

while others are either moderately or highly variable.

Biologically, a continuous distribution may seem to

be more reasonable, and indeed, when fitting several

models to the control region of human mtDNAs, Wake-

ley (1993) found that a two-rate-category model could

not fit the data properly, while the fit of a gamma dis-

tribution was statistically acceptable. Recently, the gam-

ma distribution was also incorporated into a joint like-

307

lihood analysis by Yang (1993), a direct extension of

the method of Felsenstein (1981), which assumes a sin-

gle rate for all the sites. However, the algorithm of

Yang (1993) involves intensive computation, making

it impractical for data sets with more than a few spe-

cies.

In this paper, we introduce two approximate methods.

The first, called the "discrete gamma model," uses sev-

eral categories to approximate the continuous gamma

distribution. Rates over sites are regarded as random

variables drawn from a discrete distribution. The ad-

vantage of taking a discrete distribution as an approxi-

mation to the continuous one is that only one extra pa-

rameter is needed. This appears to lead to more efficient

estimation, easier interpretation of results, and also bet-

ter fit of the model to data. The second method, called

the "fixed-rates model," predicts rates at specific sites

by assuming the star tree and the gamma distribution us-

ing the method of Yang and Wang (in press), and then

combines sites into several classes according to these

predicted rates. Sites in different classes are then as-

sumed to be evolving at known (different) rates when

other tree topologies are evaluated.

We will analyze three different data sets to examine

the fit to data of the discrete gamma model, and its ap-

proximation to the continuous distribution, in order to

determine a satisfactory number of categories in the

discrete distribution. The continuous gamma model

(Yang 1993), the discrete gamma model, the fixed-rates

model, and a least-squares method based on pairwise

distance estimates will all be applied to evaluate the pos-

sible tree topologies. The results will give us some idea

about the similarities and differences among those tree

estimation methods. However, the performance of these

methods with regard to tree reconstruction will be ex-

amined more rigorously by using computer simulations.

Methods and Data

of substitution is 1 when the process is in equilibrium. Time t or

branch lengths in the tree are then measured by the expected number

of nucleotide substitutions per site. Parameter K > max(-~y,-r~a) ad-

justs for the transition/transversion rate bias; a K larger than 0 will al-

low transitions to occur with higher probabilities than transversions.

The rate matrix was given by Hasegawa and Kishino (1989), and the

transition probability matrix, P(t) = exp{Qt}, was described by

Thorne et al. (1992). The model will be designated "F84."

When applied to real data, the fit of F84 is very similar to that of

the model of Hasegawa et al. (1985), denoted "HKY85" (results not

shown). The transition/transversion rate ratios in the two models are

roughly related

as KHKY8 5 =

2KF84 -- 1, where

KHKY8 5

is equivalent

to a/J3 in the notation of Hasegawa et al. (1985). Goldman (1993) pro-

vides a more accurate formula for this relationship. There is, howev-

er, a mathematical difference between these two models: while the rate

matrix of the HKY85 model has four distinct eigenroots (Hasegawa

et al. 1985), Q in Eq. 1 has only three, that is, k~ = 0,

~'2 = --~t,

~'3

= L 4 = -(1 + K)~t. The corresponding eigenvectors of Q are the same

as those for HKY85 and are given by Hasegawa et al. (1985). The F84

model therefore involves less computation than HKY85, especially for

the algorithm of Yang (1993). Another result of this difference is that

a simple formula for estimating the distance between two sequences

is available for F84 while there is not for HKY85, for the reasons ex-

plained by Yang (in press).

Assume that rates over sites follow a gamma distribution with (giv-

en or independently estimated) shape parameter ct. Let P and Q be the

proportions of sites with transitional and transversional differences re-

spectively in the two sequences. Following, e.g., Jin and Nei (1990)

or Tamura and Nei (1993), parameter ¢ and sequence divergence

t,

as defined by t = [4J~T~C(I

+

K/Gy) -k 4JIAT~G(I

q-

K/GR) -t- 4gy~R] ' ~t,

can be estimated as follows:

= a/b

- 1 (2)

= [4~T~C(1

+

l(/~y) -~ 4JIa~G(1 + K/JIR)

+ 4~y~R] • b

where

[ -log(A)/2, if ot =

a = (l+K)~tt =

icz(A_l/.

1)/2, if0<a<~

-- [ -log(B)/2, if c~ =

b= ut= la(B -lm-

1)/2, if0<c~<

and

(3)

(4)

(5)

The Model of Nucleotide Substitution.

In this paper we will use the

Markov process model for nucleotide substitution that has been im-

plemented in the DNAML program in J. Felsenstein' s PHYLIP pack-

age since 1984 (Felsenstein personal communication). The rate ma-

trix for this model is

A -

2(~Tg c + J~AJIG ) + 2(gTJIC~lR/g Y + :r~A~GJIy/J~R) " [1 -- Q/(2gy0"lg) ] -- P

2(~T~C/G Y q- JIA~IG/JIR )

(6)

Q _

I (1 • K/JIy)~ C ~A ~G ]

(1 + ~/=Y)=T JIA =a

JIT JIC (1 + K/JIR)~ c '

=T JIC (1 q- K/~R)~ A

(1)

where Qij (i ~ j) represents the rate of substitution from nucleotide i

to j, with the nucleotides ordered T, C, A, G. The diagonals are spec-

ified by the mathematical requirement that row sums of Q are zero.

The equilibrium distribution is given by ~T, ~c, ~A' JIG, with ~y = ~T

+ ~c and ~R = ~A + gC = 1

-

~y. g = 1/[4~c(1 + K/TOy) + 4~A~ G

(1 + K/~R) + 4~y~R] is a scale factor, chosen so that the average rate

B = 1 - Q/(2ZCyJIR) (7)

The frequency parameters ~T' ~C,

~A'

JIG, are estimated using the

averages of the observed frequencies in the two sequences. When c~

= ~, the gamma distribution reduces to the model with a single rate

over sites. In this paper, we estimate the c~ parameter to be used in Eqs.

2 and 3 by the method of Yang (1993) assuming the star tree. Pair-

wise distances estimated this way can then be used in constructing the

least-squares (LS) additive tree (Cavalli-Sforza and Edwards 1967) or

in other distance matrix methods. The maximum likelihood estimates

of K and t are normally found to be very similar to those obtained from

Eqs. 2 and 3 (results not shown).

308

The Discrete Gamma Model.

We use k categories to approximate

the gamma distribution, with equal probability

1/k

in each category.

The density function of the gamma distribution G(c~,[3) is

g(r;c~,[3)=~exp{-~r}.r

~ a, 0<r<~ (8)

with mean

E(r)

= c~/[3 and variance

V(r)

= ct/[32. We note that when

r -- G(c~,[3),

cr ~

G(ct,[3/c). The gamma distribution with 13 = J/z re-

duces to the Z2 distribution, that is, G(ot, I/2) --- X2(2c0. Using these re-

lationships, we can calculate the percentage point (the cutting point)

of the gamma distribution, i.e., the value of z such that Prob{r < z}

= p where r ~ G(c~,13), as follows:

zo(P,'Ct, 13) = z×2(p;2a)/(213)

(9)

where

zx2(p;v)

is the percentage point of the Z 2 distribution with v de-

grees of freedom, which can be calculated by, say, the algorithm of

Best and Roberts (1975).

The range of r, (0,~), is cut into k categories by k - 1 percent-

age points corresponding to p =

1/k, 2/k, ....

(k -

1)/k.

The rate in

each category can then be represented by the mean of the portion of

the gamma distribution falling in the category. Suppose that the two

cutting points of category i are a and b. Then the rate for category i

can be obtained as

4,

3

t(O

2

0 7-- ....

0 0,5 1 1.5 2

r

Fig. 1. Discrete approximation to the gamma distribution G(ct,[3),

with ct = [3 = l/2. Four categories are used to approximate the con-

tinuous distribution, with equal probability for each category. The

three boundaries are 0.1015, 0.4549, and 1.3233, which are the per-

centage points corresponding to p = 1/4,

z14,

3/4. The means of the four

categories are 0.0334, 0.2519, 0.8203, 2.8944. The medians are

0.0247, 0.2389, 0.7870, 2.3535, and these are scaled to get 0.0291,

0.2807, 0.9248, and 2.7654, so that the mean of the discrete distri-

bution is one.

ri = f;r g(r,'a,~) drlf g(r,'c~,13) dr

= a/[3 [I(b13,c~ +

1) -

l(a[3,ct +

1)](l/k) (10)

where

l(z, cQ

= [1/F(c0] f~ exp { -x} •

xa-ldx

is the incomplete gam-

ma ratio, which can be calculated, say, using the algorithm of Bhat-

tacharjee (1970).

If we use the median instead of the mean to represent the average

rate,

r i

can be calculated as the percentage points corresponding to p

= l/(2k), 3/(2k) ..... (2k - l)/(2k). In the current context, the scale

parameter 13 is redundant and can be set equal to c~ so that the mean

of the distribution is one (Yang 1993). The discrete distribution needs

also to be scaled so that the mean is one if the median is used. An ex-

ample is given in Fig. 1, with ct = [3 = 1/2, in which case the gamma

distribution is really the

~2

distribution with one degree of freedom.

When the average rate for each category is determined, the prob-

ability of observing data x at any site can be obtained as

f(x) = -~ .f(xlr = ri)

(11)

i=l

The conditional probability of observing x, given that the rate for the

site is r =

r e

is given by Yang (1993). As the postorder tree traver-

sal algorithm of Felsenstein (1981) can be used to

calculatef(x[r =

ri) ,

the computational requirement of the discrete gamma model is

roughly k times that of Felsenstein's (1981) single-rate model. The

continuous model was represented as, i.e., F84 + F (Yang et al.

1994), and we therefore represent the discrete gamma model as, i.e.,

F84 + dG. The discrete gamma model with k = 4, the value to be rec-

ommended, will be designated F84 + dG4.

It may be pointed out that the value of

r i,

which

maximizesf(xlr

= r~) in Eq. 11, can be used as the best predictor of the rate for the

site.

The Fixed-Rates Model.

With this model, substitution rates at

sites are predicted using the method of Yang and Wang (in press), as-

suming the star tree and the gamma distribution for rates over sites.

This method takes advantage of the observation that parameter esti-

mates and predicted rates are more-or-less stable across tree topolo-

gies (Yang et at. 1994; Yang and Wang in press). Rates, and their cor-

responding sites, are then classified into k = 4 categories, (0,1), (1,1

+ c0, (1 + o, 1 + 2c0, and (1 + 2o,0*), with o = (1/(~) ~2, where

is the estimated shape parameter of the gamma distribution. This

scheme of classification is very poor if taken as an approximation to

the gamma distribution, as the first category covers most of the sites.

It, however, reflects the discrete nature of the data; with a typical da-

ta set, the four site patterns that are represented by identical nu-

cleotides in all the species cover most of the sites, and most often pre-

dicted rates lbr those sites only are less than one, the average. The rate

for the ith category, ~, is obtained by averaging the predicted rates

for sites in the category. The probability of observing data x at a site

from the ith category is calculated as

f(x) = f(Mr = ?i)

(12)

In this formulation, rates at sites are not regarded as random vari-

ables; they are constants or parameters. Biologically, if we knew

which category a site should belong to, such as in the case of the three

codon positions in protein coding sequences, or if we knew whether

a site was located in a highly variable region or in a very conserved

region, such information could be used. When we lack such infor-

mation, a good guess, as provided by the method of Yang and Wang

(in press), may be used. Mathematically, the contribution to f(x) from

categories other than the most probable may be very small and may

therefore be ignored.

Several alternatives seem possible concerning the implementation

of this method. Possibilities and problems concerning the estimation

of the c~ parameter will be discussed later. It is possible to use the rates

obtained from the star tree and the continuous gamma distribution on-

ly to classify the sites, while rates for site classes can be estimated

by the likelihood function based on Eq. 12. Parameter ~: can also be

estimated this way. It is not very clear which options will produce

good performance. In this study, the average rates for site classes, ~,

and parameter ~; are all obtained from the star tree under the contin-

uous gamma model, as this saves computation. Calculation of the like-

lihood under this model, that is, based on Eq. 12, involves roughly the

same amount of computation as that of Felsenstein's (1981) single-

rate model.

Data.

We choose three data sets for which rates of substitution are

clearly variable over sites, while other aspects, such as the amount of

evolution as reflected in branch lengths in the tree, and the transition/

transversion rate bias, are quite different• For sequences such as

pseudogenes or "junk" DNA, for which rates are more-or-less

constant over sites, the methods considered in this paper are not use-

ful.

The mtDNA Sequences from Primates.

The 895-bp mtDNA se-

quences of human, chimpanzee, gorilla, orangutan, and gibbon (Brown

Table 1. Likelihood values and estimates of parameters as functions of

k,

the number of categories in the discrete gamma model a

309

Branch lengths

k f 6~5 6<-->4 6<-->7 7~8 7 ~--'3 8<-->1 8~2 ~

1 --2,

2 -2

3 -2

4 -2

5 -2

6 -2

7 -2

8 -2

9 -2

10 -2

12 -2

15 -2

20 -2

-2

667.08 0.1385 0.0998 0.0531 0.0174 0.0578 0.0412 0.0539 4.344 NA

625.55 0.2471 0.1627 0.0730 0.0225 0.0685 0.0504 0.0651 6.614 0.038

620.90 0.4686 0.2842 0.1062 0.0347 0.0669 0.0559 0.0700 11.239 0.183

621.18 0.4665 0.3010 0.1151 0.0341 0.0738 0.0576 0.0734 11.619 0.212

621.64 0.4482 0.2953 0.1151 0.0335 0.0777 0.0586 0.0750 11.359 0.228

621.98 0.4393 0.2917 0.1142 0.0334 0.0801 0.0594 0.0760 11.198 0.238

622.21 0.4348 0.2897 0.1134 0.0333 0.0816 0.0602 0.0767 11.098 0.243

622.39 0.4325 0.2887 0.1129 0.0334 0.0827 0.0608 0.0772 11.037 0.247

622.51 0.4316 0.2884 0.1126 0.0334 0.0835 0.0613 0.0776 10.999 0.249

622.61 0.4313 0.2884 0.1124 0.0335 0.0842 0.0617 0.0779 10.977 0.251

622.74 0.4319 0.2890 0.1122 0.0336 0.0852 0.0625 0.0785 10.959 0.252

622.86 0.4340 0.2904 0.1123 0.0338 0.0861 0.0632 0.0792 10.960 0.253

622.96 0.4375 0.2928 0.1126 0.0340 0.0871 0.0640 0.0799 10.986 0.253

623.06 0.4532 0.3033 0.1148 0.0349 0.0892 0.0659 0.0821 11.127 0.248

a

The mtDNA data for five species (895bp) were analyzed, assuming the best tree (Fig. 2) and the F84 model of nucleotide substitution. The

averages of nucleotide frequencies across species are wq: = 0.2532, n-r e = 0.3289, VA = 0.3120, and ~r~ = 0.1059 with

~max =

-2,476.97. The

branches are specified by their two nodes in the tree shown in Fig. 2

et al. 1982) were analyzed. This segment of the mitochondrial genome

codes for parts of two proteins at the two ends and three tRNAs in

the middle. Rates are highly variable over sites. The transition/trans-

version rate bias is very high for these sequences, and only a small

amount of evolution is involved.

This data set has been expanded since Brown et al. (1982), and

we added sequences of this region from crab-eating macaque, squir-

rel monkey, tarsier, and lemur to form another, larger, data set. The

sequences were aligned by A. Friday by eye. After sites involving in-

sertions or deletions have been excluded, there are 888 nucleotides in

each sequence. A much greater amount of evolution is now involved

due to the addition of more distantly related sequences. These two da-

ta sets can be distinguished by their numbers of species.

The et- and {3-Globin Genes from Mammals.

The c~- and [3-globin

genes of a primate (human), an artiodactyl (goat for the ct-globin gene

and cow for the f3-globin gene), a lagomorph (rabbit), a rodent (rat),

and a marsupial (the native cat for the ot-globin gene and opossum for

the [3-globin gene) were analyzed. As the evolutionary dynamics of

those two genes appear to be very similar, they were combined into

one data set. Only the first and second codon positions in the coding

regions were used, and there are 570 nucleotides in each combined

sequence (2 × 141 for the c~-globin gene and 2 X 144 for the [3-glo-

bin gene). This data set is characterized by high rate variation over

sites, relatively low transition/transversion rate ratio, and intermedi-

ate branch lengths.

The Small-Subunit rRNAs (ssrRNA).

Small-subunit rRNA se-

quences of

Sulfolobus solfatarius, Halobacterium salinarium, Esch-

erichia coli,

and

Homo sapiens

as analyzed by Navidi et al. (1991)

were used. There are 1,352 nucleotides in the sequence. Rates do not

appear to be highly variable over sites for those sequences, and • is

not large. The sequences are very different. Even nucleotide fre-

quencies are quite different among species, suggesting that the sub-

stitution processes may differ among lineages. For the purpose of

this study, however, this aspect of the inaccuracy of the models is ig-

nored.

Results

The data sets are analyzed by using the discrete gam-

ma model, assuming different numbers of categories (k),

in order to find when the model produces a satisfacto-

ry fit to the data and when the model produces a good

approximation to the continuous gamma distribution, as

reflected by an estimate of the ct parameter that is close

to that obtained under the continuous gamma model.

Obviously, k = 1 is the single-rate model, while k = co

is equivalent to the continuous gamma model. Both the

mean and the median for each category have been used

in the analyses, but as one might expect, the differ-

ences between them are not large. We therefore only

present results obtained from the use of the mean, with

comments given for the use of the median. The F84 +

F, F84 + dG4 models, the fixed-rates model, and the

pairwise least-squares method are also applied to all the

tree topologies for the three data sets. Comparison of re-

sults

concerning tree estimation will give us some feel

about the similarities and differences among these meth-

ods. The frequency parameters in the F84 model are es-

timated by using the averages of the observed frequen-

cies for all the models (methods). We do not assume the

existence of a molecular clock, and as F84 is a re-

versible-process model, only unrooted trees can be iden-

tified (Felsenstein 1981).

The mtDNA Sequences

Results obtained from the 895-bp mtDNA sequences for

five species under the discrete gamma model, F84 +

dG, are presented in Table 1. The relationship among

these species is probably (((human, chimpazee), goril-

la), orangutan, gibbon) (e.g., Hasegawa 1991); this tree

structure is assumed in the analysis (Fig. 2). The discrete

gamma model with k = 1 is equivalent to the single-rate

model. Branch lengths are severely underestimated by

this model as rate variation over sites is ignored. Param-

eter ~c is also underestimated when compared to the es-

310

4.

orangutan

3. gorilla 70.3033

1. human ] 0.0893

0.°"5X, o3491 0.114s

0~08J8

7

2.

chimpanzee

5. gibbon

Fig.

2. The maximum likelihood tree from the 895-bp mtDNA se-

quences for five species. The F84 + F model was assumed. Branch

lengths are measured by the average numbers of nucleotide substitu-

tions per site.

timate obtained from F84 + F (Table 1). Both aspects

were discussed by Yang et al. (1994).

As one would expect, the likelihood increases very

rapidly with k when k is very small. The best fit occurs

at k = 3, with (log-)likelihood ~ = -2,620.90, param-

eter estimates ~ = 11.239, d = 0.183. When k further

increases, the likelihood decreases very slowly, until

= -2623.06 for the continuous gamma model (k =

~). Estimates of parameters K and c~ under F84 + F are

= 11.127 and d = 0.248. To get a good approxima-

tion to the continuous gamma model, as reflected by a

close estimate of c~, a large k seems to be needed. When

the median for each category is used instead of the

mean, the best fit occurs at k = 3, with e = -2,620190,

= 11.239, and d = 0.174; the model fits the data just

as well as the use of the mean, but gives a poorer ap-

proximation to the continuous gamma distribution.

All the 15 bifurcating trees and the five-species star

tree are evaluated using different models (methods).

The maximum likelihood tree under F84 + F is shown

in Fig. 2. This tree can be designated HC-G, as orang-

utan and gibbon are the outgroups. The second and

third best trees are similarly designated CG-H and HG-

C respectively. Likelihood values and parameter esti-

mates obtained from these three best trees are e =

-2623.06, ~ = 11.127, and d = 0.248 for HC-G (Table

1), e = -2,626.60, ~ = 11.926, and d = 0.224 for CG-

H, and ~ = 2,626.62, ~ = 12.288 and d = 0.220 for HG-

C, respectively. Estimates of parameters obtained from

the star tree under F84 + F are ~ = 20.884, and d =

0.151 with e = - 2,630.38; these values of parameters

are used in the fixed-rates model. The discrete gamma

model with k = 4, F84 + dG4 produces similar results

to the F84 + F model; the three best trees are HC-G,

HG-C, and CG-H, with likelihoods -2,621.17,

-2,625.33, and -2,625.54, respectively. The fixed-

rates model also supports HC-G separation. When d =

0.151, which is obtained from the star tree under F84 +

F, is used in Eq. 3 to calculate pairwise distances, the

LS method supports the CG-H separation, although as-

suming a single rate for all the sites (c~ = ~) in Eq. 3

would choose the HC-G separation.

The F84 + dG model is also applied to the mtDNA

data for nine species. No attempt has been made to

search for the maximum likelihood tree, and the tree

topology assumed in the analysis separates the species

in the order human, chimpanzee, gorilla, orangutan,

gibbon, crab-eating macaque, squirrel monkey, tarsier,

and lemur. The best fit of the model occurs at k = 4,

with e = -5,044.92, and ~ = 3.855, d = 0.397. The F84

+ F model (k = ~) was not fitted for computational rea-

sons. The likelihood decreases with k when k > 4 in-

creases, and for k = 20, the results are ~ -- -5,046.64,

with parameter estimates ~ = 3.698 and d = 0.426.

When the median is used, the best fit occurs with k =

5, with ~ = 5,044.95, ~ = 3.704, and d = 0.377. Use

of the median thus fits the data just as well as use of the

mean, but the estimate of c~ is farther from that ob-

tained at k = 20.

Because of the large size of the mtDNA data for

nine species, we also fitted the discrete gamma model

combined with the general reversible-process model of

nucleotide substitution (REV, Yang in press), i.e., REV

+ dG. The REV model involves five rate parameters,

that is, a, b, c, d, e instead of only one 1¢ in the F84 mod-

el. Likelihood values and parameter estimates are list-

ed in Table 2. Although REV + F is not fitted, estimates

obtained from REV + dG with k = 20 can be expect-

ed to be very close. Apparently k = 5 gives the best fit,

with ~ = -5,031.62 and ~ = 0.462. Using the median

instead of the mean gives k = 6 for the best fit, with

= -5,031.57 and d = 0.444.

The (~ and [3 Globin Genes

The phylogenetic relationship among Primates (P), Ar-

tiodactyla (A), Lagomorpha (L), Rodentia (R) is not re-

solved although the marsupial (M) can be safely taken

as the outgroup. The maximum likelihood tree under the

F84 + F model is (((L,R),P),A) with M as the outgroup

(Fig. 3). We use this tree to examine the effects of k, the

number of categories, in the discrete gamma model.

Because parameter estimates are quite stable over tree

topologies, and the likelihoods of the several reasonable

trees are very similar, our conclusion will not be biased

seriously if this tree is not the true tree. The results are

shown in Fig. 4. The single-rate model (k = 1) gives e

= -1,792.06 with ~ = 0.074, while the continuous

gamma model (k = ~), F84 + F, gives e = -1,761.17,

with ~ = 0.116 and & = 0.360. When k increases from

one, the likelihood increases steadily, and the best fit ap-

pears to occur at k = 0% that is, with the continuous

Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods

Citations

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies

BEAST: Bayesian evolutionary analysis by sampling trees

PAML 4: Phylogenetic Analysis by Maximum Likelihood

References

Evolutionary trees from DNA sequences: A maximum likelihood approach

Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees.

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.

Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions.

Phylogenetic analysis. Models and estimation procedures.

Related Papers (5)

Evolutionary trees from DNA sequences: A maximum likelihood approach

Confidence limits on phylogenies: an approach using the bootstrap.

MRBAYES: Bayesian inference of phylogenetic trees

MODELTEST: testing the model of DNA substitution.

MrBayes 3: Bayesian phylogenetic inference under mixed models