scispace - formally typeset
Open AccessJournal ArticleDOI

Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods

Ziheng Yang
- 01 Sep 1994 - 
- Vol. 39, Iss: 3, pp 306-314
Reads0
Chats0
TLDR
Two approximate methods are proposed for maximum likelihood phylogenetic estimation, which allow variable rates of substitution across nucleotide sites, and one of them uses several categories of rates to approximate the gamma distribution, with equal probability for each category.
Abstract
Two approximate methods are proposed for maximum likelihood phylogenetic estimation, which allow variable rates of substitution across nucleotide sites. Three data sets with quite different characteristics were analyzed to examine empirically the performance of these methods. The first, called the "discrete gamma model," uses several categories of rates to approximate the gamma distribution, with equal probability for each category. The mean of each category is used to represent all the rates falling in the category. The performance of this method is found to be quite good, and four such categories appear to be sufficient to produce both an optimum, or near-optimum fit by the model to the data, and also an acceptable approximation to the continuous distribution. The second method, called "fixed-rates model", classifies sites into several classes according to their rates predicted assuming the star tree. Sites in different classes are then assumed to be evolving at these fixed rates when other tree topologies are evaluated. Analyses of the data sets suggest that this method can produce reasonable results, but it seems to share some properties of a least-squares pairwise comparison; for example, interior branch lengths in nonbest trees are often found to be zero. The computational requirements of the two methods are comparable to that of Felsenstein's (1981, J Mol Evol 17:368-376) model, which assumes a single rate for all the sites.

read more

Content maybe subject to copyright    Report

J Mol Evol (1994) 39:306-314
Journal
of Molecular
Evolution
© Springer-Verlag New York Inc. 1994
Maximum Likelihood Phylogenetic Estimation from DNA Sequences with
Variable Rates over Sites: Approximate Methods
Ziheng Yang
Department of Zoology, The Natural History Museum, London SW7 5BD, United Kingdom
Received: 22 December 1993 / Accepted: 9 March 1994
Abstract.
Two approximate methods are proposed
for maximum likelihood phylogenetic estimation, which
allow variable rates of substitution across nucleotide
sites. Three data sets with quite different characteristics
were analyzed to examine empirically the performance
of these methods. The first, called the "discrete gamma
model," uses several categories of rates to approximate
the gamma distribution, with equal probability for each
category. The mean of each category is used to repre-
sent all the rates falling in the category. The performance
of this method is found to be quite good, and four such
categories appear to be sufficient to produce both an op-
timum, or near-optimum fit by the model to the data, and
also an acceptable approximation to the continuous dis-
tribution. The second method, called "fixed-rates mod-
el," classifies sites into several classes according to
their rates predicted assuming the star tree. Sites in dif-
ferent classes are then assumed to be evolving at these
fixed rates when other tree topologies are evaluated.
Analyses of the data sets suggest that this method can
produce reasonable results, but it seems to share some
properties of a least-squares pairwise comparison; for
example, interior branch lengths in nonbest trees are of-
ten found to be zero. The computational requirements of
the two methods are comparable to that of Felsenstein's
(1981, J Mol Evol 17:368-376) model, which assumes
a single rate for all the sites.
Key words: Phylogeny -- Maximum likelihood --
Rate variation over sites -- The gamma distribution --
Approximate methods
Introduction
Variation of substitution rates across nucleotide sites has
long been recognized as a characteristic of DNA se-
quence evolution, especially for sequences coding for bi-
ological products (e.g., Fitch and Margoliash 1967;
Fitch and Markowitz 1970; Uzzell and Corbin 1971;
Holmquist et al. 1983; Fitch 1986; Kocher and Wilson
1991; Wakeley 1993). There have been many attempts
to account for such rate variation in phylogenetic analy-
sis. Two approaches are taken. The first assumes that
rates over sites are random variables drawn from a con-
tinuous distribution; for example, Nei and Gojobori
(1986), Jin and Nei (1990), Li et al. (1990), and Tamu-
ra and Nei (1993) used the gamma distribution for rates
over sites when they constructed estimators of the dis-
tance between two sequences. The second approach us-
es several categories of rates. The simplest model of this
sort assumes that a proportion of sites are invariable
while others are changing at a constant rate (e.g.,
Hasegawa et al. 1985; Palumbi 1989; Hasegawa and Ho-
rai 1991). In accounting for the extreme rate hetero-
geneity of the control region of the human mtDNA,
Hasegawa et al. (1993) adopted a three-rate-category
model, wherein some sites are assumed to be invariable
while others are either moderately or highly variable.
Biologically, a continuous distribution may seem to
be more reasonable, and indeed, when fitting several
models to the control region of human mtDNAs, Wake-
ley (1993) found that a two-rate-category model could
not fit the data properly, while the fit of a gamma dis-
tribution was statistically acceptable. Recently, the gam-
ma distribution was also incorporated into a joint like-

307
lihood analysis by Yang (1993), a direct extension of
the method of Felsenstein (1981), which assumes a sin-
gle rate for all the sites. However, the algorithm of
Yang (1993) involves intensive computation, making
it impractical for data sets with more than a few spe-
cies.
In this paper, we introduce two approximate methods.
The first, called the "discrete gamma model," uses sev-
eral categories to approximate the continuous gamma
distribution. Rates over sites are regarded as random
variables drawn from a discrete distribution. The ad-
vantage of taking a discrete distribution as an approxi-
mation to the continuous one is that only one extra pa-
rameter is needed. This appears to lead to more efficient
estimation, easier interpretation of results, and also bet-
ter fit of the model to data. The second method, called
the "fixed-rates model," predicts rates at specific sites
by assuming the star tree and the gamma distribution us-
ing the method of Yang and Wang (in press), and then
combines sites into several classes according to these
predicted rates. Sites in different classes are then as-
sumed to be evolving at known (different) rates when
other tree topologies are evaluated.
We will analyze three different data sets to examine
the fit to data of the discrete gamma model, and its ap-
proximation to the continuous distribution, in order to
determine a satisfactory number of categories in the
discrete distribution. The continuous gamma model
(Yang 1993), the discrete gamma model, the fixed-rates
model, and a least-squares method based on pairwise
distance estimates will all be applied to evaluate the pos-
sible tree topologies. The results will give us some idea
about the similarities and differences among those tree
estimation methods. However, the performance of these
methods with regard to tree reconstruction will be ex-
amined more rigorously by using computer simulations.
Methods and Data
of substitution is 1 when the process is in equilibrium. Time t or
branch lengths in the tree are then measured by the expected number
of nucleotide substitutions per site. Parameter K > max(-~y,-r~a) ad-
justs for the transition/transversion rate bias; a K larger than 0 will al-
low transitions to occur with higher probabilities than transversions.
The rate matrix was given by Hasegawa and Kishino (1989), and the
transition probability matrix, P(t) = exp{Qt}, was described by
Thorne et al. (1992). The model will be designated "F84."
When applied to real data, the fit of F84 is very similar to that of
the model of Hasegawa et al. (1985), denoted "HKY85" (results not
shown). The transition/transversion rate ratios in the two models are
roughly related
as KHKY8 5 =
2KF84 -- 1, where
KHKY8 5
is equivalent
to a/J3 in the notation of Hasegawa et al. (1985). Goldman (1993) pro-
vides a more accurate formula for this relationship. There is, howev-
er, a mathematical difference between these two models: while the rate
matrix of the HKY85 model has four distinct eigenroots (Hasegawa
et al. 1985), Q in Eq. 1 has only three, that is, k~ = 0,
~'2 = --~t,
~'3
= L 4 = -(1 + K)~t. The corresponding eigenvectors of Q are the same
as those for HKY85 and are given by Hasegawa et al. (1985). The F84
model therefore involves less computation than HKY85, especially for
the algorithm of Yang (1993). Another result of this difference is that
a simple formula for estimating the distance between two sequences
is available for F84 while there is not for HKY85, for the reasons ex-
plained by Yang (in press).
Assume that rates over sites follow a gamma distribution with (giv-
en or independently estimated) shape parameter ct. Let P and Q be the
proportions of sites with transitional and transversional differences re-
spectively in the two sequences. Following, e.g., Jin and Nei (1990)
or Tamura and Nei (1993), parameter ¢ and sequence divergence
t,
as defined by t = [4J~T~C(I
+
K/Gy) -k 4JIAT~G(I
q-
K/GR) -t- 4gy~R] ' ~t,
can be estimated as follows:
= a/b
- 1 (2)
= [4~T~C(1
+
l(/~y) -~ 4JIa~G(1 + K/JIR)
+ 4~y~R] b
where
[ -log(A)/2, if ot =
a = (l+K)~tt =
icz(A_l/.
1)/2, if0<a<~
-- [ -log(B)/2, if c~ =
b= ut= la(B -lm-
1)/2, if0<c~<
and
(3)
(4)
(5)
The Model of Nucleotide Substitution.
In this paper we will use the
Markov process model for nucleotide substitution that has been im-
plemented in the DNAML program in J. Felsenstein' s PHYLIP pack-
age since 1984 (Felsenstein personal communication). The rate ma-
trix for this model is
A -
2(~Tg c + J~AJIG ) + 2(gTJIC~lR/g Y + :r~A~GJIy/J~R) " [1 -- Q/(2gy0"lg) ] -- P
2(~T~C/G Y q- JIA~IG/JIR )
(6)
Q _
I (1 K/JIy)~ C ~A ~G ]
(1 + ~/=Y)=T JIA =a
JIT JIC (1 + K/JIR)~ c '
=T JIC (1 q- K/~R)~ A
(1)
where Qij (i ~ j) represents the rate of substitution from nucleotide i
to j, with the nucleotides ordered T, C, A, G. The diagonals are spec-
ified by the mathematical requirement that row sums of Q are zero.
The equilibrium distribution is given by ~T, ~c, ~A' JIG, with ~y = ~T
+ ~c and ~R = ~A + gC = 1
-
~y. g = 1/[4~c(1 + K/TOy) + 4~A~ G
(1 + K/~R) + 4~y~R] is a scale factor, chosen so that the average rate
B = 1 - Q/(2ZCyJIR) (7)
The frequency parameters ~T' ~C,
~A'
JIG, are estimated using the
averages of the observed frequencies in the two sequences. When c~
= ~, the gamma distribution reduces to the model with a single rate
over sites. In this paper, we estimate the c~ parameter to be used in Eqs.
2 and 3 by the method of Yang (1993) assuming the star tree. Pair-
wise distances estimated this way can then be used in constructing the
least-squares (LS) additive tree (Cavalli-Sforza and Edwards 1967) or
in other distance matrix methods. The maximum likelihood estimates
of K and t are normally found to be very similar to those obtained from
Eqs. 2 and 3 (results not shown).

308
The Discrete Gamma Model.
We use k categories to approximate
the gamma distribution, with equal probability
1/k
in each category.
The density function of the gamma distribution G(c~,[3) is
g(r;c~,[3)=~exp{-~r}.r
~ a, 0<r<~ (8)
with mean
E(r)
= c~/[3 and variance
V(r)
= ct/[32. We note that when
r -- G(c~,[3),
cr ~
G(ct,[3/c). The gamma distribution with 13 = J/z re-
duces to the Z2 distribution, that is, G(ot, I/2) --- X2(2c0. Using these re-
lationships, we can calculate the percentage point (the cutting point)
of the gamma distribution, i.e., the value of z such that Prob{r < z}
= p where r ~ G(c~,13), as follows:
zo(P,'Ct, 13) = z×2(p;2a)/(213)
(9)
where
zx2(p;v)
is the percentage point of the Z 2 distribution with v de-
grees of freedom, which can be calculated by, say, the algorithm of
Best and Roberts (1975).
The range of r, (0,~), is cut into k categories by k - 1 percent-
age points corresponding to p =
1/k, 2/k, ....
(k -
1)/k.
The rate in
each category can then be represented by the mean of the portion of
the gamma distribution falling in the category. Suppose that the two
cutting points of category i are a and b. Then the rate for category i
can be obtained as
4,
3
t(O
2
0 7-- ....
0 0,5 1 1.5 2
r
Fig. 1. Discrete approximation to the gamma distribution G(ct,[3),
with ct = [3 = l/2. Four categories are used to approximate the con-
tinuous distribution, with equal probability for each category. The
three boundaries are 0.1015, 0.4549, and 1.3233, which are the per-
centage points corresponding to p = 1/4,
z14,
3/4. The means of the four
categories are 0.0334, 0.2519, 0.8203, 2.8944. The medians are
0.0247, 0.2389, 0.7870, 2.3535, and these are scaled to get 0.0291,
0.2807, 0.9248, and 2.7654, so that the mean of the discrete distri-
bution is one.
ri = f;r g(r,'a,~) drlf g(r,'c~,13) dr
= a/[3 [I(b13,c~ +
1) -
l(a[3,ct +
1)](l/k) (10)
where
l(z, cQ
= [1/F(c0] f~ exp { -x}
xa-ldx
is the incomplete gam-
ma ratio, which can be calculated, say, using the algorithm of Bhat-
tacharjee (1970).
If we use the median instead of the mean to represent the average
rate,
r i
can be calculated as the percentage points corresponding to p
= l/(2k), 3/(2k) ..... (2k - l)/(2k). In the current context, the scale
parameter 13 is redundant and can be set equal to c~ so that the mean
of the distribution is one (Yang 1993). The discrete distribution needs
also to be scaled so that the mean is one if the median is used. An ex-
ample is given in Fig. 1, with ct = [3 = 1/2, in which case the gamma
distribution is really the
~2
distribution with one degree of freedom.
When the average rate for each category is determined, the prob-
ability of observing data x at any site can be obtained as
f(x) = -~ .f(xlr = ri)
(11)
i=l
The conditional probability of observing x, given that the rate for the
site is r =
r e
is given by Yang (1993). As the postorder tree traver-
sal algorithm of Felsenstein (1981) can be used to
calculatef(x[r =
ri) ,
the computational requirement of the discrete gamma model is
roughly k times that of Felsenstein's (1981) single-rate model. The
continuous model was represented as, i.e., F84 + F (Yang et al.
1994), and we therefore represent the discrete gamma model as, i.e.,
F84 + dG. The discrete gamma model with k = 4, the value to be rec-
ommended, will be designated F84 + dG4.
It may be pointed out that the value of
r i,
which
maximizesf(xlr
= r~) in Eq. 11, can be used as the best predictor of the rate for the
site.
The Fixed-Rates Model.
With this model, substitution rates at
sites are predicted using the method of Yang and Wang (in press), as-
suming the star tree and the gamma distribution for rates over sites.
This method takes advantage of the observation that parameter esti-
mates and predicted rates are more-or-less stable across tree topolo-
gies (Yang et at. 1994; Yang and Wang in press). Rates, and their cor-
responding sites, are then classified into k = 4 categories, (0,1), (1,1
+ c0, (1 + o, 1 + 2c0, and (1 + 2o,0*), with o = (1/(~) ~2, where
is the estimated shape parameter of the gamma distribution. This
scheme of classification is very poor if taken as an approximation to
the gamma distribution, as the first category covers most of the sites.
It, however, reflects the discrete nature of the data; with a typical da-
ta set, the four site patterns that are represented by identical nu-
cleotides in all the species cover most of the sites, and most often pre-
dicted rates lbr those sites only are less than one, the average. The rate
for the ith category, ~, is obtained by averaging the predicted rates
for sites in the category. The probability of observing data x at a site
from the ith category is calculated as
f(x) = f(Mr = ?i)
(12)
In this formulation, rates at sites are not regarded as random vari-
ables; they are constants or parameters. Biologically, if we knew
which category a site should belong to, such as in the case of the three
codon positions in protein coding sequences, or if we knew whether
a site was located in a highly variable region or in a very conserved
region, such information could be used. When we lack such infor-
mation, a good guess, as provided by the method of Yang and Wang
(in press), may be used. Mathematically, the contribution to f(x) from
categories other than the most probable may be very small and may
therefore be ignored.
Several alternatives seem possible concerning the implementation
of this method. Possibilities and problems concerning the estimation
of the c~ parameter will be discussed later. It is possible to use the rates
obtained from the star tree and the continuous gamma distribution on-
ly to classify the sites, while rates for site classes can be estimated
by the likelihood function based on Eq. 12. Parameter ~: can also be
estimated this way. It is not very clear which options will produce
good performance. In this study, the average rates for site classes, ~,
and parameter ~; are all obtained from the star tree under the contin-
uous gamma model, as this saves computation. Calculation of the like-
lihood under this model, that is, based on Eq. 12, involves roughly the
same amount of computation as that of Felsenstein's (1981) single-
rate model.
Data.
We choose three data sets for which rates of substitution are
clearly variable over sites, while other aspects, such as the amount of
evolution as reflected in branch lengths in the tree, and the transition/
transversion rate bias, are quite different• For sequences such as
pseudogenes or "junk" DNA, for which rates are more-or-less
constant over sites, the methods considered in this paper are not use-
ful.
The mtDNA Sequences from Primates.
The 895-bp mtDNA se-
quences of human, chimpanzee, gorilla, orangutan, and gibbon (Brown

Table 1. Likelihood values and estimates of parameters as functions of
k,
the number of categories in the discrete gamma model a
309
Branch lengths
k f 6~5 6<-->4 6<-->7 7~8 7 ~--'3 8<-->1 8~2 ~
1 --2,
2 -2
3 -2
4 -2
5 -2
6 -2
7 -2
8 -2
9 -2
10 -2
12 -2
15 -2
20 -2
-2
667.08 0.1385 0.0998 0.0531 0.0174 0.0578 0.0412 0.0539 4.344 NA
625.55 0.2471 0.1627 0.0730 0.0225 0.0685 0.0504 0.0651 6.614 0.038
620.90 0.4686 0.2842 0.1062 0.0347 0.0669 0.0559 0.0700 11.239 0.183
621.18 0.4665 0.3010 0.1151 0.0341 0.0738 0.0576 0.0734 11.619 0.212
621.64 0.4482 0.2953 0.1151 0.0335 0.0777 0.0586 0.0750 11.359 0.228
621.98 0.4393 0.2917 0.1142 0.0334 0.0801 0.0594 0.0760 11.198 0.238
622.21 0.4348 0.2897 0.1134 0.0333 0.0816 0.0602 0.0767 11.098 0.243
622.39 0.4325 0.2887 0.1129 0.0334 0.0827 0.0608 0.0772 11.037 0.247
622.51 0.4316 0.2884 0.1126 0.0334 0.0835 0.0613 0.0776 10.999 0.249
622.61 0.4313 0.2884 0.1124 0.0335 0.0842 0.0617 0.0779 10.977 0.251
622.74 0.4319 0.2890 0.1122 0.0336 0.0852 0.0625 0.0785 10.959 0.252
622.86 0.4340 0.2904 0.1123 0.0338 0.0861 0.0632 0.0792 10.960 0.253
622.96 0.4375 0.2928 0.1126 0.0340 0.0871 0.0640 0.0799 10.986 0.253
623.06 0.4532 0.3033 0.1148 0.0349 0.0892 0.0659 0.0821 11.127 0.248
a
The mtDNA data for five species (895bp) were analyzed, assuming the best tree (Fig. 2) and the F84 model of nucleotide substitution. The
averages of nucleotide frequencies across species are wq: = 0.2532, n-r e = 0.3289, VA = 0.3120, and ~r~ = 0.1059 with
~max =
-2,476.97. The
branches are specified by their two nodes in the tree shown in Fig. 2
et al. 1982) were analyzed. This segment of the mitochondrial genome
codes for parts of two proteins at the two ends and three tRNAs in
the middle. Rates are highly variable over sites. The transition/trans-
version rate bias is very high for these sequences, and only a small
amount of evolution is involved.
This data set has been expanded since Brown et al. (1982), and
we added sequences of this region from crab-eating macaque, squir-
rel monkey, tarsier, and lemur to form another, larger, data set. The
sequences were aligned by A. Friday by eye. After sites involving in-
sertions or deletions have been excluded, there are 888 nucleotides in
each sequence. A much greater amount of evolution is now involved
due to the addition of more distantly related sequences. These two da-
ta sets can be distinguished by their numbers of species.
The et- and {3-Globin Genes from Mammals.
The c~- and [3-globin
genes of a primate (human), an artiodactyl (goat for the ct-globin gene
and cow for the f3-globin gene), a lagomorph (rabbit), a rodent (rat),
and a marsupial (the native cat for the ot-globin gene and opossum for
the [3-globin gene) were analyzed. As the evolutionary dynamics of
those two genes appear to be very similar, they were combined into
one data set. Only the first and second codon positions in the coding
regions were used, and there are 570 nucleotides in each combined
sequence (2 × 141 for the c~-globin gene and 2 X 144 for the [3-glo-
bin gene). This data set is characterized by high rate variation over
sites, relatively low transition/transversion rate ratio, and intermedi-
ate branch lengths.
The Small-Subunit rRNAs (ssrRNA).
Small-subunit rRNA se-
quences of
Sulfolobus solfatarius, Halobacterium salinarium, Esch-
erichia coli,
and
Homo sapiens
as analyzed by Navidi et al. (1991)
were used. There are 1,352 nucleotides in the sequence. Rates do not
appear to be highly variable over sites for those sequences, and is
not large. The sequences are very different. Even nucleotide fre-
quencies are quite different among species, suggesting that the sub-
stitution processes may differ among lineages. For the purpose of
this study, however, this aspect of the inaccuracy of the models is ig-
nored.
Results
The data sets are analyzed by using the discrete gam-
ma model, assuming different numbers of categories (k),
in order to find when the model produces a satisfacto-
ry fit to the data and when the model produces a good
approximation to the continuous gamma distribution, as
reflected by an estimate of the ct parameter that is close
to that obtained under the continuous gamma model.
Obviously, k = 1 is the single-rate model, while k = co
is equivalent to the continuous gamma model. Both the
mean and the median for each category have been used
in the analyses, but as one might expect, the differ-
ences between them are not large. We therefore only
present results obtained from the use of the mean, with
comments given for the use of the median. The F84 +
F, F84 + dG4 models, the fixed-rates model, and the
pairwise least-squares method are also applied to all the
tree topologies for the three data sets. Comparison of re-
sults
concerning tree estimation will give us some feel
about the similarities and differences among these meth-
ods. The frequency parameters in the F84 model are es-
timated by using the averages of the observed frequen-
cies for all the models (methods). We do not assume the
existence of a molecular clock, and as F84 is a re-
versible-process model, only unrooted trees can be iden-
tified (Felsenstein 1981).
The mtDNA Sequences
Results obtained from the 895-bp mtDNA sequences for
five species under the discrete gamma model, F84 +
dG, are presented in Table 1. The relationship among
these species is probably (((human, chimpazee), goril-
la), orangutan, gibbon) (e.g., Hasegawa 1991); this tree
structure is assumed in the analysis (Fig. 2). The discrete
gamma model with k = 1 is equivalent to the single-rate
model. Branch lengths are severely underestimated by
this model as rate variation over sites is ignored. Param-
eter ~c is also underestimated when compared to the es-

310
4.
orangutan
3. gorilla 70.3033
1. human ] 0.0893
0.°"5X, o3491 0.114s
0~08J8
7
2.
chimpanzee
5. gibbon
Fig.
2. The maximum likelihood tree from the 895-bp mtDNA se-
quences for five species. The F84 + F model was assumed. Branch
lengths are measured by the average numbers of nucleotide substitu-
tions per site.
timate obtained from F84 + F (Table 1). Both aspects
were discussed by Yang et al. (1994).
As one would expect, the likelihood increases very
rapidly with k when k is very small. The best fit occurs
at k = 3, with (log-)likelihood ~ = -2,620.90, param-
eter estimates ~ = 11.239, d = 0.183. When k further
increases, the likelihood decreases very slowly, until
= -2623.06 for the continuous gamma model (k =
~). Estimates of parameters K and c~ under F84 + F are
= 11.127 and d = 0.248. To get a good approxima-
tion to the continuous gamma model, as reflected by a
close estimate of c~, a large k seems to be needed. When
the median for each category is used instead of the
mean, the best fit occurs at k = 3, with e = -2,620190,
= 11.239, and d = 0.174; the model fits the data just
as well as the use of the mean, but gives a poorer ap-
proximation to the continuous gamma distribution.
All the 15 bifurcating trees and the five-species star
tree are evaluated using different models (methods).
The maximum likelihood tree under F84 + F is shown
in Fig. 2. This tree can be designated HC-G, as orang-
utan and gibbon are the outgroups. The second and
third best trees are similarly designated CG-H and HG-
C respectively. Likelihood values and parameter esti-
mates obtained from these three best trees are e =
-2623.06, ~ = 11.127, and d = 0.248 for HC-G (Table
1), e = -2,626.60, ~ = 11.926, and d = 0.224 for CG-
H, and ~ = 2,626.62, ~ = 12.288 and d = 0.220 for HG-
C, respectively. Estimates of parameters obtained from
the star tree under F84 + F are ~ = 20.884, and d =
0.151 with e = - 2,630.38; these values of parameters
are used in the fixed-rates model. The discrete gamma
model with k = 4, F84 + dG4 produces similar results
to the F84 + F model; the three best trees are HC-G,
HG-C, and CG-H, with likelihoods -2,621.17,
-2,625.33, and -2,625.54, respectively. The fixed-
rates model also supports HC-G separation. When d =
0.151, which is obtained from the star tree under F84 +
F, is used in Eq. 3 to calculate pairwise distances, the
LS method supports the CG-H separation, although as-
suming a single rate for all the sites (c~ = ~) in Eq. 3
would choose the HC-G separation.
The F84 + dG model is also applied to the mtDNA
data for nine species. No attempt has been made to
search for the maximum likelihood tree, and the tree
topology assumed in the analysis separates the species
in the order human, chimpanzee, gorilla, orangutan,
gibbon, crab-eating macaque, squirrel monkey, tarsier,
and lemur. The best fit of the model occurs at k = 4,
with e = -5,044.92, and ~ = 3.855, d = 0.397. The F84
+ F model (k = ~) was not fitted for computational rea-
sons. The likelihood decreases with k when k > 4 in-
creases, and for k = 20, the results are ~ -- -5,046.64,
with parameter estimates ~ = 3.698 and d = 0.426.
When the median is used, the best fit occurs with k =
5, with ~ = 5,044.95, ~ = 3.704, and d = 0.377. Use
of the median thus fits the data just as well as use of the
mean, but the estimate of c~ is farther from that ob-
tained at k = 20.
Because of the large size of the mtDNA data for
nine species, we also fitted the discrete gamma model
combined with the general reversible-process model of
nucleotide substitution (REV, Yang in press), i.e., REV
+ dG. The REV model involves five rate parameters,
that is, a, b, c, d, e instead of only one in the F84 mod-
el. Likelihood values and parameter estimates are list-
ed in Table 2. Although REV + F is not fitted, estimates
obtained from REV + dG with k = 20 can be expect-
ed to be very close. Apparently k = 5 gives the best fit,
with ~ = -5,031.62 and ~ = 0.462. Using the median
instead of the mean gives k = 6 for the best fit, with
= -5,031.57 and d = 0.444.
The (~ and [3 Globin Genes
The phylogenetic relationship among Primates (P), Ar-
tiodactyla (A), Lagomorpha (L), Rodentia (R) is not re-
solved although the marsupial (M) can be safely taken
as the outgroup. The maximum likelihood tree under the
F84 + F model is (((L,R),P),A) with M as the outgroup
(Fig. 3). We use this tree to examine the effects of k, the
number of categories, in the discrete gamma model.
Because parameter estimates are quite stable over tree
topologies, and the likelihoods of the several reasonable
trees are very similar, our conclusion will not be biased
seriously if this tree is not the true tree. The results are
shown in Fig. 4. The single-rate model (k = 1) gives e
= -1,792.06 with ~ = 0.074, while the continuous
gamma model (k = ~), F84 + F, gives e = -1,761.17,
with ~ = 0.116 and & = 0.360. When k increases from
one, the likelihood increases steadily, and the best fit ap-
pears to occur at k = 0% that is, with the continuous

Citations
More filters
Journal ArticleDOI

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

TL;DR: The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models, inferring ancestral states and sequences, and estimating evolutionary rates site-by-site.
Journal ArticleDOI

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.
Journal ArticleDOI

IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies

TL;DR: It is shown that a combination of hill-climbing approaches and a stochastic perturbation method can be time-efficiently implemented and found higher likelihoods between 62.2% and 87.1% of the studied alignments, thus efficiently exploring the tree-space.
Journal ArticleDOI

BEAST: Bayesian evolutionary analysis by sampling trees

TL;DR: BEAST is a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree that provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions.
Journal ArticleDOI

PAML 4: Phylogenetic Analysis by Maximum Likelihood

TL;DR: PAML, currently in version 4, is a package of programs for phylogenetic analyses of DNA and protein sequences using maximum likelihood (ML), which can be used to estimate parameters in models of sequence evolution and to test interesting biological hypotheses.
References
More filters
Journal ArticleDOI

Evolutionary trees from DNA sequences: A maximum likelihood approach

TL;DR: A computationally feasible method for finding such maximum likelihood estimates is developed, and a computer program is available that allows the testing of hypotheses about the constancy of evolutionary rates by likelihood ratio tests.
Journal ArticleDOI

Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees.

TL;DR: In this paper, a new mathematical method for estimating the number of transitional and transversional substitutions per site, as well as the total number of nucleotide substitutions was proposed, taking into account excess transitions, unequal nucleotide frequencies, and variation of substitution rate among different sites.
Journal ArticleDOI

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.

TL;DR: A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed, and this dating may pose a problem for the widely believed hypothesis that the bipedal creatureAustralopithecus afarensis, which lived some 3.7 million years ago, was ancestral to man and evolved after the human-ape splitting.
Journal ArticleDOI

Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions.

TL;DR: It is shown that all available methods tend to give an underestimate of the number of nonsynonymous substitutions when the number is large and computer simulation indicates that estimates of synonymous substitutions obtained by the two methods are quite accurate unless the numberof nucleotide substitutions per site is very large.
Journal Article

Phylogenetic analysis. Models and estimation procedures.

TL;DR: This paper shows how suitable evolutionary models can be constructed and applied objectively and how the type of data will affect both the method of treatment and the validity of the results.
Related Papers (5)