What are the contributions mentioned in the paper "Mach: using sequence and genotype data to estimate haplotypes and unobserved genotypes" ?

Their approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, the authors use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, the authors show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. The authors show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, the authors show the approach is applicable in a variety of populations.

(Open Access) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes (2010) | Yun Li

MaCH: Using Sequence and Genotype Data to Estimate

Haplotypes and Unobserved Genotypes

Yun Li

, Cristen J. Willer

, Jun Ding

, Paul Scheet

, and Gonçalo R. Abecasis

2,*

Department of Genetics, Department of Biostatistics, University of North Carolina, Chapel Hill,

North Carolina

Center for Statistical Genetics, Department of Biostatistics, University of Michigan School of

Public Health, Ann Arbor, Michigan

Department of Epidemiology, University of Texas M.D. Anderson Cancer Center, Houston,

Texas

Abstract

Genome-wide association studies (GWAS) can identify common alleles that contribute to complex

disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of

most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes

thereof as proxies. We have previously implemented a computationally efficient Markov Chain

framework for genotype imputation and haplotyping in the freely available MaCH software

package. The approach describes sampled chromosomes as mosaics of each other and uses

available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes,

together with useful measures of the quality of these estimates. Our approach is already widely

used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here,

we use simulations and experimental genotypes to evaluate its accuracy and utility, considering

choices of genotyping panels, reference panel configurations, and designs where genotyping is

replaced with shotgun sequencing. Importantly, we show that genotype imputation not only

facilitates cross study analyses but also increases power of genetic association studies. We show

that genotype imputation of common variants using HapMap haplotypes as a reference is very

accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping

studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we

illustrate how association analyses of unobserved variants will benefit from ongoing advances

such as larger HapMap reference panels and whole genome shotgun sequencing technologies.

Keywords

imputation; haplotyping; sequencing

INTRODUCTION

Most ongoing genome-wide association studies (GWAS) rely on a commercial SNP

genotyping panel that directly assays only a small fraction of SNPs in the human genome

[Carlson et al., 2003; The International HapMap Consortium 2005]. In these scans, the

Correspondence to: Goncçalo R. Abecasis, Department of Biostatistics, University of Michigan School of Public Health, 1415

Washington Heights, Ann Arbor, MI 48109. goncalo@umich.edu.

Additional Supporting Information may be found in the online version of this article.

NIH Public Access

Author Manuscript

Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.

Published in final edited form as:

Genet Epidemiol

. 2010 December ; 34(8): 816–834. doi:10.1002/gepi.20533.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

majority of SNPs in the genome must be evaluated indirectly using one or more of the

genotyped SNPs as proxies [Barrett and Cardon, 2006; Pe’er et al., 2006]. Despite the ability

of individual genome-wide association scans to identify common alleles that make large

contributions to disease risk and a subset of the loci with smaller effect [Hirschhorn and

Daly, 2005], many alleles that contribute to complex disease can only be identified through

the meta-analysis of multiple genome-wide scans [for specific examples, see Lettre et al.,

2008; Sanna et al., 2008; Willer et al., 2008, 2009]. Although it is possible to assign SNPs

genotyped in each study as proxies for SNPs genotyped in the other studies [Carlson et al.,

2004; de Bakker et al., 2005; Lin et al., 2004; Nicolae, 2006; Zaitlen et al., 2007], meta-

analyses of GWAS conducted in this manner would be cumbersome because of the limited

overlap between the different commercial panels and because different choices of proxies

for a particular SNP might lead to somewhat different conclusions.

GENOTYPE IMPUTATION

A much more attractive approach for cross study analyses is to combine genotypes

generated by the International HapMap Consortium, [The International HapMap

Consortium, 2005] with genotypes from individual studies, and then use a haplotyping

algorithm that can handle genome scale data to impute genotypes at untyped markers in each

study [Scheet and Stephens, 2006]. This strategy results in a situation where all studies are

“genotyped” at all the markers examined by the HapMap consortium (albeit some markers

are genotyped using conventional means and others are genotyped in silico [Burdick et al.,

2006]). The approach relies on the intuition that even two apparently “unrelated” individuals

can share short stretches of haplotype inherited from distant common ancestors. Once one of

these stretches is identified using genotypes for a few SNPs, alleles for intervening SNPs

that are measured in one of the individuals, but not the other, can be imputed. Provided

shared haplotype stretches are identified correctly, imputed genotypes will be accurate

unless they have been disrupted by gene conversion or mutation events.

INITIAL EVALUATION OF IMPUTED GENOTYPES AND HAPLOTYPES

Here, we systematically evaluate the genotype imputation approach outlined in the

paragraph above using our Markov Chain Haplotyping algorithm (MaCH 1.0; see Appendix

for implementation details). To estimate haplotypes, our approach starts by randomly

generating a pair of haplotypes that is compatible with observed genotypes for each sampled

individual. These initial haplotype estimates are then refined through a series of iterations. In

each iteration, a new pair of haplotypes is sampled for each individual in turn using a

Hidden Markov Model (HMM) that describes the haplotype pair as an imperfect mosaic of

the other haplotypes. Model parameters that characterize the probability of change in the

mosaic pattern between every pair of consecutive markers and the probability of observing

an imperfection in the mosaic at each specific point are also updated. After many iterations

(typically 20–100), a consensus haplotype can be constructed by merging the haplotypes

sampled in each round.

HAPLOTYPING

Our approach was inspired by the Markov models commonly used for pedigree analysis [for

examples, see Abecasis et al., 2002; Kruglyak et al., 1996; Lander and Green, 1987] and

shares several features with other HMMs used to describe sampled haplotypes as a mosaic

of a set of reference haplotypes [Daly et al., 2001; Li and Stephens, 2003; Mott et al., 2000;

Stephens and Scheet, 2005a]. In order to evaluate its performance, we simulated two sets of

100 1 Mb regions that mimic the degree of linkage disequilibrium (LD) in the HapMap CEU

and YRI samples [Schaffner et al., 2005]. In each region, we simulated genotypes for ~200

markers, ascertained to mimic HapMap I allele frequency patterns [Marchini et al., 2006], in

Li et al. Page 2

Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

90 individuals with 2% of the genotypes missing at random. We then used our method to

reconstruct individual haplotypes and tallied three measures of haplotyping quality

[Marchini et al., 2006]: (1) the number of incorrectly imputed missing genotypes, (2) among

heterozygous sites, the number of consecutive sites that are phased incorrectly with respect

to each other (this is the number of “flips” required to transform estimated haplotypes into

the true haplotypes, after masking incorrectly imputed sites), and (3) the number of perfectly

inferred haplotypes. The three measures were averaged over all 100 regions and the results

are summarized in Table I. For comparison, the table also includes results for PHASE

[Stephens and Scheet, 2005b; Stephens et al., 2001] and fastPHASE [Scheet and Stephens,

2006], two state of the art haplotyping algorithms [Marchini et al., 2006], and for BEAGLE

[Browning, 2006] and PL-EM [Qin et al., 2002], two alternative haplotyping algorithms that

are very computationally efficient. Table I clearly shows that our method is competitive in

all three measures: our method results in slightly fewer incorrectly imputed genotypes,

requires slightly fewer flips to transform imputed haplotypes into the true haplotypes, and

produces slightly more correctly inferred haplotypes over the entire 1 Mb stretch than

PHASE, which was the second best method. Furthermore, note that estimates of haplotypes

and missing genotypes obtained in 5–20 min using our method are comparable in quality to

those produced by PHASE runs averaging ~1 day.

GENOTYPE IMPUTATION FOR UNTYPED MARKERS

Encouraged by these initial results, we proceeded to apply our method to impute genotypes

for untyped markers in the Finland United States Investigation of NIDDM genetics

(FUSION) GWAS [Scott et al., 2007]. Since a previous analysis suggested LD patterns in

the HapMap CEU and in FUSION are similar [Willer et al., 2006], we used genotypes for

290,690 autosomal markers with allele frequency >5% in the Illumina 317K SNP chip and

haplotypes for 2.5M polymorphic markers in the phased HapMap CEU chromosomes as

input. After running the haplotyping procedure described above, we estimated the most

likely genotype at each position (taking a majority vote across all iterations) and the

expected number of copies of the minor allele at each position (a fractional value between 0

and 2) for each individual. We obtained similar results running the haplotyping procedure

for 50–100 iterations or using only a smaller number of iterations (10–20) to estimate model

parameters and then calculating maximum likelihood estimates for the missing genotypes

and allele counts. Different chromosomes were analyzed in parallel and, overall, imputing

genotypes for 2,335 unrelated individuals took <2 days for each of the largest chromosomes

on a 2006 vintage 2.40GHz Pentium Xeon processor. In total, we imputed genotypes for

2,266,562 SNPs per individual. On average, our method used stretches of ~150 kb from the

HapMap CEU panel to reconstruct haplotypes for individuals in the FUSION sample.

IMPUTATION IN THE FUSION GENOMEWIDE ASSOCIATION STUDY

To evaluate the quality of imputed genotypes, we contrasted our estimates of the most likely

genotypes and the expected number of copies of the minor allele with actual genotype data

for three sets of markers: 521 SNP markers in a region of chromosome 14 previously

examined to fine-map a candidate linkage region [Willer et al., 2006], 1,234 SNP markers

selected to augment coverage of the Illumina 317K panel in regions surrounding 222

candidate genes [Gaulton et al., 2008] and 12,702 markers with MAF <5% not included in

the set of 290,690 markers used for imputation. We expected the last two panels of markers

to be harder to impute, because they represent SNPs that are not well tagged by the Illumina

317K SNP chip or that have lower MAF. We observed that 98.60% of imputed alleles

matched actual genotyped alleles in the fine-mapping panel, 96.24% in the candidate gene

panel, and 98.73% in the low MAF SNP panel. Furthermore, the average r

between

imputed genotypes and actual genotypes was 90.4, 79.1, and 74.0% in the three SNP panels,

respectively. This represents an improvement of 14–39% compared to the best available

Li et al. Page 3

Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

single marker tags, which provided an average r

of 76.5, 52.8, and 35.5% in the three SNP

panels, respectively.

MEASURES OF IMPUTATION QUALITY

Our Markov Chain produces three estimates of imputation quality and these can be used to

focus analyses on subsets of high-quality genotypes. First, it produces a quality score that

estimates the accuracy of each imputed genotype and is simply the proportion of iterations

where the final imputed genotype (by taking a majority vote across all iterations) was

selected. Second, it produces an overall measure of the accuracy of imputation for each

marker, which is the genotype quality score averaged across all individuals. Finally, by

comparing the distribution of sampled genotypes in each iteration with the estimated allele

counts that result from averaging over all iterations, it produces an estimate of the r

between imputed and true genotypes (see Methods for more details). Quality measures for

individual genotypes were good predictors of imputation accuracy (Supplementary Figure 1,

Right Panel) and show that most imputed genotypes are called with a high degree of

confidence (Supplementary Figure 1, Left Panel). For example, as measured by their quality

scores, the top 95% of genotypes had average quality scores of 98.9% and actually matched

experimental genotypes 98.6% of the time. Most of the errors affect a single allele so that,

when measured on a per allele basis, concordance increases to 99.3%.

To avoid preferential removal of rare genotypes or alleles at each marker, we recommend

using the per marker quality scores to select a subset of imputed SNPs for analysis, instead

of the per genotype quality scores. Overall, we saw a correlation of 0.77 between the

estimated and actual accuracy of imputed genotypes for each marker. We also saw a

correlation of 0.84 between the r

estimated by our method and the actual r

that resulted

from comparing experimentally derived allele counts with their imputed estimates. Figure 1

shows the ROC curve [Pepe, 2003] for the two quality measures, showing that the estimated

measure is a more effective way to identify poorly imputed markers. In the FUSION

GWAS scan [Scott et al., 2007], we used an r

threshold of 0.30 to decide which markers

were well imputed and should be included in further analyses, and which were not. At this

threshold, we expect to remove 70% of poorly imputed markers (those where r

with

experimental genotypes is <20%) but only 0.50% of better imputed markers (those where r

with experimental genotypes is >50%).

IMPUTATION OF STRONGLY ASSOCIATED SNPS

The results summarized so far compare a variety of imputed genotypes with experimentally

derived counterparts. However, a more interesting comparison focuses on imputed

genotypes that appear to show strong evidence for association, as those might motivate

further downstream experiments. To evaluate the accuracy of imputed genotypes for these

“strongly associated SNPs,” we compared imputed and experimental genotypes in regions

that were only selected for follow-up genotyping after imputation (for example, because

imputed genotypes resulted in strong evidence for association but nearby genotyped markers

did not). Table II summarizes the comparison of allele frequencies, association test statistics,

and individual genotype calls between imputed genotypes and actual genotypes later

determined by genotyping. Overall, it is clear that even among these strongly associated

SNPs imputation provided accurate estimates of the true P-values. The largest observed

discrepancies were for rs17384005, rs11646114, and rs4812831, which were also the three

markers for which our imputation approach estimated lower r

with actual genotypes.

Imputation is particularly useful because it allows evidence for association at SNPs with no

reliable proxies to be evaluated more accurately. For instance, after imputation, average r

increased from 0.22 to 0.66 in the set of SNPs whose best genotyped proxy had r

<0.30 and

Li et al. Page 4

Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

from 0.33 to 0.75 in the set of SNPs whose best genotyped proxy had r

<0.5 [for specific

examples of disease susceptibility loci that would be missed without imputation, see Li et

al., 2009b].

USING IMPUTATION TO ESTIMATE PAIRWISE DISEQUILIBRIUM

Remarkably, we observed that imputed genotypes could also be used to obtain very accurate

estimates of LD between pairs of untyped markers, or of LD between a genotyped marker

and an untyped marker. As shown in Figure 2, estimates of LD between two SNPs obtained

using imputed data are much closer to the results obtained by actually genotyping the two

SNPs than estimates obtained by looking up the two markers in the HapMap CEU database

(Supplementary Figure 2 shows a similar comparison for D’ estimates). Even with some

imprecision in estimates of individual genotypes, the increased sample size compensates to

reduce variation in the estimated LD measures.

COMPARISON OF DIFFERENT GENOTYPING PLATFORMS

Our experience with the FUSION GWAS, summarized above, shows that imputation can be

an effective way to estimate unobserved genotypes and/or allele counts. These genotypes

can then be used in a variety of downstream analyses, including logistic regression analyses

for discrete traits and linear regression analyses for quantitative traits, and to facilitate meta-

analysis of studies based on different platforms. A key issue when considering imputation-

based approaches is whether similarly accurate estimates of unobserved data points can be

obtained with different genotyping panels or in different populations [Clark and Li, 2007],

and to evaluate this we conducted two additional experiments.

In the first experiment, we used genotype data generated by the International HapMap

Consortium. We considered each of the HapMap samples in turn and masked available

genotypes so as to mimic an experiment using one of several commercially available chips.

For example, to evaluate the Affymetrix 500K SNP chip, we marked genotypes for all

markers that are not on the chip as missing for the individual being considered. We then

used haplotypes for the remaining individuals on the same HapMap analysis panel (either

YRI, CEU, or JPT+CHB) to impute the missing genotypes. The results are summarized in

Table III and clearly show that a large number of SNPs can be imputed very accurately

using any of the commercially available panels (e.g. with r

>0.80 to experimental

genotypes) and that, compared to relying on single marker tagging, imputation results in

improved coverage of the genome.

Depending on the commercial panel and population being investigated, coverage of

HapMap SNPs (proportion of SNPs with r

>0.80) increased by 10–30% for low MAF

alleles (MAF<5%) and by 10–20% for more common alleles (MAF>5%). In agreement with

this result, the average r

between each untyped SNP and imputed genotypes was up to 40%

higher on average when using imputed genotypes than when using the best available single

marker proxy. Imputation remained valuable even for panels with ~1 million directly

genotyped SNPs. In practice, the results shown in Table III are likely to represent an upper

bound on the performance of our method in real settings, because additional errors will

result from discrepancies in genotyping protocols between individual laboratories and the

HapMap and from differences in LD patterns between the HapMap and the samples being

studied. Nevertheless, they suggest our method is likely to be helpful for a variety of

currently available commercial SNP panels.

Li et al. Page 5

Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.

NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes

Citations

A Map of Human Genome Variation From Population-Scale Sequencing

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease

Haplotype-based variant detection from short-read sequencing

10 Years of GWAS Discovery: Biology, Function, and Translation

References

The Sequence Alignment/Map format and SAMtools

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

A new statistical method for haplotype reconstruction from population data.

The International HapMap Project

A haplotype map of the human genome

Related Papers (5)

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

Principal components analysis corrects for stratification in genome-wide association studies

METAL: fast and efficient meta-analysis of genomewide association scans.

A Map of Human Genome Variation From Population-Scale Sequencing

An integrated map of genetic variation from 1,092 human genomes

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Mach: using sequence and genotype data to estimate haplotypes and unobserved genotypes" ?