scispace - formally typeset
Open AccessJournal ArticleDOI

MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes

TLDR
It is shown that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome‐wide SNP data or smaller amounts of data typical in fine‐mapping studies, and it is illustrated how association analyses of unobserved variants will benefit from ongoing advances such as larger Hap map reference panels and whole genome shotgun sequencing technologies.
Abstract
Genome-wide association studies (GWAS) can identify common alleles that contribute to complex disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes thereof as proxies. We have previously implemented a computationally efficient Markov Chain framework for genotype imputation and haplotyping in the freely available MaCH software package. The approach describes sampled chromosomes as mosaics of each other and uses available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes, together with useful measures of the quality of these estimates. Our approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, we use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, we show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. We show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we illustrate how association analyses of unobserved variants will benefit from ongoing advances such as larger HapMap reference panels and whole genome shotgun sequencing technologies.

read more

Content maybe subject to copyright    Report

MaCH: Using Sequence and Genotype Data to Estimate
Haplotypes and Unobserved Genotypes
Yun Li
1
, Cristen J. Willer
2
, Jun Ding
2
, Paul Scheet
3
, and Gonçalo R. Abecasis
2,*
1
Department of Genetics, Department of Biostatistics, University of North Carolina, Chapel Hill,
North Carolina
2
Center for Statistical Genetics, Department of Biostatistics, University of Michigan School of
Public Health, Ann Arbor, Michigan
3
Department of Epidemiology, University of Texas M.D. Anderson Cancer Center, Houston,
Texas
Abstract
Genome-wide association studies (GWAS) can identify common alleles that contribute to complex
disease susceptibility. Despite the large number of SNPs assessed in each study, the effects of
most common SNPs must be evaluated indirectly using either genotyped markers or haplotypes
thereof as proxies. We have previously implemented a computationally efficient Markov Chain
framework for genotype imputation and haplotyping in the freely available MaCH software
package. The approach describes sampled chromosomes as mosaics of each other and uses
available genotype and shotgun sequence data to estimate unobserved genotypes and haplotypes,
together with useful measures of the quality of these estimates. Our approach is already widely
used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here,
we use simulations and experimental genotypes to evaluate its accuracy and utility, considering
choices of genotyping panels, reference panel configurations, and designs where genotyping is
replaced with shotgun sequencing. Importantly, we show that genotype imputation not only
facilitates cross study analyses but also increases power of genetic association studies. We show
that genotype imputation of common variants using HapMap haplotypes as a reference is very
accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping
studies. Furthermore, we show the approach is applicable in a variety of populations. Finally, we
illustrate how association analyses of unobserved variants will benefit from ongoing advances
such as larger HapMap reference panels and whole genome shotgun sequencing technologies.
Keywords
imputation; haplotyping; sequencing
INTRODUCTION
Most ongoing genome-wide association studies (GWAS) rely on a commercial SNP
genotyping panel that directly assays only a small fraction of SNPs in the human genome
[Carlson et al., 2003; The International HapMap Consortium 2005]. In these scans, the
© 2010 Wiley-Liss, Inc.
*
Correspondence to: Goncçalo R. Abecasis, Department of Biostatistics, University of Michigan School of Public Health, 1415
Washington Heights, Ann Arbor, MI 48109. goncalo@umich.edu.
Additional Supporting Information may be found in the online version of this article.
NIH Public Access
Author Manuscript
Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.
Published in final edited form as:
Genet Epidemiol
. 2010 December ; 34(8): 816–834. doi:10.1002/gepi.20533.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

majority of SNPs in the genome must be evaluated indirectly using one or more of the
genotyped SNPs as proxies [Barrett and Cardon, 2006; Pe’er et al., 2006]. Despite the ability
of individual genome-wide association scans to identify common alleles that make large
contributions to disease risk and a subset of the loci with smaller effect [Hirschhorn and
Daly, 2005], many alleles that contribute to complex disease can only be identified through
the meta-analysis of multiple genome-wide scans [for specific examples, see Lettre et al.,
2008; Sanna et al., 2008; Willer et al., 2008, 2009]. Although it is possible to assign SNPs
genotyped in each study as proxies for SNPs genotyped in the other studies [Carlson et al.,
2004; de Bakker et al., 2005; Lin et al., 2004; Nicolae, 2006; Zaitlen et al., 2007], meta-
analyses of GWAS conducted in this manner would be cumbersome because of the limited
overlap between the different commercial panels and because different choices of proxies
for a particular SNP might lead to somewhat different conclusions.
GENOTYPE IMPUTATION
A much more attractive approach for cross study analyses is to combine genotypes
generated by the International HapMap Consortium, [The International HapMap
Consortium, 2005] with genotypes from individual studies, and then use a haplotyping
algorithm that can handle genome scale data to impute genotypes at untyped markers in each
study [Scheet and Stephens, 2006]. This strategy results in a situation where all studies are
“genotyped” at all the markers examined by the HapMap consortium (albeit some markers
are genotyped using conventional means and others are genotyped in silico [Burdick et al.,
2006]). The approach relies on the intuition that even two apparently “unrelated” individuals
can share short stretches of haplotype inherited from distant common ancestors. Once one of
these stretches is identified using genotypes for a few SNPs, alleles for intervening SNPs
that are measured in one of the individuals, but not the other, can be imputed. Provided
shared haplotype stretches are identified correctly, imputed genotypes will be accurate
unless they have been disrupted by gene conversion or mutation events.
INITIAL EVALUATION OF IMPUTED GENOTYPES AND HAPLOTYPES
Here, we systematically evaluate the genotype imputation approach outlined in the
paragraph above using our Markov Chain Haplotyping algorithm (MaCH 1.0; see Appendix
for implementation details). To estimate haplotypes, our approach starts by randomly
generating a pair of haplotypes that is compatible with observed genotypes for each sampled
individual. These initial haplotype estimates are then refined through a series of iterations. In
each iteration, a new pair of haplotypes is sampled for each individual in turn using a
Hidden Markov Model (HMM) that describes the haplotype pair as an imperfect mosaic of
the other haplotypes. Model parameters that characterize the probability of change in the
mosaic pattern between every pair of consecutive markers and the probability of observing
an imperfection in the mosaic at each specific point are also updated. After many iterations
(typically 20–100), a consensus haplotype can be constructed by merging the haplotypes
sampled in each round.
HAPLOTYPING
Our approach was inspired by the Markov models commonly used for pedigree analysis [for
examples, see Abecasis et al., 2002; Kruglyak et al., 1996; Lander and Green, 1987] and
shares several features with other HMMs used to describe sampled haplotypes as a mosaic
of a set of reference haplotypes [Daly et al., 2001; Li and Stephens, 2003; Mott et al., 2000;
Stephens and Scheet, 2005a]. In order to evaluate its performance, we simulated two sets of
100 1 Mb regions that mimic the degree of linkage disequilibrium (LD) in the HapMap CEU
and YRI samples [Schaffner et al., 2005]. In each region, we simulated genotypes for ~200
markers, ascertained to mimic HapMap I allele frequency patterns [Marchini et al., 2006], in
Li et al. Page 2
Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

90 individuals with 2% of the genotypes missing at random. We then used our method to
reconstruct individual haplotypes and tallied three measures of haplotyping quality
[Marchini et al., 2006]: (1) the number of incorrectly imputed missing genotypes, (2) among
heterozygous sites, the number of consecutive sites that are phased incorrectly with respect
to each other (this is the number of “flips” required to transform estimated haplotypes into
the true haplotypes, after masking incorrectly imputed sites), and (3) the number of perfectly
inferred haplotypes. The three measures were averaged over all 100 regions and the results
are summarized in Table I. For comparison, the table also includes results for PHASE
[Stephens and Scheet, 2005b; Stephens et al., 2001] and fastPHASE [Scheet and Stephens,
2006], two state of the art haplotyping algorithms [Marchini et al., 2006], and for BEAGLE
[Browning, 2006] and PL-EM [Qin et al., 2002], two alternative haplotyping algorithms that
are very computationally efficient. Table I clearly shows that our method is competitive in
all three measures: our method results in slightly fewer incorrectly imputed genotypes,
requires slightly fewer flips to transform imputed haplotypes into the true haplotypes, and
produces slightly more correctly inferred haplotypes over the entire 1 Mb stretch than
PHASE, which was the second best method. Furthermore, note that estimates of haplotypes
and missing genotypes obtained in 5–20 min using our method are comparable in quality to
those produced by PHASE runs averaging ~1 day.
GENOTYPE IMPUTATION FOR UNTYPED MARKERS
Encouraged by these initial results, we proceeded to apply our method to impute genotypes
for untyped markers in the Finland United States Investigation of NIDDM genetics
(FUSION) GWAS [Scott et al., 2007]. Since a previous analysis suggested LD patterns in
the HapMap CEU and in FUSION are similar [Willer et al., 2006], we used genotypes for
290,690 autosomal markers with allele frequency >5% in the Illumina 317K SNP chip and
haplotypes for 2.5M polymorphic markers in the phased HapMap CEU chromosomes as
input. After running the haplotyping procedure described above, we estimated the most
likely genotype at each position (taking a majority vote across all iterations) and the
expected number of copies of the minor allele at each position (a fractional value between 0
and 2) for each individual. We obtained similar results running the haplotyping procedure
for 50–100 iterations or using only a smaller number of iterations (10–20) to estimate model
parameters and then calculating maximum likelihood estimates for the missing genotypes
and allele counts. Different chromosomes were analyzed in parallel and, overall, imputing
genotypes for 2,335 unrelated individuals took <2 days for each of the largest chromosomes
on a 2006 vintage 2.40GHz Pentium Xeon processor. In total, we imputed genotypes for
2,266,562 SNPs per individual. On average, our method used stretches of ~150 kb from the
HapMap CEU panel to reconstruct haplotypes for individuals in the FUSION sample.
IMPUTATION IN THE FUSION GENOMEWIDE ASSOCIATION STUDY
To evaluate the quality of imputed genotypes, we contrasted our estimates of the most likely
genotypes and the expected number of copies of the minor allele with actual genotype data
for three sets of markers: 521 SNP markers in a region of chromosome 14 previously
examined to fine-map a candidate linkage region [Willer et al., 2006], 1,234 SNP markers
selected to augment coverage of the Illumina 317K panel in regions surrounding 222
candidate genes [Gaulton et al., 2008] and 12,702 markers with MAF <5% not included in
the set of 290,690 markers used for imputation. We expected the last two panels of markers
to be harder to impute, because they represent SNPs that are not well tagged by the Illumina
317K SNP chip or that have lower MAF. We observed that 98.60% of imputed alleles
matched actual genotyped alleles in the fine-mapping panel, 96.24% in the candidate gene
panel, and 98.73% in the low MAF SNP panel. Furthermore, the average r
2
between
imputed genotypes and actual genotypes was 90.4, 79.1, and 74.0% in the three SNP panels,
respectively. This represents an improvement of 14–39% compared to the best available
Li et al. Page 3
Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

single marker tags, which provided an average r
2
of 76.5, 52.8, and 35.5% in the three SNP
panels, respectively.
MEASURES OF IMPUTATION QUALITY
Our Markov Chain produces three estimates of imputation quality and these can be used to
focus analyses on subsets of high-quality genotypes. First, it produces a quality score that
estimates the accuracy of each imputed genotype and is simply the proportion of iterations
where the final imputed genotype (by taking a majority vote across all iterations) was
selected. Second, it produces an overall measure of the accuracy of imputation for each
marker, which is the genotype quality score averaged across all individuals. Finally, by
comparing the distribution of sampled genotypes in each iteration with the estimated allele
counts that result from averaging over all iterations, it produces an estimate of the r
2
between imputed and true genotypes (see Methods for more details). Quality measures for
individual genotypes were good predictors of imputation accuracy (Supplementary Figure 1,
Right Panel) and show that most imputed genotypes are called with a high degree of
confidence (Supplementary Figure 1, Left Panel). For example, as measured by their quality
scores, the top 95% of genotypes had average quality scores of 98.9% and actually matched
experimental genotypes 98.6% of the time. Most of the errors affect a single allele so that,
when measured on a per allele basis, concordance increases to 99.3%.
To avoid preferential removal of rare genotypes or alleles at each marker, we recommend
using the per marker quality scores to select a subset of imputed SNPs for analysis, instead
of the per genotype quality scores. Overall, we saw a correlation of 0.77 between the
estimated and actual accuracy of imputed genotypes for each marker. We also saw a
correlation of 0.84 between the r
2
estimated by our method and the actual r
2
that resulted
from comparing experimentally derived allele counts with their imputed estimates. Figure 1
shows the ROC curve [Pepe, 2003] for the two quality measures, showing that the estimated
r
2
measure is a more effective way to identify poorly imputed markers. In the FUSION
GWAS scan [Scott et al., 2007], we used an r
2
threshold of 0.30 to decide which markers
were well imputed and should be included in further analyses, and which were not. At this
threshold, we expect to remove 70% of poorly imputed markers (those where r
2
with
experimental genotypes is <20%) but only 0.50% of better imputed markers (those where r
2
with experimental genotypes is >50%).
IMPUTATION OF STRONGLY ASSOCIATED SNPS
The results summarized so far compare a variety of imputed genotypes with experimentally
derived counterparts. However, a more interesting comparison focuses on imputed
genotypes that appear to show strong evidence for association, as those might motivate
further downstream experiments. To evaluate the accuracy of imputed genotypes for these
“strongly associated SNPs,” we compared imputed and experimental genotypes in regions
that were only selected for follow-up genotyping after imputation (for example, because
imputed genotypes resulted in strong evidence for association but nearby genotyped markers
did not). Table II summarizes the comparison of allele frequencies, association test statistics,
and individual genotype calls between imputed genotypes and actual genotypes later
determined by genotyping. Overall, it is clear that even among these strongly associated
SNPs imputation provided accurate estimates of the true P-values. The largest observed
discrepancies were for rs17384005, rs11646114, and rs4812831, which were also the three
markers for which our imputation approach estimated lower r
2
with actual genotypes.
Imputation is particularly useful because it allows evidence for association at SNPs with no
reliable proxies to be evaluated more accurately. For instance, after imputation, average r
2
increased from 0.22 to 0.66 in the set of SNPs whose best genotyped proxy had r
2
<0.30 and
Li et al. Page 4
Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

from 0.33 to 0.75 in the set of SNPs whose best genotyped proxy had r
2
<0.5 [for specific
examples of disease susceptibility loci that would be missed without imputation, see Li et
al., 2009b].
USING IMPUTATION TO ESTIMATE PAIRWISE DISEQUILIBRIUM
Remarkably, we observed that imputed genotypes could also be used to obtain very accurate
estimates of LD between pairs of untyped markers, or of LD between a genotyped marker
and an untyped marker. As shown in Figure 2, estimates of LD between two SNPs obtained
using imputed data are much closer to the results obtained by actually genotyping the two
SNPs than estimates obtained by looking up the two markers in the HapMap CEU database
(Supplementary Figure 2 shows a similar comparison for D’ estimates). Even with some
imprecision in estimates of individual genotypes, the increased sample size compensates to
reduce variation in the estimated LD measures.
COMPARISON OF DIFFERENT GENOTYPING PLATFORMS
Our experience with the FUSION GWAS, summarized above, shows that imputation can be
an effective way to estimate unobserved genotypes and/or allele counts. These genotypes
can then be used in a variety of downstream analyses, including logistic regression analyses
for discrete traits and linear regression analyses for quantitative traits, and to facilitate meta-
analysis of studies based on different platforms. A key issue when considering imputation-
based approaches is whether similarly accurate estimates of unobserved data points can be
obtained with different genotyping panels or in different populations [Clark and Li, 2007],
and to evaluate this we conducted two additional experiments.
In the first experiment, we used genotype data generated by the International HapMap
Consortium. We considered each of the HapMap samples in turn and masked available
genotypes so as to mimic an experiment using one of several commercially available chips.
For example, to evaluate the Affymetrix 500K SNP chip, we marked genotypes for all
markers that are not on the chip as missing for the individual being considered. We then
used haplotypes for the remaining individuals on the same HapMap analysis panel (either
YRI, CEU, or JPT+CHB) to impute the missing genotypes. The results are summarized in
Table III and clearly show that a large number of SNPs can be imputed very accurately
using any of the commercially available panels (e.g. with r
2
>0.80 to experimental
genotypes) and that, compared to relying on single marker tagging, imputation results in
improved coverage of the genome.
Depending on the commercial panel and population being investigated, coverage of
HapMap SNPs (proportion of SNPs with r
2
>0.80) increased by 10–30% for low MAF
alleles (MAF<5%) and by 10–20% for more common alleles (MAF>5%). In agreement with
this result, the average r
2
between each untyped SNP and imputed genotypes was up to 40%
higher on average when using imputed genotypes than when using the best available single
marker proxy. Imputation remained valuable even for panels with ~1 million directly
genotyped SNPs. In practice, the results shown in Table III are likely to represent an upper
bound on the performance of our method in real settings, because additional errors will
result from discrepancies in genotyping protocols between individual laboratories and the
HapMap and from differences in LD patterns between the HapMap and the samples being
studied. Nevertheless, they suggest our method is likely to be helpful for a variety of
currently available commercial SNP panels.
Li et al. Page 5
Genet Epidemiol. Author manuscript; available in PMC 2011 September 19.
NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Citations
More filters
Journal ArticleDOI

A Map of Human Genome Variation From Population-Scale Sequencing

TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Journal ArticleDOI

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

TL;DR: This work presents a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation and demonstrates that this method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping.
Journal ArticleDOI

Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease

Jean-Charles Lambert, +215 more
- 01 Dec 2013 - 
TL;DR: In addition to the APOE locus (encoding apolipoprotein E), 19 loci reached genome-wide significance (P < 5 × 10−8) in the combined stage 1 and stage 2 analysis, of which 11 are newly associated with Alzheimer's disease.
Posted Content

Haplotype-based variant detection from short-read sequencing

Erik Garrison, +1 more
- 17 Jul 2012 - 
TL;DR: A Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number is developed and its implementation in a haplotype-based variant detector, FreeBayes is described.
Journal ArticleDOI

10 Years of GWAS Discovery: Biology, Function, and Translation

TL;DR: The remarkable range of discoveriesGWASs has facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics are reviewed.
References
More filters
Journal ArticleDOI

The Sequence Alignment/Map format and SAMtools

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Journal ArticleDOI

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

TL;DR: This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies.
Journal ArticleDOI

A new statistical method for haplotype reconstruction from population data.

TL;DR: A new statistical method is presented, applicable to genotype data at linked loci from a population sample, that improves substantially on current algorithms and performs well in absolute terms, suggesting that reconstructing haplotypes experimentally or by genotyping additional family members may be an inefficient use of resources.
Journal ArticleDOI

The International HapMap Project

John W. Belmont, +145 more
- 18 Dec 2003 - 
TL;DR: The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance the ability to choose targets for therapeutic intervention.
Journal ArticleDOI

A haplotype map of the human genome

John W. Belmont, +232 more
TL;DR: A public database of common variation in the human genome: more than one million single nucleotide polymorphisms for which accurate and complete genotypes have been obtained in 269 DNA samples from four populations, including ten 500-kilobase regions in which essentially all information about common DNA variation has been extracted.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Mach: using sequence and genotype data to estimate haplotypes and unobserved genotypes" ?

Their approach is already widely used to facilitate comparison of results across studies as well as meta-analyses of GWAS. Here, the authors use simulations and experimental genotypes to evaluate its accuracy and utility, considering choices of genotyping panels, reference panel configurations, and designs where genotyping is replaced with shotgun sequencing. Importantly, the authors show that genotype imputation not only facilitates cross study analyses but also increases power of genetic association studies. The authors show that genotype imputation of common variants using HapMap haplotypes as a reference is very accurate using either genome-wide SNP data or smaller amounts of data typical in fine-mapping studies. Furthermore, the authors show the approach is applicable in a variety of populations.