Discriminant analysis of principal components: a new method for the analysis of genetically structured populations
TLDR
The Discriminant Analysis of Principal Components (DAPC) is introduced, a multivariate method designed to identify and describe clusters of genetically related individuals that performs generally better than STRUCTURE at characterizing population subdivision.Abstract:
The dramatic progress in sequencing technologies offers unprecedented prospects for deciphering the organization of natural populations in space and time. However, the size of the datasets generated also poses some daunting challenges. In particular, Bayesian clustering algorithms based on pre-defined population genetics models such as the STRUCTURE or BAPS software may not be able to cope with this unprecedented amount of data. Thus, there is a need for less computer-intensive approaches. Multivariate analyses seem particularly appealing as they are specifically devoted to extracting information from large datasets. Unfortunately, currently available multivariate methods still lack some essential features needed to study the genetic structure of natural populations. We introduce the Discriminant Analysis of Principal Components (DAPC), a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. Our approach allows extracting rich information from genetic data, providing assignment of individuals to groups, a visual assessment of between-population differentiation, and contribution of individual alleles to population structuring. We evaluate the performance of our method using simulated data, which were also analyzed using STRUCTURE as a benchmark. Additionally, we illustrate the method by analyzing microsatellite polymorphism in worldwide human populations and hemagglutinin gene sequence variation in seasonal influenza. Analysis of simulated data revealed that our approach performs generally better than STRUCTURE at characterizing population subdivision. The tools implemented in DAPC for the identification of clusters and graphical representation of between-group structures allow to unravel complex population structures. Our approach is also faster than Bayesian clustering algorithms by several orders of magnitude, and may be applicable to a wider range of datasets.read more
Citations
More filters
Journal ArticleDOI
adegenet 1.3-1
Thibaut Jombart,Ismaïl Ahmed +1 more
TL;DR: New tools implemented in the adegenet 1.3-1 package for handling and analyzing genome-wide single nucleotide polymorphism (SNP) data are introduced, using a bit-level coding scheme for SNP data and parallelized computation.
Journal ArticleDOI
Inference of population structure using dense haplotype data
TL;DR: A novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity and an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure.
Journal ArticleDOI
Genome-wide comparative diversity uncovers multiple targets of selection for improvement in hexaploid wheat landraces and cultivars.
Colin Cavanagh,Shiaoman Chao,Shichen Wang,Bevan Emma Huang,Stuart Stephen,Seifollah Kiani,Kerrie Forrest,Cyrille Saintenac,Gina Brown-Guedira,Alina Akhunova,Deven R. See,Guihua Bai,Michael O. Pumphrey,Luxmi Tomar,Debbie Wong,Stephan Kong,Matthew P. Reynolds,Marta Lopez da Silva,Harold E. Bockelman,Luther E. Talbert,James A. Anderson,Susanne Dreisigacker,Stephen Baenziger,Arron H. Carter,Viktor Korzun,Peter L. Morrell,Jorge Dubcovsky,Jorge Dubcovsky,Matthew K. Morell,Mark E. Sorrells,Matthew J. Hayden,Eduard Akhunov +31 more
TL;DR: It is shown that selection likely acts on distinct targets or multiple functionally equivalent alleles in different portions of the geographic range of wheat, suggesting either weak selection pressure or temporal variation in the targets of directional selection during breeding probably associated with changing agricultural practices or environmental conditions.
Journal ArticleDOI
Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems
TL;DR: A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework and has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets.
Journal ArticleDOI
The program structure does not reliably recover the correct population structure when sampling is uneven: subsampling and new estimators alleviate the problem.
TL;DR: Four new supervised methods to detect the number of clusters were developed and tested and were found to outperform the existing methods using both evenly and unevenly sampled data sets and a subsampling strategy aiming to reduce sampling unevenness between subpopulations is presented and tested.
References
More filters
Journal ArticleDOI
Estimating the Dimension of a Model
TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Estimating the dimension of a model
TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.
Journal ArticleDOI
Inference of population structure using multilocus genotype data
TL;DR: Pritch et al. as discussed by the authors proposed a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations, which can be applied to most of the commonly used genetic markers, provided that they are not closely linked.
Journal ArticleDOI
Clustal W and Clustal X version 2.0
Mark A. Larkin,Gordon Blackshields,Nigel P. Brown,R. Chenna,Paul A. McGettigan,Hamish McWilliam,Franck Valentin,Iain M. Wallace,Andreas Wilm,Rodrigo Lopez,J.D. Thompson,Toby J. Gibson,Desmond G. Higgins +12 more
TL;DR: The Clustal W and ClUSTal X multiple sequence alignment programs have been completely rewritten in C++ to facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems.
Journal ArticleDOI
Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.
TL;DR: It is found that in most cases the estimated ‘log probability of data’ does not provide a correct estimation of the number of clusters, K, and using an ad hoc statistic ΔK based on the rate of change in the log probability between successive K values, structure accurately detects the uppermost hierarchical level of structure for the scenarios the authors tested.
Related Papers (5)
Inference of population structure using multilocus genotype data
Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study.
STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method
Dent Earl,Bridgett M. vonHoldt +1 more