scispace - formally typeset
Journal ArticleDOI

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection.

TLDR
It is suggested that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modelingresearch.
Abstract
One of the most important characteristics of Quantitative Structure Activity Relashionships (QSAR) models is their predictive power. The latter can be defined as the ability of a model to predict accurately the target property (e.g., biological activity) of compounds that were not used for model development. We suggest that this goal can be achieved by rational division of an experimental SAR dataset into the training and test set, which are used for model development and validation, respectively. Given that all compounds are represented by points in multidimensional descriptor space, we argue that training and test sets must satisfy the following criteria: (i) Representative points of the test set must be close to those of the training set; (ii) Representative points of the training set must be close to representative points of the test set; (iii) Training set must be diverse. For quantitative description of these criteria, we use molecular dataset diversity indices introduced recently (Golbraikh, A., J. Chem. Inf. Comput. Sci., 40 (2000) 414-425). For rational division of a dataset into the training and test sets, we use three closely related sphere-exclusion algorithms. Using several experimental datasets, we demonstrate that QSAR models built and validated with our approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets. We suggest that rational approaches to the selection of training and test sets based on diversity principles should be used routinely in all QSAR modeling research.

read more

Citations
More filters
Journal ArticleDOI

On Some Aspects of Variable Selection for Partial Least Squares Regression Models

TL;DR: In this article, the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data is explored, where the compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters.
Journal ArticleDOI

How not to develop a quantitative structure–activity or structure–property relationship (QSAR/QSPR)

TL;DR: 21 types of error that continue to be perpetrated in the QSAR/QSPR literature are identified and each is discussed, with examples (including some of the authors' own).
Journal ArticleDOI

Comparative Studies on Some Metrics for External Validation of QSPR Models

TL;DR: This report questions the appropriateness of the common practice of the "classic" approach of external validation based on a single test set and derives a conclusion about predictive quality of a model on the basis of a particular validation metric.
Journal ArticleDOI

On some aspects of validation of predictive quantitative structure–activity relationship models

TL;DR: This review focuses on the importance of validation of quantitative structure–activity relationship models and different methods of validation.
Book

Molecular, Clinical and Environmental Toxicology

Andreas Luch
TL;DR: Molecular, clinical, and environmental toxicolog , Molecular, clinical and environmental Toxicolog, Clinical, andEnvironmental toxicolog, کتابخانه دیجیتال جندی اهواز
References
More filters
Journal ArticleDOI

A mathematical theory of communication

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Journal Article

The mathematical theory of communication

TL;DR: The Mathematical Theory of Communication (MTOC) as discussed by the authors was originally published as a paper on communication theory more than fifty years ago and has since gone through four hardcover and sixteen paperback printings.
Journal ArticleDOI

Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins.

TL;DR: The main features of the CoMFA approach, exemplified by analyses of the affinities of 21 varied steroids to corticosteroid and testosterone-binding globulins, and a number of advances in the methodology of molecular graphics are described.
Journal ArticleDOI

Beware of q2

TL;DR: It is argued that the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power, which is the general property of QSAR models developed using LOO cross-validation.
Related Papers (5)