An analysis of four missing data treatment methods for supervised learning

doi:10.1080/713827181

Journal ArticleDOI

An analysis of four missing data treatment methods for supervised learning

Gustavo E. A. P. A. Batista, +1 more

- 01 May 2003 -

Applied Artificial Intelligence

- Vol. 17, pp 519-533

TLDR

This analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperforms the mean or mode imputation method, which is a method broadly used to treatMissing values.

Abstract:

One relevant problem in data quality is missing data. Despite the frequent occurrence and the relevance of the missing data problem, many machine learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully treated, otherwise bias might be introduced into the knowledge induced. In this work, we analyze the use of the k-nearest neighbor as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set with some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. Our analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperform the mean or mode imputation method, which is a method broadly used to treat missing ...

Citations

PDF

Open Access

More filters

Book

Introduction to Machine Learning

Ethem Alpaydin

TL;DR: Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts, and discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining.

...read moreread less

Journal Article

Supervised Machine Learning: A Review of Classification Techniques

Sotiris Kotsiantis

- 01 Jan 2007 -

Informatica (lithuanian Academy of Scien...

TL;DR: The goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features, and the resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown.

...read moreread less

Journal ArticleDOI

KEEL: a software tool to assess evolutionary algorithms for data mining problems

Jesús Alcalá-Fdez, +11 more

TL;DR: KEEL as discussed by the authors is a software tool to assess evolutionary algorithms for data mining problems of various kinds including regression, classification, unsupervised learning, etc., which includes evolutionary learning algorithms based on different approaches: Pittsburgh, Michigan and IRL.

...read moreread less

Journal ArticleDOI

Machine learning: a review of classification and combining techniques

Sotiris Kotsiantis, +2 more

- 01 Nov 2006 -

Artificial Intelligence Review

TL;DR: Various classification algorithms and the recent attempt for improving classification accuracy—ensembles of classifiers are described.

...read moreread less

Journal ArticleDOI

Class noise vs. attribute noise: a quantitative study of their impacts

Xingquan Zhu, +1 more

- 22 Nov 2003 -

Artificial Intelligence Review

TL;DR: A systematic evaluation on the effect of noise in machine learning separates noise into two categories: class noise and attribute noise, and investigates the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Maximum likelihood from incomplete data via the EM algorithm

Arthur P. Dempster, +2 more

- 01 Sep 1977 -

Journal of the royal statistical society...

Book

C4.5: Programs for Machine Learning

J. Ross Quinlan

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

...read moreread less

Book

Statistical Analysis with Missing Data

Roderick J. A. Little, +1 more

TL;DR: This work states that maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse and large-Sample Inference Based on Maximum Likelihood Estimates is likely to be high.

...read moreread less

Programs for Machine Learning

Steven L. Salzberg, +1 more

TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.

...read moreread less

Journal ArticleDOI

Statistical Analysis With Missing Data

Nicole A. Lazar

- 01 Nov 2003 -

Technometrics

TL;DR: Generalized Estimating Equations is a good introductory book for analyzing continuous and discrete correlated data using GEE methods and provides good guidance for analyzing correlated data in biomedical studies and survey studies.

...read moreread less

An analysis of four missing data treatment methods for supervised learning

Citations

Introduction to Machine Learning

Supervised Machine Learning: A Review of Classification Techniques

KEEL: a software tool to assess evolutionary algorithms for data mining problems

Machine learning: a review of classification and combining techniques

Class noise vs. attribute noise: a quantitative study of their impacts

References

Maximum likelihood from incomplete data via the EM algorithm

C4.5: Programs for Machine Learning

Statistical Analysis with Missing Data

Programs for Machine Learning

Statistical Analysis With Missing Data

Related Papers (5)

Statistical Analysis with Missing Data

Maximum likelihood from incomplete data via the EM algorithm

C4.5: Programs for Machine Learning

Data Mining: Concepts and Techniques

Random Forests