scispace - formally typeset
Journal ArticleDOI

An analysis of four missing data treatment methods for supervised learning

TLDR
This analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperforms the mean or mode imputation method, which is a method broadly used to treatMissing values.
Abstract
One relevant problem in data quality is missing data. Despite the frequent occurrence and the relevance of the missing data problem, many machine learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully treated, otherwise bias might be introduced into the knowledge induced. In this work, we analyze the use of the k-nearest neighbor as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set with some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. Our analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperform the mean or mode imputation method, which is a method broadly used to treat missing ...

read more

Content maybe subject to copyright    Report

Citations
More filters
Book

Introduction to Machine Learning

TL;DR: Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of topics not usually included in introductory machine learning texts, and discusses many methods from different fields, including statistics, pattern recognition, neural networks, artificial intelligence, signal processing, control, and data mining.
Journal Article

Supervised Machine Learning: A Review of Classification Techniques

TL;DR: The goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictor features, and the resulting classifier is then used to assign class labels to the testing instances where the values of the predictor features are known, but the value of the class label is unknown.
Journal ArticleDOI

KEEL: a software tool to assess evolutionary algorithms for data mining problems

TL;DR: KEEL as discussed by the authors is a software tool to assess evolutionary algorithms for data mining problems of various kinds including regression, classification, unsupervised learning, etc., which includes evolutionary learning algorithms based on different approaches: Pittsburgh, Michigan and IRL.
Journal ArticleDOI

Machine learning: a review of classification and combining techniques

TL;DR: Various classification algorithms and the recent attempt for improving classification accuracy—ensembles of classifiers are described.
Journal ArticleDOI

Class noise vs. attribute noise: a quantitative study of their impacts

TL;DR: A systematic evaluation on the effect of noise in machine learning separates noise into two categories: class noise and attribute noise, and investigates the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise.
References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book

Statistical Analysis with Missing Data

TL;DR: This work states that maximum Likelihood for General Patterns of Missing Data: Introduction and Theory with Ignorable Nonresponse and large-Sample Inference Based on Maximum Likelihood Estimates is likely to be high.

Programs for Machine Learning

TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
Journal ArticleDOI

Statistical Analysis With Missing Data

TL;DR: Generalized Estimating Equations is a good introductory book for analyzing continuous and discrete correlated data using GEE methods and provides good guidance for analyzing correlated data in biomedical studies and survey studies.