scispace - formally typeset
Journal ArticleDOI

Theoretical Comparison between the Gini Index and Information Gain Criteria

TLDR
A formal methodology is introduced, which allows us to compare multiple split criteria and permits us to present fundamental insights into the decision process.
Abstract
Knowledge Discovery in Databases (KDD) is an active and important research area with the promise for a high payoff in many business and scientific applications. One of the main tasks in KDD is classification. A particular efficient method for classification is decision tree induction. The selection of the attribute used at each node of the tree to split the data (split criterion) is crucial in order to correctly classify objects. Different split criteria were proposed in the literature (Information Gain, Gini Index, etc.). It is not obvious which of them will produce the best decision tree for a given data set. A large amount of empirical tests were conducted in order to answer this question. No conclusive results were found. In this paper we introduce a formal methodology, which allows us to compare multiple split criteria. This permits us to present fundamental insights into the decision process. Furthermore, we are able to present a formal description of how to select between split criteria for a given data set. As an illustration we apply the methodology to two widely used split criteria: Gini Index and Information Gain.

read more

Citations
More filters
Proceedings Article

Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization

TL;DR: A new robust feature selection method with emphasizing joint l2,1-norm minimization on both loss function and regularization is proposed, which has been applied into both genomic and proteomic biomarkers discovery.
Journal ArticleDOI

Machine-learning approaches in drug discovery: methods and applications.

TL;DR: This work focuses on machine-learning techniques within the context of ligand-based VS (LBVS), providing a detailed view of the current state of the art in this field and highlighting not only the problematic issues, but also the successes and opportunities for further advances.
Journal ArticleDOI

Big data for cyber physical systems in industry 4.0: a survey

TL;DR: This survey conducts this survey to bring more attention to this critical intersection between cyber physical systems and big data and highlight the future research direction to achieve the fully autonomy in Industry 4.0.
Journal ArticleDOI

Classification Based on Decision Tree Algorithm for Machine Learning

TL;DR: This paper provides a detailed approach to the decision trees, and all of the approaches analyzed were discussed to illustrate the themes of the authors and identify the most accurate classifiers.
References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book

Classification and regression trees

Leo Breiman
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.

Programs for Machine Learning

TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.
Journal ArticleDOI

A survey of decision tree classifier methodology

TL;DR: The subjects of tree structure design, feature selection at each internal node, and decision and search strategies are discussed, and the relation between decision trees and neutral networks (NN) is also discussed.