scispace - formally typeset
Open AccessJournal ArticleDOI

Is an ordinal class structure useful in classifier learning

Reads0
Chats0
TLDR
The purpose of this paper is to answer the question to what extent existing techniques and learning algorithms for ordinal classification are able to exploit order information and which properties of these techniques are important in this regard.
Abstract
In recent years, a number of machine learning algorithms have been developed for the problem of ordinal classification. These algorithms try to exploit, in one way or the other, the order information of the problem, essentially relying on the assumption that the ordinal structure of the set of class labels is also reflected in the topology of the instance space. The purpose of this paper is to investigate, on an experimental basis, the validity of this assumption. Moreover, we seek to answer the question to what extent existing techniques and learning algorithms for ordinal classification are able to exploit order information and which properties of these techniques are important in this regard.

read more

Content maybe subject to copyright    Report

Is an Ordinal Class Structure Useful in
Classifier Learning?
Jens C. H¨uhn and Eyke H¨ullermeier
Department of Mathematics and Computer Science
Marburg University, Germany
Hans-Meerwein-Str., 35032 Marburg, Germany
{huehnj,eyke}@informatik.uni-marburg.de
Tracking Number IJDMMM-8118
Abstract
In recent years, a number of machine learning algorithms have been devel-
oped for the problem of ordinal classification. These algorithms try to exploit,
in one way or the other, the order information of the problem, essentially re-
lying on the assumption that the ordinal structure of the set of class labels is
also reflected in the topology of the instance space. The purpose of this paper
is to investigate, on an experimental basis, the validity of this assumption.
Moreover, we seek to answer the question to what extent existing techniques
and learning algorithms for ordinal classification are able to exploit order
information, and which properties of these techniques are important in this
regard.
Keywords: ordinal classification, binary decomposition, nested dichotomies,
pairwise classification.
1 Introduction
The problem of ordinal classification, also called ordinal regression in statistics, has
received increasing attention in the machine learning field in recent years (Frank and
Hall, 2001; Chu and Keerthi, 2005; Cardoso et al., 2005; Yu et al., 2006; Cardoso and
da Costa, 2007; Babaria et al., 2007). In ordinal classification, the set of class labels
Corresponding author (phone: ++49 6421 2821569, fax: ++49 6421 2821573)
1
E. Hüllermeier and and J. Hühn.
Is an ordinal class structure useful in classifier learning?
Int. Journal of Data Mining, Modelling and Management 1(1):45–67, 2009.

Figure 1: On the left, the distribution of classes in the instance space is in well
agreement with the class order y
1
y
2
y
3
y
4
, while this is not the case in the
right situation.
Y = {y
1
. . . y
m
} is endowed with a natural (total) order relation: y
1
y
2
· · · y
m
.
This distinguishes ordinal from conventional classification, where Y is unordered. As
examples, consider learning to predict the category of a hotel (from 1 to 5 stars),
the priority level of emails, or the customer satisfaction on a discrete scale ranging
from, e.g., poor to excellent.
From a learning point of view, the ordinal structure of Y is additional information
that a learner should of course try to exploit, and this is what existing methods
for ordinal classification essentially seek to do (Frank and Hall, 2001; Cardoso and
da Costa, 2007). The basic assumption in this regard is that the ordinal structure
of Y is also present in the instance space X , where it is reflected by the topology
of the class distributions. Or, stated differently, the ordinal class structure induces
an ordinal instance structure. Fig. 1 illustrates this idea: In the first scenario (left
picture), the topology of X is in well agreement with the ordinal structure of Y,
which is not the case in the second situation (right picture).
The above assumption is most explicitly expressed by ordinal classification meth-
ods which are based on binary decomposition techniques, that is, techniques for
transforming a polychotomous problem involving m classes into a set of binary
problems: Given an ordinal instance structure, it is more reasonable and presum-
ably simpler to discriminate, for example, the “low” classes {y
1
. . . y
k
} from the
“high” classes {y
k+1
. . . y
m
} than to discriminate an arbitrary subset of Y from its
complement. More generally, it appears reasonable to restrict to “ordered” binary
decompositions, where a binary problem involving meta-classes Y
1
, Y
2
Y is or-
dered if y
i
y
j
for all (y
i
, y
j
) Y
1
× Y
2
or y
i
y
j
for all (y
j
, y
i
) Y
1
× Y
2
. One
may even argue that this property may provide the basis for a definition of the value
of ordinal structure: Roughly, the value of order information equals the (expected)
increase in performance when solving the problem for ordered instead of unordered
decompositions.
Of course, it is not at all self-evident that the assumption of an ordinal instance
structure will hold in practice and, therefore, that ordinal classification techniques
are actually effective. By effectiveness, we mean that these techniques are able to
2

exploit the ordinal information, if any, contained in a problem. The purpose of this
paper is to investigate empirically whether or not ordinal classifiers are effective in
this sense. Our analysis is based on the following key idea: If an ordinal classifier
is effective in the above sense, then its expected performance on the true ordinal
problem should be better than its expected performance on a distorted problem in
which the label set is given by an arbitrary permutation (renaming) of Y.
In this regard, it is important to note that the effectiveness or, say, the degree to
which a classifier benefits from ordinal classification, is likely to depend on the flexi-
bility of the classifier. For example, for a linear classifier it is easy to separate classes
{y
1
, y
2
} from {y
3
, y
4
} in the first scenario in Fig. 1 but impossible in the second one.
Therefore, a linear classifier will strongly benefit from the ordinal instance structure.
The benefit of a decision tree, on the other hand, will be much smaller: Indeed, the
first problem is also simpler for this learner, but the second one is still feasible.
The above considerations give rise to the following conjectures that we shall try to
answer by means of suitable experiments:
Knowledge about the ordinal structure of the label set Y is useful in a classi-
fication setting, and ordinal classifiers can effectively exploit this knowledge.
The degree to which a learner benefits from an ordinal structure depends on its
flexibility: Complex methods producing models with flexible decision bound-
aries will benefit less than methods producing simple decision boundaries.
As mentioned previously, ordinal classifiers based on binary decomposition tech-
niques appear to be especially suitable for analyzing these hypotheses. Therefore,
we shall focus on these techniques, to be surveyed in Section 2. Our experimen-
tal setting will then be outlined in Section 3, and the results will be presented in
Section 4. The paper ends with a summary and discussion in Section 5.
2 Algorithms for Ordinal Classification
This section gives a brief introduction to the learning algorithms that we used in
the experiments. The main purpose is to convey the basic ideas underlying the
approaches. For more detailed information, we give pointers to the literature.
2.1 A Simple Approach To Ordinal Classification
A simple and intuitively appealing approach to ordinal classification has been pro-
posed by Frank and Hall (2001). The idea is to decompose the original problem
involving m classes Y = {y
1
. . . y
m
} into m1 binary problems. The i-th problem is
3

defined by the meta-classes {y
1
. . . y
i
} and {y
i+1
. . . y
m
} playing the role, respectively,
of the negative and positive class in a binary problem.
Let M
i
, i = 1 . . . m 1, denote the model learned on the training data for the
i-th problem (i.e., considering examples with labels in {y
1
. . . y
i
} as negative and
the others as positive examples). Given a query instance x, a prediction M
i
(x) is
interpreted as an estimation of the probability that the class of x, denoted y(x), is in
{y
i+1
. . . y
m
}, that is, an estimation of the probability Pr(y(x) y
i
). Consequently,
the models must guarantee outputs in the unit interval.
From the above probabilities, a probability distribution on Y is then derived as
follows:
Pr(y(x) = y
1
) = 1 Pr(y(x) y
1
)
Pr(y(x) = y
i
) = max { Pr(y(x) y
i1
) Pr(y(x) y
i
), 0 }
Pr(y(x) = y
m
) = Pr(y(x) y
m1
)
Eventually, the class with the highest probability is predicted. As mentioned in the
introduction, this approach strongly exploits the idea that “reasonable” decompo-
sitions of the class labels are produced by ordinal splits partitioning Y into a lower
and an upper part.
2.2 Ensembles of Nested Dichotomies
A nested dichotomy is a binary tree that partitions the label set Y in a recursive
way. The root of the tree is associated with the whole set Y, while the leaf nodes
correspond to single classes. Moreover, each inner node is associated with a binary
classification problem, namely to discriminate between the two respective meta-
classes of the child nodes. The output of a corresponding model, for a query input
x, is interpreted as a conditional probability of the form
p = Pr(y(x) Y
2
| y(x) Y
1
Y
2
),
where Y
1
and Y
2
denote, respectively, the meta-classes of the two child nodes (and
hence Y
1
Y
2
the meta-class of the inner node itself). Consequently, the probabilities
of the individual classes y
i
can be derived quite elegantly, namely by multiplying the
probabilities along the path from the root of the tree to the leaf node for y
i
. Nested
dichotomies have been investigated for a long time in statistics.
Obviously, there are many ways to partition Y in a recursive way, and indeed, the
prediction accuracy of a model may strongly depend on the choice of the concrete
dichotomy. Frank and Kramer (2004) have therefore combined nested dichotomies
with ensemble techniques. An ensembles of nested dichotomies (END) consists of a
set of randomly generated nested dichotomies, the predictions of which are combined
4

Root level
y
1
, y
2
vs. y
3
t
t
t
t
t
t
t
t
t
0
0
0
0
0
0
0
0
0
0
0
0
0
0
y
1
vs. y
2
, y
3
J
J
J
J
J
J
J
J
J
Inner node level
y
1
vs. y
2
J
J
J
J
J
J
J
J
J
J
y
2
vs. y
3
t
t
t
t
t
t
t
t
t
t
:
:
:
:
:
:
:
Leaf level
/.
-,
() *+
y
1
/.
-,
() *+
y
2
/.
-,
() *+
y
3
/.
-,
() *+
y
1
/.
-,
() *+
y
2
/.
-,
() *+
y
3
Figure 2: Two ordinal dichotomy trees for a 3-class problem.
by averaging the respective probability distributions. For this approach, the authors
have reported excellent classification accuracy.
ENDs can be applied to conventional classification problems, and indeed, no restric-
tions are made for the splitting of label sets into subsets. In the case of ordinal
classification, however, it again seems reasonable to restrict to ordered splits; see
Fig. 2 for an example of an ordinal nested dichotomy.
Even though the number of different dichotomies is significantly smaller for the
ordinal than for the general case, it may still become huge for a large number of
classes m. More concretely, it can be shown by simple combinatorial arguments that
the number is (3
m
(2
m+1
1))/2 for the general case, which is reduced to (m
3
m)/6
for the ordinal case. The computation of all dichotomies may thus become infeasible
for large m. Frank and Kramer (2004) found that averaging over 20 randomly
generated dichotomies is “sufficient to get close to optimum performance”. In our
experiments, we shall stick to this rule of thumb.
2.3 Pairwise Classification
Another popular binarization technique is the all-pairs approach, also called round
robin learning (F¨urnkranz, 2002a,b), which trains a separate model M
i,j
for each
pair of classes (y
i
, y
j
) Y × Y, 1 i < j m; thus, a total number of m(m 1)/2
models is needed. M
i,j
is intended to discriminate between classes y
i
and y
j
. At
classification time, a query x is submitted to all models, and each prediction M
i,j
(x)
is interpreted as a vote for a label. More specifically, assuming s
i,j
= M
i,j
(x) [0, 1],
the weighted voting techniques interprets s
i,j
and 1s
i,j
as weighted votes for classes
y
i
and y
j
, respectively, and predicts the class with the highest sum of weighted votes.
For the following reason, pairwise classification is an interesting baseline in our
context: It produces binary problems that are (trivially) ordered and hence “rea-
sonable” from an ordinal classification point of view, and yet it does not exploit any
ordinal information. F¨urnkranz (2002b) found that pairwise classification, using
decision trees as base learners, is indeed competitive to the approach of Frank and
Hall in terms of classification accuracy. As will be seen later on, our results are in
5

Figures
Citations
More filters
Journal ArticleDOI

Ordinal Regression Methods: Survey and Experimental Study

TL;DR: The results confirm that ordering information benefits ordinal models improving their accuracy and the closeness of the predictions to actual targets in the ordinal scale.
Book ChapterDOI

Preference Learning and Ranking by Pairwise Comparison

TL;DR: This chapter provides an overview of recent work on preference learning and ranking via pairwise classification and explains how to approach different preference learning problems within the framework of LPC.
Book ChapterDOI

Binary Decomposition Methods for Multipartite Ranking

TL;DR: This paper discusses extensions of the AUC metric which are suitable as evaluation criteria for multipartite rankings and proposes methods on the basis of binary decomposition techniques that have previously been used for multi-class and ordinal classification.
Journal ArticleDOI

Citation-based journal ranks: The use of fuzzy measures

TL;DR: The findings show that the journal rankings data set is difficult to model accurately due to inconsistencies and lack of monotonicity, but that the Choquet integral still performs well as a classifier.
Journal ArticleDOI

Exploitation of pairwise class distances for ordinal classification

TL;DR: The key idea of this letter is to construct a projection model directly, using insights about the class distribution obtained from pairwise distance calculations, which is intrinsically simple, intuitive, and easily understandable, yet highly competitive with state-of-the-art approaches to ordinal classification.
References
More filters
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Journal Article

Statistical Comparisons of Classifiers over Multiple Data Sets

TL;DR: A set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers is recommended: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparisons of more classifiers over multiple data sets.
Frequently Asked Questions (10)
Q1. What are the contributions in "Is an ordinal class structure useful in classifier learning?" ?

The purpose of this paper is to investigate, on an experimental basis, the validity of this assumption. 

The degree to which a learner benefits from an ordinal structure depends on itsflexibility: Complex methods producing models with flexible decision boundaries will benefit less than methods producing simple decision boundaries. 

1Due to a lack of ordinal benchmark data, several previous studies, including (Frank and Hall, 2001; Fürnkranz, 2002b), have resorted to discretized regression data for experimental purposes. 

As the authors have furthermore seen, the flexibility of the base learners is also important for the effectiveness of the meta-techniques investigated in this paper. 

Given a query instance x, a prediction Mi(x) is interpreted as an estimation of the probability that the class of x, denoted y(x), is in {yi+1 . . . ym}, that is, an estimation of the probability Pr(y(x) yi). 

In case this hypothesis is rejected, a Nemenyi test (Nemenyi, 1963) was applied as post-hoc test to find significant differences between pairs of methods. 

since the VOI values are obviously smaller than for the regression data, the results confirm their presumption that discretized regression data exhibits an even stronger developed ordinal structure than truly ordinal data. 

the authors test the statistical (null) hypothesis that r ≤ 0.5 against the (alternative) hypothesis r > 0.5, using a win/loss sign test according to Demšar (2006). 

the probabilities of the individual classes yi can be derived quite elegantly, namely by multiplying the probabilities along the path from the root of the tree to the leaf node for yi. 

since bigger meta-classes will usually call for more complex models (decision boundaries), flexible classifiers such as decision trees are advantageous for EOND, and even more so for FH.