What is the degree to which a learner benefits from an ordinal structure?

The degree to which a learner benefits from an ordinal structure depends on itsflexibility: Complex methods producing models with flexible decision boundaries will benefit less than methods producing simple decision boundaries.

Why have several previous studies resorted to discretized regression data for experimental purposes?

1Due to a lack of ordinal benchmark data, several previous studies, including (Frank and Hall, 2001; Fürnkranz, 2002b), have resorted to discretized regression data for experimental purposes.

What is the importance of the base learners?

As the authors have furthermore seen, the flexibility of the base learners is also important for the effectiveness of the meta-techniques investigated in this paper.

What test was used to find differences between pairs of methods?

In case this hypothesis is rejected, a Nemenyi test (Nemenyi, 1963) was applied as post-hoc test to find significant differences between pairs of methods.

Why is the VOI for the regression data so small?

since the VOI values are obviously smaller than for the regression data, the results confirm their presumption that discretized regression data exhibits an even stronger developed ordinal structure than truly ordinal data.

What is the statistical (null) hypothesis that r 0.5?

the authors test the statistical (null) hypothesis that r ≤ 0.5 against the (alternative) hypothesis r > 0.5, using a win/loss sign test according to Demšar (2006).

What is the difference between EOND and FH?

since bigger meta-classes will usually call for more complex models (decision boundaries), flexible classifiers such as decision trees are advantageous for EOND, and even more so for FH.

(Open Access) Is an ordinal class structure useful in classifier learning (2008) | Jens Christian Hühn

Q: What is the definition of a prediction of the probability of a class of x?

Given a query instance x, a prediction Mi(x) is interpreted as an estimation of the probability that the class of x, denoted y(x), is in {yi+1 . . . ym}, that is, an estimation of the probability Pr(y(x) yi).

Q: How can the authors get the probabilities of the individual classes yi?

the probabilities of the individual classes yi can be derived quite elegantly, namely by multiplying the probabilities along the path from the root of the tree to the leaf node for yi.

Is an Ordinal Class Structure Useful in

Classiﬁer Learning?

Jens C. H¨uhn and Eyke H¨ullermeier

∗

Department of Mathematics and Computer Science

Marburg University, Germany

Hans-Meerwein-Str., 35032 Marburg, Germany

{huehnj,eyke}@informatik.uni-marburg.de

Tracking Number IJDMMM-8118

Abstract

In recent years, a number of machine learning algorithms have been devel-

oped for the problem of ordinal classiﬁcation. These algorithms try to exploit,

in one way or the other, the order information of the problem, essentially re-

lying on the assumption that the ordinal structure of the set of class labels is

also reﬂected in the topology of the instance space. The purpose of this paper

is to investigate, on an experimental basis, the validity of this assumption.

Moreover, we seek to answer the question to what extent existing techniques

and learning algorithms for ordinal classiﬁcation are able to exploit order

information, and which properties of these techniques are important in this

regard.

Keywords: ordinal classiﬁcation, binary decomposition, nested dichotomies,

pairwise classiﬁcation.

1 Introduction

The problem of ordinal classiﬁcation, also called ordinal regression in statistics, has

received increasing attention in the machine learning ﬁeld in recent years (Frank and

Hall, 2001; Chu and Keerthi, 2005; Cardoso et al., 2005; Yu et al., 2006; Cardoso and

da Costa, 2007; Babaria et al., 2007). In ordinal classiﬁcation, the set of class labels

∗

Corresponding author (phone: ++49 6421 2821569, fax: ++49 6421 2821573)

E. Hüllermeier and and J. Hühn.

Is an ordinal class structure useful in classifier learning?

Int. Journal of Data Mining, Modelling and Management 1(1):45–67, 2009.

Figure 1: On the left, the distribution of classes in the instance space is in well

agreement with the class order y

≺ y

, while this is not the case in the

right situation.

Y = {y

. . . y

} is endowed with a natural (total) order relation: y

≺ y

≺ · · · ≺ y

This distinguishes ordinal from conventional classiﬁcation, where Y is unordered. As

examples, consider learning to predict the category of a hotel (from 1 to 5 stars),

the priority level of emails, or the customer satisfaction on a discrete scale ranging

from, e.g., poor to excellent.

From a learning point of view, the ordinal structure of Y is additional information

that a learner should of course try to exploit, and this is what existing methods

for ordinal classiﬁcation essentially seek to do (Frank and Hall, 2001; Cardoso and

da Costa, 2007). The basic assumption in this regard is that the ordinal structure

of Y is also present in the instance space X , where it is reﬂected by the topology

of the class distributions. Or, stated diﬀerently, the ordinal class structure induces

an ordinal instance structure. Fig. 1 illustrates this idea: In the ﬁrst scenario (left

picture), the topology of X is in well agreement with the ordinal structure of Y,

which is not the case in the second situation (right picture).

The above assumption is most explicitly expressed by ordinal classiﬁcation meth-

ods which are based on binary decomposition techniques, that is, techniques for

transforming a polychotomous problem involving m classes into a set of binary

problems: Given an ordinal instance structure, it is more reasonable and presum-

ably simpler to discriminate, for example, the “low” classes {y

. . . y

} from the

“high” classes {y

k+1

. . . y

} than to discriminate an arbitrary subset of Y from its

complement. More generally, it appears reasonable to restrict to “ordered” binary

decompositions, where a binary problem involving meta-classes Y

, Y

⊂ Y is or-

dered if y

≺ y

for all (y

, y

) ∈ Y

× Y

or y

 y

for all (y

, y

) ∈ Y

× Y

. One

may even argue that this property may provide the basis for a deﬁnition of the value

of ordinal structure: Roughly, the value of order information equals the (expected)

increase in performance when solving the problem for ordered instead of unordered

decompositions.

Of course, it is not at all self-evident that the assumption of an ordinal instance

structure will hold in practice and, therefore, that ordinal classiﬁcation techniques

are actually eﬀective. By eﬀectiveness, we mean that these techniques are able to

exploit the ordinal information, if any, contained in a problem. The purpose of this

paper is to investigate empirically whether or not ordinal classiﬁers are eﬀective in

this sense. Our analysis is based on the following key idea: If an ordinal classiﬁer

is eﬀective in the above sense, then its expected performance on the true ordinal

problem should be better than its expected performance on a distorted problem in

which the label set is given by an arbitrary permutation (renaming) of Y.

In this regard, it is important to note that the eﬀectiveness or, say, the degree to

which a classiﬁer beneﬁts from ordinal classiﬁcation, is likely to depend on the ﬂexi-

bility of the classiﬁer. For example, for a linear classiﬁer it is easy to separate classes

, y

} from {y

, y

} in the ﬁrst scenario in Fig. 1 but impossible in the second one.

Therefore, a linear classiﬁer will strongly beneﬁt from the ordinal instance structure.

The beneﬁt of a decision tree, on the other hand, will be much smaller: Indeed, the

ﬁrst problem is also simpler for this learner, but the second one is still feasible.

The above considerations give rise to the following conjectures that we shall try to

answer by means of suitable experiments:

• Knowledge about the ordinal structure of the label set Y is useful in a classi-

ﬁcation setting, and ordinal classiﬁers can eﬀectively exploit this knowledge.

• The degree to which a learner beneﬁts from an ordinal structure depends on its

ﬂexibility: Complex methods producing models with ﬂexible decision bound-

aries will beneﬁt less than methods producing simple decision boundaries.

As mentioned previously, ordinal classiﬁers based on binary decomposition tech-

niques appear to be especially suitable for analyzing these hypotheses. Therefore,

we shall focus on these techniques, to be surveyed in Section 2. Our experimen-

tal setting will then be outlined in Section 3, and the results will be presented in

Section 4. The paper ends with a summary and discussion in Section 5.

2 Algorithms for Ordinal Classiﬁcation

This section gives a brief introduction to the learning algorithms that we used in

the experiments. The main purpose is to convey the basic ideas underlying the

approaches. For more detailed information, we give pointers to the literature.

2.1 A Simple Approach To Ordinal Classiﬁcation

A simple and intuitively appealing approach to ordinal classiﬁcation has been pro-

posed by Frank and Hall (2001). The idea is to decompose the original problem

involving m classes Y = {y

. . . y

} into m−1 binary problems. The i-th problem is

deﬁned by the meta-classes {y

. . . y

} and {y

i+1

. . . y

} playing the role, respectively,

of the negative and positive class in a binary problem.

Let M

, i = 1 . . . m − 1, denote the model learned on the training data for the

i-th problem (i.e., considering examples with labels in {y

. . . y

} as negative and

the others as positive examples). Given a query instance x, a prediction M

(x) is

interpreted as an estimation of the probability that the class of x, denoted y(x), is in

i+1

. . . y

}, that is, an estimation of the probability Pr(y(x)  y

). Consequently,

the models must guarantee outputs in the unit interval.

From the above probabilities, a probability distribution on Y is then derived as

follows:

Pr(y(x) = y

) = 1 − Pr(y(x)  y

)

Pr(y(x) = y

) = max { Pr(y(x)  y

i−1

) − Pr(y(x)  y

), 0 }

Pr(y(x) = y

) = Pr(y(x)  y

m−1

)

Eventually, the class with the highest probability is predicted. As mentioned in the

introduction, this approach strongly exploits the idea that “reasonable” decompo-

sitions of the class labels are produced by ordinal splits partitioning Y into a lower

and an upper part.

2.2 Ensembles of Nested Dichotomies

A nested dichotomy is a binary tree that partitions the label set Y in a recursive

way. The root of the tree is associated with the whole set Y, while the leaf nodes

correspond to single classes. Moreover, each inner node is associated with a binary

classiﬁcation problem, namely to discriminate between the two respective meta-

classes of the child nodes. The output of a corresponding model, for a query input

x, is interpreted as a conditional probability of the form

p = Pr(y(x) ∈ Y

| y(x) ∈ Y

∪ Y

where Y

and Y

denote, respectively, the meta-classes of the two child nodes (and

hence Y

∪Y

the meta-class of the inner node itself). Consequently, the probabilities

of the individual classes y

can be derived quite elegantly, namely by multiplying the

probabilities along the path from the root of the tree to the leaf node for y

. Nested

dichotomies have been investigated for a long time in statistics.

Obviously, there are many ways to partition Y in a recursive way, and indeed, the

prediction accuracy of a model may strongly depend on the choice of the concrete

dichotomy. Frank and Kramer (2004) have therefore combined nested dichotomies

with ensemble techniques. An ensembles of nested dichotomies (END) consists of a

set of randomly generated nested dichotomies, the predictions of which are combined

Root level

, y

vs. y

, y



Inner node level

vs. y



vs. y

Leaf level

() *+

Figure 2: Two ordinal dichotomy trees for a 3-class problem.

by averaging the respective probability distributions. For this approach, the authors

have reported excellent classiﬁcation accuracy.

ENDs can be applied to conventional classiﬁcation problems, and indeed, no restric-

tions are made for the splitting of label sets into subsets. In the case of ordinal

classiﬁcation, however, it again seems reasonable to restrict to ordered splits; see

Fig. 2 for an example of an ordinal nested dichotomy.

Even though the number of diﬀerent dichotomies is signiﬁcantly smaller for the

ordinal than for the general case, it may still become huge for a large number of

classes m. More concretely, it can be shown by simple combinatorial arguments that

the number is (3

−(2

m+1

−1))/2 for the general case, which is reduced to (m

−m)/6

for the ordinal case. The computation of all dichotomies may thus become infeasible

for large m. Frank and Kramer (2004) found that averaging over 20 randomly

generated dichotomies is “suﬃcient to get close to optimum performance”. In our

experiments, we shall stick to this rule of thumb.

2.3 Pairwise Classiﬁcation

Another popular binarization technique is the all-pairs approach, also called round

robin learning (F¨urnkranz, 2002a,b), which trains a separate model M

i,j

for each

pair of classes (y

, y

) ∈ Y × Y, 1 ≤ i < j ≤ m; thus, a total number of m(m − 1)/2

models is needed. M

i,j

is intended to discriminate between classes y

and y

. At

classiﬁcation time, a query x is submitted to all models, and each prediction M

i,j

(x)

is interpreted as a vote for a label. More speciﬁcally, assuming s

i,j

= M

i,j

(x) ∈ [0, 1],

the weighted voting techniques interprets s

i,j

and 1−s

i,j

as weighted votes for classes

and y

, respectively, and predicts the class with the highest sum of weighted votes.

For the following reason, pairwise classiﬁcation is an interesting baseline in our

context: It produces binary problems that are (trivially) ordered and hence “rea-

sonable” from an ordinal classiﬁcation point of view, and yet it does not exploit any

ordinal information. F¨urnkranz (2002b) found that pairwise classiﬁcation, using

decision trees as base learners, is indeed competitive to the approach of Frank and

Hall in terms of classiﬁcation accuracy. As will be seen later on, our results are in

Is an ordinal class structure useful in classifier learning

Figures

Citations

Ordinal Regression Methods: Survey and Experimental Study

Preference Learning and Ranking by Pairwise Comparison

Binary Decomposition Methods for Multipartite Ranking

Citation-based journal ranks: The use of fuzzy measures

Exploitation of pairwise class distances for ordinal classification

References

The Nature of Statistical Learning Theory

C4.5: Programs for Machine Learning

Data Mining: Practical Machine Learning Tools and Techniques

UCI Machine Learning Repository

Statistical Comparisons of Classifiers over Multiple Data Sets

Related Papers (5)

A Simple Approach to Ordinal Classification

Large margin rank boundaries for ordinal regression

Regression Models for Ordinal Data

Gaussian Processes for Ordinal Regression

Reducing multiclass to binary: a unifying approach for margin classifiers

Frequently Asked Questions (10)

Q1. What are the contributions in "Is an ordinal class structure useful in classifier learning?" ?

Q2. What is the degree to which a learner benefits from an ordinal structure?

Q3. Why have several previous studies resorted to discretized regression data for experimental purposes?

Q4. What is the importance of the base learners?

Q5. What is the definition of a prediction of the probability of a class of x?

Q6. What test was used to find differences between pairs of methods?

Q7. Why is the VOI for the regression data so small?

Q8. What is the statistical (null) hypothesis that r 0.5?

Q9. How can the authors get the probabilities of the individual classes yi?

Q10. What is the difference between EOND and FH?