Elsevier

Pattern Recognition Letters

ROC analysis in ordinal regression learning

Abstract

Nowadays the area under the receiver operating characteristics (ROC) curve, which corresponds to the Wilcoxon–Mann–Whitney test statistic, is increasingly used as a performance measure for binary classification systems. In this article we present a natural generalization of this concept for more than two ordered categories, a setting known as ordinal regression. Our extension of the Wilcoxon–Mann–Whitney statistic now corresponds to the volume under an r-dimensional surface (VUS) for r ordered categories and differs from extensions recently proposed for multi-class classification. VUS rather evaluates the ranking returned by an ordinal regression model instead of measuring the error rate, a way of thinking which has especially advantages with skew class or cost distributions. We give theoretical and experimental evidence of the advantages and different behavior of VUS compared to error rate, mean absolute error and other ranking-based performance measures for ordinal regression. The results demonstrate that the models produced by ordinal regression algorithms minimizing the error rate or a preference learning based loss, not necessarily impose a good ranking on the data.

Introduction

In multi-class classification, labels are picked from a finite set of unordered categories. In metric regression, labels might take an infinite number of continuous values. Ordinal regression can be located in between these learning problems because here labels are chosen from a finite set of ordered categories. Applications of ordinal regression frequently arise in domains where humans participate in the data generation process. Humans prefer to choose a label from a (usually) small set of alternatives when they assess objects for their beauty, quality, suitability or any other characteristic. In essence, they prefer to quantify objects with ordinal labels instead of continuous scores, although often an underlying and unobserved continuous variable is assumed. This kind of data is for example found in information retrieval, when users of recommender systems express on a scale from one to five stars to what extent they like items like movies or songs. Machine learning techniques are then applied to predict the ratings of new users on these items or to recommend new items to existing users of the system. Another application is quality control, where human experts frequently evaluate products with linguistic terms, varying from 'very bad' to 'very good' for example. Also in medicine and social sciences, where many data sets originate by interaction with humans, ordinal regression models can be employed.

In these applications of ordinal regression one is often primarily interested in a subset of the classes. In many cases these classes of interest are the 'extreme' categories, such as the documents with the highest relevance to the query or the products with the lowest quality and, consequently, we would like to associate a higher misclassification cost with these objects. Moreover, there is often an unequal number of training objects for the different categories in real-world ordinal regression problems. In other words, we are often dealing with a skew class and/or cost distribution. The overall classification rate or mean absolute error, which are commonly used for evaluating ordinal regression models, do not fully represent the desired performance of our system. In these situations, we are more interested in criteria that quantify the ranking on the data imposed by the classifier. We will directly explain how this relates to ROC analysis.

This article aims to discuss possible extensions of ROC analysis for ordinal regression. It is organized as follows. In Section 2 we briefly describe the main concepts of binary and multi-class ROC analysis in the framework of machine learning. This gives us the opportunity to define an extension of the Wilcoxon–Mann–Whitney statistic and its geometrical interpretation in Section 3 and, subsequently, the properties of this performance estimator and related measures are compared. In Section 4 all measures are empirically analyzed on synthetic and real-world data. Finally, in Section 5 we formulate a conclusion and present some ideas for future work.

Section snippets

Overview of existing work

In machine learning one often assumes that examples are identically and independently drawn according to an unknown distribution D over X × Y with X the object space and Y the set of labels. In binary classification two types of objects are observed and Y = { y ¯ - , y ¯ + } . This extends to Y = { y ¯ 1 , , y ¯ r } when more than two types of objects have to be distinguished (for r categories). In the case of ordinal regression there is a linear order relation Y defined on the elements of Y . When the order relation

ROC measures for ordinal regression

Recently, different approaches have been proposed to extend ROC analysis for multi-class classification, see e.g. Hand and Till, 2001, Ferri et al., 2003, Flach, 2004, Fieldsend and Everson, 2006. In the most general case, the volume under the ROC surface (VUS) has to be maximized in multi-class classification. The ROC surface can be seen as a Pareto front, where each objective corresponds to one dimension. In case there are more than two classes (say r), then the number of objectives depends

Experiments

Three kinds of experiments were set up to reveal the characteristics of U ^ ( f , D ) and to make a comparison with the approximations and the standard measures 'mean zero-one error' and 'mean absolute error'. First we show by means of simulations on synthetic ranking functions that the relationship between U ^ ( f , D ) and its approximations appears to be non-linear. In a second simulation experiment we prove that in general there will be no monotone relationship between U ^ ( f , D ) and other ranking-based

Conclusion

In this article we analyzed ranking-based ordinal regression models. We argued that evaluating the ranking returned by an ordinal regression model is often more appropriate than looking at 'mean zero-one error' or 'mean absolute error', especially with skew class or cost distributions. To that end, we extended the concept of expected ranking accuracy for ordinal labeled data and showed that a nonparametric unbiased estimator U ^ ( f , D ) of this quantity corresponds to the volume under the ROC

Acknowledgements

We would like to thank Bruno Bauwens for useful discussions. Willem Waegeman was supported by a grant of the "Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT-Vlaanderen)".

References (23)

  • et al.

    Generalization bounds for the area under the ROC curve

    J. Machine Learn. Res.

    (2005)

  • Agresti, A., 2002. Categorical Data Analysis, 2nd version, John Wiley and...
  • W. Chu et al.

    Gaussian processes for ordinal regression

    J. Machine Learn. Res.

    (2005)

  • Chu, W., Keerthi, S., 2005. New approaches to support vector ordinal regression. In: Proc. Internat. Conf. on Machine...
  • C. Cortes et al.

    AUC optimization versus error rate minimization

  • Crammer, K., Singer, Y., 2001. Pranking with ranking. In: Proc. Conf. on Neural Inform. Process. Systems, Vancouver,...
  • N. Cristianini et al.

    An Introduction to Support Vector Machines

    (2000)

  • S. Dreiseitl et al.

    Comparing three-class diagnostic tests by three-way ROC analysis

    Med. Decis. Making

    (2000)

  • Ferri, C., Hernandez-Orallo, J., Salido, M., 2003. Volume under ROC surface for multi-class problems. In: Proc....
  • J. Fieldsend et al.

    Multi-class ROC analysis from a multi-objective optimization perspective

    Pattern Recognition Lett.

    (2006)

  • Flach, P., 2003. The geometry of ROC space: Understanding machine learning metrics through ROC isometrics. In: Proc....
  • Cited by (88)

    View full text