Logo PTI
Polish Information Processing Society
Logo FedCSIS

Annals of Computer Science and Information Systems, Volume 12

Position Papers of the 2017 Federated Conference on Computer Science and Information Systems

Evaluation of classifiers: current methods and future research directions

DOI: http://dx.doi.org/10.15439/2017F530

Citation: Position Papers of the 2017 Federated Conference on Computer Science and Information Systems, M. Ganzha, L. Maciaszek, M. Paprzycki (eds). ACSIS, Vol. 12, pages 3740 ()

Full text

Abstract. This paper aims to review the most important aspects of the classifier evaluation process including the choice of evaluating metrics (scores) as well as the statistical comparison of classifiers. Some recommendations, limitations of the described methods as well as the future, promising directions are presented. This article provides a quick guide to understand the complexity of the classifier evaluation process and tries to warn the reader about the wrong habits.


  1. Bishop Ch. “Pattern recognition and machine learning,” Springer, New York, 2006.
  2. Bouckaert R., “Estimating replicability of classifier learning experiments,” Proc. 21st Conf. ICML, AAAI Press, 2004, http://dx.doi.org/10.1145/1015330.1015338.
  3. Bradley P., “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern recognition, 30, 1997, pp. 1145–1159, http://dx.doi.org/10.1016/S0031-3203(96)00142-2.
  4. Dietterich T., “Approximate statistical tests for comparing supervised classification learning algorithms,” Neural Computation, 10, 1998, pp. 1895–1924, http://dx.doi.org/10.1162/089976698300017197.
  5. Demsar J., “Statistical comparison of classifiers over multiple data sets,” Journal of Machine Learning Research, 7, 2006, pp. 1–30.
  6. Garcia S. Fernandez A., Lutengo J. and Herrera F., “Advanced non-parametric tests for multiple comparisons in the design of experiments in the computational intelligence and data mining: experimental analysis of power,” Inf. Sci., 180(10), 2010, pp. 2044–2064, http://dx.doi.org/10.1016/j.ins.2009.12.010.
  7. Garcia V. et. al., “Index of balanced accuracy: a performance measure for skewed class distributions,” 4th IbPRIA, 2009, pp. 441–448, http://dx.doi.org/10.1007/9783–642–02172–5_57.
  8. Hand D., “Measuring classifier performance: a coherent alternative to the area under the ROC curve,” Machine Learning, 77, 2009, pp. 103–123, http://dx.doi.org/10.1007/s10994009–5119–5.
  9. Hollander M. and Wolfe D., “Nonparametric statistical methods,” John Wiley & Sons, 2013, http://dx.doi.org/10.1002/9781119196037.
  10. Japkowicz N. and Shah M., “Evaluating learning algorithms: a classification perspective,” Cambridge University Press, Cambridge, 2011.
  11. Salzberg S., “On comparing classifiers: pitfalls to avoid and recommended approach,” Data Mining and Knowledge Discovery, 1, 1997, pp. 317–328, http://dx.doi.org/10.1023/A:1009752403260.
  12. Santafe G. et. al., “Dealing with the evaluation of supervised classification algorithms,” Artif. Intell. Rev. 44, 2015, pp. 467–508, http://dx.doi.org/10.1007/s10462-015-9433-y.
  13. Shaffer J. P., “Multiple hypothesis testing,” Annual Review of Psychology, 46, 1995, pp. 561–584.
  14. Sokolova M. and Lapalme G., “A systematic analysis of performance measures for classification tasks,” Inf. Proc. and Manag., 45, 2009, pp. 427–437, http://dx.doi.org/10.1016/j.ipm.2009.03.002.
  15. Wolpert D., “The lack of a priori distinctions between learning algorithms,” Neural Comput. 8(7), 1996, pp. 1341–1390, http://dx.doi.org/10.1162/neco.1996.8.7.1341.