Imbalanced classification: an objective-oriented review

Abstract

A common issue for classification in scientific research and industry is theexistence of imbalanced classes. When sample sizes of different classes areimbalanced in training data, naively implementing a classification method oftenleads to unsatisfactory prediction results on test data. Multiple resamplingtechniques have been proposed to address the class imbalance issues. Yet, thereis no general guidance on when to use each technique. In this article, weprovide an objective-oriented review of the common resampling techniques forbinary classification under imbalanced class sizes. The learning objectives weconsider include the classical paradigm that minimizes the overallclassification error, the cost-sensitive learning paradigm that minimizes acost-adjusted weighted type I and type II errors, and the Neyman-Pearsonparadigm that minimizes the type II error subject to a type I error constraint.Under each paradigm, we investigate the combination of the resamplingtechniques and a few state-of-the-art classification methods. For each pair ofresampling techniques and classification methods, we use simulation studies tostudy the performance under different evaluation metrics. From these extensivesimulation experiments, we demonstrate under each classification paradigm, thecomplex dynamics among resampling techniques, base classification methods,evaluation metrics, and imbalance ratios. For practitioners, the take-awaymessage is that with imbalanced data, one usually should consider all thecombinations of resampling techniques and the base classification methods.