Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem Academic Article uri icon

abstract

  • Datasets that have imbalanced class distributions pose a challenge for learning and classification algorithms. Imbalanced datasets exist in many domains, such as: fraud detection, sentiment analysis, churn prediction, and intrusion detection in computer networks. To solve the imbalance problem, three main approaches are typically used: data resampling, method adaptation and cost-sensitive learning; of these, data resampling, either oversampling the minority class instances or undersampling the majority class instances, is the most used approach. However, in most cases, when implementing these approaches, there is a trade-off between the predictive performance and the complexity. In this paper we introduce a fast, novel clustering-based undersampling technique for addressing binary-class imbalance problems, which demonstrates high predictive performance, while its time complexity is bound by the size of the minority class instances. During the training phase, the algorithm clusters the minority instances and selects a similar number of majority instances from each cluster. A specific classifier is then trained for each cluster. An unlabeled instance is classified as the majority class if it does not fit into any of the clusters. Otherwise, cluster-specific classifiers are used to return the instance's classification, and the results are weighted by the inverse-distance from the clusters. Our evaluation includes several state-of-the-art methods. We plot the Pareto frontier for various datasets, to consider both computational cost and predictive performance measures. Extensive sets of experiments demonstrate that only the suggested method is always found on the frontier.

publication date

  • January 1, 2017