Abstract
Software refactoring modifies the internal structure of the software without changing its original functionality. Several studies used different machine learning classification algorithms to predict software refactoring opportunities for developers. However, the issue of an imbalanced dataset still needs to be addressed. An imbalanced dataset is a scenario that arises when we have unequal distribution of classes in a dataset. One way to balance a dataset is by adding instances to the minority class (Oversampling) or deleting instances from the majority class (Undersampling). In this paper, four popular techniques of both categories have been evaluated for their capability of improving the imbalanced ratio of highly imbalanced datasets belonging to software refactoring. We used eight publicly available datasets from Java open-source software systems. Four classification algorithms, Naive Bayes, SVM, Forest Trees, and Decision Tree, were applied to evaluate the results of each resampling technique. The experimental results show that oversampling techniques performed better than random undersampling. In particular, when the datasets were balanced with the synthetic minority oversampling technique (SMOTE), the random forest ML algorithm obtained higher scores in different evaluation metrics on all datasets.