Document

A comparison of term weighting schemes for thesis classification.

Identifier
Al-Maneiyah, Afra Humaid (2020). A comparison of term weighting schemes for thesis classification. (Master thesis. Sultan Qaboos University, Muscat, Oman).
Publisher
Sultan Qaboos University
Gregorian
2020
Language
English
English abstract
Term weighting is the basis for text classification analysis, which could affect the classification performance of classifiers. Different term weighting schemes are available but little evidence has been found for the essential difference between the schemes on the classification performance. In this research, we investigated three term weighting schemes: the count, term frequency-inverse document frequency (TF-IDF) and term frequency-inverse category frequency (TF-ICF). We compared the three schemes on the classification of theses for the postgraduate students of the College of Science at Sultan Qaboos University, using the multinomial naive bayes (MNB) and the support vector machine (SVM) classification algorithms. The comparison was based on four classification performance metrics: accuracy, recall, precision and F1-score. Our results revealed that the count scheme with MNB gave a higher macro-average recall compared to the other schemes with SVM. In addition, by considering the SVM, we found that the TF-ICF gave a higher macro-average recall compared to the other two schemes. The findings suggest that the term weighting schemes have different effects on the classification performance metrics. The results show that the counts weighting scheme performs better in classifying theses especially with MNB. The count scheme with SVM, however, could handle the imbalanced class issue better than the count with MNB. In addition, the TF-ICF with SVM had an advantage over the count and TF-IDF with SVM. Therefore, this study suggests that the students' theses could be classified using count with MNB or TF-ICF with SVM. We recommend to the College of Science and the main library at the SQU to integrate term weighting to ease automated classification of postgraduate theses.
Category
Theses and Dissertations