Document
A comparison of term weighting schemes for thesis classification.
Identifier
Al-Maneiyah, Afra Humaid (2020). A comparison of term weighting schemes for thesis classification. (Master thesis. Sultan Qaboos University, Muscat, Oman).
Publisher
Sultan Qaboos University
Gregorian
2020
Language
English
English abstract
Term weighting is the basis for text classification analysis, which could affect the classification
performance of classifiers. Different term weighting schemes are available but little
evidence has been found for the essential difference between the schemes on the classification
performance. In this research, we investigated three term weighting schemes: the count,
term frequency-inverse document frequency (TF-IDF) and term frequency-inverse category
frequency (TF-ICF). We compared the three schemes on the classification of theses for the
postgraduate students of the College of Science at Sultan Qaboos University, using the multinomial
naive bayes (MNB) and the support vector machine (SVM) classification algorithms.
The comparison was based on four classification performance metrics: accuracy, recall,
precision and F1-score. Our results revealed that the count scheme with MNB gave a higher
macro-average recall compared to the other schemes with SVM. In addition, by considering
the SVM, we found that the TF-ICF gave a higher macro-average recall compared to the
other two schemes. The findings suggest that the term weighting schemes have different
effects on the classification performance metrics. The results show that the counts weighting
scheme performs better in classifying theses especially with MNB. The count scheme with
SVM, however, could handle the imbalanced class issue better than the count with MNB.
In addition, the TF-ICF with SVM had an advantage over the count and TF-IDF with SVM.
Therefore, this study suggests that the students' theses could be classified using count with
MNB or TF-ICF with SVM. We recommend to the College of Science and the main library
at the SQU to integrate term weighting to ease automated classification of postgraduate
theses.
Member of
Resource URL
Category
Theses and Dissertations