English abstract
Abstract
The large volume of scientific literature being published today requires the development of new techniques for efficient management and classification of publications. There exist a huge gap between the stored data at digital libraries" databases and the knowledge that could be extracted from these data. Moreover there is a lack in categorizing or classifying scientific publications based on a well known classification scheme. Also, finding documents more efficient when search results are organized into topical categories than when they are presented with a standard ranked list. Therefore, there is a need to develop a solution for such obstacle, We reviewed the literature, analyzed the existing classification approaches. Then we came up with a solution that can be used to classify a scientific publication into its most related classes or categories. We proposed three methodologies that use TF-IDF in different forms based on ACM Computing Classification System version, 1998 which was mostly used for classifying scientific publications by leading organizations such as ACM, and IEEE. The first proposed approach or hypothesis is PublicationBased which applies TF IDF on all documents based on ACMC and stores result of all keywords in all ACMC classes that have these keywords. The second approach is called NodeBased which is used to compute TF-IDF on each ACMC class separately since frequency of keywords vary from class to class and may be important in some class only. The third approach is called HyperedBased which is used to calculate TF-IDF with assumption that each ACMC class is a single document.
We have evaluated the three approaches using around 86,000 publications extracted from ACM digital library and stored in a rational database. To measure the precision for each proposed approach, we evaluated the solution by randomly selecting classified publications from our database. We have used different factors to compute the precision of our solution by considering the number of publications in each classification node. After the experimental analysis we found that, increasing the number of publications in the classification nodes can improve the precision. For the classification nodes that have smaller number of publications, the precision decreases