الملخص الإنجليزي
Locating scientific publications using search engines and digital libraries is a time consuming process. Many researchers and postgraduate students spend very long times searching for publications using different search engines and digital libraries without being able to find many important publications that are related to their queries. The researchers need to conduct a scientific literature review for a given topic and they need to spend very ng times to analyze and classify the literature into various models and techniques. Trend detection and analysis for scientific publications become very important issues.
The traditional data analytics may not be able to handle large quantities of data within a specific period. Therefore, there is a need to utilize high performance Big Data platforms and techniques to efficiently analyze scientific articles and use appropriate Data Mining algorithms to analyze and classify the selected articles. For this purpose, the researchers have conducted a survey to learn how the researchers locate and classify research articles to meet their needs. Consequently, the survey results illustrated that Google Scholar is the most widely used search engine to find relevant scientific publications, most researchers faced difficulties to retrieve relevant articles and many researchers spent more than two months to obtain and analyze the desired literature review for their research topics.
The project aims to assist researchers by providing an automatic classification model for
scientific articles in a given research topic. The proposed system utilizes Big Data solutions (such as Hadoop and Mahout) to retrieve and analyze the relevant scientific articles from
various search engines and digital libraries. It employs Data Mining algorithms and techniques such Supervised Learning (e.g., VSM Model, Clustering Keywords and/or Feature Selections) and Unsupervised Learning (e.g., k-Means algorithm) to classify research articles into various models, and knowledge areas and research trends.
The researchers have developed a prototype system for the proposed approach and evaluated its performance using various metrics. The researchers have tested and evaluated the system using different data sizes and different research area topics. The highest performance was obtained when classifying dataset into two clustering groups (k=2) using the k-means algorithm with accuracy (91.04%), Precision (0.5241), Recall (0.9112) and F Measure (0.9104).
Finally, the proposed system can be improved in the future by extending the dataset for gaining the higher accuracy results using different Data Mining techniques (e.g., Semi supervised learning), using more components from publication articles (e.g., keywords, abstracts, citations and references) and their combinations to improve the system scalability and efficiency.