الملخص الإنجليزي
Abstract
Calculating similarities between text documents has always received special interest as it has been utilized in many fields such as copy detection, plagiarism detection, and information retrieval etc. This problem has drawn the attention of researchers in a wide range of areas and applications. This project looks at this problem through a new approach. This approach combines the use of a given text's structure, namely Part-Of-Speech (POS), and sequence aligning techniques, such as Longest Common Subsequence (LCS), to analyze and calculate similarity between text documents. It proposes creating better representations based on the syntactic structure of written text that preserves some of the semantics expressed as order and style. A prototype of the proposed approach has been implemented to investigate the idea's potential, applicability and utility. Experiments have been conducted to validate this approach using different sets of data from different domains. The text's structures (POS) were used as a representation for the calculation of similarities between related documents. Clustering methods and sequence alignment techniques, such as the Longest Common Subsequence (LCS) algorithm, were used to cluster together and rank documents according to their similarity measure. Similarity between two documents was measured by computing a normalized score of the length of their Longest Common Subsequence (LCS). The effect of the Part-Of-Speech (POS) tags-set size on the accuracy of the obtained results is investigated. Also experiments were conducted to find a real-life application for this approach in detecting duplicate and near-duplicate documents within a corpus, and in filtering search engine results. Experiments have shown the utility of the approach in finding similarities between written documents, demonstrating the capability of this approach in capturing duplicate and near-duplicate documents and the ability to group similar documents. An important contribution of this work was the development of two extensions for the Longest Common Subsequence (LCS) algorithm. These two extensions have improved the original LCS's efficiency. Experiments demonstrating these improvements in the LCS have been performed and the results have been discussed.