English abstract
The Internet and particularly its collection of multimedia services known as World Wide Web is rapidly growing. Many Web sites are appearing over a small period of time that puts a strain on the search engines as their technology tries to keep with the growth. The information obtained by the search engines as response to search goals and queries issued by the users of the Web do not seem to be as relevant as they should. Search engines need better and more efficient techniques to provide relevant information to the users. One of the technique is to incorporate intelligent techniques such as heuristics and pattern matching in the design of future search engines to provide them the ability to discover user access patterns, build user interest models and use these models to automatically acquire relevant information from the Web. Moreover, these additional techniques would help search engines to conduct theme-based search as opposed to keyword-based search. It is this later approach to search, which prevents the engines from spanning the entire web when looking for pages.
The reason is that keyword-based search puts lots of strain on the spider due to (1) ultra large number of servers and pages on the web, and (2) inefficiency of the keyword based search which breaks down when designers include intentionally duplicate words to attain higher ranking by the search engines. This project is an initial step towards future theme-based search engines. The framework proposed presents a design for a clustering system for Web searching, which produces a user model. This model can be given to future intelligent search engines (ISE) to automatically fetch pages and documents considered of regular interest to the user. The model is constructed using several pattern matching and clustering techniques, along with several heuristics to control the mining of Web pages during the clustering process. To test the usefulness of our framework, we conducted several experimental analyses. The results were encouraging and seem to show that the clustering techniques we used are in line with other similar techniques.