Project: Building a web search engine
Advisor: Dave Musicant
Meeting time: TTh 3:10-4:55
Final Results
Final Documentation
I. Background
Google has made searching the web a snap. If you want to index your own
intranet, Google will do so for a very large fee. There are a variety of
open source web search engines available (ht://Dig, WAIS, etc.), though all
seem to be lacking in one way or another. This project will involve
creating a new web search engine that we will test on the Carleton
intranet.
II. The Project
Here is a list of the concepts and technologies that will be necessary.
- Web and database programming. PHP and MySQL will be used
to create a dynamic web interface that is functional and attractive.
- Java. There is a considerable amount of preprocessing
that must be done so that the search engine responds immediately to a
user's request. By using Java to do this, our tools will be
cross-platform.
- Algorithms. A number of techniques have been published
for ranking web pages based not just on the keywords being searched
for, but also on referring web pages. We will try some of these
techniques to see which we are happier with.
- Satisfaction survey. In order to determine how well our ranking
techniques work relative to each other as well as relative to the standard
Carleton search engine, we will survey users in a scientific manner to
determine how they respond.
- Parsing. We will need to parse HTML in order to find
important words as well as for identifying links. Additional
functionality can be provided for parsing non-text file formats such
as PDF, DOC, etc.
The final project will be a completed collection of tools that can be used
to set up a local search engine.
III. References
L. Page, S. Brin, R. Motwani, and T. Winograd. The
PageRank Citation Ranking: Bringing Order to the Web.
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg,
S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Mining the
link structure of the World Wide Web. IEEE Computer, August 1999.