The WAIS search engine is at the heart of the WAIS Server and Workstation products. The WAIS search engine receives a user's question, searches its database for documents most relevant to the question, and returns a relevance-ranked list of documents back to the user. Each document is given a score from 1 to 1000, based on how well it matched the user's question (how many words it contained, their importance in the document, etc.). A question is an expression containing a combination of natural language, relevant documents, and boolean terms. Other key features of the WAIS search engine include fielded search, right truncation (wildcard searching), and relevance ranking.
The server can be queried using natural language questions. The server does not understand the question, rather it takes the words and phrases in the question and finds documents that have those words and phrases in them. "Tell me about portable computers." is an example of a natural language question. In this example, the WAIS Server would search for documents containing the words 'portable' and 'computers'; the other words, 'tell', 'me', and 'about', are called "stop words" -- they are so common that they occur in almost every document, so they are not used for searching a document.
The boolean operators, AND, OR, NOT, and ADJ aid in establishing logical relationships between concepts expressed in natural language. These operators are especially useful in narrowing down the search.
For data collections whose documents are structured in a semi-regular format, the regular portions of the documents can be tagged by the WAIS parser as fields. A client can then ask a WAIS server to limit its search to those documents containing a user-specified value of a particular field. This is called a "Fielded Search".
The mail-or-rmail parse format is an example of a parse format in which fields are tagged. For this parse format, the WAIS parser detects the "to" and "cc" fields, the "from" and "sender" fields, the "subject" field, and the "date" field. An example of a question using natural language, a boolean operator, and fielded search is: "company picnic AND from=barbara". The WAIS server would then return documents containing messages about a company picnic that barbara sent.
A user can specify right truncation by ending a word with the asterisk ('*') wild card character. This tells the search engine to search on words matching the base characters before the '*' and to ignore any trailing characters. For example, you might use right truncation in a question such as "geo*", which may retrieve documents containing the words: geographer, geography, geologist, geometry, geometrical, etc.
A user can group search terms and phrases together using parentheses. For example, if you wished to search for information about snowstorms, tornadoes, or hurricanes in New York City, you might search for "(snowstorms OR tornadoes OR hurricanes) AND (New ADJ York ADJ City)." You can also nest your parentheses; for example, "from = ( (ben ADJ wais) OR (brewster ADJ think) )" searches for messages from either ben@wais.com or brewster@think.com.
Each document is scored based on its relevance to a user's question, where the most relevant document has the highest score, or rank -- 1000 being the highest, 1 being the lowest. A document receives a higher score if the words in the question are in the headline, or if the words appear many times, or if phrases occur as in the question. A document's score is derived using techniques such as word weighting, term weighting, proximity relationships, and word density. Note that questions made up of natural language, relevant documents, and boolean expressions are all weighted using these techniques.
If a word in a document is found to match a word in the user's question, the word is assigned a weight, and this weight adds to the overall score of the document. The exact weight that a word receives depends on the emphasis given to the word by the author, and on where in the document the word was found. For example, a word is weighted highest if it appears in the headline, lower if the word has all capital letters or if the first letter of the word is capitalized, and finally, lowest if it appears only in the text. The WAIS parser determines word weights as it reads through the original data collection.
Each word used in data collection is assigned a numerical value, called the term weight, based on the frequency of occurrence of that word over all documents in the data collection. Words that occur frequently are not weighted as highly as those that appear less frequently. Very common words are either ignored or diminished in the scoring. For example, since the term, "animal", may occur frequently in many of the documents in a data collection, its term weight is small compared to a term such as "hippopotamus", which may occur only a few times.
Proximity relationships designate that if the words in a natural language question are located close together in a document, they are given a higher weight than those found further apart. The idea behind a proximity relationship is that if a document contains a phrase similar to one in the user's question, that document is more likely to be relevant.
The ratio of the number of times a word appears in a document to the size of the document is called the word density. It is a measure of how important a word is to the overall content of the document. A higher word density results in a higher relevance ranking.
Courtesy of WAIS Inc.