A Heuristic that Works
To deal with the wide range of possibilities for bibliographic string input the heuristic needs to be randomized. Our heuristic works in the following way:
1. All stop words are removed from the bibliographic string, abbreviations and punctuation marks are deleted. These include words such as "of, the, and." In addition, words such as "computer" in a computer science library resulted in a large number of database hits. The useful feature of a stop word list is that it acts to focus the result set. Our particular service uses a stop word list that is optimized for a computer science library, and contains over 60 words. The list will change by necessity for different types of libraries. For example, we might wish to make "DNA" a stop word for a biology library.
2. From the resulting cleaned bibliography string the heuristic attempts to create at most 12, three-word sets. The words are selected randomly.
3. Each of the three-word sets is sent to the library search interface. The library database returns a list of identifiers which are unique pointers to database entries. The identifiers are "scored" by the number of times they are returned. For example: if an identifier is returned by the server as a result of ten searches, it has a score of ten.
We chose a 12/3 system for the following reasons: If the number of word sets is decreased from twelve, the scoring of the identifiers does not allow for a definite answer. However, increasing the number beyond twelve slows the search process to a speed unacceptable to users. Additionally, we do not want to choose less than three words for each set because the number of results would be high. However, we don't want to choose a number much greater then three because the logical "and" operation when used with a large number of words in a search tends to fail.
4. Identifiers are sorted by their score, top scoring identifiers first. The identifiers are handed to the library server, and the returned database entries are displayed to the user.
This heuristic has proven to be remarkably successful for a number of reasons: it is fast, it does not require any knowledge of bibliography formats, and it deals well with bibliography strings with errors.