Xerox document search to improve results
- 20 June, 2007 22:00
Xerox researchers have developed a search tool that tries to understand documents, rather than looking for keywords, in order to provide better results.
The tool, Factspotter, analyses the underlying grammar of a text in order to infer additional information, such as whether ambiguous words are being used as nouns or verbs, or to whom a pronoun refers, said Frédérique Segond, who manages the parsing and semantics research group at Xerox Research Centre Europe near Grenoble, France.
The analysis allows the software to understand that references to "Bill Gates", "he" and "the head of Microsoft" in the same document likely refer to the same person. But the software should also be able to tell that "Bill Gates said ... " and "A friend of Bill Gates said ..." do not precede words spoken by the same person, a situation that would likely lead search engines using keyword analysis alone to return irrelevant results.
One of the first groups to use Factspotter will be Xerox Litigation Services, which next year will build it into a suite of "e-discovery" software for the legal profession, Segond said. In the discovery phase of a lawsuit, where legal teams must often sift through millions of e-mail messages and other documents, the software could be used to identify the sender and recipients of messages, and pick out information about events and dates from them. These features could be used to form a picture of who knew what, and when, in order to build a solid legal case, she said.
Segond's research team developed their own metalanguage to describe the grammars of different human languages. So far, they have used it to build descriptions of Dutch, English, French, German, Italian, Portuguese and Spanish. A joint Fujitsu-Xerox research team has also used it to describe Japanese grammar, showing that it can be used for languages using other writing systems.
Factspotter itself is written in the C programming language, and the researchers have also developed modules in Java and Python, allowing the software to interface with other applications.
Although the software only analyses written language, it can be linked with audio transcription tools in order to search radio and TV archives, and the company is involved in joint research projects to do just that, Segond said.