Sunday, August 17th, 2008...2:15 pm

RelEx Crawler: Natural Language Processing Web Crawler and HyperGraphDB Manager

Jump to Comments

This is the second of three posts wrapping up my experiences with Google Summer of Code 2008 and the Singularity Institute for Artificial Intelligence.

The RelEx Crawler was the heart of my project. The summary proposal is here on Google’s site and my full project proposal is here on the OpenCog wiki.

Given an input URL and a number of pages to crawl, the crawler will run the text of the page through the RelEx semantic relationship extractor and output the data into a variety of formats before moving on to to the next page. The SIAI people wanted an NLP crawler based on the popular ‘Nutch’ crawler, but unfortunately the Nutch Java client was functionally useless, so there was no good way of combining the two products. As a result, this crawler is new (though based very heavily on a guide put out by Sun).

The signal to noise ratio is a problem when using the entire web as corpus, so this project was designed with specific knowledge bases in mind. Most of my test cases were done using Wikipedia, and there are also a few Wikipedia-specific tweaks to stop the crawler from getting out into the open web from the external links, from crawling edit history and user pages, things like that. Wikipedia also has the advantage of being available under the GNU Free Document License, meaning that it can be freely redistributed to avoid any copyright unpleasantness which might arise from redistributing semantically-parsed material from other, non-free knowledge bases. I can’t imagine Microsoft would be too happy if you were freely redistributing a processed version of their Microsoft Developer’s Network!

The crawler can output five different formats: simple relations, RelXML, openCogXML, Relex Compact Output, the ARC archival format and HyperGraphDB.

The Relex Compact Output is a new XML-based format created specifically for this project. The problem with crawling and parsing large amounts of data is that the annotation markup blows up the space requirement by at least a factor of 15, which can mean storage requirements in the hundreds of gigabytes for large corpii, so the compact output tries to minimize the size of output data. There is also a need to include other relevant information such as date parsed and the versions of the software used. This format does this, and, coupled with the gzipping provided by ARC, minimizes the output file size.

The other interesting output is the HyperGraphDB output. For more information on hypergraphs, check out the wikipedia article and for information on HyperGraphDB, check out the developing company’s webpage.

The crawler can output a specific type of HyperGraph which I’ve called a RelationGraph. There are two types of value links in a relation graph, PropertyLinks and RelationLinks. Using this RelationGraph, we can store and query semantic and grammatical relationships between concepts, even those which appear on different web pages. This also lets us visualize NLP data using HyperGraphDB viewer, which I’ll go into further in the next post.

hyperpedia.jpg

I’d like to take some time here to thank Linas Vepstas, who is the undisputed king of natural language processing in AI! I would be nowhere without his help and his hard work. Linas, thanks ever so much for everything you’ve done for me. I’ve got some other people to thank too, but I’ll get to that in the third post.

The crawler source (in Java) is located here, along with the compile script. Just dump /crawler/ into your RelEx source path and change build.xml. You’ll probably have to get the HTMLParser library to properly strip the HTML tags from the text. I also used Scott Piao’s Sentence Detector, but I don’t think it’s actually necessary any more.

If you’d like to play with a RelationGraph, here is one containing 3 parsed pages Simple English Wikipedia, starting at The Battle of Hastings.

Stay tuned for the third post in this series, “Visualizing Natural Language Processing Data and Extracting Conceptual Relationships”!

Rich!


Stumble! | Save This Page! | Add to Technorati Favorites

3 Comments

Leave a Reply