Sunday, August 17th, 2008...2:15 pm

RelEx Crawler: Natural Language Processing Web Crawler and HyperGraphDB Manager

Jump to Comments

This is the second of three posts wrapping up my experiences with Google Summer of Code 2008 and the Singularity Institute for Artificial Intelligence.

The RelEx Crawler was the heart of my project. The summary proposal is here on Google’s site and my full project proposal is here on the OpenCog wiki.

Given an input URL and a number of pages to crawl, the crawler will run the text of the page through the RelEx semantic relationship extractor and output the data into a variety of formats before moving on to to the next page. The SIAI people wanted an NLP crawler based on the popular ‘Nutch’ crawler, but unfortunately the Nutch Java client was functionally useless, so there was no good way of combining the two products. As a result, this crawler is new (though based very heavily on a guide put out by Sun).

The signal to noise ratio is a problem when using the entire web as corpus, so this project was designed with specific knowledge bases in mind. Most of my test cases were done using Wikipedia, and there are also a few Wikipedia-specific tweaks to stop the crawler from getting out into the open web from the external links, from crawling edit history and user pages, things like that. Wikipedia also has the advantage of being available under the GNU Free Document License, meaning that it can be freely redistributed to avoid any copyright unpleasantness which might arise from redistributing semantically-parsed material from other, non-free knowledge bases. I can’t imagine Microsoft would be too happy if you were freely redistributing a processed version of their Microsoft Developer’s Network!

The crawler can output five different formats: simple relations, RelXML, openCogXML, Relex Compact Output, the ARC archival format and HyperGraphDB.

The Relex Compact Output is a new XML-based format created specifically for this project. The problem with crawling and parsing large amounts of data is that the annotation markup blows up the space requirement by at least a factor of 15, which can mean storage requirements in the hundreds of gigabytes for large corpii, so the compact output tries to minimize the size of output data. There is also a need to include other relevant information such as date parsed and the versions of the software used. This format does this, and, coupled with the gzipping provided by ARC, minimizes the output file size.

The other interesting output is the HyperGraphDB output. For more information on hypergraphs, check out the wikipedia article and for information on HyperGraphDB, check out the developing company’s webpage.

The crawler can output a specific type of HyperGraph which I’ve called a RelationGraph. There are two types of value links in a relation graph, PropertyLinks and RelationLinks. Using this RelationGraph, we can store and query semantic and grammatical relationships between concepts, even those which appear on different web pages. This also lets us visualize NLP data using HyperGraphDB viewer, which I’ll go into further in the next post.

hyperpedia.jpg

I’d like to take some time here to thank Linas Vepstas, who is the undisputed king of natural language processing in AI! I would be nowhere without his help and his hard work. Linas, thanks ever so much for everything you’ve done for me. I’ve got some other people to thank too, but I’ll get to that in the third post.

The crawler source (in Java) is located here, along with the compile script. Just dump /crawler/ into your RelEx source path and change build.xml. You’ll probably have to get the HTMLParser library to properly strip the HTML tags from the text. I also used Scott Piao’s Sentence Detector, but I don’t think it’s actually necessary any more.

If you’d like to play with a RelationGraph, here is one containing 3 parsed pages Simple English Wikipedia, starting at The Battle of Hastings.

Stay tuned for the third post in this series, “Visualizing Natural Language Processing Data and Extracting Conceptual Relationships”!

Rich!


Stumble! | Save This Page! | Add to Technorati Favorites

3 Comments

Leave a Reply

price of viagra synthroid without prescription viagra overnight cheap cialis without prescription viagra in us order viagra in us purchase viagra cheap viagra online buy clomid without prescription tablet cialis lasix pharmacy cheap generic lasix order cialis cheap online cialis no online prescription levitra no prescription cialis buy online purchase accutane cialis overnight delivery overnight cialis low cost viagra viagra in australia buy discount cialis online order viagra accutane without a prescription buy discount cialis viagra approved clomid no prescription purchase propecia zithromax prescription buy acomplia cheap synthroid online cheapest cialis price buy no rx viagra buy generic levitra cheap viagra no rx clomid for sale zithromax for sale viagra without a prescription cheap generic clomid where to buy accutane cheap propecia cheap cialis accutane generic buy generic lasix cialis side effects soma sale fda approved cialis cheap viagra pharmacy propecia for sale buy cheap cialis online soma discount cheap clomid online discount lasix acomplia online stores generic lasix cialis information cialis no prescription clomid online where to buy lasix online lasix propecia discount cialis online without prescription cialis online cheap cialis internet generic levitra find discount cialis online levitra prices where to order cialis buy levitra cheap synthroid generic order cheap viagra online cialis india cheap clomid tablets cheap zithromax online buy accutane online acomplia no prescription cialis internet order viagra no prescription viagra vendors buy cheap synthroid viagra prices cheap viagra no prescription