Tuesday, August 19th, 2008...8:32 pm

Visualizing Natural Language Processing Data and Extracting Conceptual Relationships

Jump to Comments

This is the final of three posts wrapping up my experiences with Google Summer of Code 2008 and the Singularity Institute for Artificial Intelligence. (The other two are here and here.)

In this post, I will show how to visualize NLP parsed data using Wikipedia as an example.

As I mentioned in the previous post, the RelEx Crawler can output a HyperGraphDB.

To view the HyperGraph, we will be using the HGViewer package of HyperGraphDB. The version in the source repository when I downloaded it was out of date and broken, so I updated the code so it would compile. You can download that compiled JAR file here, and a tarball of the updated source here.

To use the viewer, we will be using the Scriba environment. Scriba is made by Kobrix Software, who also make HyperGraphDB, so the two integrate very well (in fact, Scriba relies on HGDB). Scriba is a ‘notebook environment’ which allows internal Java scripting with BeanShell, letting us execute sections of Java code inside the document on the fly, which is really quite handy. It’s still buggy and can be a little finicky to set up but it does what we need it to.

After using the Crawler to parse a section of the web, you will have a folder with the starting the first url. Get the HGEnvironment and load this into Scriba and start visualizing. If you want a demo, play around with this Scriba document and this small relation graph.

Because Scriba allows in-document scripting, you create a Crawler, give it a target, crawl, parse, get the resulting RelationGraph and visualize it all in the same workspace!

Have a look at some pictures beyond the jump..

Ok, take a look at some of the resulting pictures! If you want to know more about what the parsed data actually means, this page and this page on the openCog wiki will give you a good idea of what this is all about.

This is ‘England’ with a depth of 1:

england1depth.jpg

And this is ‘England’ with a depth of 3. At that depth, it gets crowded quite quickly if you don’t zoom out. Notice the selected yellow blob in the bottom right corner!:

NLP

These are some earlier tests showing all of a different HG which crawled a larger chunk (~50 pages) of English Wikipedia. My limiting factor is actually hard-drive space (I should really get an external HDD), but there’s no reason that this couldn’t do thousands of pages.

hyperpedia1.jpg

And zoomed to fit:
hype2.jpg

And this one is just to show the nice forms that appear if you choose to view some of the most common elements of a RelationGraph, such as this case, the concept of noun:

noun.jpg.

Now, this looks pretty cool and is handy for figuring out the structure of our data, but what’s much more interesting is what we can do with it once we understand it. We have a large collection of interconnected, semantically processed data, so hopefully we should actually be able to extract some conceptual relationships from it.

This is an interesting problem. HyperGraphDB contains a fair set of query conditions, which can be combined to make interesting subsets of the graph. Most useful are the TargetCondition, which allow you trace back the links coming into an element, and the classic AND/NOT/OR conditions.

It also provides access to some searches and some classic path algorithms, but only Djikstra’s algorithm is implemented. This is handy to find the shortest path between two elements. For three elements, this can be daisy-chained to find path(a,c) by finding path(a,b) and path(b,c). But what about for more elements? Then it gets more tricky.

I started working on this a few ways. One of them was just to keep using Dijkstra’s algorithm on all of the different possible element combinations, but that’s horribly inefficient. Next I made the multiDijkstra algorithm, which returns a path from a start point to any endpoint in an array if all of the other elements have been found. Even so, the results were inconsistent a lot of the time and it wasn’t exactly what I was after anyway.

What I’d really like to be able find is the smallest connected component of a graph given a collection of points to include, then find the low-arity neighbors. I haven’t done enough research into graph theory to find the best way to do it. Do you know how? I asked a friend about this and she pointed out that surely this is how Google Maps works. I should ask them!

Now, the fun bit comes once we actually have the resulting subset. Remember that this is all NLP data, so, theoretically, we should be able to piece it back together to form valid new sentences!

Turning parsed data back into valid data is a big problem, but the SIAI has somebody working on it. There is already a crude implementation (warning: 100M file), but it is uncommented code and there is no documentation right now. Last time I talked to Ben Goertzel, supreme leader of all things AI, he was working on porting the documentation into the OpenCog wiki, but I don’t know if that ever actually happened (he’s a very busy man). Once that happens though, I’ll see if I can turn query results into real sentences.

So what’s the practical application of this? Well, with enough pre-parsed text and processing power, it could ideally become a competitor to NLP search engines like Cognition and the fancy but overhyped Powerset. Or, hopefully, the computer from Star Trek. Of course, this is still a narrow AI application. Hopefully some of the same ideas here could be used in a general AI application, though. If you’d like to know more about AGI, I’d strongly suggest you take a peek at OpenCogPrime, which is the embodiment of all of the progress the OpenCog team has made so far.

I’m wrapping up now, but I definitely need to thank a few people first. Thank you, Linas Veptas for your work on RelEx and for helping me with my problems and expanding my ideas about what could be done with the crawler! Thank you, Borislav Iordanov for creating HyperGraphDB helping me every step of the way! Thank you Ben Goertzel for creating this venture! And thank you to David Hart for being my mentor this summer! Also, a final thanks to everybody in the #opencog room on Freenode who helped me over these months and let me play with my robot in there.

Okay, I think that’s it! (Phew!)
Until next year (hopefully!),
Rich!


Stumble! | Save This Page! | Add to Technorati Favorites

3 Comments

  • I’m pretty sure that CLR has a multiple shortest path algorithm that is more efficient than Dijkstra’s run over and over. I’ll have CLR on monday. Email me if you haven’t already figured this out and I’ll take a look.

  • CLR.. .NET? I’d be really interested in anything you can find about this, but you didn’t leave your email!

  • you'llknowifyousee$email
    October 2nd, 2008 at 6:26 pm

    Is that code GPL? if so, feed the beasts…

Leave a Reply