Archive for December, 2009

Learning

21 Dec 2009

Consider how we humans learn language. Even with formal education, it takes a child about 15 years starting from infancy to be able to read and understand general news articles in the New York Times. Over this period, one would probably hear or read at least on the order of 10 billion words. Even so, most high schoolers will need many additional years of schooling to become able to comprehend technical material.

So, how can anyone expect a computer to understand something like medical text after training on only about 100 million words of data? A computer of course runs on nanosecond cycles while the human brain operates on millisecond cycles; but we have had about 50,000 generations to evolve our language software, while the electronic computer has had only about 10 generations.

The bottom line here is that language learning is difficult; and it requires sifting through immense amounts of data. There probably is no magic technological shortcut here, but we have reached now the stage where our systems can routinely handle the volumes of data that would support semantic capabilities equivalent to an 8th-grade education. Decent commercial language processing tools are also now available.

Consequently, we are making major progress on semantic dictionaries, but have to be realistic about the work still ahead of us. Expect no overnight miracles from us or anyone else, especially when these are based on measly samples of data. There is still no royal road to semantics.

Testing our SemanticHacker WordPress plugin has some similarities to testing foof, our Firefox extension, in that we are testing within another application. As with testing Firefox extensions, WordPress plugin testing must include testing on multiple operating systems and multiple versions of Firefox, and it adds the need to test on additional browsers. Because WordPress has been releasing frequent updates we’ve had to focus attention on how to quickly verify our plugin on each WordPress upgrade. As a result, we have two major types of testing for our WordPress plugin: testing a new release of the plugin and verifying our plugin in a new WordPress release.

Regardless of which type of test sequence we’re on, there are some things that we always have to test. We need to validate all supported browser and OS combinations and we need to test all functionality of the SemanticHacker plugin. This functionality includes the ability to use text in a blog post to find relevant content links, tags, webpage links, and products.

When testing a new release of our WordPress plugin, we have two user paths we need to test: An update of an older plugin release and a fresh install of the new version of the plugin. We run our tests on all versions for WordPress that we are supporting following both paths. Of course, if there is new functionality or bug fixes, we need to add test cases to cover those cases.

When there is a new WordPress release, we also consider two paths in which our plugin can appear in that version of WordPress: One is an existing instance of WordPress with the Semantic Hacker plugin is upgraded to the new version. The other is that our plugin is installed fresh on the version being tested. All tests are run on the new version of WordPress following both possible paths. Assuming the new WordPress release passes our tests, we add that version to our list of supported WordPress releases. At the same time we determine if there are older versions on the list for which it is no longer worthwhile to continue testing because they are too little used.

A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.

In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.

Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.

Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.

In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.

Web 3.0 Conference – January 26-27, Santa Clara, CA
web30 logo

We are pleased to announce that our CEO Connie Kenneally, will be co-hosting the session “The Evolution of Semantic Search” on Day Two of the Web 3.0 Conference with Mark Johnson, Senior Program Manager at Bing. View Program. See below for a registration discount code.

About the Web 3.0 Conference
The emergence of a new era of technologies, collectively known as Web 3.0, provides a strategically significant opportunity to make businesses run better. Also known as the semantic web or linked data, web 3.0 is a web in which data is linked to allow for more meaningful, actionable insight to be extracted. At the conference, we will explore how companies are using these technologies today, and should be using them tomorrow, for significant bottom line impact in areas like marketing, corporate information management, publishing, search, customer service, and personal productivity. Use code W3SPKR and save 20%! register: www.web3event.com

A new book has been published by author and software developer Jose Sandoval (http://www.josesandoval.com) titled RESTful Java Web Services. A detailed overview of the book can be found on Javabeat.net.

TextWise is particularly excited about this book since Chapter 3 “shows you how to develop a mashup application that uses RESTful web services that connect to Google, Yahoo!, Twitter, and TextWise’s SemanticHacker API. It also covers in detail what it takes to consume JSON objects using JavaScript.”

We wish the best of luck to Jose on his new book! You can follow Jose on twitter @ http://twitter.com/josesandoval

Even in the world of print, one dictionary is often not enough. Just for English, for example, we can go to standard references like Webster’s Third New International, The American Heritage Dictionary of the English Language, or the Oxford English Dictionary, as well as more specialized lexicons. So how many semantic dictionaries do we really need?

That of course depends on the application. If we are in the situation where our target text data is extremely stable and requires only a general vocabulary, then we might get away with a single semantic dictionary based on a large sample of data processed quite carefully. On the Web, however, we have nothing of the sort, if you haven’t noticed lately.

A sophisticated dictionary that took weeks to build with hairy mathematical algorithms on a reasonable sample of training text may become obsolete overnight. That is not to say that sophisticated dictionaries are unhelpful; but in the merciless competition of the information marketplace, we probably need to be able to pop out a new semantic dictionary based on a gigabyte or more of text in just hours.

Given this kind of turnaround, why would anyone want to rely on a single semantic dictionary with its limited vocabulary and somewhat dated concepts? A new dictionary will of course involve a nontrivial upfront investment, but once a reliable source of tagged data is developed, actual dictionary building can be largely automated. That is the advantage of relying on statistical methods.