Posts Tagged ‘context’

Watson, IBM’s Jeopardy computer, is showing everyone that its 900-pound gorilla of trivia and is likely to beat its human opponents. Watson could still do something stupid, but its formidable performance says much about the effectiveness of current natural language processing technology and computation resources.

Although Watson has a knowledge base of millions of documents gleaned from the Web, its weakness is that it really does not understand any of this data. It is just an extremely smart entity extraction system; Watson uses the terms of a Jeopardy clue as a selecting a particular entity as an answer, which of course then has to be phrased as a question. It has to figure what kind of entity to look for and what kind of context that entity would be found in.

In a sense, this is a simple kind of semantic search because it involves scanning its entire knowledge base of documents and scoring contexts statistically. The entities of the right kind in the highest-scoring contexts are then the prime candidates for an answer; and Watson can use their statistics to derive a level of confidence that a given candidate is the right answer. This basically relies heavily on brute computational power.

As can be seen in the Jeopardy competition, brute power can be quite effective. In most of the straightforward questions that one might expect that Google would do well on, Watson can simply outsearch its opponents. It can grab enough right answers in this way to make up for its frequent wrong answers on more subtle questions requiring a deeper understanding. This is as much gamesmanship as it is intelligence.

Now imagine how overwhelming Watson could be if it actually developed some understanding and made far fewer wrong answers. The first step in this direction is in fact quite easy: develop a large set of semantic categories corresponding to how humans understand language. Indexing a knowledge base by such predefined categories would have the immediate effect of simplifying the search process so that documents do not always have to be analyzed at the lowest linguistic level. That should allow the searches to be broader, much like allowing a chess computer to analyze more moves ahead.

We of course are in the business of semantic dictionaries, which provide a quick way of assigning semantic categories to text documents. Hey, Watson. If you are listening, give us a call.

Our Roots

11 Aug 2009

Semantic Signatures℠ approaches meaning of words from the perspective of their context. In the past couple of months, there has been extensive discussion here and elsewhere about how this differs from RDF, the basis for the Semantic Web. The simplest answer is that we are data-driven where RDF is model-driven.

This dichotomy is nothing new. In fact, if we look at semantics over a hundred years ago, we see the empirical idea of contextual semantics in the structural linguistics of  Ferdinand de Saussure in contrast to the logical formulation of meaning in the predicate calculus of Bertrand Russell and Alfred North Whitehead. The former inferred meaning from the comparative analysis of text; the latter defined a mapping between text and a formal model of possible meanings.

The model-driven approach became less popular after the logician Kurt Gödel proved the incompleteness of all non-trivial logical systems in the 1930′s. Structural linguistics then became the favored approach until Noam Chomsky put the study of language back on a formal basis in the 1950′s, and the semantics of language also tilted to the formal in order to be more consistent with the study of syntax.

This is not to say that one approach is right and the other is wrong. The choice of approach to take should really depend on one’s circumstances. If one has available an appropriate logical model, which today might correspond to a taxonomy and a formal way to relate taxonomic entities, then the model-driven option is compelling. On the other hand, if an appropriate model is lacking or incomplete, but there is plenty of tagged text data to work from, then the data-driven option should be considered.

One can always in fact choose to work with the best of both worlds. We are not the sole providers of data-driven semantic technology, but our statistical characterization of meaning is probably de Saussure himself might have done it if he had access to the Worldwide Web and 21st Century cloud computing.

According to WordNet, the word BANK has multiple senses, and so any occurrence of it in a text document is ambiguous. For example, we can have a river BANK, a financial BANK, a fog BANK, or an aeronautical BANK. The intended sense in a particular document has to be determined by looking at the context of occurrence. So, to determine the actual meaning of BANK in a document, we have to ask in effect whether the document is talking about streams of water, financial meltdowns, marine navigation, or aircraft in flight.

Now the number of different possible contexts is probably huge.One cannot hope to recognize them all; but for disambiguation of words, we need only fairly general contexts to distinguish the word senses of prime interested to us. Furthermore, given a large of our target text, we can employ statistical methods to identify the most important of such contexts.

This is essentially what SemanticHacker is all about.The dimensions of one of our semantic dictionaries defines thousands of contextual reference points for the interpretation of terms. For example, if the words stream, water, flow, erosion, and grass are in a document, then with the ODP 2009 dictionary, we find that the top match dimension is 1461 (Top/Science/Environment/Water_Resources) with a weight of 0.5138. In this context, the word BANK would probably mean “river bank.”

Actually, we don’t need to make this explicit association. With a search engine user interface, one just needs a way of describing the context of ambiguous search terms, perhaps by listing contextual words. Then all a semantic search engine has to do is find a document containing the search term and having the same described context in its semantic signature. This is of course a part of our API for search.