Posts Tagged ‘semantics’

Hijacked

26 May 2010

Has this ever happened to you? You are Googling for information on the Web, but inadvertently your query happens to share keywords with the latest cultural phenom: the next tweener heart throb, a YouTube video suddenly gone viral, or yet another paranoid political fantasy that refuses to die.

You are a professional, however, and so switch into Advanced Mode to reshape your query, but to no avail. Your information has been buried under pop detritus; it has been hijacked by the maximum likelihood estimate (MLE) on the Web.

At times like this, you want to grab your search engine by the neck and shout, “I am NOT a screaming twelve-year-old girl into dancing cats and fixated on the President’s birth place!” But your search engine continues blithely in the wisdom of the crowd.

It is a reminder that statistically grounded information systems are at the mercy of their training data. If we cede too much control of a system to its finely wrought black box judgment, then we sometimes are going to run off the tracks. This is especially true with web semantics.

If we do in fact want to get under the hood to adjust a semantic system to go against the popular flow, then it helps tremendously if the categories underlying the representation of document content are intelligible to people. Such transparency is a prime motivation for how semantic dictionaries are currently built by TextWise.

Of course, if you care nary a lick about transparency, then may I interest you in this slightly used synthetic collateralized debt obligation….

A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.

In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.

Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.

Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.

In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.

A former colleague of mine used to have an entire can of soup for lunch every day. We razzed him about this, but he shook us all off until one day, I looked at the nutrition label on the can. That soup had 1800 mg of sodium altogether! We gravely informed him of this fact, and the soup was soon history.

To understand this story, you would have to know that the recommended daily maximum dietary intake of sodium for an adult is about 900 mg. Without this context, the number 1800 really means nothing. So what do all those numbers in a semantic dictionary mean, if anything?

The key property of semantic dictionary numbers is that they are based on probabilities and so have to fall between 0 and 1. They measure the likelihood that a document containing a given term is related to a given semantic dimension. For example, a dictionary weight of 1.0000 for a term and a dimension would indicate that a document containing the term is absolutely associated with the dimension.

There is a complication here, however. In real life, nothing is ever so certain. If we saw a 1.0000 term weight for a dimension, a more reasonable interpretation is that our sample of training data was too small for estimating the probability of that term accurately. A similar problem arises for a dictionary weight of 0.0000.

In general, a statistician will be highly suspicious of any extreme probabilities like 1.0000 and 0.0000. As a proponent of statistical technology, we have to make a special effort to avoid such probability estimates in our semantic dictionaries. In contrast, certain other mathematic approaches to semantics tend to skate over niceties like this, choosing just to plug in numbers to what is essentially a fixed formula.

If one is careless about the meaning of numbers, though, how can one be careful in capturing the actual meaning of words?

In Norton Juster’s classic The Phantom Tollbooth, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the Castle in the Air.

Okay, the symbolism is a bit over the top, but the conflict about whether semantics should involve numbers as opposed to some logical formalism makes just as little sense and could also benefit from the return of Sweet Rhyme and Pure Reason. There is not just one way to build a house, or plant a garden, or skin a cat. In any real-world enterprise, we always have multiple options, each with tradeoffs.

Our job as an semantic API developer is to provide another option with tradeoffs that are attractive to users. What we offer with statistical semantics is simplicity, transparency, broad coverage, timely data, rigor, and historical grounding of methodology. And we strive to be better each day at what we do.

Iron Semanticist

12 Aug 2009

Some people have been disparaging a statistical approach to the semantics of natural language. This is essentially a kind of prejudice, as if we came from the wrong side of the technology railroad tracks. It ignores the fact that statistical approaches have performed spectacularly well in some high profile settings.

Have you ever watched the “Iron Chef” on the Food Network? This is where two competing chefs are given an ingredient kept secret until the start of the show, and each contestant then has 60 minutes to create an entire meal around that ingredient. A panel then judges and critiques the two meals and crowns a winner.

In 2003, DARPA ran its own version of “Iron Chef,” though with only a single team of collaborators from eleven academic institutions across the U.S. The team was given a language, with the task of creating a cross-language information retrieval system and a machine translation system within TEN DAYS after learning what the language actually was.

To make challenge harder, the language was not French, Arabic, or Russian, but Cebuano, a dialect spoken in the Philippines. None of the team was familiar with the language, but through the magic of Internet collaboration, they were able in ten days to collect a corpus of resources in Cebuano and English and apply statistical methods to create both a fully workable cross-language retrieval system and a credible start to a translation capability.

The two principal investigators of the Herculean exercise wrote afterward that, given what they learned in those ten days, they would do better next time. They predicted that their team could  build a fully working statistical machine translation facility for a specified language in just a single day given adequate linguistic and computational resources.

In ten days, you could not build even a parser for a language that you have never heard of, much less develop the semantic mapping of that language into some kind of logical model of meaning to support cross-language search and machine translation. Statistical methods do work in semantics.

Our Roots

11 Aug 2009

Semantic Signatures℠ approaches meaning of words from the perspective of their context. In the past couple of months, there has been extensive discussion here and elsewhere about how this differs from RDF, the basis for the Semantic Web. The simplest answer is that we are data-driven where RDF is model-driven.

This dichotomy is nothing new. In fact, if we look at semantics over a hundred years ago, we see the empirical idea of contextual semantics in the structural linguistics of  Ferdinand de Saussure in contrast to the logical formulation of meaning in the predicate calculus of Bertrand Russell and Alfred North Whitehead. The former inferred meaning from the comparative analysis of text; the latter defined a mapping between text and a formal model of possible meanings.

The model-driven approach became less popular after the logician Kurt Gödel proved the incompleteness of all non-trivial logical systems in the 1930’s. Structural linguistics then became the favored approach until Noam Chomsky put the study of language back on a formal basis in the 1950’s, and the semantics of language also tilted to the formal in order to be more consistent with the study of syntax.

This is not to say that one approach is right and the other is wrong. The choice of approach to take should really depend on one’s circumstances. If one has available an appropriate logical model, which today might correspond to a taxonomy and a formal way to relate taxonomic entities, then the model-driven option is compelling. On the other hand, if an appropriate model is lacking or incomplete, but there is plenty of tagged text data to work from, then the data-driven option should be considered.

One can always in fact choose to work with the best of both worlds. We are not the sole providers of data-driven semantic technology, but our statistical characterization of meaning is probably de Saussure himself might have done it if he had access to the Worldwide Web and 21st Century cloud computing.

Some linguists believe that early language was always about specific entities–that is, denotational; this kind of reference then evolves into concepts, which are connotational. For example, a baby learns his or her particular meaning of MAMA, which then generalizes into MOTHERHOOD.

We can still see such evolution at work today. About two years ago, the term “Sarah Palin” was only denotational, but after the fall of 2008, it has now become quite connotational. Something similar might be said on the other side of the political spectrum about the term “Barack Obama.”

The whole process of turning denotation into connotation has been extensively studied and is better known as “branding.” Anyone who has ever written a resumé has had experience in doing it.

A semantic dictionary in fact trades on the natural kind of branding. Since we do statistical analyses of context to assign meaning, We may not yet be able to interpret a term for which we have only a small sample of occurrences. Give us a little time, though.

Romeo Montague once noted that the semantic function of a name contrasts quite saliently with that of an ordinary word. Shakespeare didn’t quite put it that way, but it is a fact of language. Classic semanticist would frame it as a distinction between denotation (i.e. identification) versus connotation (i.e. description).

As it turns out, this difference can be seen even at the statistical level. Ordinary words with a little massaging have a frequency distribution best described as binomial; names are typically not binomial. That will have consequences for how we mine text data to create a semantic dictionary.

This is all a fine point, but the quality of a product is determined by many such fine points. None of our API competitors on the web bother with denotation and connotation, but it can really matter when you are processing data with many product designations.

Recently, there were news reports of scientists identifying an Oprah Winfrey neuron in the brain of an epileptic person who had been wired to help control seizures. This one particular neuron  in the hippocampus fires whenever the person hears Oprah’s name or sees a picture of her. It may help to explain how memory works.It also can explain how semantic dictionaries work. In the case of Oprah, stimuli from many different senses travel various paths to converge on her neuron. In a semantic dictionary with Oprah as a concept, various terms associated with her in effect will vote for the concept with differing degrees of confidence when they occur in some document. When there is convergence because of mutual corroboration of terms, then one can infer that the document is about the queen of daytime TV.

What We Sell

22 Jul 2009

A TextWise semantic dictionary is essentially a big bunch of numbers between 0 and 1. To be more precise, they are conditional probabilities of a semantic dimension being relevant to a document containing an occurrence of a given term; but to a casual observer, they can look very ho-hum and uncool. What is so great about them?

Some people are in fact dismissive of any numbers being applied to semantics. This is probably because of the unfortunate legacy of numerical abuse in information technology, where system builders all too commonly slam numbers together willy-nilly and hope that something sensible comes out.

At TextWise, we don’t do this. We not only follow rigorous statistical practice to get the most information out of available text data, but also apply proprietary filtering and reduction methods to eliminate many of the anomalies that can slip through any statistical system by chance. To paraphrase the Colonel, “We do numbers right.”