Archive for the ‘semantics’ Category

In Norton Juster’s classic The Phantom Tollbooth, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the Castle in the Air.

Okay, the symbolism is a bit over the top, but the conflict about whether semantics should involve numbers as opposed to some logical formalism makes just as little sense and could also benefit from the return of Sweet Rhyme and Pure Reason. There is not just one way to build a house, or plant a garden, or skin a cat. In any real-world enterprise, we always have multiple options, each with tradeoffs.

Our job as an semantic API developer is to provide another option with tradeoffs that are attractive to users. What we offer with statistical semantics is simplicity, transparency, broad coverage, timely data, rigor, and historical grounding of methodology. And we strive to be better each day at what we do.

Iron Semanticist

12 Aug 2009

Some people have been disparaging a statistical approach to the semantics of natural language. This is essentially a kind of prejudice, as if we came from the wrong side of the technology railroad tracks. It ignores the fact that statistical approaches have performed spectacularly well in some high profile settings.

Have you ever watched the “Iron Chef” on the Food Network? This is where two competing chefs are given an ingredient kept secret until the start of the show, and each contestant then has 60 minutes to create an entire meal around that ingredient. A panel then judges and critiques the two meals and crowns a winner.

In 2003, DARPA ran its own version of “Iron Chef,” though with only a single team of collaborators from eleven academic institutions across the U.S. The team was given a language, with the task of creating a cross-language information retrieval system and a machine translation system within TEN DAYS after learning what the language actually was.

To make challenge harder, the language was not French, Arabic, or Russian, but Cebuano, a dialect spoken in the Philippines. None of the team was familiar with the language, but through the magic of Internet collaboration, they were able in ten days to collect a corpus of resources in Cebuano and English and apply statistical methods to create both a fully workable cross-language retrieval system and a credible start to a translation capability.

The two principal investigators of the Herculean exercise wrote afterward that, given what they learned in those ten days, they would do better next time. They predicted that their team could  build a fully working statistical machine translation facility for a specified language in just a single day given adequate linguistic and computational resources.

In ten days, you could not build even a parser for a language that you have never heard of, much less develop the semantic mapping of that language into some kind of logical model of meaning to support cross-language search and machine translation. Statistical methods do work in semantics.

Our Roots

11 Aug 2009

Semantic Signatures℠ approaches meaning of words from the perspective of their context. In the past couple of months, there has been extensive discussion here and elsewhere about how this differs from RDF, the basis for the Semantic Web. The simplest answer is that we are data-driven where RDF is model-driven.

This dichotomy is nothing new. In fact, if we look at semantics over a hundred years ago, we see the empirical idea of contextual semantics in the structural linguistics of  Ferdinand de Saussure in contrast to the logical formulation of meaning in the predicate calculus of Bertrand Russell and Alfred North Whitehead. The former inferred meaning from the comparative analysis of text; the latter defined a mapping between text and a formal model of possible meanings.

The model-driven approach became less popular after the logician Kurt Gödel proved the incompleteness of all non-trivial logical systems in the 1930′s. Structural linguistics then became the favored approach until Noam Chomsky put the study of language back on a formal basis in the 1950′s, and the semantics of language also tilted to the formal in order to be more consistent with the study of syntax.

This is not to say that one approach is right and the other is wrong. The choice of approach to take should really depend on one’s circumstances. If one has available an appropriate logical model, which today might correspond to a taxonomy and a formal way to relate taxonomic entities, then the model-driven option is compelling. On the other hand, if an appropriate model is lacking or incomplete, but there is plenty of tagged text data to work from, then the data-driven option should be considered.

One can always in fact choose to work with the best of both worlds. We are not the sole providers of data-driven semantic technology, but our statistical characterization of meaning is probably de Saussure himself might have done it if he had access to the Worldwide Web and 21st Century cloud computing.

The current SemanticHacker API offers more than one semantic dictionary. Each one is crafted from a particular collection of categorized documents at a particular time. The choice of a dictionary depends on one’s target application. Ideally, that dictionary will be trained on categorized documents similar to the documents to be analyzed for content.

Currently, the two main types of dictionaries available in English come from the ODP conceptual hierarchy and the USPTO class hierarchy. The dimensions defined for these types have practically no overlap. The differences in language and vocabulary in training data are also huge, and these have major consequences in the dimensional weights computed for terms in the two dictionaries.

In theory, one could employ a USPTO dictionary in a general web application, but one then risks being unable to pick up on popular language and culture. You won’t find “lol” or “Brangelina” in any patent. Similarly, an ODP dictionary may be a bit thin for handling medical journal articles; it would be much better here to have a semantic dictionary trained specifically on medical language and vocabulary.

The cost of building a specialized dictionary varies, mostly due to the complicated legal, technical, and logistical process of collecting the proper training data. Once the data is obtained, however, the actual dictionary process is largely mechanical, although we do carry out extensive quality assessment to determine whether we are running under optimal dictionary building parameters.

With proper training data in hand, we can turn out a semantic dictionary of about 200,000 terms over 2,000 dimensions in only about a day. This turnaround is possible because of our reliance on statistical methods as opposed to more complicated mathematical modeling of other semantic approaches. It means that we could build new dictionaries fast enough to keep up with news cycles as short as one week, given the computational resources needed.

Some linguists believe that early language was always about specific entities–that is, denotational; this kind of reference then evolves into concepts, which are connotational. For example, a baby learns his or her particular meaning of MAMA, which then generalizes into MOTHERHOOD.

We can still see such evolution at work today. About two years ago, the term “Sarah Palin” was only denotational, but after the fall of 2008, it has now become quite connotational. Something similar might be said on the other side of the political spectrum about the term “Barack Obama.”

The whole process of turning denotation into connotation has been extensively studied and is better known as “branding.” Anyone who has ever written a resumé has had experience in doing it.

A semantic dictionary in fact trades on the natural kind of branding. Since we do statistical analyses of context to assign meaning, We may not yet be able to interpret a term for which we have only a small sample of occurrences. Give us a little time, though.

Romeo Montague once noted that the semantic function of a name contrasts quite saliently with that of an ordinary word. Shakespeare didn’t quite put it that way, but it is a fact of language. Classic semanticist would frame it as a distinction between denotation (i.e. identification) versus connotation (i.e. description).

As it turns out, this difference can be seen even at the statistical level. Ordinary words with a little massaging have a frequency distribution best described as binomial; names are typically not binomial. That will have consequences for how we mine text data to create a semantic dictionary.

This is all a fine point, but the quality of a product is determined by many such fine points. None of our API competitors on the web bother with denotation and connotation, but it can really matter when you are processing data with many product designations.

In Chapter 8 of Lewis Carroll’s “Alice Through the Looking Glass,” our intrepid logical adventurer is talking to the White Knight, who wants to sing to her. He says, “The name of the song is called ‘HADDOCK’S EYES.’”

It turns out of course that the name of the song is really “THE AGED AGED MAN,” though the song is actually called “WAYS AND MEANS.” The confusion here about naming is quite understandable to anyone who has ever ordered TenderSweet™ clams at HoJo’s and discovered that they are neither tender nor sweet.

All of this would be hilarious except that we have to build semantic dictionaries that must deal extensively with the meaning of names in text. This problem will take a while to talk about adequately; and so please tune in tomorrow.

Recently, there were news reports of scientists identifying an Oprah Winfrey neuron in the brain of an epileptic person who had been wired to help control seizures. This one particular neuron  in the hippocampus fires whenever the person hears Oprah’s name or sees a picture of her. It may help to explain how memory works.It also can explain how semantic dictionaries work. In the case of Oprah, stimuli from many different senses travel various paths to converge on her neuron. In a semantic dictionary with Oprah as a concept, various terms associated with her in effect will vote for the concept with differing degrees of confidence when they occur in some document. When there is convergence because of mutual corroboration of terms, then one can infer that the document is about the queen of daytime TV.

Ingredients

27 Jul 2009

This posting will probably make the eyes of most people glaze over, but current and prospective users of our SemanticHacker API should really be informed consumers. So think of this as being like one of those federally mandated labels on your bottle of Red Bull.

The ingredients of a semantic dictionary are a set of hundreds of thousands of terms, a set of thousands of dimensions, and various numbers expressing the strength of association between a given term and a given dimension. Most of these associations will have zero strength, indicating that we have no information about them; but there will still be millions of non-zero numbers to provide a rigorous undergirding for statistical semantics.

We build a semantic dictionary by obtaining large training samples of documents relevant to each of its dimensions. The strength of association is then estimated as being proportional to the relative frequency of occurrence in training documents for a term in a dimension versus in those for all other other dimensions. The process is actually more complicated than this, but the differences are just refinements of the overall scheme as described.

Now we all understand what terms are (e.g. britney_spears, midfielder, rugelach, purple), but where do dimensions come from? The answer is that they are somewhat arbitrary. A dimension can be defined around any kind of category for which someone has provided requisite training documents. In many cases, we can find prior sets of categories to work from (ODP, USPTO), but we also can ourselves try to infer categories from some available pool of potential training data.

However we proceed here, it is necessary that the resulting dimensions be pertinent to an application of interest, be independent of each other, be supported by adequate training data, and be associated with enough terms to support semantic analysis of target text. This all can be tricky to achieve, but if it were easy, everyone would be doing it.

Estimating a probability basically involves computing an average. Since most middle-schoolers know how to do this, what is so difficult about building a semantic dictionary consisting of conditional probabilities?

The problem turns out to be with sample sizes. To get reliable dictionary weights for a given term, we need many examples of its occurrence in text, but most terms are rather infrequent in any given corpus. This fact of life is articulated in Zipf’s Law, which states that occurrences of the n-th most common term in a corpus will be approximately proportional to 1/n.

Such a relationship is called a “power law,” which can also be seen in many other natural phenomena. For instance, sociologists often note that only ten percent of the people in any organization does ninety percent of all the work.

Unfortunately, the most frequent terms in any corpus are typically the least interesting for information applications. So the challenge is to make reliable probability estimates for tens of thousands of terms when the statistical support is less than ideal.

To build a good dictionary, we need to do much more than simply add up some term frequencies and then divide.