Archive for August, 2009

The current SemanticHacker API offers more than one semantic dictionary. Each one is crafted from a particular collection of categorized documents at a particular time. The choice of a dictionary depends on one’s target application. Ideally, that dictionary will be trained on categorized documents similar to the documents to be analyzed for content.

Currently, the two main types of dictionaries available in English come from the ODP conceptual hierarchy and the USPTO class hierarchy. The dimensions defined for these types have practically no overlap. The differences in language and vocabulary in training data are also huge, and these have major consequences in the dimensional weights computed for terms in the two dictionaries.

In theory, one could employ a USPTO dictionary in a general web application, but one then risks being unable to pick up on popular language and culture. You won’t find “lol” or “Brangelina” in any patent. Similarly, an ODP dictionary may be a bit thin for handling medical journal articles; it would be much better here to have a semantic dictionary trained specifically on medical language and vocabulary.

The cost of building a specialized dictionary varies, mostly due to the complicated legal, technical, and logistical process of collecting the proper training data. Once the data is obtained, however, the actual dictionary process is largely mechanical, although we do carry out extensive quality assessment to determine whether we are running under optimal dictionary building parameters.

With proper training data in hand, we can turn out a semantic dictionary of about 200,000 terms over 2,000 dimensions in only about a day. This turnaround is possible because of our reliance on statistical methods as opposed to more complicated mathematical modeling of other semantic approaches. It means that we could build new dictionaries fast enough to keep up with news cycles as short as one week, given the computational resources needed.

Some linguists believe that early language was always about specific entities–that is, denotational; this kind of reference then evolves into concepts, which are connotational. For example, a baby learns his or her particular meaning of MAMA, which then generalizes into MOTHERHOOD.

We can still see such evolution at work today. About two years ago, the term “Sarah Palin” was only denotational, but after the fall of 2008, it has now become quite connotational. Something similar might be said on the other side of the political spectrum about the term “Barack Obama.”

The whole process of turning denotation into connotation has been extensively studied and is better known as “branding.” Anyone who has ever written a resumé has had experience in doing it.

A semantic dictionary in fact trades on the natural kind of branding. Since we do statistical analyses of context to assign meaning, We may not yet be able to interpret a term for which we have only a small sample of occurrences. Give us a little time, though.

Romeo Montague once noted that the semantic function of a name contrasts quite saliently with that of an ordinary word. Shakespeare didn’t quite put it that way, but it is a fact of language. Classic semanticist would frame it as a distinction between denotation (i.e. identification) versus connotation (i.e. description).

As it turns out, this difference can be seen even at the statistical level. Ordinary words with a little massaging have a frequency distribution best described as binomial; names are typically not binomial. That will have consequences for how we mine text data to create a semantic dictionary.

This is all a fine point, but the quality of a product is determined by many such fine points. None of our API competitors on the web bother with denotation and connotation, but it can really matter when you are processing data with many product designations.

In Chapter 8 of Lewis Carroll’s “Alice Through the Looking Glass,” our intrepid logical adventurer is talking to the White Knight, who wants to sing to her. He says, “The name of the song is called ‘HADDOCK’S EYES.’”

It turns out of course that the name of the song is really “THE AGED AGED MAN,” though the song is actually called “WAYS AND MEANS.” The confusion here about naming is quite understandable to anyone who has ever ordered TenderSweet™ clams at HoJo’s and discovered that they are neither tender nor sweet.

All of this would be hilarious except that we have to build semantic dictionaries that must deal extensively with the meaning of names in text. This problem will take a while to talk about adequately; and so please tune in tomorrow.

Suppose that we want to know the average body-mass index (BMI) of American teenagers. Since it is extremely difficult even to count every single teenager in the country, sampling is necessary. So we try to find N typical teenagers, measure and weigh them, and then compute their average BMI with the standard statistical formula

population mean ≈ ∑ᵢ BMIᵢ / (N + 1)

Now we all learned averages in junior high. Where did the “+ 1″ come from? This is in fact a simple trick that every statistician has to learn on day 1. When we estimate a population mean from a small sample, there will inevitably be an error, typically on the high side. As a useful rule of thumb, we get a better estimate when dividing by (N + 1) instead of by N. Note that, as N gets large, N ≈ (N + 1); and so we do converge to the population mean in the limit.

A semantic dictionary is nothing more than millions of averages of term frequencies in documents, and most of them are based on only a fairly small number of occurrences of a given term. To get good results here, we have to do more than just junior high math.

Our situation is actually much more complicated than that of estimating a simple population mean, but we have to do a similar kind of data smoothing. This is all to provide you with the highest quality numbers for your web app.

Recently, there were news reports of scientists identifying an Oprah Winfrey neuron in the brain of an epileptic person who had been wired to help control seizures. This one particular neuron  in the hippocampus fires whenever the person hears Oprah’s name or sees a picture of her. It may help to explain how memory works.It also can explain how semantic dictionaries work. In the case of Oprah, stimuli from many different senses travel various paths to converge on her neuron. In a semantic dictionary with Oprah as a concept, various terms associated with her in effect will vote for the concept with differing degrees of confidence when they occur in some document. When there is convergence because of mutual corroboration of terms, then one can infer that the document is about the queen of daytime TV.