Posts Tagged ‘Statistics’

Watson, IBM’s Jeopardy computer, is showing everyone that its 900-pound gorilla of trivia and is likely to beat its human opponents. Watson could still do something stupid, but its formidable performance says much about the effectiveness of current natural language processing technology and computation resources.

Although Watson has a knowledge base of millions of documents gleaned from the Web, its weakness is that it really does not understand any of this data. It is just an extremely smart entity extraction system; Watson uses the terms of a Jeopardy clue as a selecting a particular entity as an answer, which of course then has to be phrased as a question. It has to figure what kind of entity to look for and what kind of context that entity would be found in.

In a sense, this is a simple kind of semantic search because it involves scanning its entire knowledge base of documents and scoring contexts statistically. The entities of the right kind in the highest-scoring contexts are then the prime candidates for an answer; and Watson can use their statistics to derive a level of confidence that a given candidate is the right answer. This basically relies heavily on brute computational power.

As can be seen in the Jeopardy competition, brute power can be quite effective. In most of the straightforward questions that one might expect that Google would do well on, Watson can simply outsearch its opponents. It can grab enough right answers in this way to make up for its frequent wrong answers on more subtle questions requiring a deeper understanding. This is as much gamesmanship as it is intelligence.

Now imagine how overwhelming Watson could be if it actually developed some understanding and made far fewer wrong answers. The first step in this direction is in fact quite easy: develop a large set of semantic categories corresponding to how humans understand language. Indexing a knowledge base by such predefined categories would have the immediate effect of simplifying the search process so that documents do not always have to be analyzed at the lowest linguistic level. That should allow the searches to be broader, much like allowing a chess computer to analyze more moves ahead.

We of course are in the business of semantic dictionaries, which provide a quick way of assigning semantic categories to text documents. Hey, Watson. If you are listening, give us a call.

Casino Royale

14 Sep 2009

In any statistical information system, one can never achieve absolute certainty. Every result is a kind of bet with the possibility of losing. For Semantic Signatures, however, this is more like playing blackjack than like playing roulette. Whether we imagine ourselves as the house or some hotshot card counter, we try our utmost to bend the odds in our favor.

When a given term occurs in a document, we know that there is a certain probability that the document is about a given topic. For example, THRILLER may relate to Michael Jackson or to some recent summer popcorn epic. Similarly MOONWALK may refer to Apollo XI or to a dance move. We would be rash to judge content just on the basis of a single term, but when multiple terms can corroborate each other, we do have a better bet.

The trick here is to able to set up a semantic dictionary so that we can always expect to find a reasonable number of terms in a target document that allow us to make that better bet. This requires careful balancing: we need enough semantic dimensions to be able to distinguish the different important kinds of content and enough terms for each dimension to put it into play. It is much like developing a diverse portfolio of investments to weather any shift in economic climate.

Most people will probably pass on building their own semantic dictionaries. It takes a tremendous amount of work to collect and filter the requisite text data to ground our dictionary weights and to massage all those numbers to get the maximum amount of usable information. But we want to get on the right side of the odds.

A former colleague of mine used to have an entire can of soup for lunch every day. We razzed him about this, but he shook us all off until one day, I looked at the nutrition label on the can. That soup had 1800 mg of sodium altogether! We gravely informed him of this fact, and the soup was soon history.

To understand this story, you would have to know that the recommended daily maximum dietary intake of sodium for an adult is about 900 mg. Without this context, the number 1800 really means nothing. So what do all those numbers in a semantic dictionary mean, if anything?

The key property of semantic dictionary numbers is that they are based on probabilities and so have to fall between 0 and 1. They measure the likelihood that a document containing a given term is related to a given semantic dimension. For example, a dictionary weight of 1.0000 for a term and a dimension would indicate that a document containing the term is absolutely associated with the dimension.

There is a complication here, however. In real life, nothing is ever so certain. If we saw a 1.0000 term weight for a dimension, a more reasonable interpretation is that our sample of training data was too small for estimating the probability of that term accurately. A similar problem arises for a dictionary weight of 0.0000.

In general, a statistician will be highly suspicious of any extreme probabilities like 1.0000 and 0.0000. As a proponent of statistical technology, we have to make a special effort to avoid such probability estimates in our semantic dictionaries. In contrast, certain other mathematic approaches to semantics tend to skate over niceties like this, choosing just to plug in numbers to what is essentially a fixed formula.

If one is careless about the meaning of numbers, though, how can one be careful in capturing the actual meaning of words?

When is a semantic dictionary good? It really depends on the application, since more specialized content requires more specialized dictionary dimensions. Typically,  validation of a given application will involve extensive benchmark testing, often entailing human judgments of the effectiveness of particular statistical characterizations of content.

TextWise does all of this in its product development process, but one would not want to go through an elaboration validation procedure to test the consequences of every small change. As it turns out, there are quick statistical ways to check whether a change is likely to be good or bad. This is no substitute for actual detailed validation at some point, but it allows one to experiment with new ideas at a fairly low cost.

A digital photography metaphor is apt here. One cannot use statistics to identify a prize-winning shot, it is certainly possible to detect major problems without human judgments. For example, areas of maximally white pixels indicate blown highlights, which typically detract from the quality of an image. Similarly, problems with white balance, dynamic range, focus, and other conditions are also readily detectable.

With any huge data object like a semantic dictionary, it is difficult to construct a benchmark that will cover every aspect of it thoroughly. Statistical testing provides an overall sanity check on quality. Otherwise, one would just be buying and selling pigs in a poke.

In Norton Juster’s classic The Phantom Tollbooth, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the Castle in the Air.

Okay, the symbolism is a bit over the top, but the conflict about whether semantics should involve numbers as opposed to some logical formalism makes just as little sense and could also benefit from the return of Sweet Rhyme and Pure Reason. There is not just one way to build a house, or plant a garden, or skin a cat. In any real-world enterprise, we always have multiple options, each with tradeoffs.

Our job as an semantic API developer is to provide another option with tradeoffs that are attractive to users. What we offer with statistical semantics is simplicity, transparency, broad coverage, timely data, rigor, and historical grounding of methodology. And we strive to be better each day at what we do.

Iron Semanticist

12 Aug 2009

Some people have been disparaging a statistical approach to the semantics of natural language. This is essentially a kind of prejudice, as if we came from the wrong side of the technology railroad tracks. It ignores the fact that statistical approaches have performed spectacularly well in some high profile settings.

Have you ever watched the “Iron Chef” on the Food Network? This is where two competing chefs are given an ingredient kept secret until the start of the show, and each contestant then has 60 minutes to create an entire meal around that ingredient. A panel then judges and critiques the two meals and crowns a winner.

In 2003, DARPA ran its own version of “Iron Chef,” though with only a single team of collaborators from eleven academic institutions across the U.S. The team was given a language, with the task of creating a cross-language information retrieval system and a machine translation system within TEN DAYS after learning what the language actually was.

To make challenge harder, the language was not French, Arabic, or Russian, but Cebuano, a dialect spoken in the Philippines. None of the team was familiar with the language, but through the magic of Internet collaboration, they were able in ten days to collect a corpus of resources in Cebuano and English and apply statistical methods to create both a fully workable cross-language retrieval system and a credible start to a translation capability.

The two principal investigators of the Herculean exercise wrote afterward that, given what they learned in those ten days, they would do better next time. They predicted that their team could  build a fully working statistical machine translation facility for a specified language in just a single day given adequate linguistic and computational resources.

In ten days, you could not build even a parser for a language that you have never heard of, much less develop the semantic mapping of that language into some kind of logical model of meaning to support cross-language search and machine translation. Statistical methods do work in semantics.

Suppose that we want to know the average body-mass index (BMI) of American teenagers. Since it is extremely difficult even to count every single teenager in the country, sampling is necessary. So we try to find N typical teenagers, measure and weigh them, and then compute their average BMI with the standard statistical formula

population mean ≈ ∑ᵢ BMIᵢ / (N + 1)

Now we all learned averages in junior high. Where did the “+ 1″ come from? This is in fact a simple trick that every statistician has to learn on day 1. When we estimate a population mean from a small sample, there will inevitably be an error, typically on the high side. As a useful rule of thumb, we get a better estimate when dividing by (N + 1) instead of by N. Note that, as N gets large, N ≈ (N + 1); and so we do converge to the population mean in the limit.

A semantic dictionary is nothing more than millions of averages of term frequencies in documents, and most of them are based on only a fairly small number of occurrences of a given term. To get good results here, we have to do more than just junior high math.

Our situation is actually much more complicated than that of estimating a simple population mean, but we have to do a similar kind of data smoothing. This is all to provide you with the highest quality numbers for your web app.

Ingredients

27 Jul 2009

This posting will probably make the eyes of most people glaze over, but current and prospective users of our SemanticHacker API should really be informed consumers. So think of this as being like one of those federally mandated labels on your bottle of Red Bull.

The ingredients of a semantic dictionary are a set of hundreds of thousands of terms, a set of thousands of dimensions, and various numbers expressing the strength of association between a given term and a given dimension. Most of these associations will have zero strength, indicating that we have no information about them; but there will still be millions of non-zero numbers to provide a rigorous undergirding for statistical semantics.

We build a semantic dictionary by obtaining large training samples of documents relevant to each of its dimensions. The strength of association is then estimated as being proportional to the relative frequency of occurrence in training documents for a term in a dimension versus in those for all other other dimensions. The process is actually more complicated than this, but the differences are just refinements of the overall scheme as described.

Now we all understand what terms are (e.g. britney_spears, midfielder, rugelach, purple), but where do dimensions come from? The answer is that they are somewhat arbitrary. A dimension can be defined around any kind of category for which someone has provided requisite training documents. In many cases, we can find prior sets of categories to work from (ODP, USPTO), but we also can ourselves try to infer categories from some available pool of potential training data.

However we proceed here, it is necessary that the resulting dimensions be pertinent to an application of interest, be independent of each other, be supported by adequate training data, and be associated with enough terms to support semantic analysis of target text. This all can be tricky to achieve, but if it were easy, everyone would be doing it.

Estimating a probability basically involves computing an average. Since most middle-schoolers know how to do this, what is so difficult about building a semantic dictionary consisting of conditional probabilities?

The problem turns out to be with sample sizes. To get reliable dictionary weights for a given term, we need many examples of its occurrence in text, but most terms are rather infrequent in any given corpus. This fact of life is articulated in Zipf’s Law, which states that occurrences of the n-th most common term in a corpus will be approximately proportional to 1/n.

Such a relationship is called a “power law,” which can also be seen in many other natural phenomena. For instance, sociologists often note that only ten percent of the people in any organization does ninety percent of all the work.

Unfortunately, the most frequent terms in any corpus are typically the least interesting for information applications. So the challenge is to make reliable probability estimates for tens of thousands of terms when the statistical support is less than ideal.

To build a good dictionary, we need to do much more than simply add up some term frequencies and then divide.

What We Sell

22 Jul 2009

A TextWise semantic dictionary is essentially a big bunch of numbers between 0 and 1. To be more precise, they are conditional probabilities of a semantic dimension being relevant to a document containing an occurrence of a given term; but to a casual observer, they can look very ho-hum and uncool. What is so great about them?

Some people are in fact dismissive of any numbers being applied to semantics. This is probably because of the unfortunate legacy of numerical abuse in information technology, where system builders all too commonly slam numbers together willy-nilly and hope that something sensible comes out.

At TextWise, we don’t do this. We not only follow rigorous statistical practice to get the most information out of available text data, but also apply proprietary filtering and reduction methods to eliminate many of the anomalies that can slip through any statistical system by chance. To paraphrase the Colonel, “We do numbers right.”