Posts Tagged ‘probability’

Casino Royale

14 Sep 2009

In any statistical information system, one can never achieve absolute certainty. Every result is a kind of bet with the possibility of losing. For Semantic Signatures, however, this is more like playing blackjack than like playing roulette. Whether we imagine ourselves as the house or some hotshot card counter, we try our utmost to bend the odds in our favor.

When a given term occurs in a document, we know that there is a certain probability that the document is about a given topic. For example, THRILLER may relate to Michael Jackson or to some recent summer popcorn epic. Similarly MOONWALK may refer to Apollo XI or to a dance move. We would be rash to judge content just on the basis of a single term, but when multiple terms can corroborate each other, we do have a better bet.

The trick here is to able to set up a semantic dictionary so that we can always expect to find a reasonable number of terms in a target document that allow us to make that better bet. This requires careful balancing: we need enough semantic dimensions to be able to distinguish the different important kinds of content and enough terms for each dimension to put it into play. It is much like developing a diverse portfolio of investments to weather any shift in economic climate.

Most people will probably pass on building their own semantic dictionaries. It takes a tremendous amount of work to collect and filter the requisite text data to ground our dictionary weights and to massage all those numbers to get the maximum amount of usable information. But we want to get on the right side of the odds.

Estimating a probability basically involves computing an average. Since most middle-schoolers know how to do this, what is so difficult about building a semantic dictionary consisting of conditional probabilities?

The problem turns out to be with sample sizes. To get reliable dictionary weights for a given term, we need many examples of its occurrence in text, but most terms are rather infrequent in any given corpus. This fact of life is articulated in Zipf’s Law, which states that occurrences of the n-th most common term in a corpus will be approximately proportional to 1/n.

Such a relationship is called a “power law,” which can also be seen in many other natural phenomena. For instance, sociologists often note that only ten percent of the people in any organization does ninety percent of all the work.

Unfortunately, the most frequent terms in any corpus are typically the least interesting for information applications. So the challenge is to make reliable probability estimates for tens of thousands of terms when the statistical support is less than ideal.

To build a good dictionary, we need to do much more than simply add up some term frequencies and then divide.