Posts Tagged ‘sample’

Suppose that we want to know the average body-mass index (BMI) of American teenagers. Since it is extremely difficult even to count every single teenager in the country, sampling is necessary. So we try to find N typical teenagers, measure and weigh them, and then compute their average BMI with the standard statistical formula

population mean ≈ ∑ᵢ BMIᵢ / (N + 1)

Now we all learned averages in junior high. Where did the “+ 1″ come from? This is in fact a simple trick that every statistician has to learn on day 1. When we estimate a population mean from a small sample, there will inevitably be an error, typically on the high side. As a useful rule of thumb, we get a better estimate when dividing by (N + 1) instead of by N. Note that, as N gets large, N ≈ (N + 1); and so we do converge to the population mean in the limit.

A semantic dictionary is nothing more than millions of averages of term frequencies in documents, and most of them are based on only a fairly small number of occurrences of a given term. To get good results here, we have to do more than just junior high math.

Our situation is actually much more complicated than that of estimating a simple population mean, but we have to do a similar kind of data smoothing. This is all to provide you with the highest quality numbers for your web app.

Ingredients

27 Jul 2009

This posting will probably make the eyes of most people glaze over, but current and prospective users of our SemanticHacker API should really be informed consumers. So think of this as being like one of those federally mandated labels on your bottle of Red Bull.

The ingredients of a semantic dictionary are a set of hundreds of thousands of terms, a set of thousands of dimensions, and various numbers expressing the strength of association between a given term and a given dimension. Most of these associations will have zero strength, indicating that we have no information about them; but there will still be millions of non-zero numbers to provide a rigorous undergirding for statistical semantics.

We build a semantic dictionary by obtaining large training samples of documents relevant to each of its dimensions. The strength of association is then estimated as being proportional to the relative frequency of occurrence in training documents for a term in a dimension versus in those for all other other dimensions. The process is actually more complicated than this, but the differences are just refinements of the overall scheme as described.

Now we all understand what terms are (e.g. britney_spears, midfielder, rugelach, purple), but where do dimensions come from? The answer is that they are somewhat arbitrary. A dimension can be defined around any kind of category for which someone has provided requisite training documents. In many cases, we can find prior sets of categories to work from (ODP, USPTO), but we also can ourselves try to infer categories from some available pool of potential training data.

However we proceed here, it is necessary that the resulting dimensions be pertinent to an application of interest, be independent of each other, be supported by adequate training data, and be associated with enough terms to support semantic analysis of target text. This all can be tricky to achieve, but if it were easy, everyone would be doing it.