Posts Tagged ‘understanding’

The February issue of Scientific American had an article on the latest thinking about the Whorfian Hypothesis, which states that language strongly influences how humans think. This was a hot idea about sixty years ago, but eventually fell out of academic favor because of the lack of hard empirical evidence. Now that evidence is starting to show up, which has some implications for computational semantics.

The standard view on language and meaning has recently emphasized universality. This is to say that the understanding of language is hardwired in our heads, and so any competent human should qualify as an expert in the algorithmic delineation of meaning. The Whorfian hypothesis throws us a curve here in that we now have to consider language along with culture in our models of thought. A single well-crafted taxonomy or other semantic construct will not fit all.

We see something of this problem on the Worldwide Web. As Jimmy Wales noted this past week, the content of the Web, and Wikipedia in particular, is largely created by twenty- and thirty-something males and so is dominated by their interests. A set of semantic categories derived from the Web in general will certainly be insufficient for understanding text on finance or on medicine and may be challenged even when dealing with the pages frequented by twenty- and thirty-something females.

This does not mean that a given semantic scheme is invalid. Each scheme, however, is limited by the vocabulary it covers and in the kinds of distinctions that that it makes. That should be good news for those of us who make their living in computational semantics.

Going Deep

12 May 2010

When people read text, they may not understand everything in it. For example, a layman might look at an article from a medical journal and see only that it is about some kind of drug. Someone more familiar with medicine would pick up that this is an experimental drug for treating estrogen-sensitive breast cancer. An expert would note that the drug is an aromatase blocker that performs as well as a standard approved drug in a double-blind controlled trials with a large sample of patients.

If an application seeks simply to distinguish documents about pharmaceuticals from documents about toxic financial assets or about the World Cup tournament in South Africa, then it is enough to understand at a superficial level. If a physician is searching for treatment options for a patient with a recurrence of breast cancer, however, a much deeper grasp of content is called for.

A general type of semantic dictionary covering a broad variety of different subjects is more or less forced to opt for broad coverage by default. Collecting enough training data for two thousand dimensions is a major undertaking; having to do it for twenty thousand dimensions will entail a big commitment of resources that one will have to justify. Still, if such a dictionary is critical for a given application, then we need to make the investment.

In many cases the domain of content to be covered can be quite circumscribed. Accordingly, we probably would be better off to add a fairly small number of dimensions to an existing semantic dictionary rather than build a whole new dictionary from scratch. This will require some special statistical balancing of course, but balancing is what dictionary building is all about.