Posts Tagged ‘language’

The February issue of Scientific American had an article on the latest thinking about the Whorfian Hypothesis, which states that language strongly influences how humans think. This was a hot idea about sixty years ago, but eventually fell out of academic favor because of the lack of hard empirical evidence. Now that evidence is starting to show up, which has some implications for computational semantics.

The standard view on language and meaning has recently emphasized universality. This is to say that the understanding of language is hardwired in our heads, and so any competent human should qualify as an expert in the algorithmic delineation of meaning. The Whorfian hypothesis throws us a curve here in that we now have to consider language along with culture in our models of thought. A single well-crafted taxonomy or other semantic construct will not fit all.

We see something of this problem on the Worldwide Web. As Jimmy Wales noted this past week, the content of the Web, and Wikipedia in particular, is largely created by twenty- and thirty-something males and so is dominated by their interests. A set of semantic categories derived from the Web in general will certainly be insufficient for understanding text on finance or on medicine and may be challenged even when dealing with the pages frequented by twenty- and thirty-something females.

This does not mean that a given semantic scheme is invalid. Each scheme, however, is limited by the vocabulary it covers and in the kinds of distinctions that that it makes. That should be good news for those of us who make their living in computational semantics.

Back in the 60′s and 70′s of the last century, the Whorfian hypothesis was a hot subject on college campuses. This was the idea that one’s native language, its syntax and semantics, strongly shaped one’s worldview. For example, Eskimos speaking Inuit supposedly had thirty different words for snow and so had a more complex relationship with their environment than someone speaking English with only one word for snow.

The problem of course is that skiers can make plenty of distinctions about kinds of snow even in English. Despite Whorfian hypothesis being theoretically attractive, it did not square in the end with our actual experience with language. That pretty much took the steam out of the Whorfian hypothesis, but now in the 21st Century, empirical support has been accumulating for a weaker version of it. This was the subject of an article in New York Times Magazine (http://nyti.ms/boqzs5).

The weak Whorfian hypothesis rejects the idea that language establishes an absolute limit on thinking. Thus we can learn about distinctions in types of snow if we really need them. The structure of a language, however, definitely can bias our thinking; and this could have consequences in practical matters like the ranking of retrieved documents. The choice of a particular semantic framework like RDF may therefore affect the performance of an information system in unexpected ways.

So far, experimental results on language and thought have focused on highly specific biases in areas of language like giving spatial directions, assigning gender to nouns, and dividing the spectrum into colors. It seems plausible, though, that this should generalize to the overall semantic problem of dividing up meaning into some kind of compact space. There is more than one way to skin a cat here, and there are probably advantages and disadvantages in each possibility.

A dogmatist might be tempted to argue here that RDF with certain standard taxonomies is the right way and everything else is wrong, but that is probably overreaching. We are not yet savvy enough about semantics to carve tablets in stone about its implementation. At present, one can say only whether a given scheme is optimal in some formal sense; but if it makes no obvious sense to people, then something more comprehendable might be better in the long run even if it is less than optimal.

The weak Whorfian hypothesis forces us to be more honest. If each semantic scheme introduces its own biases, then we need to experiment to see how different approaches work out for a given target application. Given that humans operate with more than one linguistic framework, we should not be so quick to assume than machines can do better at semantics with just a single framework.

Basics

5 Oct 2010

Linguists have long debated whether human language ability is innate or is simply learned by highly plastic neurocircuitry of a general sort. Recent studies with fMRI scans indicate, however, that cognitive skills like language understanding tend to be associated with highly specific brain locations across different individuals, supporting the idea that some kind of language-related structures exists. Studies of people impaired by strokes occurring in language regions also have shown this.

So when a young child learns that Mama is related to a concept of MOTHER, which applies to more than a single individual, this seems to draw upon specialized builtin logic within the human brain. This kind of symbolic capability is not unique to humans, being found to some extent in other large-brained social animals like elephants, whales, dolphins, and chimpanzees; but we certainly have more of it. This can seen in the relative size and organizational complexity of human brains.

The implication here is that concepts like MOTHER, BIRD, HOUSE, or FOOD are real in some sense at the genetic level. We of course do not necessarily all learn the same particular concepts; for example, speakers of different languages in different cultures can be expected to develop divergent concept frameworks. Nevertheless, it is possible to translate between unrelated languages like Inuit and English, meaning that there is still a large overlap in their lingistic repertories of concepts.

Consequently, when we technologists talk about incorporating semantics into search engines and other applications, we need to remember that semantics existed a long time before the first boolean electronic circuit and that what we call “semantics” should be consistent to what goes on in our own heads. This is perhaps only a marketing concern, but the business of selling semantic technology will be that much harder if we cannot agree on what we really mean.

The concept of CONCEPT would seem to be a focus point for semantics that everyone can grasp. Whether we approach language and meaning like Wittgenstein or like Russell or like Korzybski or like Chomsky or like Miller or like Berners-Lee, it helps to get grounded properly.

Learning

21 Dec 2009

Consider how we humans learn language. Even with formal education, it takes a child about 15 years starting from infancy to be able to read and understand general news articles in the New York Times. Over this period, one would probably hear or read at least on the order of 10 billion words. Even so, most high schoolers will need many additional years of schooling to become able to comprehend technical material.

So, how can anyone expect a computer to understand something like medical text after training on only about 100 million words of data? A computer of course runs on nanosecond cycles while the human brain operates on millisecond cycles; but we have had about 50,000 generations to evolve our language software, while the electronic computer has had only about 10 generations.

The bottom line here is that language learning is difficult; and it requires sifting through immense amounts of data. There probably is no magic technological shortcut here, but we have reached now the stage where our systems can routinely handle the volumes of data that would support semantic capabilities equivalent to an 8th-grade education. Decent commercial language processing tools are also now available.

Consequently, we are making major progress on semantic dictionaries, but have to be realistic about the work still ahead of us. Expect no overnight miracles from us or anyone else, especially when these are based on measly samples of data. There is still no royal road to semantics.

In Norton Juster’s classic The Phantom Tollbooth, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the Castle in the Air.

Okay, the symbolism is a bit over the top, but the conflict about whether semantics should involve numbers as opposed to some logical formalism makes just as little sense and could also benefit from the return of Sweet Rhyme and Pure Reason. There is not just one way to build a house, or plant a garden, or skin a cat. In any real-world enterprise, we always have multiple options, each with tradeoffs.

Our job as an semantic API developer is to provide another option with tradeoffs that are attractive to users. What we offer with statistical semantics is simplicity, transparency, broad coverage, timely data, rigor, and historical grounding of methodology. And we strive to be better each day at what we do.

The current SemanticHacker API offers more than one semantic dictionary. Each one is crafted from a particular collection of categorized documents at a particular time. The choice of a dictionary depends on one’s target application. Ideally, that dictionary will be trained on categorized documents similar to the documents to be analyzed for content.

Currently, the two main types of dictionaries available in English come from the ODP conceptual hierarchy and the USPTO class hierarchy. The dimensions defined for these types have practically no overlap. The differences in language and vocabulary in training data are also huge, and these have major consequences in the dimensional weights computed for terms in the two dictionaries.

In theory, one could employ a USPTO dictionary in a general web application, but one then risks being unable to pick up on popular language and culture. You won’t find “lol” or “Brangelina” in any patent. Similarly, an ODP dictionary may be a bit thin for handling medical journal articles; it would be much better here to have a semantic dictionary trained specifically on medical language and vocabulary.

The cost of building a specialized dictionary varies, mostly due to the complicated legal, technical, and logistical process of collecting the proper training data. Once the data is obtained, however, the actual dictionary process is largely mechanical, although we do carry out extensive quality assessment to determine whether we are running under optimal dictionary building parameters.

With proper training data in hand, we can turn out a semantic dictionary of about 200,000 terms over 2,000 dimensions in only about a day. This turnaround is possible because of our reliance on statistical methods as opposed to more complicated mathematical modeling of other semantic approaches. It means that we could build new dictionaries fast enough to keep up with news cycles as short as one week, given the computational resources needed.

Some linguists believe that early language was always about specific entities–that is, denotational; this kind of reference then evolves into concepts, which are connotational. For example, a baby learns his or her particular meaning of MAMA, which then generalizes into MOTHERHOOD.

We can still see such evolution at work today. About two years ago, the term “Sarah Palin” was only denotational, but after the fall of 2008, it has now become quite connotational. Something similar might be said on the other side of the political spectrum about the term “Barack Obama.”

The whole process of turning denotation into connotation has been extensively studied and is better known as “branding.” Anyone who has ever written a resumé has had experience in doing it.

A semantic dictionary in fact trades on the natural kind of branding. Since we do statistical analyses of context to assign meaning, We may not yet be able to interpret a term for which we have only a small sample of occurrences. Give us a little time, though.