Posts Tagged ‘vocabulary’

Learning

21 Dec 2009

Consider how we humans learn language. Even with formal education, it takes a child about 15 years starting from infancy to be able to read and understand general news articles in the New York Times. Over this period, one would probably hear or read at least on the order of 10 billion words. Even so, most high schoolers will need many additional years of schooling to become able to comprehend technical material.

So, how can anyone expect a computer to understand something like medical text after training on only about 100 million words of data? A computer of course runs on nanosecond cycles while the human brain operates on millisecond cycles; but we have had about 50,000 generations to evolve our language software, while the electronic computer has had only about 10 generations.

The bottom line here is that language learning is difficult; and it requires sifting through immense amounts of data. There probably is no magic technological shortcut here, but we have reached now the stage where our systems can routinely handle the volumes of data that would support semantic capabilities equivalent to an 8th-grade education. Decent commercial language processing tools are also now available.

Consequently, we are making major progress on semantic dictionaries, but have to be realistic about the work still ahead of us. Expect no overnight miracles from us or anyone else, especially when these are based on measly samples of data. There is still no royal road to semantics.

Even in the world of print, one dictionary is often not enough. Just for English, for example, we can go to standard references like Webster’s Third New International, The American Heritage Dictionary of the English Language, or the Oxford English Dictionary, as well as more specialized lexicons. So how many semantic dictionaries do we really need?

That of course depends on the application. If we are in the situation where our target text data is extremely stable and requires only a general vocabulary, then we might get away with a single semantic dictionary based on a large sample of data processed quite carefully. On the Web, however, we have nothing of the sort, if you haven’t noticed lately.

A sophisticated dictionary that took weeks to build with hairy mathematical algorithms on a reasonable sample of training text may become obsolete overnight. That is not to say that sophisticated dictionaries are unhelpful; but in the merciless competition of the information marketplace, we probably need to be able to pop out a new semantic dictionary based on a gigabyte or more of text in just hours.

Given this kind of turnaround, why would anyone want to rely on a single semantic dictionary with its limited vocabulary and somewhat dated concepts? A new dictionary will of course involve a nontrivial upfront investment, but once a reliable source of tagged data is developed, actual dictionary building can be largely automated. That is the advantage of relying on statistical methods.

The current SemanticHacker API offers more than one semantic dictionary. Each one is crafted from a particular collection of categorized documents at a particular time. The choice of a dictionary depends on one’s target application. Ideally, that dictionary will be trained on categorized documents similar to the documents to be analyzed for content.

Currently, the two main types of dictionaries available in English come from the ODP conceptual hierarchy and the USPTO class hierarchy. The dimensions defined for these types have practically no overlap. The differences in language and vocabulary in training data are also huge, and these have major consequences in the dimensional weights computed for terms in the two dictionaries.

In theory, one could employ a USPTO dictionary in a general web application, but one then risks being unable to pick up on popular language and culture. You won’t find “lol” or “Brangelina” in any patent. Similarly, an ODP dictionary may be a bit thin for handling medical journal articles; it would be much better here to have a semantic dictionary trained specifically on medical language and vocabulary.

The cost of building a specialized dictionary varies, mostly due to the complicated legal, technical, and logistical process of collecting the proper training data. Once the data is obtained, however, the actual dictionary process is largely mechanical, although we do carry out extensive quality assessment to determine whether we are running under optimal dictionary building parameters.

With proper training data in hand, we can turn out a semantic dictionary of about 200,000 terms over 2,000 dimensions in only about a day. This turnaround is possible because of our reliance on statistical methods as opposed to more complicated mathematical modeling of other semantic approaches. It means that we could build new dictionaries fast enough to keep up with news cycles as short as one week, given the computational resources needed.