Consider how we humans learn language. Even with formal education, it takes a child about 15 years starting from infancy to be able to read and understand general news articles in the New York Times. Over this period, one would probably hear or read at least on the order of 10 billion words. Even so, most high schoolers will need many additional years of schooling to become able to comprehend technical material.
So, how can anyone expect a computer to understand something like medical text after training on only about 100 million words of data? A computer of course runs on nanosecond cycles while the human brain operates on millisecond cycles; but we have had about 50,000 generations to evolve our language software, while the electronic computer has had only about 10 generations.
The bottom line here is that language learning is difficult; and it requires sifting through immense amounts of data. There probably is no magic technological shortcut here, but we have reached now the stage where our systems can routinely handle the volumes of data that would support semantic capabilities equivalent to an 8th-grade education. Decent commercial language processing tools are also now available.
Consequently, we are making major progress on semantic dictionaries, but have to be realistic about the work still ahead of us. Expect no overnight miracles from us or anyone else, especially when these are based on measly samples of data. There is still no royal road to semantics.
0 Comments » Posted in semantics by Clinton Mah
Read More »
In Norton Juster’s classic The Phantom Tollbooth, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the Castle in the Air.
Okay, the symbolism is a bit over the top, but the conflict about whether semantics should involve numbers as opposed to some logical formalism makes just as little sense and could also benefit from the return of Sweet Rhyme and Pure Reason. There is not just one way to build a house, or plant a garden, or skin a cat. In any real-world enterprise, we always have multiple options, each with tradeoffs.
Our job as an semantic API developer is to provide another option with tradeoffs that are attractive to users. What we offer with statistical semantics is simplicity, transparency, broad coverage, timely data, rigor, and historical grounding of methodology. And we strive to be better each day at what we do.
0 Comments » Posted in semantics by Clinton Mah
Read More »
The current SemanticHacker API offers more than one semantic dictionary. Each one is crafted from a particular collection of categorized documents at a particular time. The choice of a dictionary depends on one’s target application. Ideally, that dictionary will be trained on categorized documents similar to the documents to be analyzed for content.
Currently, the two main types of dictionaries available in English come from the ODP conceptual hierarchy and the USPTO class hierarchy. The dimensions defined for these types have practically no overlap. The differences in language and vocabulary in training data are also huge, and these have major consequences in the dimensional weights computed for terms in the two dictionaries.
In theory, one could employ a USPTO dictionary in a general web application, but one then risks being unable to pick up on popular language and culture. You won’t find “lol” or “Brangelina” in any patent. Similarly, an ODP dictionary may be a bit thin for handling medical journal articles; it would be much better here to have a semantic dictionary trained specifically on medical language and vocabulary.
The cost of building a specialized dictionary varies, mostly due to the complicated legal, technical, and logistical process of collecting the proper training data. Once the data is obtained, however, the actual dictionary process is largely mechanical, although we do carry out extensive quality assessment to determine whether we are running under optimal dictionary building parameters.
With proper training data in hand, we can turn out a semantic dictionary of about 200,000 terms over 2,000 dimensions in only about a day. This turnaround is possible because of our reliance on statistical methods as opposed to more complicated mathematical modeling of other semantic approaches. It means that we could build new dictionaries fast enough to keep up with news cycles as short as one week, given the computational resources needed.
0 Comments » Posted in General, Semantic Signatures, semantics by Clinton Mah
Read More »
Some linguists believe that early language was always about specific entities–that is, denotational; this kind of reference then evolves into concepts, which are connotational. For example, a baby learns his or her particular meaning of MAMA, which then generalizes into MOTHERHOOD.
We can still see such evolution at work today. About two years ago, the term “Sarah Palin” was only denotational, but after the fall of 2008, it has now become quite connotational. Something similar might be said on the other side of the political spectrum about the term “Barack Obama.”
The whole process of turning denotation into connotation has been extensively studied and is better known as “branding.” Anyone who has ever written a resumé has had experience in doing it.
A semantic dictionary in fact trades on the natural kind of branding. Since we do statistical analyses of context to assign meaning, We may not yet be able to interpret a term for which we have only a small sample of occurrences. Give us a little time, though.
0 Comments » Posted in General, semantics by Clinton Mah
Read More »