Archive for May, 2010

Hijacked

26 May 2010

Has this ever happened to you? You are Googling for information on the Web, but inadvertently your query happens to share keywords with the latest cultural phenom: the next tweener heart throb, a YouTube video suddenly gone viral, or yet another paranoid political fantasy that refuses to die.

You are a professional, however, and so switch into Advanced Mode to reshape your query, but to no avail. Your information has been buried under pop detritus; it has been hijacked by the maximum likelihood estimate (MLE) on the Web.

At times like this, you want to grab your search engine by the neck and shout, “I am NOT a screaming twelve-year-old girl into dancing cats and fixated on the President’s birth place!” But your search engine continues blithely in the wisdom of the crowd.

It is a reminder that statistically grounded information systems are at the mercy of their training data. If we cede too much control of a system to its finely wrought black box judgment, then we sometimes are going to run off the tracks. This is especially true with web semantics.

If we do in fact want to get under the hood to adjust a semantic system to go against the popular flow, then it helps tremendously if the categories underlying the representation of document content are intelligible to people. Such transparency is a prime motivation for how semantic dictionaries are currently built by TextWise.

Of course, if you care nary a lick about transparency, then may I interest you in this slightly used synthetic collateralized debt obligation….

Going Deep

12 May 2010

When people read text, they may not understand everything in it. For example, a layman might look at an article from a medical journal and see only that it is about some kind of drug. Someone more familiar with medicine would pick up that this is an experimental drug for treating estrogen-sensitive breast cancer. An expert would note that the drug is an aromatase blocker that performs as well as a standard approved drug in a double-blind controlled trials with a large sample of patients.

If an application seeks simply to distinguish documents about pharmaceuticals from documents about toxic financial assets or about the World Cup tournament in South Africa, then it is enough to understand at a superficial level. If a physician is searching for treatment options for a patient with a recurrence of breast cancer, however, a much deeper grasp of content is called for.

A general type of semantic dictionary covering a broad variety of different subjects is more or less forced to opt for broad coverage by default. Collecting enough training data for two thousand dimensions is a major undertaking; having to do it for twenty thousand dimensions will entail a big commitment of resources that one will have to justify. Still, if such a dictionary is critical for a given application, then we need to make the investment.

In many cases the domain of content to be covered can be quite circumscribed. Accordingly, we probably would be better off to add a fairly small number of dimensions to an existing semantic dictionary rather than build a whole new dictionary from scratch. This will require some special statistical balancing of course, but balancing is what dictionary building is all about.

The 2010 Semantic Technology Conference will take place from June 21 – 25 in San Francisco.

This year our Executive Director of Science, Wen Ruan, will be on the ‘Semantic Social Networking’ panel along with Amit P. Sheth, Professor & Director of Kno.e.sis, Wright State University and Jamie Taylor who oversees data operations at Metaweb Technologies.

This panel will present some approaches and tools that combine semantics with social network data, visualization of relationships, measurement and interactions, and user-generated content analysis. For more on the topics being covered, go to the schedule. The session is scheduled on June 24th from 4:45pm – 5:45pm PST.

Watch a video of last year’s panel that Wen hosted: “Semantic Search: Beyond RDF

Hope to see you there!

Tags: