Posts Tagged ‘semantics’

Our Roots

11 Aug 2009

Semantic Signatures℠ approaches meaning of words from the perspective of their context. In the past couple of months, there has been extensive discussion here and elsewhere about how this differs from RDF, the basis for the Semantic Web. The simplest answer is that we are data-driven where RDF is model-driven.

This dichotomy is nothing new. In fact, if we look at semantics over a hundred years ago, we see the empirical idea of contextual semantics in the structural linguistics of  Ferdinand de Saussure in contrast to the logical formulation of meaning in the predicate calculus of Bertrand Russell and Alfred North Whitehead. The former inferred meaning from the comparative analysis of text; the latter defined a mapping between text and a formal model of possible meanings.

The model-driven approach became less popular after the logician Kurt Gödel proved the incompleteness of all non-trivial logical systems in the 1930′s. Structural linguistics then became the favored approach until Noam Chomsky put the study of language back on a formal basis in the 1950′s, and the semantics of language also tilted to the formal in order to be more consistent with the study of syntax.

This is not to say that one approach is right and the other is wrong. The choice of approach to take should really depend on one’s circumstances. If one has available an appropriate logical model, which today might correspond to a taxonomy and a formal way to relate taxonomic entities, then the model-driven option is compelling. On the other hand, if an appropriate model is lacking or incomplete, but there is plenty of tagged text data to work from, then the data-driven option should be considered.

One can always in fact choose to work with the best of both worlds. We are not the sole providers of data-driven semantic technology, but our statistical characterization of meaning is probably de Saussure himself might have done it if he had access to the Worldwide Web and 21st Century cloud computing.

Some linguists believe that early language was always about specific entities–that is, denotational; this kind of reference then evolves into concepts, which are connotational. For example, a baby learns his or her particular meaning of MAMA, which then generalizes into MOTHERHOOD.

We can still see such evolution at work today. About two years ago, the term “Sarah Palin” was only denotational, but after the fall of 2008, it has now become quite connotational. Something similar might be said on the other side of the political spectrum about the term “Barack Obama.”

The whole process of turning denotation into connotation has been extensively studied and is better known as “branding.” Anyone who has ever written a resumé has had experience in doing it.

A semantic dictionary in fact trades on the natural kind of branding. Since we do statistical analyses of context to assign meaning, We may not yet be able to interpret a term for which we have only a small sample of occurrences. Give us a little time, though.

Romeo Montague once noted that the semantic function of a name contrasts quite saliently with that of an ordinary word. Shakespeare didn’t quite put it that way, but it is a fact of language. Classic semanticist would frame it as a distinction between denotation (i.e. identification) versus connotation (i.e. description).

As it turns out, this difference can be seen even at the statistical level. Ordinary words with a little massaging have a frequency distribution best described as binomial; names are typically not binomial. That will have consequences for how we mine text data to create a semantic dictionary.

This is all a fine point, but the quality of a product is determined by many such fine points. None of our API competitors on the web bother with denotation and connotation, but it can really matter when you are processing data with many product designations.

Recently, there were news reports of scientists identifying an Oprah Winfrey neuron in the brain of an epileptic person who had been wired to help control seizures. This one particular neuron  in the hippocampus fires whenever the person hears Oprah’s name or sees a picture of her. It may help to explain how memory works.It also can explain how semantic dictionaries work. In the case of Oprah, stimuli from many different senses travel various paths to converge on her neuron. In a semantic dictionary with Oprah as a concept, various terms associated with her in effect will vote for the concept with differing degrees of confidence when they occur in some document. When there is convergence because of mutual corroboration of terms, then one can infer that the document is about the queen of daytime TV.

What We Sell

22 Jul 2009

A TextWise semantic dictionary is essentially a big bunch of numbers between 0 and 1. To be more precise, they are conditional probabilities of a semantic dimension being relevant to a document containing an occurrence of a given term; but to a casual observer, they can look very ho-hum and uncool. What is so great about them?

Some people are in fact dismissive of any numbers being applied to semantics. This is probably because of the unfortunate legacy of numerical abuse in information technology, where system builders all too commonly slam numbers together willy-nilly and hope that something sensible comes out.

At TextWise, we don’t do this. We not only follow rigorous statistical practice to get the most information out of available text data, but also apply proprietary filtering and reduction methods to eliminate many of the anomalies that can slip through any statistical system by chance. To paraphrase the Colonel, “We do numbers right.”

According to WordNet, the word BANK has multiple senses, and so any occurrence of it in a text document is ambiguous. For example, we can have a river BANK, a financial BANK, a fog BANK, or an aeronautical BANK. The intended sense in a particular document has to be determined by looking at the context of occurrence. So, to determine the actual meaning of BANK in a document, we have to ask in effect whether the document is talking about streams of water, financial meltdowns, marine navigation, or aircraft in flight.

Now the number of different possible contexts is probably huge.One cannot hope to recognize them all; but for disambiguation of words, we need only fairly general contexts to distinguish the word senses of prime interested to us. Furthermore, given a large of our target text, we can employ statistical methods to identify the most important of such contexts.

This is essentially what SemanticHacker is all about.The dimensions of one of our semantic dictionaries defines thousands of contextual reference points for the interpretation of terms. For example, if the words stream, water, flow, erosion, and grass are in a document, then with the ODP 2009 dictionary, we find that the top match dimension is 1461 (Top/Science/Environment/Water_Resources) with a weight of 0.5138. In this context, the word BANK would probably mean “river bank.”

Actually, we don’t need to make this explicit association. With a search engine user interface, one just needs a way of describing the context of ambiguous search terms, perhaps by listing contextual words. Then all a semantic search engine has to do is find a document containing the search term and having the same described context in its semantic signature. This is of course a part of our API for search.

Search engines work remarkably well when one is searching for a popular topic. Just try the query LOVATO. If you are of the demographic normally reading this blog, then you probably don’t know yet who she is, but Google or Bing will find her. Although she is still obscure enough so that Lovato Electric, Inc., beats her out for top spot on Bing, there is no problem in getting the goods on this latest Disney ‘tween idol.

Here is a different, more frustrating search story, however. I was over at the National Gallery in Washington on Sunday and saw a remarkable series of Renaissance Italian frescos. At home afterwards, I queried on ITALIAN VILLA FRESCO NATIONAL GALLERY WASHINGTON, but found nothing recognizable on Google with either web or image search. About an hour later, I gave up after trying numerous variations of queries.

Then I went to www.nga.gov and navigated down to its 16th Century Italian art page. It offered a virtual tour of a series of frescos by Bernardino Luini on the legend of Procris and Cephalus. Bingo! According to the web site, “These nine paintings are the only examples of an Italian Renaissance fresco series in America.” Strangely enough, I had actually tried the term LUINI in one of my unsuccessful queries.

So we obviously have a failure to communicate here; and this is really a problem that semantic search should be addressing. The relevant page was out there and my queries should have been specific enough, but somehow a beautiful young bride being run through and killed by a magic javelin just wasn’t as sexy as Britney 4.0.

It was a great pleasure teaming up with Ron Kaplan (Powerset/Microsoft), Riza Berkan (hakia) and Kiki Hempelmann (RiverGlass) in this panel presentation on Semantic Search Beyond RDF at SemTech 2009 conference.  What is semantics, what is not?  It is quite interesting to hear different perspectives.  Particularly, is statistics semantics?  One often hears the statement that statistics is not semantics.  Then what about contextual semantics?

Statistics are only numbers, but with enough of the right kinds of numbers, one can model the economy of Uzbekistan, prove the existence of the Higgs boson, or characterize the content of a text document. Numbers are our friends, if we treat them with proper respect.

The important thing is to keep an open mind when it comes to semantic search.  But we do have one thing in agreement – semantic search CAN go beyond RDF markups.  The question is a matter of how.

Scalability and standard measurements are still hot topics around semantic search during the Q&A session.  When the question of benchmarks for comparing search systems came up, each of the panelists agreed that there is NO one benchmark number that can be used to compare all search systems simply because it is hard to interpret and may not make sense to one’s business.

We’ve had a successful first couple of days after launching the Challenge. In addition to making TechCrunch, a few other sites where we’ve received coverage include:

We also have a couple of applications and ideas already on the Forum. Overall we feel confident that something amazing will come out of this!