Archive for August, 2009

I’ve worked in Product Management for years, and for years our battles were ones with well-known, battle weary opponents.  The battle lines had been drawn, strategic alliances were made and broken, and weapons were fired (i.e., fingers were pointed) if a product flopped, forecasts weren’t met, or the “biggest prospect ever” walked away.

Sales blamed Marketing, Sales blamed the Product Manager and vice versa, Developers blamed the Requirements Analysts and QA (bad requirements and bad QA people for not catching the bugs prior to launch), and the CEO blamed everybody, but always came back with donuts and cookies for us to enjoy as we licked our collective wounds.

I hate using the whole war/battleground analogy because in the end we’re all on the same team, working toward the same goal.  And if your company is in the business of selling products and services, the goal typically boils down to the bottom line – profitability.

Just over a year ago, I embarked on a new path in my Product Management journey.  I’m now working as a Product Manager for a company whose underlying technology is semantic web technology.  This means we’ve got PhD’s and patent holders with specialties in things like informatics, search engine technology, information retrieval and relevance testing.  So in addition to engineers wanting to continually improve the system and its architecture, I’ve now got scientists who want to (and rightly so) continually improve the core technology, the algorithms.  Tweak, tweak, tweak.  They go to conferences like SIGIR and come back fired up, ready to go. They’re like a new battalion that has shown up on the battlefield and I have to size them up quickly in order to protect my precious project timeline. “Are you friend or foe?”

So I find myself with an interesting dilemma.  How do you tell a scientist to think faster, come to a conclusion faster, improve the technology but get it done NOW so we can get the product built and out the door?  How do I get the engineers to wrap it up? Is it research or is it a project?  I need an answer, and what I keep hearing from all camps is, “It’s both!”

I can hear you Product Managers and Project Managers saying it now.  Just continue with new product development and merge the R&D advancements into the products as they’re ready.  I used to say and do that too, and it worked.  But that was before I landed in a place where what works for typical Web data doesn’t quite work for patent data, and none of it compares to that elusive Holy Grail called the medical domain.  And even tweaks in certain parts of the engineering code can potentially affect relevance and skew the results.

Have I given up on my beloved project schedules, my requirements documents, my product development lifecycle tools?  NEVER. I’m a Product Manager at heart. But just like the ever changing world of the semantic web, and the scientists and engineers I interact with daily, I’m tweaking my process as I go.

I was asked that question quite a few times when I was at the KM World and SemTech conferences. The answer is simple: use a Semantic Signature as a query against an index of Semantic Signatures to find the most relevant content.

In order to illustrate what a Semantic Signature is, we provide the example of a document with 30 semantic dimensions labeled using the Open Directory Project taxonomy (www.dmoz.org). The example lead people to believe that a Semantic Signature is nothing more a multivariate categorizer for content navigation, categorization, or other forms of content bucketing. While Signatures can certainly be used for that, it is not how we use them at TextWise.

If you examine a Semantic Signature without reading the thirty labels, you’ll observe it is a 30 dimension vector, of concepts and weights. These concepts and weights are used by TextWise in a simple vector math calculation to determine the similarity between two signatures. Once a score is obtained, it is normalized to an integer value and then a cutoff is chosen to determine if each signature is relevant to the query.

For a real world application, a user controlled sliding scale from 1 -10 can be used within the calculation to control what content items, represented by the Semantic Signatures, are displayed: a score of 9 would instruct the application to show only the highly relevant content while a score of 2 would show a greater recall of content.

Why would I use Semantic Signatures to search for content?  If you have read an article on the web or on your companies’ intranet and attempted to find additional content related to what you’re looking at, you know it is a cumbersome process:  Identify keywords to use from the source, use them to search, review the results, repeat the process until you either found what you are looking for or capitulated in your effort.  If you performed the same search against an index of Semantic Signatures, you simply use the document as the query, eliminating the inherent keyword/guesswork/review cycle with using today’s keyword systems.

From a developer’s perspective, the major benefits of Semantic Signatures are:

  • They are a very accurate and compact representation of a document – each Signature only consumes ~180 bytes of RAM.
  • Computing similarity of Signatures is a very light weight vector calculation and unlike keyword matching, there is no need for patterns, alias tables, synonym tables, spell correction, etc.
  • Scalability. 3 million Signatures will easily fit within a 1.5 GB 32 bit Java VM and result in full index searches taking ~ 70 milliseconds.

If you want to learn more about Semantic Signature technology and use our free API to create Semantic Signatures for your content, visit www.semantichacker.com.

I had an uncle who would coach my brothers when bringing home those less-than-spectacular report cards.  “Just tell your mother and father they changed the scale” and he suggested the following: A=Awful, B=Bad, C=Could be better, D=Dandy and F=Fantastic.  Alas the new scale was rejected, and the usual no TV purgatory followed for the boys. 

We all have  a pretty good intuitive sense of the A-F grading scale, or the happy/sad face pain scale now used in hospitals and doctors’ offices.  But there is no intuitive scale in use today to determine whether something returned to us from a search on the web, or suggested as a related item to a web page or document being viewed, is actually ‘what you’re looking for’ or ‘intent’.   In my iSchool (Syracuse) we investigated all the variants of ‘what you’re looking for/intent’  including relevant, pertinent, accurate, useful, etc. and of course the recall/precision tradeoff.   (Wonder if it’s true 9/10 times business people want precision? That’s what a Gartner analyst claimed at SIGIR this year.)  But it still isn’t easy to create  a scale for match judgments even when the definition (intent) is tightened up.  Pre-web, a binary scale was certainly most popular: Relevant/Not Relevant.  This was the TREC scale used for years, and it worked very well for tuning systems along precision/recall lines.  But this scale has always been problematic.  There are so many gradations to Relevant it is very hard for humans to make a yes/no call on a match, especially when the evaluator is not the person who came up with the query. 

When we tested our advertising system at TextWise, we worked long and hard to provide definitions for a 5 point scale: Extremely Relevant, Highly Relevant, Somewhat Relevant, Not Relevant, Embarrassing.  Eventually Extremely Relevant was collapsed into Highly Relevant as the distinction simply required too much documentation.  An Embarrassing rating changes given the application.  For advertising, Embarrassing was the car ad placed on the page containing the article about how the car resembled an accordion in an accident.  For other applications, Embarrassing may simply be that there is no discernable reason why the match occurred.

Attending the recent SIGIR09 meeting in Boston this year, another scale was frequently presented for  web search: Perfect, Excellent, Good, Fair, Bad.  Most any system will show reasonably well with a scale like that.   Is this really “Yes!  Kind of, Sort of, Maybe, No”?   What is Excellent v Good v Fair?  These distinctions require lengthy discussions and serious documentation. And still will yield noisy judgments.

Inter-coder reliability, or assessing how much humans actually agree on these human judgment tasks, is frequently measured by Kappa statistics.  For match judgments, this is usually not a strong number.    

There are even services now to provide these judgments, from Mechanical Turk, Delores Labs, etc.  These services use a “casual workforce” so the training/documentation can’t be too burdensome.  One page max for guidelines is recommended.  This means whatever scale used has to be pretty intuitive.    And there is a load of noise generated by low inter-coder reliability, which means pay for lots of judgments to account for the noise.

Interested in hearing from others who have travelled down this road.  What are you using?

A former colleague of mine used to have an entire can of soup for lunch every day. We razzed him about this, but he shook us all off until one day, I looked at the nutrition label on the can. That soup had 1800 mg of sodium altogether! We gravely informed him of this fact, and the soup was soon history.

To understand this story, you would have to know that the recommended daily maximum dietary intake of sodium for an adult is about 900 mg. Without this context, the number 1800 really means nothing. So what do all those numbers in a semantic dictionary mean, if anything?

The key property of semantic dictionary numbers is that they are based on probabilities and so have to fall between 0 and 1. They measure the likelihood that a document containing a given term is related to a given semantic dimension. For example, a dictionary weight of 1.0000 for a term and a dimension would indicate that a document containing the term is absolutely associated with the dimension.

There is a complication here, however. In real life, nothing is ever so certain. If we saw a 1.0000 term weight for a dimension, a more reasonable interpretation is that our sample of training data was too small for estimating the probability of that term accurately. A similar problem arises for a dictionary weight of 0.0000.

In general, a statistician will be highly suspicious of any extreme probabilities like 1.0000 and 0.0000. As a proponent of statistical technology, we have to make a special effort to avoid such probability estimates in our semantic dictionaries. In contrast, certain other mathematic approaches to semantics tend to skate over niceties like this, choosing just to plug in numbers to what is essentially a fixed formula.

If one is careless about the meaning of numbers, though, how can one be careful in capturing the actual meaning of words?

When is a semantic dictionary good? It really depends on the application, since more specialized content requires more specialized dictionary dimensions. Typically,  validation of a given application will involve extensive benchmark testing, often entailing human judgments of the effectiveness of particular statistical characterizations of content.

TextWise does all of this in its product development process, but one would not want to go through an elaboration validation procedure to test the consequences of every small change. As it turns out, there are quick statistical ways to check whether a change is likely to be good or bad. This is no substitute for actual detailed validation at some point, but it allows one to experiment with new ideas at a fairly low cost.

A digital photography metaphor is apt here. One cannot use statistics to identify a prize-winning shot, it is certainly possible to detect major problems without human judgments. For example, areas of maximally white pixels indicate blown highlights, which typically detract from the quality of an image. Similarly, problems with white balance, dynamic range, focus, and other conditions are also readily detectable.

With any huge data object like a semantic dictionary, it is difficult to construct a benchmark that will cover every aspect of it thoroughly. Statistical testing provides an overall sanity check on quality. Otherwise, one would just be buying and selling pigs in a poke.

If you have studied the mathematics of linear spaces, then you know that there are infinitely many ways to represent a given point in a space as a set of coordinates. For any particular set of data points, however, one can mechanically derive a particular set of axes that results in the representation requiring the fewest coordinates to capture the most important characteristics of those points.

This is the principle behind Latent Semantic Indexing (LSI), which was in large part why Susan Dumais of Microsoft received the Salton prize at the recently concluded SIGIR Conference. So, why don’t we use LSI?

It all boils down to whether one believes that a linear space is good model for the semantics of natural language; and the main issue here is that of orthogonality. Orthogonality is great in a Pythagorean ideal world, but the real world tends to be quite messy. A rectangular grid can be imposed on places like the U.S. Midwest, but would be quite inappropriate for land management or road building in the Amazon Basin or in Siberia.

An orthogonal system disregards the landscape, which is in fact what we have to live with and in.  Two towns ten miles apart along a navigable river are in a sense closer than two towns five miles apart with a mountain range between them. Our approach to semantics is that of conforming to the landscape of text data, which is probably better described as being fractal than being orthogonal.

In our current effort to develop a French semantic dictionary, we ran across the word TSOIN in two stop lists posted on the Web. It was not in my old pocket Larousse, but an online French lexicon explained that it usually appeared in the doubled form “tsoin-tsoin.” Unfortunately, it had “no official definition.”

How can an expression common enough to be included on a stop list have no meaning? At this point, contextual semantics came in to save the day. Who really needs a normative model-driven definition? We could simply apply our standard structural methods to characterize where the expression occurs.

To begin with, there is a French Wikipedia tongue-in-cheek article about the African “tsoin-tsoin fly” that carries a lethargic disease that kills a person in about twenty or thirty years. A French rock band issued an album with “tsoin-tsoin” in its title. Several bloggers or forum posters have it in their user names.

So we can infer that “tsoin-tsoin” is probably not obscene. It seems to have some negative connotations like “slacker” or “lazy,” but in a somewhat positive way. A similar English adjective would be “laidback,” and one may perhaps even say “cool.”

Actually, our sample size is much too small to make any reliable definition yet, but with more examples of its occurrence,  we can eventually home in on a broadly accepted meaning. In a sense, this meaning is still being developed in popular usage; we are watching contextual semantics at work in real life.

In Norton Juster’s classic The Phantom Tollbooth, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the Castle in the Air.

Okay, the symbolism is a bit over the top, but the conflict about whether semantics should involve numbers as opposed to some logical formalism makes just as little sense and could also benefit from the return of Sweet Rhyme and Pure Reason. There is not just one way to build a house, or plant a garden, or skin a cat. In any real-world enterprise, we always have multiple options, each with tradeoffs.

Our job as an semantic API developer is to provide another option with tradeoffs that are attractive to users. What we offer with statistical semantics is simplicity, transparency, broad coverage, timely data, rigor, and historical grounding of methodology. And we strive to be better each day at what we do.

Iron Semanticist

12 Aug 2009

Some people have been disparaging a statistical approach to the semantics of natural language. This is essentially a kind of prejudice, as if we came from the wrong side of the technology railroad tracks. It ignores the fact that statistical approaches have performed spectacularly well in some high profile settings.

Have you ever watched the “Iron Chef” on the Food Network? This is where two competing chefs are given an ingredient kept secret until the start of the show, and each contestant then has 60 minutes to create an entire meal around that ingredient. A panel then judges and critiques the two meals and crowns a winner.

In 2003, DARPA ran its own version of “Iron Chef,” though with only a single team of collaborators from eleven academic institutions across the U.S. The team was given a language, with the task of creating a cross-language information retrieval system and a machine translation system within TEN DAYS after learning what the language actually was.

To make challenge harder, the language was not French, Arabic, or Russian, but Cebuano, a dialect spoken in the Philippines. None of the team was familiar with the language, but through the magic of Internet collaboration, they were able in ten days to collect a corpus of resources in Cebuano and English and apply statistical methods to create both a fully workable cross-language retrieval system and a credible start to a translation capability.

The two principal investigators of the Herculean exercise wrote afterward that, given what they learned in those ten days, they would do better next time. They predicted that their team could  build a fully working statistical machine translation facility for a specified language in just a single day given adequate linguistic and computational resources.

In ten days, you could not build even a parser for a language that you have never heard of, much less develop the semantic mapping of that language into some kind of logical model of meaning to support cross-language search and machine translation. Statistical methods do work in semantics.

Our Roots

11 Aug 2009

Semantic Signatures℠ approaches meaning of words from the perspective of their context. In the past couple of months, there has been extensive discussion here and elsewhere about how this differs from RDF, the basis for the Semantic Web. The simplest answer is that we are data-driven where RDF is model-driven.

This dichotomy is nothing new. In fact, if we look at semantics over a hundred years ago, we see the empirical idea of contextual semantics in the structural linguistics of  Ferdinand de Saussure in contrast to the logical formulation of meaning in the predicate calculus of Bertrand Russell and Alfred North Whitehead. The former inferred meaning from the comparative analysis of text; the latter defined a mapping between text and a formal model of possible meanings.

The model-driven approach became less popular after the logician Kurt Gödel proved the incompleteness of all non-trivial logical systems in the 1930′s. Structural linguistics then became the favored approach until Noam Chomsky put the study of language back on a formal basis in the 1950′s, and the semantics of language also tilted to the formal in order to be more consistent with the study of syntax.

This is not to say that one approach is right and the other is wrong. The choice of approach to take should really depend on one’s circumstances. If one has available an appropriate logical model, which today might correspond to a taxonomy and a formal way to relate taxonomic entities, then the model-driven option is compelling. On the other hand, if an appropriate model is lacking or incomplete, but there is plenty of tagged text data to work from, then the data-driven option should be considered.

One can always in fact choose to work with the best of both worlds. We are not the sole providers of data-driven semantic technology, but our statistical characterization of meaning is probably de Saussure himself might have done it if he had access to the Worldwide Web and 21st Century cloud computing.