Archive for the ‘semantics’ Category

We have been thinking lately about how many dimensions a semantic dictionary should have. Some researchers at Carnegie-Mellon have been approaching the same question from the perspective of neuroscience and real-time imaging of activity in the human brain while understanding language (http://bit.ly/buIZEx).

According to CMU, there are really only THREE basic semantic dimensions: (1) Can I eat it? (2) Can I pick it up? (3) Can I hide in it? Admittedly, this primitive partitioning of the world probably goes back to our primate origins, but does have a certain resonance. Let’s remember it the next time we try to categorize journal articles in nanotechnology or search postings on someone’s Facebook wall.

Learning

21 Dec 2009

Consider how we humans learn language. Even with formal education, it takes a child about 15 years starting from infancy to be able to read and understand general news articles in the New York Times. Over this period, one would probably hear or read at least on the order of 10 billion words. Even so, most high schoolers will need many additional years of schooling to become able to comprehend technical material.

So, how can anyone expect a computer to understand something like medical text after training on only about 100 million words of data? A computer of course runs on nanosecond cycles while the human brain operates on millisecond cycles; but we have had about 50,000 generations to evolve our language software, while the electronic computer has had only about 10 generations.

The bottom line here is that language learning is difficult; and it requires sifting through immense amounts of data. There probably is no magic technological shortcut here, but we have reached now the stage where our systems can routinely handle the volumes of data that would support semantic capabilities equivalent to an 8th-grade education. Decent commercial language processing tools are also now available.

Consequently, we are making major progress on semantic dictionaries, but have to be realistic about the work still ahead of us. Expect no overnight miracles from us or anyone else, especially when these are based on measly samples of data. There is still no royal road to semantics.

A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.

In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.

Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.

Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.

In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.

Even in the world of print, one dictionary is often not enough. Just for English, for example, we can go to standard references like Webster’s Third New International, The American Heritage Dictionary of the English Language, or the Oxford English Dictionary, as well as more specialized lexicons. So how many semantic dictionaries do we really need?

That of course depends on the application. If we are in the situation where our target text data is extremely stable and requires only a general vocabulary, then we might get away with a single semantic dictionary based on a large sample of data processed quite carefully. On the Web, however, we have nothing of the sort, if you haven’t noticed lately.

A sophisticated dictionary that took weeks to build with hairy mathematical algorithms on a reasonable sample of training text may become obsolete overnight. That is not to say that sophisticated dictionaries are unhelpful; but in the merciless competition of the information marketplace, we probably need to be able to pop out a new semantic dictionary based on a gigabyte or more of text in just hours.

Given this kind of turnaround, why would anyone want to rely on a single semantic dictionary with its limited vocabulary and somewhat dated concepts? A new dictionary will of course involve a nontrivial upfront investment, but once a reliable source of tagged data is developed, actual dictionary building can be largely automated. That is the advantage of relying on statistical methods.

Casino Royale

14 Sep 2009

In any statistical information system, one can never achieve absolute certainty. Every result is a kind of bet with the possibility of losing. For Semantic Signatures, however, this is more like playing blackjack than like playing roulette. Whether we imagine ourselves as the house or some hotshot card counter, we try our utmost to bend the odds in our favor.

When a given term occurs in a document, we know that there is a certain probability that the document is about a given topic. For example, THRILLER may relate to Michael Jackson or to some recent summer popcorn epic. Similarly MOONWALK may refer to Apollo XI or to a dance move. We would be rash to judge content just on the basis of a single term, but when multiple terms can corroborate each other, we do have a better bet.

The trick here is to able to set up a semantic dictionary so that we can always expect to find a reasonable number of terms in a target document that allow us to make that better bet. This requires careful balancing: we need enough semantic dimensions to be able to distinguish the different important kinds of content and enough terms for each dimension to put it into play. It is much like developing a diverse portfolio of investments to weather any shift in economic climate.

Most people will probably pass on building their own semantic dictionaries. It takes a tremendous amount of work to collect and filter the requisite text data to ground our dictionary weights and to massage all those numbers to get the maximum amount of usable information. But we want to get on the right side of the odds.

I was asked that question quite a few times when I was at the KM World and SemTech conferences. The answer is simple: use a Semantic Signature as a query against an index of Semantic Signatures to find the most relevant content.

In order to illustrate what a Semantic Signature is, we provide the example of a document with 30 semantic dimensions labeled using the Open Directory Project taxonomy (www.dmoz.org). The example lead people to believe that a Semantic Signature is nothing more a multivariate categorizer for content navigation, categorization, or other forms of content bucketing. While Signatures can certainly be used for that, it is not how we use them at TextWise.

If you examine a Semantic Signature without reading the thirty labels, you’ll observe it is a 30 dimension vector, of concepts and weights. These concepts and weights are used by TextWise in a simple vector math calculation to determine the similarity between two signatures. Once a score is obtained, it is normalized to an integer value and then a cutoff is chosen to determine if each signature is relevant to the query.

For a real world application, a user controlled sliding scale from 1 -10 can be used within the calculation to control what content items, represented by the Semantic Signatures, are displayed: a score of 9 would instruct the application to show only the highly relevant content while a score of 2 would show a greater recall of content.

Why would I use Semantic Signatures to search for content?  If you have read an article on the web or on your companies’ intranet and attempted to find additional content related to what you’re looking at, you know it is a cumbersome process:  Identify keywords to use from the source, use them to search, review the results, repeat the process until you either found what you are looking for or capitulated in your effort.  If you performed the same search against an index of Semantic Signatures, you simply use the document as the query, eliminating the inherent keyword/guesswork/review cycle with using today’s keyword systems.

From a developer’s perspective, the major benefits of Semantic Signatures are:

  • They are a very accurate and compact representation of a document – each Signature only consumes ~180 bytes of RAM.
  • Computing similarity of Signatures is a very light weight vector calculation and unlike keyword matching, there is no need for patterns, alias tables, synonym tables, spell correction, etc.
  • Scalability. 3 million Signatures will easily fit within a 1.5 GB 32 bit Java VM and result in full index searches taking ~ 70 milliseconds.

If you want to learn more about Semantic Signature technology and use our free API to create Semantic Signatures for your content, visit www.semantichacker.com.

A former colleague of mine used to have an entire can of soup for lunch every day. We razzed him about this, but he shook us all off until one day, I looked at the nutrition label on the can. That soup had 1800 mg of sodium altogether! We gravely informed him of this fact, and the soup was soon history.

To understand this story, you would have to know that the recommended daily maximum dietary intake of sodium for an adult is about 900 mg. Without this context, the number 1800 really means nothing. So what do all those numbers in a semantic dictionary mean, if anything?

The key property of semantic dictionary numbers is that they are based on probabilities and so have to fall between 0 and 1. They measure the likelihood that a document containing a given term is related to a given semantic dimension. For example, a dictionary weight of 1.0000 for a term and a dimension would indicate that a document containing the term is absolutely associated with the dimension.

There is a complication here, however. In real life, nothing is ever so certain. If we saw a 1.0000 term weight for a dimension, a more reasonable interpretation is that our sample of training data was too small for estimating the probability of that term accurately. A similar problem arises for a dictionary weight of 0.0000.

In general, a statistician will be highly suspicious of any extreme probabilities like 1.0000 and 0.0000. As a proponent of statistical technology, we have to make a special effort to avoid such probability estimates in our semantic dictionaries. In contrast, certain other mathematic approaches to semantics tend to skate over niceties like this, choosing just to plug in numbers to what is essentially a fixed formula.

If one is careless about the meaning of numbers, though, how can one be careful in capturing the actual meaning of words?

When is a semantic dictionary good? It really depends on the application, since more specialized content requires more specialized dictionary dimensions. Typically,  validation of a given application will involve extensive benchmark testing, often entailing human judgments of the effectiveness of particular statistical characterizations of content.

TextWise does all of this in its product development process, but one would not want to go through an elaboration validation procedure to test the consequences of every small change. As it turns out, there are quick statistical ways to check whether a change is likely to be good or bad. This is no substitute for actual detailed validation at some point, but it allows one to experiment with new ideas at a fairly low cost.

A digital photography metaphor is apt here. One cannot use statistics to identify a prize-winning shot, it is certainly possible to detect major problems without human judgments. For example, areas of maximally white pixels indicate blown highlights, which typically detract from the quality of an image. Similarly, problems with white balance, dynamic range, focus, and other conditions are also readily detectable.

With any huge data object like a semantic dictionary, it is difficult to construct a benchmark that will cover every aspect of it thoroughly. Statistical testing provides an overall sanity check on quality. Otherwise, one would just be buying and selling pigs in a poke.

If you have studied the mathematics of linear spaces, then you know that there are infinitely many ways to represent a given point in a space as a set of coordinates. For any particular set of data points, however, one can mechanically derive a particular set of axes that results in the representation requiring the fewest coordinates to capture the most important characteristics of those points.

This is the principle behind Latent Semantic Indexing (LSI), which was in large part why Susan Dumais of Microsoft received the Salton prize at the recently concluded SIGIR Conference. So, why don’t we use LSI?

It all boils down to whether one believes that a linear space is good model for the semantics of natural language; and the main issue here is that of orthogonality. Orthogonality is great in a Pythagorean ideal world, but the real world tends to be quite messy. A rectangular grid can be imposed on places like the U.S. Midwest, but would be quite inappropriate for land management or road building in the Amazon Basin or in Siberia.

An orthogonal system disregards the landscape, which is in fact what we have to live with and in.  Two towns ten miles apart along a navigable river are in a sense closer than two towns five miles apart with a mountain range between them. Our approach to semantics is that of conforming to the landscape of text data, which is probably better described as being fractal than being orthogonal.

In our current effort to develop a French semantic dictionary, we ran across the word TSOIN in two stop lists posted on the Web. It was not in my old pocket Larousse, but an online French lexicon explained that it usually appeared in the doubled form “tsoin-tsoin.” Unfortunately, it had “no official definition.”

How can an expression common enough to be included on a stop list have no meaning? At this point, contextual semantics came in to save the day. Who really needs a normative model-driven definition? We could simply apply our standard structural methods to characterize where the expression occurs.

To begin with, there is a French Wikipedia tongue-in-cheek article about the African “tsoin-tsoin fly” that carries a lethargic disease that kills a person in about twenty or thirty years. A French rock band issued an album with “tsoin-tsoin” in its title. Several bloggers or forum posters have it in their user names.

So we can infer that “tsoin-tsoin” is probably not obscene. It seems to have some negative connotations like “slacker” or “lazy,” but in a somewhat positive way. A similar English adjective would be “laidback,” and one may perhaps even say “cool.”

Actually, our sample size is much too small to make any reliable definition yet, but with more examples of its occurrence,  we can eventually home in on a broadly accepted meaning. In a sense, this meaning is still being developed in popular usage; we are watching contextual semantics at work in real life.