Archive for the ‘semantics’ Category

Hijacked

26 May 2010

Has this ever happened to you? You are Googling for information on the Web, but inadvertently your query happens to share keywords with the latest cultural phenom: the next tweener heart throb, a YouTube video suddenly gone viral, or yet another paranoid political fantasy that refuses to die.

You are a professional, however, and so switch into Advanced Mode to reshape your query, but to no avail. Your information has been buried under pop detritus; it has been hijacked by the maximum likelihood estimate (MLE) on the Web.

At times like this, you want to grab your search engine by the neck and shout, “I am NOT a screaming twelve-year-old girl into dancing cats and fixated on the President’s birth place!” But your search engine continues blithely in the wisdom of the crowd.

It is a reminder that statistically grounded information systems are at the mercy of their training data. If we cede too much control of a system to its finely wrought black box judgment, then we sometimes are going to run off the tracks. This is especially true with web semantics.

If we do in fact want to get under the hood to adjust a semantic system to go against the popular flow, then it helps tremendously if the categories underlying the representation of document content are intelligible to people. Such transparency is a prime motivation for how semantic dictionaries are currently built by TextWise.

Of course, if you care nary a lick about transparency, then may I interest you in this slightly used synthetic collateralized debt obligation….

Going Deep

12 May 2010

When people read text, they may not understand everything in it. For example, a layman might look at an article from a medical journal and see only that it is about some kind of drug. Someone more familiar with medicine would pick up that this is an experimental drug for treating estrogen-sensitive breast cancer. An expert would note that the drug is an aromatase blocker that performs as well as a standard approved drug in a double-blind controlled trials with a large sample of patients.

If an application seeks simply to distinguish documents about pharmaceuticals from documents about toxic financial assets or about the World Cup tournament in South Africa, then it is enough to understand at a superficial level. If a physician is searching for treatment options for a patient with a recurrence of breast cancer, however, a much deeper grasp of content is called for.

A general type of semantic dictionary covering a broad variety of different subjects is more or less forced to opt for broad coverage by default. Collecting enough training data for two thousand dimensions is a major undertaking; having to do it for twenty thousand dimensions will entail a big commitment of resources that one will have to justify. Still, if such a dictionary is critical for a given application, then we need to make the investment.

In many cases the domain of content to be covered can be quite circumscribed. Accordingly, we probably would be better off to add a fairly small number of dimensions to an existing semantic dictionary rather than build a whole new dictionary from scratch. This will require some special statistical balancing of course, but balancing is what dictionary building is all about.

Perfect What?

22 Mar 2010

We have been musing about the true topology of semantic spaces and how this affects our concept of dimensionality. This segués logically into a hot area of contention. In our linear approximation of meaning, how many dimensions do we really need and what should they be?

Some people prefer to approach this problem mathematically. Given a representative sample of documents to describe semantically, we can look at the relationship between terms and documents as a defining a vector space. One can then apply the method of singular vector decomposition (SVD) to find a minimal set of basis vectors to span that space. These singular vectors are like eigenvectors on steroids.

If you have actually read this far into this blog, then you will know that we (TextWise) have a competitor that employs SVD for semantic analysis. We get asked all the time why we have stuck with basic statistical techniques when we could instead be rigorously mathematical. Our usual response is that we have much faster turnaround in building semantic dictionaries, finer-grain descriptions of content, and more intuitive concepts overall.

There are more fundamental concerns, however, both theoretical and practical. On the theoretical side, SVD might be pushing a linear-space semantic model too far if meaning is in fact topological complex. More significantly on the practical side, though, is that one might be getting caught in the common problem of overtraining.

Suppose that we have a hundred thousand blog posting to which we apply SVD to get some optimal set of dimensions for analyzing their content. What then happens next week when we get a million new blogs that we have never seen before? Our perfect basis set is now distinctly handicapped.

Now we could try to reprocess all our data here, but SVD is so computationally intensive as an algorithm that it probably will be too slow to keep up without superextraordinary investments in hardware resources. We also would end up with an unstable system in which it is quite difficult to compare results from one week to the next. Anyway, we made our choice here.

People in the information sciences are fond of high-dimensional vector spaces as models of document content. These are in fact only approximations of reality, however; and in the specific case of semantics, they are probably an oversimplification. We already know something about how the neural circuitry in our brains work when we process the meaning of language; we can find no clean finite-dimensional linear space in the tangle of our synapses.

Neural imaging like PET does support the theory that linguistic concepts correspond to particular clusters of neurons connected in fairly complex feedback loops. Our understanding here is still quite limited, though. We do not know how many such clusters exist or how widely they are distributed. Visual concepts are in a different part of the brain than auditory concepts, for example; and overall, we have not yet found any obvious switchboard, say in the hippocampus, that could somehow tie everything together neatly.

In our computational semantic model, we assume that all concepts are independent and equal. That seems to work in semantic dictionary applications when we have thousands of concepts of concepts as dimensions, but an espistemologist here would have the lurking suspicion that our actual semantic space has to be some kind of complex manifold with all kinds of holes and twisting surfaces like a deranged n-th-order Moebius strip. Meaning is messy.

Our linear Euclidean model may therefore be valid only in a small local region of our actual semantic space, but in practice, that is really where all our apps have to live. One cannot presume to comprehend all possible content in text. We can only slice off a small piece of the pie of meaning, and until world peace and perfect enlightenment break out, that is a good start.

We have been thinking lately about how many dimensions a semantic dictionary should have. Some researchers at Carnegie-Mellon have been approaching the same question from the perspective of neuroscience and real-time imaging of activity in the human brain while understanding language (http://bit.ly/buIZEx).

According to CMU, there are really only THREE basic semantic dimensions: (1) Can I eat it? (2) Can I pick it up? (3) Can I hide in it? Admittedly, this primitive partitioning of the world probably goes back to our primate origins, but does have a certain resonance. Let’s remember it the next time we try to categorize journal articles in nanotechnology or search postings on someone’s Facebook wall.

Learning

21 Dec 2009

Consider how we humans learn language. Even with formal education, it takes a child about 15 years starting from infancy to be able to read and understand general news articles in the New York Times. Over this period, one would probably hear or read at least on the order of 10 billion words. Even so, most high schoolers will need many additional years of schooling to become able to comprehend technical material.

So, how can anyone expect a computer to understand something like medical text after training on only about 100 million words of data? A computer of course runs on nanosecond cycles while the human brain operates on millisecond cycles; but we have had about 50,000 generations to evolve our language software, while the electronic computer has had only about 10 generations.

The bottom line here is that language learning is difficult; and it requires sifting through immense amounts of data. There probably is no magic technological shortcut here, but we have reached now the stage where our systems can routinely handle the volumes of data that would support semantic capabilities equivalent to an 8th-grade education. Decent commercial language processing tools are also now available.

Consequently, we are making major progress on semantic dictionaries, but have to be realistic about the work still ahead of us. Expect no overnight miracles from us or anyone else, especially when these are based on measly samples of data. There is still no royal road to semantics.

A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.

In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.

Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.

Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.

In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.

Even in the world of print, one dictionary is often not enough. Just for English, for example, we can go to standard references like Webster’s Third New International, The American Heritage Dictionary of the English Language, or the Oxford English Dictionary, as well as more specialized lexicons. So how many semantic dictionaries do we really need?

That of course depends on the application. If we are in the situation where our target text data is extremely stable and requires only a general vocabulary, then we might get away with a single semantic dictionary based on a large sample of data processed quite carefully. On the Web, however, we have nothing of the sort, if you haven’t noticed lately.

A sophisticated dictionary that took weeks to build with hairy mathematical algorithms on a reasonable sample of training text may become obsolete overnight. That is not to say that sophisticated dictionaries are unhelpful; but in the merciless competition of the information marketplace, we probably need to be able to pop out a new semantic dictionary based on a gigabyte or more of text in just hours.

Given this kind of turnaround, why would anyone want to rely on a single semantic dictionary with its limited vocabulary and somewhat dated concepts? A new dictionary will of course involve a nontrivial upfront investment, but once a reliable source of tagged data is developed, actual dictionary building can be largely automated. That is the advantage of relying on statistical methods.

Casino Royale

14 Sep 2009

In any statistical information system, one can never achieve absolute certainty. Every result is a kind of bet with the possibility of losing. For Semantic Signatures, however, this is more like playing blackjack than like playing roulette. Whether we imagine ourselves as the house or some hotshot card counter, we try our utmost to bend the odds in our favor.

When a given term occurs in a document, we know that there is a certain probability that the document is about a given topic. For example, THRILLER may relate to Michael Jackson or to some recent summer popcorn epic. Similarly MOONWALK may refer to Apollo XI or to a dance move. We would be rash to judge content just on the basis of a single term, but when multiple terms can corroborate each other, we do have a better bet.

The trick here is to able to set up a semantic dictionary so that we can always expect to find a reasonable number of terms in a target document that allow us to make that better bet. This requires careful balancing: we need enough semantic dimensions to be able to distinguish the different important kinds of content and enough terms for each dimension to put it into play. It is much like developing a diverse portfolio of investments to weather any shift in economic climate.

Most people will probably pass on building their own semantic dictionaries. It takes a tremendous amount of work to collect and filter the requisite text data to ground our dictionary weights and to massage all those numbers to get the maximum amount of usable information. But we want to get on the right side of the odds.

I was asked that question quite a few times when I was at the KM World and SemTech conferences. The answer is simple: use a Semantic Signature as a query against an index of Semantic Signatures to find the most relevant content.

In order to illustrate what a Semantic Signature is, we provide the example of a document with 30 semantic dimensions labeled using the Open Directory Project taxonomy (www.dmoz.org). The example lead people to believe that a Semantic Signature is nothing more a multivariate categorizer for content navigation, categorization, or other forms of content bucketing. While Signatures can certainly be used for that, it is not how we use them at TextWise.

If you examine a Semantic Signature without reading the thirty labels, you’ll observe it is a 30 dimension vector, of concepts and weights. These concepts and weights are used by TextWise in a simple vector math calculation to determine the similarity between two signatures. Once a score is obtained, it is normalized to an integer value and then a cutoff is chosen to determine if each signature is relevant to the query.

For a real world application, a user controlled sliding scale from 1 -10 can be used within the calculation to control what content items, represented by the Semantic Signatures, are displayed: a score of 9 would instruct the application to show only the highly relevant content while a score of 2 would show a greater recall of content.

Why would I use Semantic Signatures to search for content?  If you have read an article on the web or on your companies’ intranet and attempted to find additional content related to what you’re looking at, you know it is a cumbersome process:  Identify keywords to use from the source, use them to search, review the results, repeat the process until you either found what you are looking for or capitulated in your effort.  If you performed the same search against an index of Semantic Signatures, you simply use the document as the query, eliminating the inherent keyword/guesswork/review cycle with using today’s keyword systems.

From a developer’s perspective, the major benefits of Semantic Signatures are:

  • They are a very accurate and compact representation of a document – each Signature only consumes ~180 bytes of RAM.
  • Computing similarity of Signatures is a very light weight vector calculation and unlike keyword matching, there is no need for patterns, alias tables, synonym tables, spell correction, etc.
  • Scalability. 3 million Signatures will easily fit within a 1.5 GB 32 bit Java VM and result in full index searches taking ~ 70 milliseconds.

If you want to learn more about Semantic Signature technology and use our free API to create Semantic Signatures for your content, visit www.semantichacker.com.