Archive for the ‘General’ Category

Hijacked

26 May 2010

Has this ever happened to you? You are Googling for information on the Web, but inadvertently your query happens to share keywords with the latest cultural phenom: the next tweener heart throb, a YouTube video suddenly gone viral, or yet another paranoid political fantasy that refuses to die.

You are a professional, however, and so switch into Advanced Mode to reshape your query, but to no avail. Your information has been buried under pop detritus; it has been hijacked by the maximum likelihood estimate (MLE) on the Web.

At times like this, you want to grab your search engine by the neck and shout, “I am NOT a screaming twelve-year-old girl into dancing cats and fixated on the President’s birth place!” But your search engine continues blithely in the wisdom of the crowd.

It is a reminder that statistically grounded information systems are at the mercy of their training data. If we cede too much control of a system to its finely wrought black box judgment, then we sometimes are going to run off the tracks. This is especially true with web semantics.

If we do in fact want to get under the hood to adjust a semantic system to go against the popular flow, then it helps tremendously if the categories underlying the representation of document content are intelligible to people. Such transparency is a prime motivation for how semantic dictionaries are currently built by TextWise.

Of course, if you care nary a lick about transparency, then may I interest you in this slightly used synthetic collateralized debt obligation….

Going Deep

12 May 2010

When people read text, they may not understand everything in it. For example, a layman might look at an article from a medical journal and see only that it is about some kind of drug. Someone more familiar with medicine would pick up that this is an experimental drug for treating estrogen-sensitive breast cancer. An expert would note that the drug is an aromatase blocker that performs as well as a standard approved drug in a double-blind controlled trials with a large sample of patients.

If an application seeks simply to distinguish documents about pharmaceuticals from documents about toxic financial assets or about the World Cup tournament in South Africa, then it is enough to understand at a superficial level. If a physician is searching for treatment options for a patient with a recurrence of breast cancer, however, a much deeper grasp of content is called for.

A general type of semantic dictionary covering a broad variety of different subjects is more or less forced to opt for broad coverage by default. Collecting enough training data for two thousand dimensions is a major undertaking; having to do it for twenty thousand dimensions will entail a big commitment of resources that one will have to justify. Still, if such a dictionary is critical for a given application, then we need to make the investment.

In many cases the domain of content to be covered can be quite circumscribed. Accordingly, we probably would be better off to add a fairly small number of dimensions to an existing semantic dictionary rather than build a whole new dictionary from scratch. This will require some special statistical balancing of course, but balancing is what dictionary building is all about.

Perfect What?

22 Mar 2010

We have been musing about the true topology of semantic spaces and how this affects our concept of dimensionality. This segués logically into a hot area of contention. In our linear approximation of meaning, how many dimensions do we really need and what should they be?

Some people prefer to approach this problem mathematically. Given a representative sample of documents to describe semantically, we can look at the relationship between terms and documents as a defining a vector space. One can then apply the method of singular vector decomposition (SVD) to find a minimal set of basis vectors to span that space. These singular vectors are like eigenvectors on steroids.

If you have actually read this far into this blog, then you will know that we (TextWise) have a competitor that employs SVD for semantic analysis. We get asked all the time why we have stuck with basic statistical techniques when we could instead be rigorously mathematical. Our usual response is that we have much faster turnaround in building semantic dictionaries, finer-grain descriptions of content, and more intuitive concepts overall.

There are more fundamental concerns, however, both theoretical and practical. On the theoretical side, SVD might be pushing a linear-space semantic model too far if meaning is in fact topological complex. More significantly on the practical side, though, is that one might be getting caught in the common problem of overtraining.

Suppose that we have a hundred thousand blog posting to which we apply SVD to get some optimal set of dimensions for analyzing their content. What then happens next week when we get a million new blogs that we have never seen before? Our perfect basis set is now distinctly handicapped.

Now we could try to reprocess all our data here, but SVD is so computationally intensive as an algorithm that it probably will be too slow to keep up without superextraordinary investments in hardware resources. We also would end up with an unstable system in which it is quite difficult to compare results from one week to the next. Anyway, we made our choice here.

People in the information sciences are fond of high-dimensional vector spaces as models of document content. These are in fact only approximations of reality, however; and in the specific case of semantics, they are probably an oversimplification. We already know something about how the neural circuitry in our brains work when we process the meaning of language; we can find no clean finite-dimensional linear space in the tangle of our synapses.

Neural imaging like PET does support the theory that linguistic concepts correspond to particular clusters of neurons connected in fairly complex feedback loops. Our understanding here is still quite limited, though. We do not know how many such clusters exist or how widely they are distributed. Visual concepts are in a different part of the brain than auditory concepts, for example; and overall, we have not yet found any obvious switchboard, say in the hippocampus, that could somehow tie everything together neatly.

In our computational semantic model, we assume that all concepts are independent and equal. That seems to work in semantic dictionary applications when we have thousands of concepts of concepts as dimensions, but an espistemologist here would have the lurking suspicion that our actual semantic space has to be some kind of complex manifold with all kinds of holes and twisting surfaces like a deranged n-th-order Moebius strip. Meaning is messy.

Our linear Euclidean model may therefore be valid only in a small local region of our actual semantic space, but in practice, that is really where all our apps have to live. One cannot presume to comprehend all possible content in text. We can only slice off a small piece of the pie of meaning, and until world peace and perfect enlightenment break out, that is a good start.

We have been thinking lately about how many dimensions a semantic dictionary should have. Some researchers at Carnegie-Mellon have been approaching the same question from the perspective of neuroscience and real-time imaging of activity in the human brain while understanding language (http://bit.ly/buIZEx).

According to CMU, there are really only THREE basic semantic dimensions: (1) Can I eat it? (2) Can I pick it up? (3) Can I hide in it? Admittedly, this primitive partitioning of the world probably goes back to our primate origins, but does have a certain resonance. Let’s remember it the next time we try to categorize journal articles in nanotechnology or search postings on someone’s Facebook wall.

Even in the world of print, one dictionary is often not enough. Just for English, for example, we can go to standard references like Webster’s Third New International, The American Heritage Dictionary of the English Language, or the Oxford English Dictionary, as well as more specialized lexicons. So how many semantic dictionaries do we really need?

That of course depends on the application. If we are in the situation where our target text data is extremely stable and requires only a general vocabulary, then we might get away with a single semantic dictionary based on a large sample of data processed quite carefully. On the Web, however, we have nothing of the sort, if you haven’t noticed lately.

A sophisticated dictionary that took weeks to build with hairy mathematical algorithms on a reasonable sample of training text may become obsolete overnight. That is not to say that sophisticated dictionaries are unhelpful; but in the merciless competition of the information marketplace, we probably need to be able to pop out a new semantic dictionary based on a gigabyte or more of text in just hours.

Given this kind of turnaround, why would anyone want to rely on a single semantic dictionary with its limited vocabulary and somewhat dated concepts? A new dictionary will of course involve a nontrivial upfront investment, but once a reliable source of tagged data is developed, actual dictionary building can be largely automated. That is the advantage of relying on statistical methods.

One of the challenges with creating and maintaining applications for the web is keeping up with all of today’s different web browsers and their differing under-the-hood technologies and functionality.  New versions of browsers and operating systems are released frequently for a number of reasons, such as feature enhancements to security fixes.  There is a wide variety of web browsers available today, each offering something a bit different from the others.  Operating system vendors have their own, some of them are cross-platform and work on other operating systems, then there are the third-party browsers, and we haven’t even explored the mobile browser realm yet…  Creating and maintaining a set of browser and OS combinations as a company standard toward which applications can be developed and tested has become key for us.

Our standard has been created using statistics on browser and OS usage from W3Schools, broken down by brand and version.  By collecting this data and observing trends over time, we can decide when it’s appropriate to either start or discontinue supporting a browser, OS, or combination of the two.  Our process is to evaluate our browser/OS support matrix each time a new major or minor version of a browser or OS is released, or at most every 6 months (assuming no browser or OS updates have occurred).  Doing an evaluation of the statistics is important even if no updates have occurred, because some browsers may fall below a percentage of use needed for support, or others may have increased enough in usage or popularity to now be supported.

It’s also important to be able to test those combinations to ensure compatibility.  Rather than bearing the expense of having every possible combination in-house, we use a service on the web that specializes in providing those tools to help us test.  The service that we use is called BrowserCam, which gives us the ability to take “snapshots” of our applications in various browser/OS combinations on the web, and remote access on those machines for interactive testing.  And to answer the original question, we have no idea – PlanetWeb2.6 on Dreamcast is not one of our supported combinations.

I’ve worked in Product Management for years, and for years our battles were ones with well-known, battle weary opponents.  The battle lines had been drawn, strategic alliances were made and broken, and weapons were fired (i.e., fingers were pointed) if a product flopped, forecasts weren’t met, or the “biggest prospect ever” walked away.

Sales blamed Marketing, Sales blamed the Product Manager and vice versa, Developers blamed the Requirements Analysts and QA (bad requirements and bad QA people for not catching the bugs prior to launch), and the CEO blamed everybody, but always came back with donuts and cookies for us to enjoy as we licked our collective wounds.

I hate using the whole war/battleground analogy because in the end we’re all on the same team, working toward the same goal.  And if your company is in the business of selling products and services, the goal typically boils down to the bottom line – profitability.

Just over a year ago, I embarked on a new path in my Product Management journey.  I’m now working as a Product Manager for a company whose underlying technology is semantic web technology.  This means we’ve got PhD’s and patent holders with specialties in things like informatics, search engine technology, information retrieval and relevance testing.  So in addition to engineers wanting to continually improve the system and its architecture, I’ve now got scientists who want to (and rightly so) continually improve the core technology, the algorithms.  Tweak, tweak, tweak.  They go to conferences like SIGIR and come back fired up, ready to go. They’re like a new battalion that has shown up on the battlefield and I have to size them up quickly in order to protect my precious project timeline. “Are you friend or foe?”

So I find myself with an interesting dilemma.  How do you tell a scientist to think faster, come to a conclusion faster, improve the technology but get it done NOW so we can get the product built and out the door?  How do I get the engineers to wrap it up? Is it research or is it a project?  I need an answer, and what I keep hearing from all camps is, “It’s both!”

I can hear you Product Managers and Project Managers saying it now.  Just continue with new product development and merge the R&D advancements into the products as they’re ready.  I used to say and do that too, and it worked.  But that was before I landed in a place where what works for typical Web data doesn’t quite work for patent data, and none of it compares to that elusive Holy Grail called the medical domain.  And even tweaks in certain parts of the engineering code can potentially affect relevance and skew the results.

Have I given up on my beloved project schedules, my requirements documents, my product development lifecycle tools?  NEVER. I’m a Product Manager at heart. But just like the ever changing world of the semantic web, and the scientists and engineers I interact with daily, I’m tweaking my process as I go.

A former colleague of mine used to have an entire can of soup for lunch every day. We razzed him about this, but he shook us all off until one day, I looked at the nutrition label on the can. That soup had 1800 mg of sodium altogether! We gravely informed him of this fact, and the soup was soon history.

To understand this story, you would have to know that the recommended daily maximum dietary intake of sodium for an adult is about 900 mg. Without this context, the number 1800 really means nothing. So what do all those numbers in a semantic dictionary mean, if anything?

The key property of semantic dictionary numbers is that they are based on probabilities and so have to fall between 0 and 1. They measure the likelihood that a document containing a given term is related to a given semantic dimension. For example, a dictionary weight of 1.0000 for a term and a dimension would indicate that a document containing the term is absolutely associated with the dimension.

There is a complication here, however. In real life, nothing is ever so certain. If we saw a 1.0000 term weight for a dimension, a more reasonable interpretation is that our sample of training data was too small for estimating the probability of that term accurately. A similar problem arises for a dictionary weight of 0.0000.

In general, a statistician will be highly suspicious of any extreme probabilities like 1.0000 and 0.0000. As a proponent of statistical technology, we have to make a special effort to avoid such probability estimates in our semantic dictionaries. In contrast, certain other mathematic approaches to semantics tend to skate over niceties like this, choosing just to plug in numbers to what is essentially a fixed formula.

If one is careless about the meaning of numbers, though, how can one be careful in capturing the actual meaning of words?

If you have studied the mathematics of linear spaces, then you know that there are infinitely many ways to represent a given point in a space as a set of coordinates. For any particular set of data points, however, one can mechanically derive a particular set of axes that results in the representation requiring the fewest coordinates to capture the most important characteristics of those points.

This is the principle behind Latent Semantic Indexing (LSI), which was in large part why Susan Dumais of Microsoft received the Salton prize at the recently concluded SIGIR Conference. So, why don’t we use LSI?

It all boils down to whether one believes that a linear space is good model for the semantics of natural language; and the main issue here is that of orthogonality. Orthogonality is great in a Pythagorean ideal world, but the real world tends to be quite messy. A rectangular grid can be imposed on places like the U.S. Midwest, but would be quite inappropriate for land management or road building in the Amazon Basin or in Siberia.

An orthogonal system disregards the landscape, which is in fact what we have to live with and in.  Two towns ten miles apart along a navigable river are in a sense closer than two towns five miles apart with a mountain range between them. Our approach to semantics is that of conforming to the landscape of text data, which is probably better described as being fractal than being orthogonal.