Archive for the ‘Science’ Category

Hijacked

26 May 2010

Has this ever happened to you? You are Googling for information on the Web, but inadvertently your query happens to share keywords with the latest cultural phenom: the next tweener heart throb, a YouTube video suddenly gone viral, or yet another paranoid political fantasy that refuses to die.

You are a professional, however, and so switch into Advanced Mode to reshape your query, but to no avail. Your information has been buried under pop detritus; it has been hijacked by the maximum likelihood estimate (MLE) on the Web.

At times like this, you want to grab your search engine by the neck and shout, “I am NOT a screaming twelve-year-old girl into dancing cats and fixated on the President’s birth place!” But your search engine continues blithely in the wisdom of the crowd.

It is a reminder that statistically grounded information systems are at the mercy of their training data. If we cede too much control of a system to its finely wrought black box judgment, then we sometimes are going to run off the tracks. This is especially true with web semantics.

If we do in fact want to get under the hood to adjust a semantic system to go against the popular flow, then it helps tremendously if the categories underlying the representation of document content are intelligible to people. Such transparency is a prime motivation for how semantic dictionaries are currently built by TextWise.

Of course, if you care nary a lick about transparency, then may I interest you in this slightly used synthetic collateralized debt obligation….

Going Deep

12 May 2010

When people read text, they may not understand everything in it. For example, a layman might look at an article from a medical journal and see only that it is about some kind of drug. Someone more familiar with medicine would pick up that this is an experimental drug for treating estrogen-sensitive breast cancer. An expert would note that the drug is an aromatase blocker that performs as well as a standard approved drug in a double-blind controlled trials with a large sample of patients.

If an application seeks simply to distinguish documents about pharmaceuticals from documents about toxic financial assets or about the World Cup tournament in South Africa, then it is enough to understand at a superficial level. If a physician is searching for treatment options for a patient with a recurrence of breast cancer, however, a much deeper grasp of content is called for.

A general type of semantic dictionary covering a broad variety of different subjects is more or less forced to opt for broad coverage by default. Collecting enough training data for two thousand dimensions is a major undertaking; having to do it for twenty thousand dimensions will entail a big commitment of resources that one will have to justify. Still, if such a dictionary is critical for a given application, then we need to make the investment.

In many cases the domain of content to be covered can be quite circumscribed. Accordingly, we probably would be better off to add a fairly small number of dimensions to an existing semantic dictionary rather than build a whole new dictionary from scratch. This will require some special statistical balancing of course, but balancing is what dictionary building is all about.

Perfect What?

22 Mar 2010

We have been musing about the true topology of semantic spaces and how this affects our concept of dimensionality. This segués logically into a hot area of contention. In our linear approximation of meaning, how many dimensions do we really need and what should they be?

Some people prefer to approach this problem mathematically. Given a representative sample of documents to describe semantically, we can look at the relationship between terms and documents as a defining a vector space. One can then apply the method of singular vector decomposition (SVD) to find a minimal set of basis vectors to span that space. These singular vectors are like eigenvectors on steroids.

If you have actually read this far into this blog, then you will know that we (TextWise) have a competitor that employs SVD for semantic analysis. We get asked all the time why we have stuck with basic statistical techniques when we could instead be rigorously mathematical. Our usual response is that we have much faster turnaround in building semantic dictionaries, finer-grain descriptions of content, and more intuitive concepts overall.

There are more fundamental concerns, however, both theoretical and practical. On the theoretical side, SVD might be pushing a linear-space semantic model too far if meaning is in fact topological complex. More significantly on the practical side, though, is that one might be getting caught in the common problem of overtraining.

Suppose that we have a hundred thousand blog posting to which we apply SVD to get some optimal set of dimensions for analyzing their content. What then happens next week when we get a million new blogs that we have never seen before? Our perfect basis set is now distinctly handicapped.

Now we could try to reprocess all our data here, but SVD is so computationally intensive as an algorithm that it probably will be too slow to keep up without superextraordinary investments in hardware resources. We also would end up with an unstable system in which it is quite difficult to compare results from one week to the next. Anyway, we made our choice here.

People in the information sciences are fond of high-dimensional vector spaces as models of document content. These are in fact only approximations of reality, however; and in the specific case of semantics, they are probably an oversimplification. We already know something about how the neural circuitry in our brains work when we process the meaning of language; we can find no clean finite-dimensional linear space in the tangle of our synapses.

Neural imaging like PET does support the theory that linguistic concepts correspond to particular clusters of neurons connected in fairly complex feedback loops. Our understanding here is still quite limited, though. We do not know how many such clusters exist or how widely they are distributed. Visual concepts are in a different part of the brain than auditory concepts, for example; and overall, we have not yet found any obvious switchboard, say in the hippocampus, that could somehow tie everything together neatly.

In our computational semantic model, we assume that all concepts are independent and equal. That seems to work in semantic dictionary applications when we have thousands of concepts of concepts as dimensions, but an espistemologist here would have the lurking suspicion that our actual semantic space has to be some kind of complex manifold with all kinds of holes and twisting surfaces like a deranged n-th-order Moebius strip. Meaning is messy.

Our linear Euclidean model may therefore be valid only in a small local region of our actual semantic space, but in practice, that is really where all our apps have to live. One cannot presume to comprehend all possible content in text. We can only slice off a small piece of the pie of meaning, and until world peace and perfect enlightenment break out, that is a good start.

We have been thinking lately about how many dimensions a semantic dictionary should have. Some researchers at Carnegie-Mellon have been approaching the same question from the perspective of neuroscience and real-time imaging of activity in the human brain while understanding language (http://bit.ly/buIZEx).

According to CMU, there are really only THREE basic semantic dimensions: (1) Can I eat it? (2) Can I pick it up? (3) Can I hide in it? Admittedly, this primitive partitioning of the world probably goes back to our primate origins, but does have a certain resonance. Let’s remember it the next time we try to categorize journal articles in nanotechnology or search postings on someone’s Facebook wall.

In my last post, I presented some research on the different content types we found in our corpus of 8.9 million Twitter messages. One surprising result we found is that Portuguese is apparently the second most common language on Twitter, beating out both Japanese and Spanish. Given the unreliability of TextCat on short pieces of text, I decided to verify our language statistics by looking at the location field in the user info for the unique set of users in our corpus. This was not a straightforward thing to do, however, because the location is a text field which people can write absolutely anything they want into. For example, the following all occurred more than once in our corpus:

  • “New York”
  • “NYC”
  • “everywhere!!!!”
  • “In ur computers, eating ur RAM”
  • “Earth”
  • “Mars”
  • “Utah :)”
  • “utah :(”

To get around this problem, I normalized the text by converting it to lowercase, removing punctuation, and changing things that looked like addresses to have just the city (so that “123 Fake St., Springfield, USA” becomes just “springfield”). I then looked at the top 500 locations in terms of number of twitterers. These are the most common countries represented in users’ locations:
Twitter User Locations (by Country)
And the top 10 cities are:

  1. New York
  2. São Paulo
  3. Los Angeles
  4. London
  5. Chicago
  6. San Francisco
  7. Rio de Janeiro
  8. Tokyo
  9. Atlanta
  10. Toronto

While the locations are dominated by English-speaking countries, Brazil does come in second in terms of number of users, and two Brazilian cities show up in the top 10, which suggests that our language stats aren’t too far off the mark.

Another question we considered in our study is whether there is any way to distinguish between twitterers who post broadly informative messages from those who post mainly personal messages or spam. Our first thought was that the number of followers a twitterer has would be a good indication of how informative their messages are to a wider audience. But we were quite surprised when we looked at the distribution of the number of followers in our sample:
Histogram of Log Number of Followers on Twitter
The x-axis here is the logarithm base 10 of the number of followers. While most twitterers in our corpus have between 15 to 60 followers (log=1.2 to 1.8), there is a long tail where we can find accounts with more than a thousand, 100,000, or even a million followers. We didn’t realize at first the number of celebrities currently using Twitter, as you can see in this list of the top 100 most-followed Twitter accounts. Of course, it’s a matter of opinion whether the latest funny video that Ashton Kutcher found on YouTube is more important than what Barack Obama has to say about health care, but for our purposes, we’d rather filter out celebrity ramblings from the more serious messages, and that is not easy to do based on the number of followers alone.

A more surprising fact we discovered is that spammer accounts can have relatively high numbers of followers as well, as you can see in the following boxplot:
Number of Followers on Twitter, by Message Type
This data is from the 1,000 tweet sample which was classified by message type that I discussed in my previous post. In this plot, messages about the user’s current status and private conversations are grouped together as “personal” messages, while all other messages (excluding spam) are “info” messages. The boxes show the middle 50% of the distribution for each type, while the whiskers extending from the boxes show where 99% of the data points lie. (There are a few outliers above 5,000 followers which are not shown here, to make the distributions easier to see.) While spam messages only made up a small fraction (4%) of our sample, the plot shows that within the set of spammer accounts there are quite a few which have more than 500-1000 followers, a number which would be pretty high for the other two message types. There is even one spam account in our sample which had over 10,000 followers at the time they posted.

But how could a spammer get so many followers, given that all they post is spam? Given that for nearly all of these accounts, the number of friends (accounts they are following) is greater than the number of followers, I suspect that what’s going on is that spammers go around following other twitterers at random, and at least some of these people are following them back out of courtesy, without realizing that they are actually a spam account. The only way a Twitter spammer could get someone to see their tweets is if they are followed by them, after all. There’s probably a high rate of turnover in a spammer’s followers list, but that wouldn’t matter much, as long as they can find more people to follow who will follow them in return without checking them out first.

All of this means that distinguishing spam from informative tweets will not be easy, even if there isn’t that much of it currently. But some good news for us is that twitterers who post lots of informative content do tend to have more followers than those who post mainly personal messages. This fact, combined with some semantic analysis of Twitter messages, should help us a great deal in mining the Twitter stream for useful content.

Recently we’ve been looking at how well our Semantic Signatures technology works with messages posted to Twitter. These kinds of messages pose significant challenges for the semantic web in general, because their extremely short length (140 characters or less) means that there will be very little context available for understanding the content of the message. In addition, many of these messages feature “creative” spellings and grammar, and are of a personal nature (e.g. “Having sushi for lunch today”) that would not be of general interest. Extracting any meaningful information from these snippets of random conversation will be quite a difficult task indeed.

To see what exactly we’re up against, we undertook a small study to characterize the different types of messages that can be found on Twitter. Read the rest of this entry »