Recently we’ve been looking at how well our Semantic Signatures technology works with messages posted to Twitter. These kinds of messages pose significant challenges for the semantic web in general, because their extremely short length (140 characters or less) means that there will be very little context available for understanding the content of the message. In addition, many of these messages feature “creative” spellings and grammar, and are of a personal nature (e.g. “Having sushi for lunch today”) that would not be of general interest. Extracting any meaningful information from these snippets of random conversation will be quite a difficult task indeed.
To see what exactly we’re up against, we undertook a small study to characterize the different types of messages that can be found on Twitter. We downloaded a sample of tweets over a two-week period using the Twitter streaming API. This resulted in a corpus of 8.9 million messages (“tweets”) posted by 2.6 million unique users. About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were retweets. Almost 2 million (22%) of the messages contained a URL.
Next we used a modified version of the language-guessing program TextCat to find the distribution of languages in our corpus.

As you might expect, English turns out to be the most common language used on Twitter (61%). But surprisingly, the next most common language is Portuguese (11%), beating out both Japanese (6%) and Spanish (4%). This is the opposite ordering of these three languages compared to what we find on the internet in general (http://www.internetworldstats.com/stats7.htm). It seems that Twitter must be a lot more popular in Brazil than in the rest of Latin America. Other languages like French, German, and even Malay are also fairly common in our corpus. It’s hard to get accurate counts for these, though, because TextCat doesn’t deal well with shorter texts (note the estimated 10% of messages that are unknown or misclassified), and it doesn’t seem to be able to identify these languages in our sample as reliably as the first four languages.
We were then interested in seeing what kinds of messages get posted to Twitter. Is it really all just people talking about where they are right now and what they’re having for lunch, or is there actually some informative content out there for us to find? To answer this question, we took a random sample of 1,000 English-language tweets from the corpus we collected, and then classified each message as one of the following types:
- User’s current status — where the user is right now, what they’re doing, etc.
- Private conversations — some twitterers seem to use the service as if it were a giant internet chatroom
- Links to web content — a URL with an article title and/or some commentary on its content. Further broken down into: links to blog and news articles; links to images and videos; and other links.
- Politics, sports, current events — discussion of these topics
- Product recommendations/complaints — recommendations or complaints about specific TV shows, movies, techie gadgets, etc.
- Advertising — posted from a company’s twitter account
- Spam — a strange phenomenon, given that an account has to be followed for anyone to see its tweets, but it does exist
- Other messages — messages that don’t quite fit under any of the above categories. Fan messages to celebrities, shoutouts to other users, web-based polls and quizzes, and so on.
Here is a graph showing the frequency of these different message types in the 1,000 tweet sample.

As you can see, over half of the tweets in the sample are either user statuses or private conversations. Only about 10-20% of the messages could be considered more broadly relevant to a larger audience (depending on which message types are considered to be informative). So while there is some interesting content to be discovered on Twitter, it will definitely take a bit of work to find it.
We also collected some statistics on the twitterers in our sample, such as their average rate of posting, the most common clients they use, and so on. But this post is already pretty long, so I’ll save that for a later date. Stay tuned…
Tags: corpus analysis, pragmatics, speech acts, twitter
[...] SÃO PAULO – Com aproximadamente 11% de participação, o idioma português é o segundo mais usado nos domínios do Twitter, de acordo com as medições da empresa americana Textwise. [...]
[...] Posse List @PosseList. One of them, from the SemanticHacker Blog, poses in its title the question How informative is Twitter? It reports on a study to “characterize different types of messages that can be found on [...]
[...] via: How Informative is Twitter? [...]
[...] PCWorld. Fonte e gráfico: TextWise AKPC_IDS += [...]
[...] estudo desenvolvido pelo TextWise Semantic Signatures mostrou que o português é o segundo idioma mais frequente no Twitter, atrás apenas do [...]
[...] português é o segundo idioma mais usado no twitter, segundo estudo desenvolvido pelo TextWise Semantic Signatures. Em 8,9 milhões de twettes 11% seriam em português. Dalê [...]
[...] January 8th, 2010 by Cliff Crawford Leave a reply » [...]
Social comments and analytics for this post…
This post was mentioned on Twitter by tleeow: How informative is Twitter?: http://bit.ly/4yEiNv via @addthis…
[...] SemanticHacker, the blog of contextual ad platform Textwise, has crunched some numbers and we may have to eat our [...]
[...] recent study by SemanticHacker (http://blog.textwise.com/?p=222) shows the overwhelming majority of tweets are in English. The majority of Twitter users are still [...]
[...] How informative is Twitter? [...]
[...] How informative is Twitter? [...]
[...] classification of tweets http://blog.textwise.com/?p=222 /via @jodischneider @johnbreslin language + message types from 8.9M tweet corpus roundtrip – [...]
[...] See the details at SemanticHacker Blog. [...]
Very nice article, thanks! Try to see the same analysis style with some other approaches in “Making the ordinary visible in microblogs”, article by Ouslavirta et al (http://www.springerlink.com/content/8l4jl02517051307/)
I am using for my Master Thesis
[...] This post was mentioned on Twitter by Mosaico Social, Cesar Larcen. Cesar Larcen said: How informative is Twitter? http://twitpic.com/1eprap from http://bit.ly/9g7zcB via @mosaicosocial [...]
[...] Study by Sematichacker after if it crunched some numbers from a random sample of 1000 english-language tweets. [...]
[...] of companies that use Twitter see an ROI within months of registration. According to Textwise, over 66% of the content tweeted on Twitter is something aside from a personal update (where that person is, [...]