<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>TextWise Blog &#187; corpus analysis</title>
	<atom:link href="http://blog.textwise.com/tag/corpus-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.textwise.com</link>
	<description>A blog about the SemanticHacker API by TextWise</description>
	<lastBuildDate>Wed, 31 Aug 2011 18:50:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>How informative is Twitter? (part 2)</title>
		<link>http://blog.textwise.com/2010/01/26/how-informative-is-twitter-part-2/</link>
		<comments>http://blog.textwise.com/2010/01/26/how-informative-is-twitter-part-2/#comments</comments>
		<pubDate>Tue, 26 Jan 2010 13:30:25 +0000</pubDate>
		<dc:creator>Cliff Crawford</dc:creator>
				<category><![CDATA[Science]]></category>
		<category><![CDATA[corpus analysis]]></category>
		<category><![CDATA[pragmatics]]></category>
		<category><![CDATA[speech acts]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=269</guid>
		<description><![CDATA[In my last post, I presented some research on the different content types we found in our corpus of 8.9 million Twitter messages. One surprising result we found is that Portuguese is apparently the second most common language on Twitter, beating out both Japanese and Spanish. Given the unreliability of TextCat on short pieces of [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F01%2F26%2Fhow-informative-is-twitter-part-2%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F01%2F26%2Fhow-informative-is-twitter-part-2%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>In my <a href="http://blog.textwise.com/?p=222">last post</a>, I presented some research on the different content types we found in our corpus of 8.9 million Twitter messages.  One surprising result we found is that Portuguese is apparently the second most common language on Twitter, beating out both Japanese and Spanish.  Given the unreliability of <a href="http://www.let.rug.nl/vannoord/TextCat/">TextCat</a> on short pieces of text, I decided to verify our language statistics by looking at the location field in the user info for the unique set of users in our corpus.  This was not a straightforward thing to do, however, because the location is a text field which people can write absolutely anything they want into.  For example, the following all occurred more than once in our corpus:</p>
<ul>
<li>&#8220;New York&#8221;</li>
<li>&#8220;NYC&#8221;</li>
<li>&#8220;everywhere!!!!&#8221;</li>
<li>&#8220;In ur computers, eating ur RAM&#8221;</li>
<li>&#8220;Earth&#8221;</li>
<li>&#8220;Mars&#8221;</li>
<li>&#8220;Utah :&#41;&#8221;</li>
<li>&#8220;utah :&#40;&#8221;</li>
</ul>
<p>To get around this problem, I normalized the text by converting it to lowercase, removing punctuation, and changing things that looked like addresses to have just the city (so that &#8220;123 Fake St., Springfield, USA&#8221; becomes just &#8220;springfield&#8221;).  I then looked at the top 500 locations in terms of number of twitterers.  These are the most common countries represented in users&#8217; locations:<br />
<img src="http://blog.textwise.com/wp-content/uploads/2010/01/twitter-countries1.png" alt="Twitter User Locations (by Country)" width="600" height="420" class="size-full wp-image-290" /><br />
And the top 10 cities are:</p>
<ol>
<li>New York</li>
<li>São Paulo</li>
<li>Los Angeles</li>
<li>London</li>
<li>Chicago</li>
<li>San Francisco</li>
<li>Rio de Janeiro</li>
<li>Tokyo</li>
<li>Atlanta</li>
<li>Toronto</li>
</ol>
<p>While the locations are dominated by English-speaking countries, Brazil does come in second in terms of number of users, and two Brazilian cities show up in the top 10, which suggests that our language stats aren&#8217;t too far off the mark.</p>
<p>Another question we considered in our study is whether there is any way to distinguish between twitterers who post broadly informative messages from those who post mainly personal messages or spam.  Our first thought was that the number of followers a twitterer has would be a good indication of how informative their messages are to a wider audience.  But we were quite surprised when we looked at the distribution of the number of followers in our sample:<br />
<img src="http://blog.textwise.com/wp-content/uploads/2010/01/numfollowers-blog1.png" alt="Histogram of Log Number of Followers on Twitter" width="600" height="480" class="size-full wp-image-284" /><br />
The x-axis here is the logarithm base 10 of the number of followers.  While most twitterers in our corpus have between 15 to 60 followers (log=1.2 to 1.8), there is a long tail where we can find accounts with more than a thousand, 100,000, or even a million followers.  We didn&#8217;t realize at first the number of celebrities currently using Twitter, as you can see in <a href="http://twitterholic.com/top100/followers/">this list</a> of the top 100 most-followed Twitter accounts.  Of course, it&#8217;s a matter of opinion whether the latest funny video that <a href="http://twitter.com/aplusk">Ashton Kutcher</a> found on YouTube is more important than what <a href="http://twitter.com/BarackObama">Barack Obama</a> has to say about health care, but for our purposes, we&#8217;d rather filter out celebrity ramblings from the more serious messages, and that is not easy to do based on the number of followers alone.</p>
<p>A more surprising fact we discovered is that spammer accounts can have relatively high numbers of followers as well, as you can see in the following boxplot:<br />
<img src="http://blog.textwise.com/wp-content/uploads/2010/01/numfollowers-bytype.png" alt="Number of Followers on Twitter, by Message Type" width="600" height="480" class="size-full wp-image-286" /><br />
This data is from the 1,000 tweet sample which was classified by message type that I discussed in my <a href="http://blog.textwise.com/?p=222">previous post</a>.  In this plot, messages about the user&#8217;s current status and private conversations are grouped together as &#8220;personal&#8221; messages, while all other messages (excluding spam) are &#8220;info&#8221; messages.  The boxes show the middle 50% of the distribution for each type, while the whiskers extending from the boxes show where 99% of the data points lie.  (There are a few outliers above 5,000 followers which are not shown here, to make the distributions easier to see.)  While spam messages only made up a small fraction (4%) of our sample, the plot shows that within the set of spammer accounts there are quite a few which have more than 500-1000 followers, a number which would be pretty high for the other two message types.  There is even one spam account in our sample which had over 10,000 followers at the time they posted.</p>
<p>But how could a spammer get so many followers, given that all they post is spam?  Given that for nearly all of these accounts, the number of friends (accounts they are following) is greater than the number of followers, I suspect that what&#8217;s going on is that spammers go around following other twitterers at random, and at least some of these people are following them back out of courtesy, without realizing that they are actually a spam account.  The only way a Twitter spammer could get someone to see their tweets is if they are followed by them, after all.  There&#8217;s probably a high rate of turnover in a spammer&#8217;s followers list, but that wouldn&#8217;t matter much, as long as they can find more people to follow who will follow them in return without checking them out first.</p>
<p>All of this means that distinguishing spam from informative tweets will not be easy, even if there isn&#8217;t that much of it currently.  But some good news for us is that twitterers who post lots of informative content do tend to have more followers than those who post mainly personal messages.  This fact, combined with some semantic analysis of Twitter messages, should help us a great deal in mining the Twitter stream for useful content.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2010%2F01%2F26%2Fhow-informative-is-twitter-part-2%2F';
  addthis_title  = 'How+informative+is+Twitter%3F+%28part+2%29';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2010/01/26/how-informative-is-twitter-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How informative is Twitter?</title>
		<link>http://blog.textwise.com/2010/01/08/how-informative-is-twitter/</link>
		<comments>http://blog.textwise.com/2010/01/08/how-informative-is-twitter/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 19:28:08 +0000</pubDate>
		<dc:creator>Cliff Crawford</dc:creator>
				<category><![CDATA[Science]]></category>
		<category><![CDATA[corpus analysis]]></category>
		<category><![CDATA[pragmatics]]></category>
		<category><![CDATA[speech acts]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=222</guid>
		<description><![CDATA[Recently we&#8217;ve been looking at how well our Semantic Signatures technology works with messages posted to Twitter. These kinds of messages pose significant challenges for the semantic web in general, because their extremely short length (140 characters or less) means that there will be very little context available for understanding the content of the message. [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F01%2F08%2Fhow-informative-is-twitter%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F01%2F08%2Fhow-informative-is-twitter%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Recently we&#8217;ve been looking at how well our Semantic Signatures technology works with messages posted to <a href="http://twitter.com/">Twitter</a>.  These kinds of messages pose significant challenges for the semantic web in general, because their extremely short length (140 characters or less) means that there will be very little context available for understanding the content of the message.  In addition, many of these messages feature &#8220;creative&#8221; spellings and grammar, and are of a personal nature (e.g. &#8220;Having sushi for lunch today&#8221;) that would not be of general interest.  Extracting any meaningful information from these snippets of random conversation will be quite a difficult task indeed.</p>
<p>To see what exactly we&#8217;re up against, we undertook a small study to characterize the different types of messages that can be found on Twitter.  <span id="more-222"></span>We downloaded a sample of tweets over a two-week period using the Twitter <a href="http://apiwiki.twitter.com/Streaming-API-Documentation">streaming API</a>.  This resulted in a corpus of 8.9 million messages (&#8220;tweets&#8221;) posted by 2.6 million unique users.  About 2.7 million of these tweets, or 31%, were replies to a tweet posted by another user, while half a million (6%) were <a href="http://mashable.com/2009/04/16/retweet-guide/">retweets</a>.  Almost 2 million (22%) of the messages contained a URL.</p>
<p>Next we used a modified version of the language-guessing program <a href="http://www.let.rug.nl/vannoord/TextCat/">TextCat</a> to find the distribution of languages in our corpus.</p>
<p><img src="http://blog.textwise.com/wp-content/uploads/2009/12/twitter-langs.png" alt="Languages used on Twitter" width="563" height="311" class="size-full wp-image-231" /></p>
<p>As you might expect, English turns out to be the most common language used on Twitter (61%).  But surprisingly, the next most common language is Portuguese (11%), beating out both Japanese (6%) and Spanish (4%).  This is the opposite ordering of these three languages compared to what we find on the internet in general (<a href="http://www.internetworldstats.com/stats7.htm">http://www.internetworldstats.com/stats7.htm</a>).  It seems that Twitter must be a lot more popular in Brazil than in the rest of Latin America.  Other languages like French, German, and even <a href="http://en.wikipedia.org/wiki/Malay_language">Malay</a> are also fairly common in our corpus. It&#8217;s hard to get accurate counts for these, though, because TextCat doesn&#8217;t deal well with shorter texts (note the estimated 10% of messages that are unknown or misclassified), and it doesn&#8217;t seem to be able to identify these languages in our sample as reliably as the first four languages.</p>
<p>We were then interested in seeing what kinds of messages get posted to Twitter.  Is it really all just people talking about where they are right now and what they&#8217;re having for lunch, or is there actually some informative content out there for us to find?  To answer this question, we took a random sample of 1,000 English-language tweets from the corpus we collected, and then classified each message as one of the following types:</p>
<ul>
<li>User&#8217;s current status &mdash; where the user is right now, what they&#8217;re doing, etc.</li>
<li>Private conversations &mdash; some twitterers seem to use the service as if it were a giant internet chatroom</li>
<li>Links to web content &mdash; a URL with an article title and/or some commentary on its content.  Further broken down into: links to blog and news articles; links to images and videos; and other links.</li>
<li>Politics, sports, current events &mdash; discussion of these topics</li>
<li>Product recommendations/complaints &mdash; recommendations or complaints about specific TV shows, movies, techie gadgets, etc.</li>
<li>Advertising &mdash; posted from a company&#8217;s twitter account</li>
<li>Spam &mdash; a strange phenomenon, given that an account has to be followed for anyone to see its tweets, but it does exist</li>
<li>Other messages &mdash; messages that don&#8217;t quite fit under any of the above categories.  Fan messages to celebrities, shoutouts to other users, web-based polls and quizzes, and so on.</li>
</ul>
<p>Here is a graph showing the frequency of these different message types in the 1,000 tweet sample.</p>
<p><img src="http://blog.textwise.com/wp-content/uploads/2009/12/twitter-msg-types.png" alt="Twitter message types" width="660" height="455" class="size-full wp-image-232" /></p>
<p>As you can see, over half of the tweets in the sample are either user statuses or private conversations.  Only about 10-20% of the messages could be considered more broadly relevant to a larger audience (depending on which message types are considered to be informative).  So while there is some interesting content to be discovered on Twitter, it will definitely take a bit of work to find it.</p>
<p>We also collected some statistics on the twitterers in our sample, such as their average rate of posting, the most common clients they use, and so on.  But this post is already pretty long, so I&#8217;ll save that for a later date.  Stay tuned&#8230;</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2010%2F01%2F08%2Fhow-informative-is-twitter%2F';
  addthis_title  = 'How+informative+is+Twitter%3F';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2010/01/08/how-informative-is-twitter/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

