<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>TextWise Blog &#187; semantics</title>
	<atom:link href="http://blog.textwise.com/tag/semantics/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.textwise.com</link>
	<description>A blog about the SemanticHacker API by TextWise</description>
	<lastBuildDate>Wed, 31 Aug 2011 18:50:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Language, Thought, and Action</title>
		<link>http://blog.textwise.com/2011/03/03/language-thought-and-action/</link>
		<comments>http://blog.textwise.com/2011/03/03/language-thought-and-action/#comments</comments>
		<pubDate>Thu, 03 Mar 2011 21:01:44 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[categories]]></category>
		<category><![CDATA[culture]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[understanding]]></category>
		<category><![CDATA[whorf]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=453</guid>
		<description><![CDATA[The February issue of Scientific American had an article on the latest thinking about the Whorfian Hypothesis, which states that language strongly influences how humans think. This was a hot idea about sixty years ago, but eventually fell out of academic favor because of the lack of hard empirical evidence. Now that evidence is starting [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2011%2F03%2F03%2Flanguage-thought-and-action%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2011%2F03%2F03%2Flanguage-thought-and-action%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>The February issue of Scientific American had an article on the latest thinking about the Whorfian Hypothesis, which states that language strongly influences how humans think. This was a hot idea about sixty years ago, but eventually fell out of academic favor because of the lack of hard empirical evidence. Now that evidence is starting to show up, which has some implications for computational semantics.</p>
<p>The standard view on language and meaning has recently emphasized universality. This is to say that the understanding of language is hardwired in our heads, and so any competent human should qualify as an expert in the algorithmic delineation of meaning. The Whorfian hypothesis throws us a curve here in that we now have to consider language along with culture in our models of thought. A single well-crafted taxonomy or other semantic construct will not fit all.</p>
<p>We see something of this problem on the Worldwide Web. As Jimmy Wales noted this past week, the content of the Web, and Wikipedia in particular, is largely created by twenty- and thirty-something males and so is dominated by their interests. A set of semantic categories derived from the Web in general will certainly be insufficient for understanding text on finance or on medicine and may be challenged even when dealing with the pages frequented by twenty- and thirty-something females.</p>
<p>This does not mean that a given semantic scheme is invalid. Each scheme, however, is limited by the vocabulary it covers and in the kinds of distinctions that that it makes. That should be good news for those of us who make their living in computational semantics. </p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2011%2F03%2F03%2Flanguage-thought-and-action%2F';
  addthis_title  = 'Language%2C+Thought%2C+and+Action';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2011/03/03/language-thought-and-action/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>IBM&#8217;s Watson &#8212; It&#8217;s Elementary, My Dear Holmes</title>
		<link>http://blog.textwise.com/2011/02/17/ibms-watson-its-elementary-my-dear-holmes/</link>
		<comments>http://blog.textwise.com/2011/02/17/ibms-watson-its-elementary-my-dear-holmes/#comments</comments>
		<pubDate>Thu, 17 Feb 2011 15:16:24 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Semantic Search]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[context]]></category>
		<category><![CDATA[entity extraction]]></category>
		<category><![CDATA[ibm]]></category>
		<category><![CDATA[jeopardy]]></category>
		<category><![CDATA[question answering]]></category>
		<category><![CDATA[searching]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[turing test]]></category>
		<category><![CDATA[watson]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=460</guid>
		<description><![CDATA[Watson, IBM&#8217;s Jeopardy computer, is showing everyone that its 900-pound gorilla of trivia and is likely to beat its human opponents. Watson could still do something stupid, but its formidable performance says much about the effectiveness of current natural language processing technology and computation resources. Although Watson has a knowledge base of millions of documents [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2011%2F02%2F17%2Fibms-watson-its-elementary-my-dear-holmes%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2011%2F02%2F17%2Fibms-watson-its-elementary-my-dear-holmes%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Watson, IBM&#8217;s Jeopardy computer, is showing everyone that its 900-pound gorilla of trivia and is likely to beat its human opponents. Watson could still do something stupid, but its formidable performance says much about the effectiveness of current natural language processing technology and computation resources.</p>
<p>Although Watson has a knowledge base of millions of documents gleaned from the Web, its weakness is that it really does not understand any of this data. It is just an extremely smart entity extraction system; Watson uses the terms of a Jeopardy clue as a selecting a particular entity as an answer, which of course then has to be phrased as a question. It has to figure what kind of entity to look for and what kind of context that entity would be found in.</p>
<p>In a sense, this is a simple kind of semantic search because it involves scanning its entire knowledge base of documents and scoring contexts statistically. The entities of the right kind in the highest-scoring contexts are then the prime candidates for an answer; and Watson can use their statistics to derive a level of confidence that a given candidate is the right answer. This basically relies heavily on brute computational power.</p>
<p>As can be seen in the Jeopardy competition, brute power can be quite effective. In most of the straightforward questions that one might expect that Google would do well on, Watson can simply outsearch its opponents. It can grab enough right answers in this way to make up for its frequent wrong answers on more subtle questions requiring a deeper understanding. This is as much gamesmanship as it is intelligence.</p>
<p>Now imagine how overwhelming Watson could be if it actually developed some understanding and made far fewer wrong answers. The first step in this direction is in fact quite easy: develop a large set of semantic categories corresponding to how humans understand language. Indexing a knowledge base by such predefined categories would have the immediate effect of simplifying the search process so that documents do not always have to be analyzed at the lowest linguistic level. That should allow the searches to be broader, much like allowing a chess computer to analyze more moves ahead.</p>
<p>We of course are in the business of semantic dictionaries, which provide a quick way of assigning semantic categories to text documents. Hey, Watson. If you are listening, give us a call.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2011%2F02%2F17%2Fibms-watson-its-elementary-my-dear-holmes%2F';
  addthis_title  = 'IBM%26%238217%3Bs+Watson+%26%238212%3B+It%26%238217%3Bs+Elementary%2C+My+Dear+Holmes';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
<div id="textwise_suggestions"><h4 id='twBlogs'>Similar Blog & News Articles: Powered by <a href="http://www.textwise.com/" target="_blank"><img border="0" src="http://blog.textwise.com/wp-content/plugins/textwise/img/textwise_logo.gif" alt="TextWise" align="top" /></a></h4><ul><li><a href="http://feeds.livescience.com/~r/Livesciencecom/~3/QdHwy1aCIlU/12855-jeopardy-computer-ibm-watson-works.html">'Jeopardy!' vs. Computer: How IBM's Watson Works</a> :: <em><a href="http://www.livescience.com/">LiveScience.com</a></em></li><li><a href="http://www.npr.org/2011/02/14/133697585/on-jeopardy-its-man-vs-this-machine?ft=1&f=1001">On 'Jeopardy!' It's Man Vs. This Machine</a> :: <em><a href="http://www.npr.org/templates/story/story.php?storyId=1001&ft=1&f=1001">NPR Topics: News</a></em></li><li><a href="http://feedproxy.google.com/~r/Venturebeat/~3/BCs6puvc4HA/">IBM's Watson obliterates humans in first Jeopardy round</a> :: <em><a href="http://venturebeat.com">VentureBeat</a></em></li><li><a href="http://feeds.arstechnica.com/~r/arstechnica/index/~3/Y3SFR8OjY_Y/ibms-watson-tied-for-1st-in-jeopardy-almost-sneaks-wrong-answer-by-trebek.ars">Jeopardy: IBM's Watson almost sneaks wrong answer by Trebek</a> :: <em><a href="http://arstechnica.com/index.php">Ars Technica</a></em></li></ul><h4 id='twWiki'>Similar Wikipedia Articles: Powered by <a href="http://www.textwise.com/" target="_blank"><img border="0" src="http://blog.textwise.com/wp-content/plugins/textwise/img/textwise_logo.gif" alt="TextWise" align="top" /></a></h4><ul><li><a href="http://en.wikipedia.org/wiki/Watson%20%28artificial%20intelligence%20software%29">Watson (artificial intelligence software)</a></li></ul></div>]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2011/02/17/ibms-watson-its-elementary-my-dear-holmes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Language, Thought, and Reality</title>
		<link>http://blog.textwise.com/2010/10/19/language-thought-and-reality/</link>
		<comments>http://blog.textwise.com/2010/10/19/language-thought-and-reality/#comments</comments>
		<pubDate>Tue, 19 Oct 2010 13:00:31 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[bias]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[meaning]]></category>
		<category><![CDATA[reality]]></category>
		<category><![CDATA[whorfian hypothesis]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=398</guid>
		<description><![CDATA[Back in the 60&#8242;s and 70&#8242;s of the last century, the Whorfian hypothesis was a hot subject on college campuses. This was the idea that one&#8217;s native language, its syntax and semantics, strongly shaped one&#8217;s worldview. For example, Eskimos speaking Inuit supposedly had thirty different words for snow and so had a more complex relationship [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F10%2F19%2Flanguage-thought-and-reality%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F10%2F19%2Flanguage-thought-and-reality%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Back in the 60&#8242;s and 70&#8242;s of the last century, the Whorfian hypothesis was a hot subject on college campuses. This was the idea that one&#8217;s native language, its syntax and semantics, strongly shaped one&#8217;s worldview. For example, Eskimos speaking Inuit supposedly had thirty different words for snow and so had a more complex relationship with their environment than someone speaking English with only one word for snow.</p>
<p>The problem of course is that skiers can make plenty of distinctions about kinds of snow even in English. Despite Whorfian hypothesis being theoretically attractive, it did not square in the end with our actual experience with language. That pretty much took the steam out of the Whorfian hypothesis, but now in the 21st Century, empirical support has been accumulating for a weaker version of it. This was the subject of an article in New York Times Magazine (http://nyti.ms/boqzs5).</p>
<p>The weak Whorfian hypothesis rejects the idea that language establishes an absolute limit on thinking. Thus we can learn about distinctions in types of snow if we really need them. The structure of a language, however, definitely can bias our thinking; and this could have consequences in practical matters like the ranking of retrieved documents. The choice of a particular semantic framework like RDF may therefore affect the performance of an information system in unexpected ways.</p>
<p>So far, experimental results on language and thought have focused on highly specific biases in areas of language like giving spatial directions, assigning gender to nouns, and dividing the spectrum into colors. It seems plausible, though, that this should generalize to the overall semantic problem of dividing up meaning into some kind of compact space. There is more than one way to skin a cat here, and there are probably advantages and disadvantages in each possibility.</p>
<p>A dogmatist might be tempted to argue here that RDF with certain standard taxonomies is the right way and everything else is wrong, but that is probably overreaching. We are not yet savvy enough about semantics to carve tablets in stone about its implementation. At present, one can say only whether a given scheme is optimal in some formal sense; but if it makes no obvious sense to people, then something more comprehendable might be better in the long run even if it is less than optimal.</p>
<p>The weak Whorfian hypothesis forces us to be more honest. If each semantic scheme introduces its own biases, then we need to experiment to see how different approaches work out for a given target application. Given that humans operate with more than one linguistic framework, we should not be so quick to assume than machines can do better at semantics with just a single framework.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2010%2F10%2F19%2Flanguage-thought-and-reality%2F';
  addthis_title  = 'Language%2C+Thought%2C+and+Reality';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2010/10/19/language-thought-and-reality/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Basics</title>
		<link>http://blog.textwise.com/2010/10/05/basics/</link>
		<comments>http://blog.textwise.com/2010/10/05/basics/#comments</comments>
		<pubDate>Tue, 05 Oct 2010 13:00:34 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[concept]]></category>
		<category><![CDATA[innateness]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[meaning]]></category>
		<category><![CDATA[neuroscience]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=394</guid>
		<description><![CDATA[Linguists have long debated whether human language ability is innate or is simply learned by highly plastic neurocircuitry of a general sort. Recent studies with fMRI scans indicate, however, that cognitive skills like language understanding tend to be associated with highly specific brain locations across different individuals, supporting the idea that some kind of language-related [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F10%2F05%2Fbasics%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F10%2F05%2Fbasics%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Linguists have long debated whether human language ability is innate or is simply learned by highly plastic neurocircuitry of a general sort. Recent studies with fMRI scans indicate, however, that cognitive skills like language understanding tend to be associated with highly specific brain locations across different individuals, supporting the idea that some kind of language-related structures exists. Studies of people impaired by strokes occurring in language regions also have shown this.</p>
<p>So when a young child learns that Mama is related to a concept of MOTHER, which applies to more than a single individual, this seems to draw upon specialized builtin logic within the human brain. This kind of symbolic capability is not unique to humans, being found to some extent in other large-brained social animals like elephants, whales, dolphins, and chimpanzees; but we certainly have more of it. This can seen in the relative size and organizational complexity of human brains.</p>
<p>The implication here is that concepts like MOTHER, BIRD, HOUSE, or FOOD are real in some sense at the genetic level. We of course do not necessarily all learn the same particular concepts; for example, speakers of different languages in different cultures can be expected to develop divergent concept frameworks. Nevertheless, it is possible to translate between unrelated languages like Inuit and English, meaning that there is still a large overlap in their lingistic repertories of concepts.</p>
<p>Consequently, when we technologists talk about incorporating semantics into search engines and other applications, we need to remember that semantics existed a long time before the first boolean electronic circuit and that what we call &#8220;semantics&#8221; should be consistent to what goes on in our own heads. This is perhaps only a marketing concern, but the business of selling semantic technology will be that much harder if we cannot agree on what we really mean.</p>
<p>The concept of CONCEPT would seem to be a focus point for semantics that everyone can grasp. Whether we approach language and meaning like Wittgenstein or like Russell or like Korzybski or like Chomsky or like Miller or like Berners-Lee, it helps to get grounded properly.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2010%2F10%2F05%2Fbasics%2F';
  addthis_title  = 'Basics';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2010/10/05/basics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>One Size Fits All?</title>
		<link>http://blog.textwise.com/2010/09/15/one-size-fits-all/</link>
		<comments>http://blog.textwise.com/2010/09/15/one-size-fits-all/#comments</comments>
		<pubDate>Wed, 15 Sep 2010 19:06:22 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Opinion]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[customization]]></category>
		<category><![CDATA[global solutions]]></category>
		<category><![CDATA[long tail]]></category>
		<category><![CDATA[semantic learning]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=388</guid>
		<description><![CDATA[Because my wife is petite in size, she has difficulty finding clothes and shoes that fit and, being of a certain age, no longer has the option of shopping in the teens section. One understands exactly how this situation came about. Some researchers developed a demographic profile of women in the United States and then [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F09%2F15%2Fone-size-fits-all%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F09%2F15%2Fone-size-fits-all%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Because my wife is petite in size, she has difficulty finding clothes and shoes that fit and, being of a certain age, no longer has the option of shopping in the teens section. One understands exactly how this situation came about. Some researchers developed a demographic profile of women in the United States and then applied a multivariate optimization algorithm to calculate the mix of sizes that would generate the most profit for manufacturers. If you happen to be in the tail of the demographic profile, you are just out of luck.</p>
<p>We see something similar happening on the Semantic Web. Developers seem to favor global solutions, which are often highly optimized for capturing content deemed to be the most important somehow. The problem, though, is that this strategy discounts the long tail of distributions, which is unfortunate, since pundits see long-tail processing as the new frontier for Web applications. As with clothing and shoes, an optimal mix of sizes does not fit all.</p>
<p>How we each use our words in language depends on how we learned to speak; and each of us has had a unique history. The differences in our language are most evident when we look at text in specialized areas like medicine, law, government, or technology; but even in &#8220;normal&#8221; discourse on the Web, we see jargon or unusual usages of common terms that may be missed by some kind of global semantic solution.</p>
<p>This is not to say that global solutions are invalid. They are quite useful in the absence of better information about what someone means; but the Web is moving more towards customization and personalization and localization. Our semantic frameworks should follow that lead. It means that we have to streamline our automated learning of semantic concepts so that we can in fact support individual solutions at least in part.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2010%2F09%2F15%2Fone-size-fits-all%2F';
  addthis_title  = 'One+Size+Fits+All%3F';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2010/09/15/one-size-fits-all/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hijacked</title>
		<link>http://blog.textwise.com/2010/05/26/hijacked/</link>
		<comments>http://blog.textwise.com/2010/05/26/hijacked/#comments</comments>
		<pubDate>Wed, 26 May 2010 13:00:15 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Semantic Search]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[MLE]]></category>
		<category><![CDATA[popularity]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=351</guid>
		<description><![CDATA[Has this ever happened to you? You are Googling for information on the Web, but inadvertently your query happens to share keywords with the latest cultural phenom: the next tweener heart throb, a YouTube video suddenly gone viral, or yet another paranoid political fantasy that refuses to die. You are a professional, however, and so [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F05%2F26%2Fhijacked%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2010%2F05%2F26%2Fhijacked%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Has this ever happened to you? You are Googling for information on the Web, but inadvertently your query happens to share keywords with the latest cultural phenom: the next tweener heart throb, a YouTube video suddenly gone viral, or yet another paranoid political fantasy that refuses to die.</p>
<p>You are a professional, however, and so switch into Advanced Mode to reshape your query, but to no avail. Your information has been buried under pop detritus; it has been hijacked by the maximum likelihood estimate (MLE) on the Web.</p>
<p>At times like this, you want to grab your search engine by the neck and shout, &#8220;I am NOT a screaming twelve-year-old girl into dancing cats and fixated on the President&#8217;s birth place!&#8221; But your search engine continues blithely in the wisdom of the crowd.</p>
<p>It is a reminder that statistically grounded information systems are at the mercy of their training data. If we cede too much control of a system to its finely wrought black box judgment, then we sometimes are going to run off the tracks. This is especially true with web semantics.</p>
<p>If we do in fact want to get under the hood to adjust a semantic system to go against the popular flow, then it helps tremendously if the categories underlying the representation of document content are intelligible to people. Such transparency is a prime motivation for how semantic dictionaries are currently built by TextWise.</p>
<p>Of course, if you care nary a lick about transparency, then may I interest you in this slightly used synthetic collateralized debt obligation&#8230;.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2010%2F05%2F26%2Fhijacked%2F';
  addthis_title  = 'Hijacked';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2010/05/26/hijacked/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ABC&#8217;s of Semantic Dictionaries</title>
		<link>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/</link>
		<comments>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 14:21:49 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[qa]]></category>
		<category><![CDATA[Semantic Signatures]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[dimension]]></category>
		<category><![CDATA[quality]]></category>
		<category><![CDATA[semantic dictionary]]></category>
		<category><![CDATA[term]]></category>
		<category><![CDATA[weight]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=182</guid>
		<description><![CDATA[A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.</p>
<p>In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.</p>
<p>Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.</p>
<p>Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.</p>
<p>In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F';
  addthis_title  = 'ABC%26%238217%3Bs+of+Semantic+Dictionaries';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Significance of Numbers</title>
		<link>http://blog.textwise.com/2009/08/26/significance-of-numbers/</link>
		<comments>http://blog.textwise.com/2009/08/26/significance-of-numbers/#comments</comments>
		<pubDate>Wed, 26 Aug 2009 14:46:11 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[extreme probabilities]]></category>
		<category><![CDATA[interpreting numbers]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=71</guid>
		<description><![CDATA[A former colleague of mine used to have an entire can of soup for lunch every day. We razzed him about this, but he shook us all off until one day, I looked at the nutrition label on the can. That soup had 1800 mg of sodium altogether! We gravely informed him of this fact, [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F26%2Fsignificance-of-numbers%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F26%2Fsignificance-of-numbers%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>A former colleague of mine used to have an entire can of soup for lunch every day. We razzed him about this, but he shook us all off until one day, I looked at the nutrition label on the can. That soup had 1800 mg of sodium altogether! We gravely informed him of this fact, and the soup was soon history.
<p>To understand this story, you would have to know that the recommended daily maximum dietary intake of sodium for an adult is about 900 mg. Without this context, the number 1800 really means nothing. So what do all those numbers in a semantic dictionary mean, if anything?
<p>The key property of semantic dictionary numbers is that they are based on probabilities and so have to fall between 0 and 1. They measure the likelihood that a document containing a given term is related to a given semantic dimension. For example, a dictionary weight of 1.0000 for a term and a dimension would indicate that a document containing the term is absolutely associated with the dimension.
<p>There is a complication here, however. In real life, nothing is ever so certain. If we saw a 1.0000 term weight for a dimension, a more reasonable interpretation is that our sample of training data was too small for estimating the probability of that term accurately. A similar problem arises for a dictionary weight of 0.0000.
<p>In general, a statistician will be highly suspicious of any extreme probabilities like 1.0000 and 0.0000. As a proponent of statistical technology, we have to make a special effort to avoid such probability estimates in our semantic dictionaries. In contrast, certain other mathematic approaches to semantics tend to skate over niceties like this, choosing just to plug in numbers to what is essentially a fixed formula.
<p>If one is careless about the meaning of numbers, though, how can one be careful in capturing the actual meaning of words?</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F26%2Fsignificance-of-numbers%2F';
  addthis_title  = 'Significance+of+Numbers';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2009/08/26/significance-of-numbers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Return of Rhyme and Reason</title>
		<link>http://blog.textwise.com/2009/08/13/return-of-rhyme-and-reason/</link>
		<comments>http://blog.textwise.com/2009/08/13/return-of-rhyme-and-reason/#comments</comments>
		<pubDate>Thu, 13 Aug 2009 16:01:35 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[semantics]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[phantom tollbooth]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[two cultures]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=67</guid>
		<description><![CDATA[In Norton Juster&#8217;s classic The Phantom Tollbooth, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F13%2Freturn-of-rhyme-and-reason%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F13%2Freturn-of-rhyme-and-reason%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>In Norton Juster&#8217;s classic <em>The Phantom Tollbooth</em>, a young boy boy visits the Kingdom of Wisdom and finds that its principal cities, Dictionopolis and Digitopolis, are in a cold war likely to turn quite hot. This conflict makes no sense and is the consequence of the Princesses Rhyme and Reason having been exiled to the Castle in the Air.
<p>Okay, the symbolism is a bit over the top, but the conflict about whether semantics should involve numbers as opposed to some logical formalism makes just as little sense and could also benefit from the return of Sweet Rhyme and Pure Reason. There is not just one way to build a house, or plant a garden, or skin a cat. In any real-world enterprise, we always have multiple options, each with tradeoffs.
<p>Our job as an semantic API developer is to provide another option with tradeoffs that are attractive to users. What we offer with statistical semantics is simplicity, transparency, broad coverage, timely data, rigor, and historical grounding of methodology. And we strive to be better each day at what we do.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F13%2Freturn-of-rhyme-and-reason%2F';
  addthis_title  = 'Return+of+Rhyme+and+Reason';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2009/08/13/return-of-rhyme-and-reason/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Iron Semanticist</title>
		<link>http://blog.textwise.com/2009/08/12/iron-semanticist/</link>
		<comments>http://blog.textwise.com/2009/08/12/iron-semanticist/#comments</comments>
		<pubDate>Wed, 12 Aug 2009 13:21:58 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[cebuano]]></category>
		<category><![CDATA[darpa]]></category>
		<category><![CDATA[iron chef]]></category>
		<category><![CDATA[machine translation]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=65</guid>
		<description><![CDATA[Some people have been disparaging a statistical approach to the semantics of natural language. This is essentially a kind of prejudice, as if we came from the wrong side of the technology railroad tracks. It ignores the fact that statistical approaches have performed spectacularly well in some high profile settings. Have you ever watched the [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F12%2Firon-semanticist%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F12%2Firon-semanticist%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>Some people have been disparaging a statistical approach to the semantics of natural language. This is essentially a kind of prejudice, as if we came from the wrong side of the technology railroad tracks. It ignores the fact that statistical approaches have performed spectacularly well in some high profile settings.
<p>Have you ever watched the &#8220;Iron Chef&#8221; on the Food Network? This is where two competing chefs are given an ingredient kept secret until the start of the show, and each contestant then has 60 minutes to create an entire meal around that ingredient. A panel then judges and critiques the two meals and crowns a winner.
<p>In 2003, DARPA ran its own version of &#8220;Iron Chef,&#8221; though with only a single team of collaborators from eleven academic institutions across the U.S. The team was given a language, with the task of creating a cross-language information retrieval system and a machine translation system within TEN DAYS after learning what the language actually was.
<p>To make challenge harder, the language was not French, Arabic, or Russian, but Cebuano, a dialect spoken in the Philippines. None of the team was familiar with the language, but through the magic of Internet collaboration, they were able in ten days to collect a corpus of resources in Cebuano and English and apply statistical methods to create both a fully workable cross-language retrieval system and a credible start to a translation capability.
<p>The two principal investigators of the Herculean exercise wrote afterward that, given what they learned in those ten days, they would do better next time. They predicted that their team could  build a fully working statistical machine translation facility for a specified language in just a single day given adequate linguistic and computational resources.
<p>In ten days, you could not build even a parser for a language that you have never heard of, much less develop the semantic mapping of that language into some kind of logical model of meaning to support cross-language search and machine translation. Statistical methods do work in semantics.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2009%2F08%2F12%2Firon-semanticist%2F';
  addthis_title  = 'Iron+Semanticist';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2009/08/12/iron-semanticist/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

