<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>TextWise Blog &#187; dimension</title>
	<atom:link href="http://blog.textwise.com/tag/dimension/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.textwise.com</link>
	<description>A blog about the SemanticHacker API by TextWise</description>
	<lastBuildDate>Wed, 31 Aug 2011 18:50:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>ABC&#8217;s of Semantic Dictionaries</title>
		<link>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/</link>
		<comments>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 14:21:49 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[qa]]></category>
		<category><![CDATA[Semantic Signatures]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[dimension]]></category>
		<category><![CDATA[quality]]></category>
		<category><![CDATA[semantic dictionary]]></category>
		<category><![CDATA[term]]></category>
		<category><![CDATA[weight]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=182</guid>
		<description><![CDATA[A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.</p>
<p>In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.</p>
<p>Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.</p>
<p>Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.</p>
<p>In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F';
  addthis_title  = 'ABC%26%238217%3Bs+of+Semantic+Dictionaries';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ingredients</title>
		<link>http://blog.textwise.com/2009/07/27/ingredients/</link>
		<comments>http://blog.textwise.com/2009/07/27/ingredients/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 12:12:26 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[association]]></category>
		<category><![CDATA[dictionary]]></category>
		<category><![CDATA[dimension]]></category>
		<category><![CDATA[sample]]></category>
		<category><![CDATA[semantic]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[term]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=55</guid>
		<description><![CDATA[This posting will probably make the eyes of most people glaze over, but current and prospective users of our SemanticHacker API should really be informed consumers. So think of this as being like one of those federally mandated labels on your bottle of Red Bull. The ingredients of a semantic dictionary are a set of [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F07%2F27%2Fingredients%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F07%2F27%2Fingredients%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>This posting will probably make the eyes of most people glaze over, but current and prospective users of our SemanticHacker API should really be informed consumers. So think of this as being like one of those federally mandated labels on your bottle of Red Bull.
<p>The ingredients of a semantic dictionary are a set of hundreds of thousands of terms, a set of thousands of dimensions, and various numbers expressing the strength of association between a given term and a given dimension. Most of these associations will have zero strength, indicating that we have no information about them; but there will still be millions of non-zero numbers to provide a rigorous undergirding for statistical semantics.
<p>We build a semantic dictionary by obtaining large training samples of documents relevant to each of its dimensions. The strength of association is then estimated as being proportional to the relative frequency of occurrence in training documents for a term in a dimension versus in those for all other other dimensions. The process is actually more complicated than this, but the differences are just refinements of the overall scheme as described.
<p>Now we all understand what terms are (e.g. britney_spears, midfielder, rugelach, purple), but where do dimensions come from? The answer is that they are somewhat arbitrary. A dimension can be defined around any kind of category for which someone has provided requisite training documents. In many cases, we can find prior sets of categories to work from (ODP, USPTO), but we also can ourselves try to infer categories from some available pool of potential training data.
<p>However we proceed here, it is necessary that the resulting dimensions be pertinent to an application of interest, be independent of each other, be supported by adequate training data, and be associated with enough terms to support semantic analysis of target text. This all can be tricky to achieve, but if it were easy, everyone would be doing it.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2009%2F07%2F27%2Fingredients%2F';
  addthis_title  = 'Ingredients';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2009/07/27/ingredients/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

