<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>TextWise Blog &#187; quality</title>
	<atom:link href="http://blog.textwise.com/tag/quality/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.textwise.com</link>
	<description>A blog about the SemanticHacker API by TextWise</description>
	<lastBuildDate>Wed, 31 Aug 2011 18:50:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>ABC&#8217;s of Semantic Dictionaries</title>
		<link>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/</link>
		<comments>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/#comments</comments>
		<pubDate>Mon, 14 Dec 2009 14:21:49 +0000</pubDate>
		<dc:creator>Clinton Mah</dc:creator>
				<category><![CDATA[qa]]></category>
		<category><![CDATA[Semantic Signatures]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[dimension]]></category>
		<category><![CDATA[quality]]></category>
		<category><![CDATA[semantic dictionary]]></category>
		<category><![CDATA[term]]></category>
		<category><![CDATA[weight]]></category>

		<guid isPermaLink="false">http://blog.textwise.com/?p=182</guid>
		<description><![CDATA[A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant [...]]]></description>
			<content:encoded><![CDATA[<div class="tweetmeme_button" style="float: right; margin-left: 10px;">
			<a href="http://api.tweetmeme.com/share?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F"><br />
				<img src="http://api.tweetmeme.com/imagebutton.gif?url=http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F&amp;style=normal&amp;b=2" height="61" width="50" /><br />
			</a>
		</div>
<p>A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.</p>
<p>In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.</p>
<p>Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.</p>
<p>Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.</p>
<p>In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.textwise.com%2F2009%2F12%2F14%2Fabcs-of-semantic-dictionaries%2F';
  addthis_title  = 'ABC%26%238217%3Bs+of+Semantic+Dictionaries';
  addthis_pub    = '';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.textwise.com/2009/12/14/abcs-of-semantic-dictionaries/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

