Archive for April, 2008

The new Semantic Hacker match server exploits the fact that Semantic Signatures® are mathematically related. The original example tools we provided have a “similarity” function. This produces a “score” of how closely related two signatures are.  Often times when writing applications a large set of documents is in hand, and one wishes to find the most closely related documents to some other document. This is exactly how the Wikipedia extension (and the front page demonstration) work.  We used the API to generate a signature for every single Wikipedia page. All 2 million of them. Then, we took those signatures and added them to a match server. Once that’s all done we can get the most closely related Wikipedia articles to any document.

The concept of the match server is the same as using the similarity tool and then sorting by which ones had the highest score.  That’s tedious work and error prone code. We’re providing the match server to speed up application development of ideas that require it.
The match server is also very very fast. It can sift through all those 2 million Wikipedia pages and grab the top matches in less then 10 milliseconds.

Semantic Signatures® represent content as points in an abstract m-dimensional conceptual space. By itself, a signature can give us an idea of what a particular document is about; but they become even more useful when we can define a distance between two signatures. Those familiar with vector space theory and the IR work of Gerard Salton will know what to do next: compute a cosine measure between a pair of signatures to be interpreted as vectors.

You need not understand the math underlying cosine measures. It is enough to know that, for Semantic Signatures®, they will range from 0 to 1, where 0 means no match at all and 1 means a perfect match. The problem is in determining what range of values will indicate a good match for a given application. The SemanticHacker example tools web page recommends a minimum of 0.4, with 0.8 and above being a good match; but the choice really depends on your application.

For example, if you are checking for plagiarism, then a similarity ≥ 0.95 might be a helpful result. In most information contexts, however, you are rarely interested in getting exact or close duplicates of what you have already; and so you might want to set an upper threshold of 0.9 or even 0.85 so that your matches will find more diverse information.

Similarly, when missing something entails a high cost (e.g. 9/11), then you may want to lower your match threshold down to 0.2. This means that most of the matches will be noise, but if someone is willing to sift through them all, then there is a significant chance that you will find something. One has to make the proper trade-off here.

In theory, half of any Semantic Signature® conceptual space will be within 0.7 of any given point in the space. In practice, signatures are so sparse that there will usually be only a few within 0.7 of a given reference point. This sparseness is actually a good situation to have, when your application allows you to take advantage of it.

A Semantic Signature® can be seen as the result of a kind of election to choose semantic categories to describe the content of a document. A semantic dictionary serves to define how each term in the document will vote for different categories; and so this will be critical to the usefulness of signatures. The suitability of a dictionary for an application will depend on its range of categories and on the breadth of its vocabulary for those categories.

The current API dictionary was trained on the listings of the DMOZ Open Directory Project. It is particularly strong in covering the content most commonly found on the Worldwide Web; for example digital electronics, video games, professional sports, movies, and cooking recipes. Since the Web seems to be biased toward the interests of young males, however, an ODP dictionary may provide less detailed coverage of subjects like designer shoes for women, Roth IRA’s, or Tanzanian rural development.

When generating Semantic Signatures® for a particular application, check their weights to see how well they are capturing your own target content. In the top 30 weights now shown, you should see a good contrast between the highest and lowest weights. We want to avoid something like the 2008 Democratic U.S. presidential primaries and caucuses, where one candidate is ahead, but there seems to be no clear winner. One can approach such a degree of contrast statistically, but simple eyeballing should be good enough most of the time.

In areas where the ODP offers fine-grain coverage, you may get many relevant categories, which is OK. The problem is when you see signature weights about the same for many categories that don’t seem to be closely related. In that case, you may want to try increasing the amount of text you generate signatures from in order to get more corroboration on voting. If you insist on doing women’s haute couture or calls and puts in the options market, however, you probably want a specialized semantic dictionary; this is not difficult build, but requires proper training data.

NOTE: For the purpose of the SemanticHacker Innovators’ Challenge we will evaluate all application prototypes using the general purpose dictionary provided with the API. We understand that certain dictionary customization may be required after a winner is selected to improve the “matching” capability for a vertical. That work will be included in the product build.

weareexhibiting.jpg

TextWise will be speaking and exhibiting at the 2008 Semantic Technology Conference in San Jose, CA. We would like to invite you to attend with a $200 discount off the full registration fee. The discount expires May 9th. MORE INFORMATION. The tutorials, sessions, etc. run from May 18th – 23rd.

Join our conference session on Wednesday, May 21st from 9:45 – 10:45am.

Also, look for us at booth #302 and we’ll have some of our beta applications up (internet access permitting) that you can play with and plenty of representatives to chat with. Hope to see you there!

The GIGO principle has long been part of the wisdom of computing. In the age of Web 2.0 and higher, when anyone can be an information creator, quality still counts. To develop that next killer app, we need not only cutting edge concepts and technology, but also decent data.

Semantic Signatures® are derived from text essentially by a complicated voting scheme by which occurrences of terms in that text select the semantic dimensions by which one can represent the content of that text. The current API returns the top 30 dimensions, but reliability of those results for a document will depend on how well its voting terms corroborate each other. If we have only two or three terms that are completely unrelated to each other, then selection of dimensions may be hit or miss.

Discerning a corroboration problem by just looking at a Semantic Signature® is nearly impossible. Dimensional weights in a signature will always be normalized to make them more easily comparable, making it difficult to infer anything about the original text. Make sure that the text is not something like a splash page, a login page, or a status page without any content of interest.

We have been working with Semantic Signatures® for a while now, both on the theoretical and on the practical sides. In that time, we have learned a few tricks that you won’t find in the marketing materials or the official API documentation. Our new blog series: “The Skinny on Semantic Signatures®” will try to share these insights with you, the prospective semantic hacker.

Now sharing of information is the very life of the Web, but this being a challenge with rather large sums of money at stake, it would be understandable if people were a bit tight-lipped. However, we still want to try to fill in some of the gaps. After all, we do want you all to succeed in building innovative software prototypes or business plans using Semantic Signatures®.

We believe that Semantic Signatures® provide an unprecedented scalable tool for computational characterization of meaning in natural language text; but one must remember that they are statistical constructs based on numbers. To make numbers work right for you, you must also treat them properly. This is not being negative, but wise.