Archive for the ‘Semantic Signatures’ Category

According to WordNet, the word BANK has multiple senses, and so any occurrence of it in a text document is ambiguous. For example, we can have a river BANK, a financial BANK, a fog BANK, or an aeronautical BANK. The intended sense in a particular document has to be determined by looking at the context of occurrence. So, to determine the actual meaning of BANK in a document, we have to ask in effect whether the document is talking about streams of water, financial meltdowns, marine navigation, or aircraft in flight.

Now the number of different possible contexts is probably huge.One cannot hope to recognize them all; but for disambiguation of words, we need only fairly general contexts to distinguish the word senses of prime interested to us. Furthermore, given a large of our target text, we can employ statistical methods to identify the most important of such contexts.

This is essentially what SemanticHacker is all about.The dimensions of one of our semantic dictionaries defines thousands of contextual reference points for the interpretation of terms. For example, if the words stream, water, flow, erosion, and grass are in a document, then with the ODP 2009 dictionary, we find that the top match dimension is 1461 (Top/Science/Environment/Water_Resources) with a weight of 0.5138. In this context, the word BANK would probably mean “river bank.”

Actually, we don’t need to make this explicit association. With a search engine user interface, one just needs a way of describing the context of ambiguous search terms, perhaps by listing contextual words. Then all a semantic search engine has to do is find a document containing the search term and having the same described context in its semantic signature. This is of course a part of our API for search.

Semantic Signatures® represent content as points in an abstract m-dimensional conceptual space. By itself, a signature can give us an idea of what a particular document is about; but they become even more useful when we can define a distance between two signatures. Those familiar with vector space theory and the IR work of Gerard Salton will know what to do next: compute a cosine measure between a pair of signatures to be interpreted as vectors.

You need not understand the math underlying cosine measures. It is enough to know that, for Semantic Signatures®, they will range from 0 to 1, where 0 means no match at all and 1 means a perfect match. The problem is in determining what range of values will indicate a good match for a given application. The SemanticHacker example tools web page recommends a minimum of 0.4, with 0.8 and above being a good match; but the choice really depends on your application.

For example, if you are checking for plagiarism, then a similarity ≥ 0.95 might be a helpful result. In most information contexts, however, you are rarely interested in getting exact or close duplicates of what you have already; and so you might want to set an upper threshold of 0.9 or even 0.85 so that your matches will find more diverse information.

Similarly, when missing something entails a high cost (e.g. 9/11), then you may want to lower your match threshold down to 0.2. This means that most of the matches will be noise, but if someone is willing to sift through them all, then there is a significant chance that you will find something. One has to make the proper trade-off here.

In theory, half of any Semantic Signature® conceptual space will be within 0.7 of any given point in the space. In practice, signatures are so sparse that there will usually be only a few within 0.7 of a given reference point. This sparseness is actually a good situation to have, when your application allows you to take advantage of it.

A Semantic Signature® can be seen as the result of a kind of election to choose semantic categories to describe the content of a document. A semantic dictionary serves to define how each term in the document will vote for different categories; and so this will be critical to the usefulness of signatures. The suitability of a dictionary for an application will depend on its range of categories and on the breadth of its vocabulary for those categories.

The current API dictionary was trained on the listings of the DMOZ Open Directory Project. It is particularly strong in covering the content most commonly found on the Worldwide Web; for example digital electronics, video games, professional sports, movies, and cooking recipes. Since the Web seems to be biased toward the interests of young males, however, an ODP dictionary may provide less detailed coverage of subjects like designer shoes for women, Roth IRA’s, or Tanzanian rural development.

When generating Semantic Signatures® for a particular application, check their weights to see how well they are capturing your own target content. In the top 30 weights now shown, you should see a good contrast between the highest and lowest weights. We want to avoid something like the 2008 Democratic U.S. presidential primaries and caucuses, where one candidate is ahead, but there seems to be no clear winner. One can approach such a degree of contrast statistically, but simple eyeballing should be good enough most of the time.

In areas where the ODP offers fine-grain coverage, you may get many relevant categories, which is OK. The problem is when you see signature weights about the same for many categories that don’t seem to be closely related. In that case, you may want to try increasing the amount of text you generate signatures from in order to get more corroboration on voting. If you insist on doing women’s haute couture or calls and puts in the options market, however, you probably want a specialized semantic dictionary; this is not difficult build, but requires proper training data.

NOTE: For the purpose of the SemanticHacker Innovators’ Challenge we will evaluate all application prototypes using the general purpose dictionary provided with the API. We understand that certain dictionary customization may be required after a winner is selected to improve the “matching” capability for a vertical. That work will be included in the product build.

The GIGO principle has long been part of the wisdom of computing. In the age of Web 2.0 and higher, when anyone can be an information creator, quality still counts. To develop that next killer app, we need not only cutting edge concepts and technology, but also decent data.

Semantic Signatures® are derived from text essentially by a complicated voting scheme by which occurrences of terms in that text select the semantic dimensions by which one can represent the content of that text. The current API returns the top 30 dimensions, but reliability of those results for a document will depend on how well its voting terms corroborate each other. If we have only two or three terms that are completely unrelated to each other, then selection of dimensions may be hit or miss.

Discerning a corroboration problem by just looking at a Semantic Signature® is nearly impossible. Dimensional weights in a signature will always be normalized to make them more easily comparable, making it difficult to infer anything about the original text. Make sure that the text is not something like a splash page, a login page, or a status page without any content of interest.

We have been working with Semantic Signatures® for a while now, both on the theoretical and on the practical sides. In that time, we have learned a few tricks that you won’t find in the marketing materials or the official API documentation. Our new blog series: “The Skinny on Semantic Signatures®” will try to share these insights with you, the prospective semantic hacker.

Now sharing of information is the very life of the Web, but this being a challenge with rather large sums of money at stake, it would be understandable if people were a bit tight-lipped. However, we still want to try to fill in some of the gaps. After all, we do want you all to succeed in building innovative software prototypes or business plans using Semantic Signatures®.

We believe that Semantic Signatures® provide an unprecedented scalable tool for computational characterization of meaning in natural language text; but one must remember that they are statistical constructs based on numbers. To make numbers work right for you, you must also treat them properly. This is not being negative, but wise.