A Semantic Signature® can be seen as the result of a kind of election to choose semantic categories to describe the content of a document. A semantic dictionary serves to define how each term in the document will vote for different categories; and so this will be critical to the usefulness of signatures. The suitability of a dictionary for an application will depend on its range of categories and on the breadth of its vocabulary for those categories.
The current API dictionary was trained on the listings of the DMOZ Open Directory Project. It is particularly strong in covering the content most commonly found on the Worldwide Web; for example digital electronics, video games, professional sports, movies, and cooking recipes. Since the Web seems to be biased toward the interests of young males, however, an ODP dictionary may provide less detailed coverage of subjects like designer shoes for women, Roth IRA’s, or Tanzanian rural development.
When generating Semantic Signatures® for a particular application, check their weights to see how well they are capturing your own target content. In the top 30 weights now shown, you should see a good contrast between the highest and lowest weights. We want to avoid something like the 2008 Democratic U.S. presidential primaries and caucuses, where one candidate is ahead, but there seems to be no clear winner. One can approach such a degree of contrast statistically, but simple eyeballing should be good enough most of the time.
In areas where the ODP offers fine-grain coverage, you may get many relevant categories, which is OK. The problem is when you see signature weights about the same for many categories that don’t seem to be closely related. In that case, you may want to try increasing the amount of text you generate signatures from in order to get more corroboration on voting. If you insist on doing women’s haute couture or calls and puts in the options market, however, you probably want a specialized semantic dictionary; this is not difficult build, but requires proper training data.
NOTE: For the purpose of the SemanticHacker Innovators’ Challenge we will evaluate all application prototypes using the general purpose dictionary provided with the API. We understand that certain dictionary customization may be required after a winner is selected to improve the “matching” capability for a vertical. That work will be included in the product build.