Archive for March, 2008

TextWise Semantic Signatures® are based on vector spaces with thousands of dimensions, each corresponding to a single concept in some domain of interest. We use a special semantic dictionary to map the content of a given text document into a point within that conceptual space; and one can then gauge the similarity of two documents from the distance between their points in the conceptual space.

High-dimensional vector spaces should be quite familiar to information retrieval specialists and users of Salton’s SMART system. One should note, however, that SMART is based solely on counting word occurrences and so is not semantic. The highly skewed distribution of words in text, as described by Zipf’s Law, means that the dimensions will be highly unbalanced, much like all the extra folded up physical dimensions of strings in current grand unification theories.

Semantic spaces behave better, in large part because we get to choose the concepts for their dimensions. One wants those concepts to be independent, well balanced, and representative of the kind of text content to be described. This turns out to be a challenging set of requirements, but the underlying ideas are quite straightforward.

TextWise takes a purely statistical approach to semantics. Each concept in a semantic space has to be defined by a big sample of text documents related to that concept. We can then apply standard language modeling methods on such data to estimate the conditional probabilities of certain terms being associated with certain concepts; and these numbers with a few adjustments will then constitute our semantic dictionaries. This whole process is called “training”

TextWise has already built several large semantic dictionaries, most notably one with categories and training data from the USPTO and another with categories and indexed web pages from the ODP. The latter is probably the best choice for working with web applications, but one should note that many DMOZ categories have had to be consolidated or eliminated in order to satisfy minimum training data requirements for a dictionary.

A Semantic Signature® derived with the latest ODP dictionary will have over ten thousand dimensions. This will be hard to work with, but with typical web pages, only a few of those dimensions will have a significant weight. So our API currently keeps only the top 30 dimensions of a signature, which should be plenty to work with. Subsequent release will introduce signatures with variable degrees of truncation according to the actual statistical significance of weights.

Because they are statistical, Semantic Signatures® will always have a small unavoidable degree of noise. It is possible that some highly weighted categories in a signature will be wrong, just as it is possible for the house to lose some bets in a casino. With proper handling of training data, though, one should be able to ensure that the house will in fact still win most of the time.

Semantics is a slippery subject to talk about because the meaning of ‘meaning’ is poorly defined. That’s a pity, but those are the breaks that we have to live with. We believe there are at least three different approaches to deciphering meaning.

1. We can build a computational representation of the world and try to understand natural language by mapping its elements into that model. This is the classical artificial intelligence method of Minsky and McCarthy; and the prime example here would be CYC, but one could also work with more simple semantic networks.

2. We could define meaning procedurally as done in programming languages. This is the way that natural language interfaces to formatted databases are typically constructed. That is, a question in English might be translated into SQL that then produces an answer.

3. We can organize a system of tags to identify types of content, possibly according to some standard taxonomy. The type of tag is critical here, because those employed in HTML and other schemes are clearly not semantic. An example of semantic tagging is entity extraction, where the names of persons, organizations, and geographic regions in a text document are identified.

The Semantic Web of Berners-Lee aspires to both 2 and 3, but from a practical perspective, Web application developers tend to focus on only one or the other.

The TextWise view of semantics falls mainly into approach 3, with a little of 1. We do not define taxonomies, but have a high-dimensional semantic space that we map the content of text documents into. We avoid the huge cost of model building, but still have an extensive array of concepts that can be used in effect to triangulate meaning.

We’ve had a successful first couple of days after launching the Challenge. In addition to making TechCrunch, a few other sites where we’ve received coverage include:

We also have a couple of applications and ideas already on the Forum. Overall we feel confident that something amazing will come out of this!

The doors to the SemanticHacker API and SemanticHacker Innovators’ Challenge officially launched today at 9am EST.

TechCrunch broke the news first.

Official Press Release

Unlocking the intellectual capital trapped inside organizations that deal with enormous amounts of text is a major challenge. To accelerate the development of new semantic business applications tackling this problem, TextWise LLC, a pioneer in semantic application development, today announced the SemanticHacker Innovators’ Challenge (SHIC). TextWise is offering a $1 million incentive for each useful and inventive implementation of the firm’s open API for semantic discovery.

“Organizations need to cut through the noise and see what really matters both in their information archives and on the Web. They need systems that can help them process masses of information and understand what it all means,” said Connie Kenneally, Textwise CEO. With a system that can automatically discern the meaning of text, it becomes easy to bring relevant, related information to the surface from huge volumes of text, with little or no prior organization and no keyword search.

“We get paralyzed by the scale of what we find online and in our corporate networks,” stated Ms. Kenneally. The TextWise SemanticHacker open API uses patented technology to decode and distill the meaning of any piece of text by creating a Semantic Signature®. A Semantic Signature® can be thought of as a representation of the ‘DNA’ of that text.

TextWise’s multi-million dollar challenge is designed to showcase the power of the firm’s patented Semantic Signature® technology and to accelerate the development of breakthrough applications for Semantic Discovery. The overall goal of the SemanticHacker Innovators’ Challenge is to encourage the development of software prototypes and/or business plans that demonstrate commercial viability in specific industries. Information-intensive businesses in the legal, pharmaceutical, and financial markets are all good candidates for Semantic Discovery applications, but other industries are also of interest.

TextWise is making its open API available for immediate use, at www.semantichacker.com, and inviting developers and business innovators throughout the U.S. to submit software prototypes and/or business plan ideas. The new site also includes a simple demonstration of the use of Semantic Signatures. “We know that there are brilliant minds out there, looking for ways to crack the infoglut epidemic, and we’re ready to fund the best and most inspiring solutions with up to a million dollars each,” said Ms. Kenneally.

An expert judging panel composed of industry experts, venture capitalists and representatives from TextWise and its advisory board will assess all entries to the TextWise SHIC program. Once selected, the winner or winners of the Challenge will each receive $100,000 immediately, with the potential for revenue share up to an additional $900,000 over the first year after the chosen application reaches the market.

The SemanticHacker Innovators’ Challenge is open to Individuals who are legal U.S. residents and over the age of 18 at the time the entry is submitted. Full details on SHIC including rules, entry requirements and judging criteria are available online at www.semantichacker.com.

TextWise is busy extending its application reach beyond its successful beta in contextual advertising to other semantic applications including an ad replacement plug-in using semantic matching, a discovery tool for eBay auctions, a blog organization service that uses semantic bookmarks, and a Facebook application for sharing conversation and organizing links among users.