Archive for the ‘Semantic Signatures’ Category

A semantic dictionary in essence consists of triples [ t , d , w ] , where t is a term, d is a semantic dimension, and w is a weight. Each triple says that the occurrence of a term t in a document constitutes a raw vote of w for dimension d being relevant to the document. For example, [ BRAD , Arts/People/Jolie,_Angelina , 0.12100 ] indicates that the occurrence of BRAD in a news story provides evidence that it might be about the movie celebrity Angelina Jolie. If it were conclusive evidence, the weight would be 1.00000, but we never expect any single term to be that definitive.

In building a dictionary for an application, we have to start with the dimensions. Do we have the kinds of dimensions to cover the target content, and are there enough dimensions to make the distinctions in content required by the application? For a patent information system, an Angelina dimension may not have much relevance, and even something more appropriate like Electrical Machinery may have to be divided up into multiple dimensions to support a reasonable level of granularity in indexing.

Given the dimensions, we next have to define the terms to go along with them. The target content we want to process will have a certain vocabulary, and our dictionary terms should try to encompass most of it. This can be tricky in a statistical approach because we need reasonably large samples of training data to make a particular term become associated with a particular dimension.

Weights are determined in large part by training data, but the distribution of those numbers are important. To begin with, not all weights should be the same, and generally, we want to see them spread out over the entire dynamic range available to us. Weights that are quite big or quite large have to be supported by more data that those in the middle range. Weights have to be balanced between dimensions, and there should be enough of them so that most terms are related to more than one dimension.

In theory, we could build a dictionary with just one weight in just one dimension for each term, but that would be in denial about the inherent ambiguity of language. So, we typically want a dictionary to be as big as possible, based on an appropriate amount of training data. To build the best possible dictionary requires much inspiration and much perspiration.

Even in the world of print, one dictionary is often not enough. Just for English, for example, we can go to standard references like Webster’s Third New International, The American Heritage Dictionary of the English Language, or the Oxford English Dictionary, as well as more specialized lexicons. So how many semantic dictionaries do we really need?

That of course depends on the application. If we are in the situation where our target text data is extremely stable and requires only a general vocabulary, then we might get away with a single semantic dictionary based on a large sample of data processed quite carefully. On the Web, however, we have nothing of the sort, if you haven’t noticed lately.

A sophisticated dictionary that took weeks to build with hairy mathematical algorithms on a reasonable sample of training text may become obsolete overnight. That is not to say that sophisticated dictionaries are unhelpful; but in the merciless competition of the information marketplace, we probably need to be able to pop out a new semantic dictionary based on a gigabyte or more of text in just hours.

Given this kind of turnaround, why would anyone want to rely on a single semantic dictionary with its limited vocabulary and somewhat dated concepts? A new dictionary will of course involve a nontrivial upfront investment, but once a reliable source of tagged data is developed, actual dictionary building can be largely automated. That is the advantage of relying on statistical methods.

Casino Royale

14 Sep 2009

In any statistical information system, one can never achieve absolute certainty. Every result is a kind of bet with the possibility of losing. For Semantic Signatures, however, this is more like playing blackjack than like playing roulette. Whether we imagine ourselves as the house or some hotshot card counter, we try our utmost to bend the odds in our favor.

When a given term occurs in a document, we know that there is a certain probability that the document is about a given topic. For example, THRILLER may relate to Michael Jackson or to some recent summer popcorn epic. Similarly MOONWALK may refer to Apollo XI or to a dance move. We would be rash to judge content just on the basis of a single term, but when multiple terms can corroborate each other, we do have a better bet.

The trick here is to able to set up a semantic dictionary so that we can always expect to find a reasonable number of terms in a target document that allow us to make that better bet. This requires careful balancing: we need enough semantic dimensions to be able to distinguish the different important kinds of content and enough terms for each dimension to put it into play. It is much like developing a diverse portfolio of investments to weather any shift in economic climate.

Most people will probably pass on building their own semantic dictionaries. It takes a tremendous amount of work to collect and filter the requisite text data to ground our dictionary weights and to massage all those numbers to get the maximum amount of usable information. But we want to get on the right side of the odds.

I was asked that question quite a few times when I was at the KM World and SemTech conferences. The answer is simple: use a Semantic Signature as a query against an index of Semantic Signatures to find the most relevant content.

In order to illustrate what a Semantic Signature is, we provide the example of a document with 30 semantic dimensions labeled using the Open Directory Project taxonomy (www.dmoz.org). The example lead people to believe that a Semantic Signature is nothing more a multivariate categorizer for content navigation, categorization, or other forms of content bucketing. While Signatures can certainly be used for that, it is not how we use them at TextWise.

If you examine a Semantic Signature without reading the thirty labels, you’ll observe it is a 30 dimension vector, of concepts and weights. These concepts and weights are used by TextWise in a simple vector math calculation to determine the similarity between two signatures. Once a score is obtained, it is normalized to an integer value and then a cutoff is chosen to determine if each signature is relevant to the query.

For a real world application, a user controlled sliding scale from 1 -10 can be used within the calculation to control what content items, represented by the Semantic Signatures, are displayed: a score of 9 would instruct the application to show only the highly relevant content while a score of 2 would show a greater recall of content.

Why would I use Semantic Signatures to search for content?  If you have read an article on the web or on your companies’ intranet and attempted to find additional content related to what you’re looking at, you know it is a cumbersome process:  Identify keywords to use from the source, use them to search, review the results, repeat the process until you either found what you are looking for or capitulated in your effort.  If you performed the same search against an index of Semantic Signatures, you simply use the document as the query, eliminating the inherent keyword/guesswork/review cycle with using today’s keyword systems.

From a developer’s perspective, the major benefits of Semantic Signatures are:

  • They are a very accurate and compact representation of a document – each Signature only consumes ~180 bytes of RAM.
  • Computing similarity of Signatures is a very light weight vector calculation and unlike keyword matching, there is no need for patterns, alias tables, synonym tables, spell correction, etc.
  • Scalability. 3 million Signatures will easily fit within a 1.5 GB 32 bit Java VM and result in full index searches taking ~ 70 milliseconds.

If you want to learn more about Semantic Signature technology and use our free API to create Semantic Signatures for your content, visit www.semantichacker.com.

When is a semantic dictionary good? It really depends on the application, since more specialized content requires more specialized dictionary dimensions. Typically,  validation of a given application will involve extensive benchmark testing, often entailing human judgments of the effectiveness of particular statistical characterizations of content.

TextWise does all of this in its product development process, but one would not want to go through an elaboration validation procedure to test the consequences of every small change. As it turns out, there are quick statistical ways to check whether a change is likely to be good or bad. This is no substitute for actual detailed validation at some point, but it allows one to experiment with new ideas at a fairly low cost.

A digital photography metaphor is apt here. One cannot use statistics to identify a prize-winning shot, it is certainly possible to detect major problems without human judgments. For example, areas of maximally white pixels indicate blown highlights, which typically detract from the quality of an image. Similarly, problems with white balance, dynamic range, focus, and other conditions are also readily detectable.

With any huge data object like a semantic dictionary, it is difficult to construct a benchmark that will cover every aspect of it thoroughly. Statistical testing provides an overall sanity check on quality. Otherwise, one would just be buying and selling pigs in a poke.

Our Roots

11 Aug 2009

Semantic Signatures℠ approaches meaning of words from the perspective of their context. In the past couple of months, there has been extensive discussion here and elsewhere about how this differs from RDF, the basis for the Semantic Web. The simplest answer is that we are data-driven where RDF is model-driven.

This dichotomy is nothing new. In fact, if we look at semantics over a hundred years ago, we see the empirical idea of contextual semantics in the structural linguistics of  Ferdinand de Saussure in contrast to the logical formulation of meaning in the predicate calculus of Bertrand Russell and Alfred North Whitehead. The former inferred meaning from the comparative analysis of text; the latter defined a mapping between text and a formal model of possible meanings.

The model-driven approach became less popular after the logician Kurt Gödel proved the incompleteness of all non-trivial logical systems in the 1930’s. Structural linguistics then became the favored approach until Noam Chomsky put the study of language back on a formal basis in the 1950’s, and the semantics of language also tilted to the formal in order to be more consistent with the study of syntax.

This is not to say that one approach is right and the other is wrong. The choice of approach to take should really depend on one’s circumstances. If one has available an appropriate logical model, which today might correspond to a taxonomy and a formal way to relate taxonomic entities, then the model-driven option is compelling. On the other hand, if an appropriate model is lacking or incomplete, but there is plenty of tagged text data to work from, then the data-driven option should be considered.

One can always in fact choose to work with the best of both worlds. We are not the sole providers of data-driven semantic technology, but our statistical characterization of meaning is probably de Saussure himself might have done it if he had access to the Worldwide Web and 21st Century cloud computing.

The current SemanticHacker API offers more than one semantic dictionary. Each one is crafted from a particular collection of categorized documents at a particular time. The choice of a dictionary depends on one’s target application. Ideally, that dictionary will be trained on categorized documents similar to the documents to be analyzed for content.

Currently, the two main types of dictionaries available in English come from the ODP conceptual hierarchy and the USPTO class hierarchy. The dimensions defined for these types have practically no overlap. The differences in language and vocabulary in training data are also huge, and these have major consequences in the dimensional weights computed for terms in the two dictionaries.

In theory, one could employ a USPTO dictionary in a general web application, but one then risks being unable to pick up on popular language and culture. You won’t find “lol” or “Brangelina” in any patent. Similarly, an ODP dictionary may be a bit thin for handling medical journal articles; it would be much better here to have a semantic dictionary trained specifically on medical language and vocabulary.

The cost of building a specialized dictionary varies, mostly due to the complicated legal, technical, and logistical process of collecting the proper training data. Once the data is obtained, however, the actual dictionary process is largely mechanical, although we do carry out extensive quality assessment to determine whether we are running under optimal dictionary building parameters.

With proper training data in hand, we can turn out a semantic dictionary of about 200,000 terms over 2,000 dimensions in only about a day. This turnaround is possible because of our reliance on statistical methods as opposed to more complicated mathematical modeling of other semantic approaches. It means that we could build new dictionaries fast enough to keep up with news cycles as short as one week, given the computational resources needed.

Suppose that we want to know the average body-mass index (BMI) of American teenagers. Since it is extremely difficult even to count every single teenager in the country, sampling is necessary. So we try to find N typical teenagers, measure and weigh them, and then compute their average BMI with the standard statistical formula

population mean ≈ ∑ᵢ BMIᵢ / (N + 1)

Now we all learned averages in junior high. Where did the “+ 1″ come from? This is in fact a simple trick that every statistician has to learn on day 1. When we estimate a population mean from a small sample, there will inevitably be an error, typically on the high side. As a useful rule of thumb, we get a better estimate when dividing by (N + 1) instead of by N. Note that, as N gets large, N ≈ (N + 1); and so we do converge to the population mean in the limit.

A semantic dictionary is nothing more than millions of averages of term frequencies in documents, and most of them are based on only a fairly small number of occurrences of a given term. To get good results here, we have to do more than just junior high math.

Our situation is actually much more complicated than that of estimating a simple population mean, but we have to do a similar kind of data smoothing. This is all to provide you with the highest quality numbers for your web app.

Overview

A group of us, at TextWise, were working on our semantic similarity technology that allowed us to match arbitrary text documents to similar documents. One of our initial uses of the technology was to contextually match ads to Web pages. This worked very well, but we decided to focus on a Web 3.0 API (Semantic Hacker)  and wanted to come up with an interesting demo of the technology.
The idea for foof (foofme.com) came from suggestions in various forums and blogs about possible improvements to Wladimir Palant’s Adblock Plus. These suggestions focused on allowing pictures or other images to replace the ads, instead of just crunching (or blanking) the space.
Ad blockers examine the html of a web page and look for patterns of code that are indicative of ad displays. They then eliminate the code, while trying to not disrupt the look and feel of the base page.
During the debugging of our original advertising system, we had implemented a tool that replaced Ads on test Web pages with our ads – to allow us to debug in situ. Being users of Adblock Plus, we were reading the blogs and realized that we could use our technology to offer more than just replacing ads with images. Thus, the idea of using TextWise’s semantic similarity engine and various content sources (news, blogs, Wikipedia, video’s and personal images) to match interesting content to web pages and fill the ad holes, was born.
In developing the foof ad blocker, we needed to solve several problems:

  • Finding and eliminating the ads on the web page
  • Determining the size of the hole that remained, so that we could fit content into the hole
  • Selecting which content indexes to be used to fill each hole
  • Determining what the web page is about
  • Matching the replacement content to the web page
  • Providing an experience that is not overwhelming

Finding the Ads

This was the easiest part of the design. We started with Wladimir Palant’s, open source, Adblock Plus code as a base. This is the best Firefox ad blocker and using it as our base meant that foof would do an equally good job.

Determining the Hole Size

Once the ads are located on the web page, we examine both the ad and the page structure and determine the possible size of the hole left after elimination. As each type of content only fits well into holes of certain sizes and geometries, we characterize each hole and decide if it is to be left blank or can contain content.

If the user, during set-up, chose to only block ads, then the process is complete and blank space is substituted for all ads.

Determining the Type of Content for a Hole

Once  we determine that a specific hole can contain content, then, we characterize the hole to see what types of content it  can support (news, blogs, Wikipedia, Videos, personal images). A typical hole might be capable of containing more than one type of content. At this point we examine the user’s configuration settings to see which types of content the user enabled and in which priority order the user would like us to choose the types of content. The order is important, because there may not be a relevant content match available for for the web page for every content type.

Determining What the Web Page is About

Determining what a web page is about is a multi-step process. These include:

  1. Determining the address of the web page
  2. Fetching the web page
  3. Filtering the web page to remove HTML, JavaScript, and boilerplate text
  4. Generating a semantic signature™ for the page (a signature is the digital DNA of the page’s content – see http://www.textwise.com and http://www.semantichacker.com for more information)

Matching Relevant Content to the Web Page

Given the semantic signature™ of the web page, it is relatively easy to take that signature and match it to the content signatures in the signature index of the content type chosen to fill the hole.

A signature is simply the best 30 weighted dimensions of a 1700+ dimension semantic space. The best matches are then biased by a keyword match that is done using a proprietary term selection algorithm. This is done to improve the precision of the results. The combined signature and keyword matches are ranked and if there were any acceptable matches the results are returned.

If there were no acceptable matches, then the match is retried with the next content type’s index. If there are no matches for a given hole, then a blank is used to fill the hole.

Maintaining a Quality Experience

During alpha testing, we determined that in order to have a pleasing experience we needed to:

  • Only fill one hole on a web page with a given content type (for example:  news would appear only once on a page)
  • Only fill two holes on a page with content, leaving the others blank
  • Provide a mechanism to browse content within the hole. This mechanism would allow the user to:
    • View additional articles, images, or videos related to the page, beyond the initially visible item (this is done by clicking on the <- and  -> arrows in the content header)
    • View other types of content related to the page (this is done via tabs in the content header)
  • Provide a mechanism to verify the presence of our servers on the web and default to pure ad block mode, if the servers are not available

Additionally, though we did not implement contextual image search in foof (it now is available to the Semantic Hacker API), we decided to add an option for users to view their own photos in place of ads on the web pages. To implement this, we choose Flickr and provided a way to point to a Flickr account, as an option.

And it Works!

The development of foof was an interesting experience that gave the team a chance to have some fun and at the same time solve interesting problems.

Currently there are over 27,000 users of foof (July, 2009). The download for Firefox is available in the Mozilla Add-On sandbox (experimental Add-On) and at http://www.foofme.com .

According to WordNet, the word BANK has multiple senses, and so any occurrence of it in a text document is ambiguous. For example, we can have a river BANK, a financial BANK, a fog BANK, or an aeronautical BANK. The intended sense in a particular document has to be determined by looking at the context of occurrence. So, to determine the actual meaning of BANK in a document, we have to ask in effect whether the document is talking about streams of water, financial meltdowns, marine navigation, or aircraft in flight.

Now the number of different possible contexts is probably huge.One cannot hope to recognize them all; but for disambiguation of words, we need only fairly general contexts to distinguish the word senses of prime interested to us. Furthermore, given a large of our target text, we can employ statistical methods to identify the most important of such contexts.

This is essentially what SemanticHacker is all about.The dimensions of one of our semantic dictionaries defines thousands of contextual reference points for the interpretation of terms. For example, if the words stream, water, flow, erosion, and grass are in a document, then with the ODP 2009 dictionary, we find that the top match dimension is 1461 (Top/Science/Environment/Water_Resources) with a weight of 0.5138. In this context, the word BANK would probably mean “river bank.”

Actually, we don’t need to make this explicit association. With a search engine user interface, one just needs a way of describing the context of ambiguous search terms, perhaps by listing contextual words. Then all a semantic search engine has to do is find a document containing the search term and having the same described context in its semantic signature. This is of course a part of our API for search.