Archive for the ‘General’ Category

Romeo Montague once noted that the semantic function of a name contrasts quite saliently with that of an ordinary word. Shakespeare didn’t quite put it that way, but it is a fact of language. Classic semanticist would frame it as a distinction between denotation (i.e. identification) versus connotation (i.e. description).

As it turns out, this difference can be seen even at the statistical level. Ordinary words with a little massaging have a frequency distribution best described as binomial; names are typically not binomial. That will have consequences for how we mine text data to create a semantic dictionary.

This is all a fine point, but the quality of a product is determined by many such fine points. None of our API competitors on the web bother with denotation and connotation, but it can really matter when you are processing data with many product designations.

In Chapter 8 of Lewis Carroll’s “Alice Through the Looking Glass,” our intrepid logical adventurer is talking to the White Knight, who wants to sing to her. He says, “The name of the song is called ‘HADDOCK’S EYES.’”

It turns out of course that the name of the song is really “THE AGED AGED MAN,” though the song is actually called “WAYS AND MEANS.” The confusion here about naming is quite understandable to anyone who has ever ordered TenderSweet™ clams at HoJo’s and discovered that they are neither tender nor sweet.

All of this would be hilarious except that we have to build semantic dictionaries that must deal extensively with the meaning of names in text. This problem will take a while to talk about adequately; and so please tune in tomorrow.

Suppose that we want to know the average body-mass index (BMI) of American teenagers. Since it is extremely difficult even to count every single teenager in the country, sampling is necessary. So we try to find N typical teenagers, measure and weigh them, and then compute their average BMI with the standard statistical formula

population mean ≈ ∑ᵢ BMIᵢ / (N + 1)

Now we all learned averages in junior high. Where did the “+ 1″ come from? This is in fact a simple trick that every statistician has to learn on day 1. When we estimate a population mean from a small sample, there will inevitably be an error, typically on the high side. As a useful rule of thumb, we get a better estimate when dividing by (N + 1) instead of by N. Note that, as N gets large, N ≈ (N + 1); and so we do converge to the population mean in the limit.

A semantic dictionary is nothing more than millions of averages of term frequencies in documents, and most of them are based on only a fairly small number of occurrences of a given term. To get good results here, we have to do more than just junior high math.

Our situation is actually much more complicated than that of estimating a simple population mean, but we have to do a similar kind of data smoothing. This is all to provide you with the highest quality numbers for your web app.

Recently, there were news reports of scientists identifying an Oprah Winfrey neuron in the brain of an epileptic person who had been wired to help control seizures. This one particular neuron  in the hippocampus fires whenever the person hears Oprah’s name or sees a picture of her. It may help to explain how memory works.It also can explain how semantic dictionaries work. In the case of Oprah, stimuli from many different senses travel various paths to converge on her neuron. In a semantic dictionary with Oprah as a concept, various terms associated with her in effect will vote for the concept with differing degrees of confidence when they occur in some document. When there is convergence because of mutual corroboration of terms, then one can infer that the document is about the queen of daytime TV.

Overview

A group of us, at TextWise, were working on our semantic similarity technology that allowed us to match arbitrary text documents to similar documents. One of our initial uses of the technology was to contextually match ads to Web pages. This worked very well, but we decided to focus on a Web 3.0 API and wanted to come up with an interesting demo of the technology.
The idea for foof came from suggestions in various forums and blogs about possible improvements to Wladimir Palant’s Adblock Plus. These suggestions focused on allowing pictures or other images to replace the ads, instead of just crunching (or blanking) the space.
Ad blockers examine the html of a web page and look for patterns of code that are indicative of ad displays. They then eliminate the code, while trying to not disrupt the look and feel of the base page.
During the debugging of our original advertising system, we had implemented a tool that replaced Ads on test Web pages with our ads – to allow us to debug in situ. Being users of Adblock Plus, we were reading the blogs and realized that we could use our technology to offer more than just replacing ads with images. Thus, the idea of using TextWise’s semantic similarity engine and various content sources (news, blogs, Wikipedia, video’s and personal images) to match interesting content to web pages and fill the ad holes, was born.
In developing the foof ad blocker, we needed to solve several problems:

  • Finding and eliminating the ads on the web page
  • Determining the size of the hole that remained, so that we could fit content into the hole
  • Selecting which content indexes to be used to fill each hole
  • Determining what the web page is about
  • Matching the replacement content to the web page
  • Providing an experience that is not overwhelming

Finding the Ads

This was the easiest part of the design. We started with Wladimir Palant’s, open source, Adblock Plus code as a base. This is the best Firefox ad blocker and using it as our base meant that foof would do an equally good job.

Determining the Hole Size

Once the ads are located on the web page, we examine both the ad and the page structure and determine the possible size of the hole left after elimination. As each type of content only fits well into holes of certain sizes and geometries, we characterize each hole and decide if it is to be left blank or can contain content.

If the user, during set-up, chose to only block ads, then the process is complete and blank space is substituted for all ads.

Determining the Type of Content for a Hole

Once  we determine that a specific hole can contain content, then, we characterize the hole to see what types of content it  can support (news, blogs, Wikipedia, Videos, personal images). A typical hole might be capable of containing more than one type of content. At this point we examine the user’s configuration settings to see which types of content the user enabled and in which priority order the user would like us to choose the types of content. The order is important, because there may not be a relevant content match available for for the web page for every content type.

Determining What the Web Page is About

Determining what a web page is about is a multi-step process. These include:

  1. Determining the address of the web page
  2. Fetching the web page
  3. Filtering the web page to remove HTML, JavaScript, and boilerplate text
  4. Generating a semantic signature™ for the page (a signature is the digital DNA of the page’s content – see http://www.textwise.com  for more information)

Matching Relevant Content to the Web Page

Given the semantic signature™ of the web page, it is relatively easy to take that signature and match it to the content signatures in the signature index of the content type chosen to fill the hole.

A signature is simply the best 30 weighted dimensions of a 1700+ dimension semantic space. The best matches are then biased by a keyword match that is done using a proprietary term selection algorithm. This is done to improve the precision of the results. The combined signature and keyword matches are ranked and if there were any acceptable matches the results are returned.

If there were no acceptable matches, then the match is retried with the next content type’s index. If there are no matches for a given hole, then a blank is used to fill the hole.

Maintaining a Quality Experience

During alpha testing, we determined that in order to have a pleasing experience we needed to:

  • Only fill one hole on a web page with a given content type (for example:  news would appear only once on a page)
  • Only fill two holes on a page with content, leaving the others blank
  • Provide a mechanism to browse content within the hole. This mechanism would allow the user to:
    • View additional articles, images, or videos related to the page, beyond the initially visible item (this is done by clicking on the <- and  -> arrows in the content header)
    • View other types of content related to the page (this is done via tabs in the content header)
  • Provide a mechanism to verify the presence of our servers on the web and default to pure ad block mode, if the servers are not available

Additionally, though we did not implement contextual image search in foof (it now is available to the Semantic Hacker API), we decided to add an option for users to view their own photos in place of ads on the web pages. To implement this, we choose Flickr and provided a way to point to a Flickr account, as an option.

And it Works!

The development of foof was an interesting experience that gave the team a chance to have some fun and at the same time solve interesting problems.

Currently there are over 27,000 users of foof (July, 2009). The download for Firefox is available in the Mozilla Add-On sandbox (experimental Add-On) and at http://www.foofme.com .

Ingredients

27 Jul 2009

This posting will probably make the eyes of most people glaze over, but current and prospective users of our SemanticHacker API should really be informed consumers. So think of this as being like one of those federally mandated labels on your bottle of Red Bull.

The ingredients of a semantic dictionary are a set of hundreds of thousands of terms, a set of thousands of dimensions, and various numbers expressing the strength of association between a given term and a given dimension. Most of these associations will have zero strength, indicating that we have no information about them; but there will still be millions of non-zero numbers to provide a rigorous undergirding for statistical semantics.

We build a semantic dictionary by obtaining large training samples of documents relevant to each of its dimensions. The strength of association is then estimated as being proportional to the relative frequency of occurrence in training documents for a term in a dimension versus in those for all other other dimensions. The process is actually more complicated than this, but the differences are just refinements of the overall scheme as described.

Now we all understand what terms are (e.g. britney_spears, midfielder, rugelach, purple), but where do dimensions come from? The answer is that they are somewhat arbitrary. A dimension can be defined around any kind of category for which someone has provided requisite training documents. In many cases, we can find prior sets of categories to work from (ODP, USPTO), but we also can ourselves try to infer categories from some available pool of potential training data.

However we proceed here, it is necessary that the resulting dimensions be pertinent to an application of interest, be independent of each other, be supported by adequate training data, and be associated with enough terms to support semantic analysis of target text. This all can be tricky to achieve, but if it were easy, everyone would be doing it.

Estimating a probability basically involves computing an average. Since most middle-schoolers know how to do this, what is so difficult about building a semantic dictionary consisting of conditional probabilities?

The problem turns out to be with sample sizes. To get reliable dictionary weights for a given term, we need many examples of its occurrence in text, but most terms are rather infrequent in any given corpus. This fact of life is articulated in Zipf’s Law, which states that occurrences of the n-th most common term in a corpus will be approximately proportional to 1/n.

Such a relationship is called a “power law,” which can also be seen in many other natural phenomena. For instance, sociologists often note that only ten percent of the people in any organization does ninety percent of all the work.

Unfortunately, the most frequent terms in any corpus are typically the least interesting for information applications. So the challenge is to make reliable probability estimates for tens of thousands of terms when the statistical support is less than ideal.

To build a good dictionary, we need to do much more than simply add up some term frequencies and then divide.

According to WordNet, the word BANK has multiple senses, and so any occurrence of it in a text document is ambiguous. For example, we can have a river BANK, a financial BANK, a fog BANK, or an aeronautical BANK. The intended sense in a particular document has to be determined by looking at the context of occurrence. So, to determine the actual meaning of BANK in a document, we have to ask in effect whether the document is talking about streams of water, financial meltdowns, marine navigation, or aircraft in flight.

Now the number of different possible contexts is probably huge.One cannot hope to recognize them all; but for disambiguation of words, we need only fairly general contexts to distinguish the word senses of prime interested to us. Furthermore, given a large of our target text, we can employ statistical methods to identify the most important of such contexts.

This is essentially what SemanticHacker is all about.The dimensions of one of our semantic dictionaries defines thousands of contextual reference points for the interpretation of terms. For example, if the words stream, water, flow, erosion, and grass are in a document, then with the ODP 2009 dictionary, we find that the top match dimension is 1461 (Top/Science/Environment/Water_Resources) with a weight of 0.5138. In this context, the word BANK would probably mean “river bank.”

Actually, we don’t need to make this explicit association. With a search engine user interface, one just needs a way of describing the context of ambiguous search terms, perhaps by listing contextual words. Then all a semantic search engine has to do is find a document containing the search term and having the same described context in its semantic signature. This is of course a part of our API for search.

SIGIR 09 Day One  http://sigir2009.org/Program  Several parallel tracks at SIGIR – here are some highlights from sessions I attended today.

Susan Dumais gave the opening keynote @ SIGIR 09  “An Interdisciplinary Perspective on Information Retrieval”  Dumais was the 2009 recipient of the Gerard Salton Award for her contribution to the Information Retrieval field.  Her work at Bell Labs/Bellcore exploring vocabulary mismatch (aka verbal disagreement) led to her LSI work. She has worked at Microsoft Research since 1997 and currently leads the Context, Learning, and User Experiences in Search team. Her talk spoke about her background (cognitive psychology/mathematics) and how the problems of information retrieval and the huge social and technical leaps in the fields in the last fifteen years have made it a very exciting time to be working in this area. However, as much as things have changed, much has stayed the same. Haven’t escaped the search box, or the results list. Observed searching habits: high frequency in which we repeat our searching – “re-finding” on the desktop and the web. Date is the most common sort selected when changing from the default option. She called for more personalized search research – we need models to support personalized search: when to use it, when not to (works only some of the time).  Evaluation continues to be challenging. Behavioral data is extremely noisy – especially click data.  For future research: IR solutions must acknowledge dynamic information environment and experiments and data must reflect this environment.  Need data that mirrors the dynamic information environment; she called for a ‘Living Laboratory’ made up of logs of search engine, searching resources such as Wikipedia, etc. Needs a group to mobilize to put this resource together; plugged the Lemur Query Toolbar. IR research needs and interdisciplinary team to understand users and thinking outside the box to meet the challenges ahead in IR.

Novel Search Features Session Notes: “Web Searching for Daily Living” (NTT Comm): collecting information about every day actions from cameras and incorporating the information into websearch queries using clustering techniques to return useful information.  Forward looking research as few of us have web browsing tools on our appliances or in our bathrooms but paving the way. This is what they mean by the phrase search ubiquity!  “Global Ranking by Exploiting User Clicks” (Yahoo!): Collecting information about user click sequences and then through supervised learning provide prediction. Must look across results, not within single documents after click. Position influences clicks – first result often clicked on.  Aggregation of data is key – click data is very noisy.  “Good Abandonment in Mobile and PC Internet Search” (Google) Investigation of when search abandonment is good (answer is right in results list – no need to open page) much more likely to occur on mobile device as opposed to PC; varies by locale (looked at US, Japan, China) and by category of query. Research to estimate rates and get first study designed: classification by modality, locale, category.

Web 2.0 Session Notes: “A Statistical Comparison of Tag and Query Logs” (Strathclyde & Lugano Universities) Very cool zooming slide ware used in presentation.  Found more vocabulary shared between queries and tags than any combination of queries, tags, and content of search results.  Data set used: AOL query logs, Delicious tags, ODP categories.  “Enhancing Cluster Labeling Using Wikipedia” (IBM Research) Found very promising results using Wikipedia metadata to label clusters. Walked through approach, evaluation. Findings suggest continued development of this work would provide better quality labeling of clusters.

Question Answering Session Notes:

“A Classification-based Approach to QA in Discussion Boards” (Lehigh University) How to ask questions on the web – Options: Search Engines, QA portals, Discussion boards. This research focused on detecting Questions and Answers on Discussion Boards. Discussed techniques found to work best for Questions and for Answers. “Ranking Community Answers by Modeling Question-Answer Relationships via Analogical Reasoning” (Microsoft Research & Huazhong Science and Tech University) Presenter said search engines must deliver answers sooner than later. Mining data from community forums (Yahoo! Answers Archive) to find clues for linkages among question and answers. Model the previous knowledge.  Each question had 16 answers on average in data set. Very promising results.                                                                                

Tags:

Search engines work remarkably well when one is searching for a popular topic. Just try the query LOVATO. If you are of the demographic normally reading this blog, then you probably don’t know yet who she is, but Google or Bing will find her. Although she is still obscure enough so that Lovato Electric, Inc., beats her out for top spot on Bing, there is no problem in getting the goods on this latest Disney ‘tween idol.

Here is a different, more frustrating search story, however. I was over at the National Gallery in Washington on Sunday and saw a remarkable series of Renaissance Italian frescos. At home afterwards, I queried on ITALIAN VILLA FRESCO NATIONAL GALLERY WASHINGTON, but found nothing recognizable on Google with either web or image search. About an hour later, I gave up after trying numerous variations of queries.

Then I went to www.nga.gov and navigated down to its 16th Century Italian art page. It offered a virtual tour of a series of frescos by Bernardino Luini on the legend of Procris and Cephalus. Bingo! According to the web site, “These nine paintings are the only examples of an Italian Renaissance fresco series in America.” Strangely enough, I had actually tried the term LUINI in one of my unsuccessful queries.

So we obviously have a failure to communicate here; and this is really a problem that semantic search should be addressing. The relevant page was out there and my queries should have been specific enough, but somehow a beautiful young bride being run through and killed by a magic javelin just wasn’t as sexy as Britney 4.0.