Archive for July, 2009

Overview

A group of us, at TextWise, were working on our semantic similarity technology that allowed us to match arbitrary text documents to similar documents. One of our initial uses of the technology was to contextually match ads to Web pages. This worked very well, but we decided to focus on a Web 3.0 API (Semantic Hacker)  and wanted to come up with an interesting demo of the technology.
The idea for foof (foofme.com) came from suggestions in various forums and blogs about possible improvements to Wladimir Palant’s Adblock Plus. These suggestions focused on allowing pictures or other images to replace the ads, instead of just crunching (or blanking) the space.
Ad blockers examine the html of a web page and look for patterns of code that are indicative of ad displays. They then eliminate the code, while trying to not disrupt the look and feel of the base page.
During the debugging of our original advertising system, we had implemented a tool that replaced Ads on test Web pages with our ads – to allow us to debug in situ. Being users of Adblock Plus, we were reading the blogs and realized that we could use our technology to offer more than just replacing ads with images. Thus, the idea of using TextWise’s semantic similarity engine and various content sources (news, blogs, Wikipedia, video’s and personal images) to match interesting content to web pages and fill the ad holes, was born.
In developing the foof ad blocker, we needed to solve several problems:

  • Finding and eliminating the ads on the web page
  • Determining the size of the hole that remained, so that we could fit content into the hole
  • Selecting which content indexes to be used to fill each hole
  • Determining what the web page is about
  • Matching the replacement content to the web page
  • Providing an experience that is not overwhelming

Finding the Ads

This was the easiest part of the design. We started with Wladimir Palant’s, open source, Adblock Plus code as a base. This is the best Firefox ad blocker and using it as our base meant that foof would do an equally good job.

Determining the Hole Size

Once the ads are located on the web page, we examine both the ad and the page structure and determine the possible size of the hole left after elimination. As each type of content only fits well into holes of certain sizes and geometries, we characterize each hole and decide if it is to be left blank or can contain content.

If the user, during set-up, chose to only block ads, then the process is complete and blank space is substituted for all ads.

Determining the Type of Content for a Hole

Once  we determine that a specific hole can contain content, then, we characterize the hole to see what types of content it  can support (news, blogs, Wikipedia, Videos, personal images). A typical hole might be capable of containing more than one type of content. At this point we examine the user’s configuration settings to see which types of content the user enabled and in which priority order the user would like us to choose the types of content. The order is important, because there may not be a relevant content match available for for the web page for every content type.

Determining What the Web Page is About

Determining what a web page is about is a multi-step process. These include:

  1. Determining the address of the web page
  2. Fetching the web page
  3. Filtering the web page to remove HTML, JavaScript, and boilerplate text
  4. Generating a semantic signature™ for the page (a signature is the digital DNA of the page’s content – see http://www.textwise.com and http://www.semantichacker.com for more information)

Matching Relevant Content to the Web Page

Given the semantic signature™ of the web page, it is relatively easy to take that signature and match it to the content signatures in the signature index of the content type chosen to fill the hole.

A signature is simply the best 30 weighted dimensions of a 1700+ dimension semantic space. The best matches are then biased by a keyword match that is done using a proprietary term selection algorithm. This is done to improve the precision of the results. The combined signature and keyword matches are ranked and if there were any acceptable matches the results are returned.

If there were no acceptable matches, then the match is retried with the next content type’s index. If there are no matches for a given hole, then a blank is used to fill the hole.

Maintaining a Quality Experience

During alpha testing, we determined that in order to have a pleasing experience we needed to:

  • Only fill one hole on a web page with a given content type (for example:  news would appear only once on a page)
  • Only fill two holes on a page with content, leaving the others blank
  • Provide a mechanism to browse content within the hole. This mechanism would allow the user to:
    • View additional articles, images, or videos related to the page, beyond the initially visible item (this is done by clicking on the <- and  -> arrows in the content header)
    • View other types of content related to the page (this is done via tabs in the content header)
  • Provide a mechanism to verify the presence of our servers on the web and default to pure ad block mode, if the servers are not available

Additionally, though we did not implement contextual image search in foof (it now is available to the Semantic Hacker API), we decided to add an option for users to view their own photos in place of ads on the web pages. To implement this, we choose Flickr and provided a way to point to a Flickr account, as an option.

And it Works!

The development of foof was an interesting experience that gave the team a chance to have some fun and at the same time solve interesting problems.

Currently there are over 27,000 users of foof (July, 2009). The download for Firefox is available in the Mozilla Add-On sandbox (experimental Add-On) and at http://www.foofme.com .

Ingredients

27 Jul 2009

This posting will probably make the eyes of most people glaze over, but current and prospective users of our SemanticHacker API should really be informed consumers. So think of this as being like one of those federally mandated labels on your bottle of Red Bull.

The ingredients of a semantic dictionary are a set of hundreds of thousands of terms, a set of thousands of dimensions, and various numbers expressing the strength of association between a given term and a given dimension. Most of these associations will have zero strength, indicating that we have no information about them; but there will still be millions of non-zero numbers to provide a rigorous undergirding for statistical semantics.

We build a semantic dictionary by obtaining large training samples of documents relevant to each of its dimensions. The strength of association is then estimated as being proportional to the relative frequency of occurrence in training documents for a term in a dimension versus in those for all other other dimensions. The process is actually more complicated than this, but the differences are just refinements of the overall scheme as described.

Now we all understand what terms are (e.g. britney_spears, midfielder, rugelach, purple), but where do dimensions come from? The answer is that they are somewhat arbitrary. A dimension can be defined around any kind of category for which someone has provided requisite training documents. In many cases, we can find prior sets of categories to work from (ODP, USPTO), but we also can ourselves try to infer categories from some available pool of potential training data.

However we proceed here, it is necessary that the resulting dimensions be pertinent to an application of interest, be independent of each other, be supported by adequate training data, and be associated with enough terms to support semantic analysis of target text. This all can be tricky to achieve, but if it were easy, everyone would be doing it.

Estimating a probability basically involves computing an average. Since most middle-schoolers know how to do this, what is so difficult about building a semantic dictionary consisting of conditional probabilities?

The problem turns out to be with sample sizes. To get reliable dictionary weights for a given term, we need many examples of its occurrence in text, but most terms are rather infrequent in any given corpus. This fact of life is articulated in Zipf’s Law, which states that occurrences of the n-th most common term in a corpus will be approximately proportional to 1/n.

Such a relationship is called a “power law,” which can also be seen in many other natural phenomena. For instance, sociologists often note that only ten percent of the people in any organization does ninety percent of all the work.

Unfortunately, the most frequent terms in any corpus are typically the least interesting for information applications. So the challenge is to make reliable probability estimates for tens of thousands of terms when the statistical support is less than ideal.

To build a good dictionary, we need to do much more than simply add up some term frequencies and then divide.

SIGIR Day Three  July 22, 2009

Great job by Daniel Tunkelang of Endeca putting this track together

Morning: Industry Track Speakers

“Webspam  and Adversarial IR: the Road Ahead” (Google) Matt Cutts

Requirements for spammers: Content, Reputation, Opportunity for monetization.  Examples of on-page and off-page spam provided. Spoke of defensive tools such as nofollow.  Clear increase in devising spamming routines to outright hacking.  1)  Concentrate on finding hackers  - joining with spammers – malware detection key  - hack sites and sell links.  2)  Prevent common spam – human tests, etc   – which techniques prevent it that any site pub can use (spam classification for wordpress blogs – good tool).  3) Looking for trust, identity, authentication.   Warning – facebook, twitter, etc new ecosystem new forms of spams, fake profiles abound.

“The Searchable Nature of Acts in Networked Publics” Danah Boyd (Microsoft)

Danah’s research area is social media, she’s looked at differences between myspace and facebook, etc. Her focus is communication – she is an ethnographer:  “How young people use the internet” Everything is VISIBLE. Distinction between social network sites and social networking sites. Social network sites: A/ engaging with preexisting friends  (diff from social networking sites – meet new people) Profile is the digital body – misinformation is intended and everywhere  1) meant to be funny (alter egos) 2) young people have been told to lie about who/what – keep away the predators, 3) don’t want to searchable/found. Don’t assume there is accurate information in social network sites. Average age stats are wrong!  B/ Public articulation of “friends” – assumes links are equal but relationships are not equal. Three key concepts of networks: sociological, articulated (public), behavioral (exchange content/interact).   Networked Publics: Issues – Persistence, Replicability (context freq gone), Searchability (not who you want for the most part) Scalability (who is seeing your content) Invisible Audiences – who are you talking to? Leads to imaginary audiences. Collapsed Contexts (social context is constantly changing, freq misleading) New Public/Private Boundaries (getting reworked). Twitter: just this spring a Big player. Twitter is not a chat. It is constantly changing who is using it for what; celebrity cache and mouthpiece to get back at powerful bloggers; soapbox, you choose who you follow.  5-15% accounts are protected – most accts are public. 5% contain a hashtag (almost half of these contain a URL).    22% include a URL.  36% mention another twitter user (put it at the beginning – Tweet is really directed at individual).   50 accts  have over 1M followers, 350 have several hundred thousand,  millions of accounts are dead. 140 characters – very  difficult constraint for searchability; retweets – some attribute, some drop it.  Info Retrieval Thoughts: Social media is about conversation and contexts – tough to make sense of the social context.  Danah@danah.org

“Ad Retrieval – A New Frontier of Information Retrieval”  (Vanja Josifovski – Yahoo! Research)

Disclaimer – can’t expose any Yahoo! trade secrets.   40% ads textual – competing with other content on page  - sponsored search and content match placement.  ~30% of web users interact with ads (thinks this is because ads are not relevant).  Text ads have visible/non-visible parts – landing url too big, too much info.  Bid phrase (keywords) used to target ad (ads are creatives + bid phrase).  Ad Retrieval:  Sponsored search – keyword bid – dbase technology.   Content Match –look for bid phrases, place ads – still single feature matching.  New way to look at it: Treat the ad as a document in IR.  Cost of serving the ad needs to be less than the revenue returned, also need to keep performance in mind.  

“Corpus Linguistics and Semantic Technology at the New York Times”  (Evan Sandhaus – NYT)   (Semantic Technologist, NYT R&D) 

NYT annotated corpus – LDC – 20 years of data.  20 years of annotated corpus (launched 10/08) obtain through LCD or nyt – 1987-2007 – 1.8M articles, abstracts, 900K+ tags, 665K abstracts  – NITF formal, xml standard  corpus.nytimes.com   Reuters corpus came first, smaller collection, annotated.  Potential uses of data: some ideas..  Location of article is implicit ranking, # of words, etc. (mm: too temporal?); Automated document summarization corpus gold.  80 users after nine months.

“Query Modeling at bing” Nick Croswell, Bing (Microsoft Research Cambridge)  (Filling in for sick speaker – OCLC)

Can’t  tell how everything works (trade secrets, etc)   Ambiguous queries: ‘house’  Mine the logs – click logs; Session data – what query follows the query “house”?; What other queries have click on that URL (co-click data); Intent clustering – use session data string of queries, keep nodes but replace edges with co-click data; then cluster on top of this.  Provides measurement poss, improve IR modeling of queries, put into UI dev. Temporal dynamics in logs: spiking/seasonal queries – feed ranking; Periodic queries (DST 2007/DST 2008);  Stale anchors / trailing signals (BO -campaign page v white house) temporal query expansion – watch spikes. Table of Contents: Summary of aspects of entity – summarize mainline (results) Sticky control panel.   “Bing Gets It” provide info along popularity and consistent content, summary of mainline results.  V1 just out, continues development.

Industry Track: Afternoon Panels

Search Industry Analysts:    Whit Andrews Gartner  Sue Feldman IDC  Theresa Regli (CMS Watch)    Marti Hearst (Responder)     Daniel Tunkelang (Moderator)

TR: ( Implementation consultant for ten years) 3 yrs as an analyst, evaluate products, what fits for your needs? CR of search products – clients have very specific needs. Sounds like she has lots of eDiscovery clients; audio/video clients;

SF: (linguistics background) Web search is not ahead of enterprise search – v interesting stuff in the enterprise search systems. Real time information – Enterprises have immediate info needs, automated online info key;  Mobile work force – access to everything in company (security and access); Money – now a necessity to fund access, not a nice to have.

Trends – search based apps to solve a biz problem – borrowed from search architecture; Convergence of platforms – IM, search, etc, Unified access to info – BI tools on all data – flexible Hybrid architectures – dbase with inverted indexes / but dbase features supporting ad hoc querying and search. Search is not a goal in itself, needs to be integrated into the workflow process. UI will sell new tools to new buyers (mktg/mgrs, not it staff). Task/tools, not single search tech.

Open source embedded everywhere – collab, crm, sales, etc.  Lucene, Solar, etc.  (not free really)

WA: (ex-journalist) 4 trends: Federation (access without paying for it) doesn’t nec always work, but seeing improvement and will see value. Conversation – disambiguation of query – ask the user – participatory search; Transparency – what is driving results ranking. Video – growing like crazy in 2009. Real time is more important than ever.  Value. “Relevance is about money”

Lively discussion throughout the session about the relevance=money statement. Big Disagreement.

QA Session – Could not hear some floor speakers, selected questions only her

?Employees want a search box – no one wants sophistication? No defense against the search box and search button

A: People are not married to Google results if you show effective use of different system.

? How do you evaluate an enterprise search system?

A: depends on the need: recall sometimes, precision sometimes A: precision.  Business goals are what matter -any evaluation needs to be within business goal and most companies don’t have them coming to the table.

Marti: Spent a lot of time in this panel talking about integrity, unusual for this conference as we strive to be honest and direct in our work. How do you control when your competitor says “this is the cutting edge” how do you not go on the bandwagon.  SF: Don’t read competitor reports. If buyers, first bring your requirements.   WA: we have a hype cycle.  Tech trigger to hype, to expected expectation, trough of no delivery, then the reality.  Where it is on the hype cycle? where is it on the adoption curve? What hype are you willing to tolerate – where does your business fit on this curve?

Theresa: Tamping down the hype – be skeptical.

(Several other questions here but responses were wide ranging so I have left out)

Industry Track: Vendor Panel

Jeff Fried Sr Prod Mgr MSFT(M); Rual Valdes-Perez Co-Founder Vivisimo (V); Adam Ferrari CTO Endeca(E)

Liz (Moderator)    Bruce Croft (Responder)

Liz: What areas need work/advances?

Endeca: Evaluation for interactive IR? What features to include? Efficiency of architectures

Search to find vs search to learn – > interactive search

Microsoft: 3 views – Biz Analyst, End User, Systems Guy. Where researchers and practitioners align:

Take users intent, match it to content. Most people are unhappy with enterprise search. Context matters, diff systems for diff companies; Positive feedback: what’s working; User Experience Measure: search.ui. matters.

Vivisimo: The big opportunity in enterprise search – web search history.  Enterprise – lots of new companies and UIs

Single search box access to everything is the big opportunity in enterprise search.   

Liz: What is the first problem we should work on? E: Test the efficacy of interactive IR.  M: Holistic eval is needed. Better theory about interaction. V: Search to Problem solving systems.

Liz: What’s unique about determining relevance in your system? M: Provide controls, slide bar for precision/recall control, exploration. Use user logs to improve relevance. V:Tunability, and something else, missed it. E: need relevance, facets, may care about them differently – customizability.

Liz: What would be most appreciated by users of websearch from Enterprise search. ? V: Def UI.  Don’t need to worry about ad real estate on the enterprise screen. E: People will enjoy better faceting, etc. but slow. Advance features, visualization, etc. really is appreciated. M: UI, faceting, exploration. People will use these things if they work.

L: What evaluation measures does Enterprise search use/need? E: business metrics will dominate. Look at logs. What’s working, what isn’t.  M: Task completion and happy users.  V: we see people use # searches before/after adoption. What is the average position of the doc people click on?

L:What is the fundamental technical problem you are currently focused on?  M: Systems scale footprint needs. Grow as much as you need it to. V: getting search to work across vast array of content types.  E: Greater effectiveness with greater simplicity.  M: Search connectors are always tough, would love to have research on it.

Where could academics and vendors work together? V: Companies only want people, not their research. People transfer would be the best project.  Fund university – get grad students. E: Events like today. Cross pollinate. Formalize ways of opening dialogue. More openness on part of vendors for transparency. M: LDC, SIGIR Industry Track, University sponsorship needs to be easier, cheaper.

Panel had an opportunity to ask each other questions:

E? How do you reconcile single search box with enterprise search needing deep interaction with repositories of data?

High value deep problems won’t work with this… V: Do believe this is an opportunity. Not a research problem. It’s an opportunity. Facets are needed for every diff kind of data. Provide both.

? Where does Endeca see the role of federation? Coexist but infrastructure needs to be there to accommodate deep search needs.

V?: Sharepoint is wonderful. I’ve seen the videos. Can a platform serve the entire a market? Underserve/overserve markets.  High end medical records mgmt might be underserved.   (Missed audience QA and Responder)

What We Sell

22 Jul 2009

A TextWise semantic dictionary is essentially a big bunch of numbers between 0 and 1. To be more precise, they are conditional probabilities of a semantic dimension being relevant to a document containing an occurrence of a given term; but to a casual observer, they can look very ho-hum and uncool. What is so great about them?

Some people are in fact dismissive of any numbers being applied to semantics. This is probably because of the unfortunate legacy of numerical abuse in information technology, where system builders all too commonly slam numbers together willy-nilly and hope that something sensible comes out.

At TextWise, we don’t do this. We not only follow rigorous statistical practice to get the most information out of available text data, but also apply proprietary filtering and reduction methods to eliminate many of the anomalies that can slip through any statistical system by chance. To paraphrase the Colonel, “We do numbers right.”

SIGIR 09 Highlights July 21, 2009   Jumped around to attend sessions across parallel tracks today.

Information Extraction: “Named Entity Recognition in Query” (Microsoft Research & Institute of Computing Technology CAS)  Queries are very challenging for named entity extraction: few words, poorly formed, ambiguous (harry potter review – movie? book?). Research used triples from queries and used millions of queries as training data.  Method outperforms the baseline model and shows promise in experimental results.

Web Retrieval 1:  “Using Anchor Texts with their Hyperlink Structures” (Microsoft Research & University of Montreal) Use of anchor text Works best with navigational queries (navig. query has only one satisfactory result – need to nail the one right page/site).  Previous work gave each link equal and independent status; New models: web site counts for only one vote – sites are independent. Relationships of links within sites and between sites are in the model. Scale (Perfect, Excellent, Good, Fair, Bad) Combined body and anchor text performed best. Anchor text can be improved over current models and site relationships most promising for navigational searching. Future investigations will explore other anchor text relationships.

Interactive Search: “Predicting User Interests from Contextual Information” (Microsoft Research)  Using what we know about you to predict future best retrieval.  User interest models, personalization, IF, etc. ; little is actually known about the value of different context sources. Cited Ingwerson and Jarvelin – nested model of contextual stratify representing main contextual influences of people engaged in information behavior. Study showed different context sources should not be treated equally. Depends if you are looking at the next hour, day or week in user’s schedule what source is most important.

“ A Comparison of Query and Term Suggestion Features for Interactive Searching” (UNC)

Study set out to help users stumped by system – need help to form query to find information. Used several variations of system generated terms, queries, and user generated terms, queries. Nice quantitative and qualitative analysis. Findings suggest next study should be hybrid system that lets users change terms in suggested queries.

“An Aspectual Interface for Supporting Complex Search Tasks” (Univ of Glasgow) Complex Task as defined by  Campbell (1988). Used BOSS search engine, designed aspectual search interface designed to support subtasks. 3 Research questions: 1) Does aspectual interface help user discover information? 2) Does the aspectual interface help the user understand the task? 3) What features are used to carry out task? Best results were in broad tasks. Users worked the entire twenty minute limit with aspectual interface, no saw with baseline interface.

Keynote Day Two:   “From Networks to Human Behavior”  Albert-Laszio Barabasi   Center for Complex Network Research Northeastern University.   Fascinating presentation on analysis of networks.  90 minute presentation covering a complex topic so not lending itself to blog summary.  Excellent speaker and a great end to day two.

According to WordNet, the word BANK has multiple senses, and so any occurrence of it in a text document is ambiguous. For example, we can have a river BANK, a financial BANK, a fog BANK, or an aeronautical BANK. The intended sense in a particular document has to be determined by looking at the context of occurrence. So, to determine the actual meaning of BANK in a document, we have to ask in effect whether the document is talking about streams of water, financial meltdowns, marine navigation, or aircraft in flight.

Now the number of different possible contexts is probably huge.One cannot hope to recognize them all; but for disambiguation of words, we need only fairly general contexts to distinguish the word senses of prime interested to us. Furthermore, given a large of our target text, we can employ statistical methods to identify the most important of such contexts.

This is essentially what SemanticHacker is all about.The dimensions of one of our semantic dictionaries defines thousands of contextual reference points for the interpretation of terms. For example, if the words stream, water, flow, erosion, and grass are in a document, then with the ODP 2009 dictionary, we find that the top match dimension is 1461 (Top/Science/Environment/Water_Resources) with a weight of 0.5138. In this context, the word BANK would probably mean “river bank.”

Actually, we don’t need to make this explicit association. With a search engine user interface, one just needs a way of describing the context of ambiguous search terms, perhaps by listing contextual words. Then all a semantic search engine has to do is find a document containing the search term and having the same described context in its semantic signature. This is of course a part of our API for search.

SIGIR 09 Day One  http://sigir2009.org/Program  Several parallel tracks at SIGIR – here are some highlights from sessions I attended today.

Susan Dumais gave the opening keynote @ SIGIR 09  “An Interdisciplinary Perspective on Information Retrieval”  Dumais was the 2009 recipient of the Gerard Salton Award for her contribution to the Information Retrieval field.  Her work at Bell Labs/Bellcore exploring vocabulary mismatch (aka verbal disagreement) led to her LSI work. She has worked at Microsoft Research since 1997 and currently leads the Context, Learning, and User Experiences in Search team. Her talk spoke about her background (cognitive psychology/mathematics) and how the problems of information retrieval and the huge social and technical leaps in the fields in the last fifteen years have made it a very exciting time to be working in this area. However, as much as things have changed, much has stayed the same. Haven’t escaped the search box, or the results list. Observed searching habits: high frequency in which we repeat our searching – “re-finding” on the desktop and the web. Date is the most common sort selected when changing from the default option. She called for more personalized search research – we need models to support personalized search: when to use it, when not to (works only some of the time).  Evaluation continues to be challenging. Behavioral data is extremely noisy – especially click data.  For future research: IR solutions must acknowledge dynamic information environment and experiments and data must reflect this environment.  Need data that mirrors the dynamic information environment; she called for a ‘Living Laboratory’ made up of logs of search engine, searching resources such as Wikipedia, etc. Needs a group to mobilize to put this resource together; plugged the Lemur Query Toolbar. IR research needs and interdisciplinary team to understand users and thinking outside the box to meet the challenges ahead in IR.

Novel Search Features Session Notes: “Web Searching for Daily Living” (NTT Comm): collecting information about every day actions from cameras and incorporating the information into websearch queries using clustering techniques to return useful information.  Forward looking research as few of us have web browsing tools on our appliances or in our bathrooms but paving the way. This is what they mean by the phrase search ubiquity!  “Global Ranking by Exploiting User Clicks” (Yahoo!): Collecting information about user click sequences and then through supervised learning provide prediction. Must look across results, not within single documents after click. Position influences clicks – first result often clicked on.  Aggregation of data is key – click data is very noisy.  “Good Abandonment in Mobile and PC Internet Search” (Google) Investigation of when search abandonment is good (answer is right in results list – no need to open page) much more likely to occur on mobile device as opposed to PC; varies by locale (looked at US, Japan, China) and by category of query. Research to estimate rates and get first study designed: classification by modality, locale, category.

Web 2.0 Session Notes: “A Statistical Comparison of Tag and Query Logs” (Strathclyde & Lugano Universities) Very cool zooming slide ware used in presentation.  Found more vocabulary shared between queries and tags than any combination of queries, tags, and content of search results.  Data set used: AOL query logs, Delicious tags, ODP categories.  “Enhancing Cluster Labeling Using Wikipedia” (IBM Research) Found very promising results using Wikipedia metadata to label clusters. Walked through approach, evaluation. Findings suggest continued development of this work would provide better quality labeling of clusters.

Question Answering Session Notes:

“A Classification-based Approach to QA in Discussion Boards” (Lehigh University) How to ask questions on the web – Options: Search Engines, QA portals, Discussion boards. This research focused on detecting Questions and Answers on Discussion Boards. Discussed techniques found to work best for Questions and for Answers. “Ranking Community Answers by Modeling Question-Answer Relationships via Analogical Reasoning” (Microsoft Research & Huazhong Science and Tech University) Presenter said search engines must deliver answers sooner than later. Mining data from community forums (Yahoo! Answers Archive) to find clues for linkages among question and answers. Model the previous knowledge.  Each question had 16 answers on average in data set. Very promising results.                                                                                

Tags:

Search engines work remarkably well when one is searching for a popular topic. Just try the query LOVATO. If you are of the demographic normally reading this blog, then you probably don’t know yet who she is, but Google or Bing will find her. Although she is still obscure enough so that Lovato Electric, Inc., beats her out for top spot on Bing, there is no problem in getting the goods on this latest Disney ‘tween idol.

Here is a different, more frustrating search story, however. I was over at the National Gallery in Washington on Sunday and saw a remarkable series of Renaissance Italian frescos. At home afterwards, I queried on ITALIAN VILLA FRESCO NATIONAL GALLERY WASHINGTON, but found nothing recognizable on Google with either web or image search. About an hour later, I gave up after trying numerous variations of queries.

Then I went to www.nga.gov and navigated down to its 16th Century Italian art page. It offered a virtual tour of a series of frescos by Bernardino Luini on the legend of Procris and Cephalus. Bingo! According to the web site, “These nine paintings are the only examples of an Italian Renaissance fresco series in America.” Strangely enough, I had actually tried the term LUINI in one of my unsuccessful queries.

So we obviously have a failure to communicate here; and this is really a problem that semantic search should be addressing. The relevant page was out there and my queries should have been specific enough, but somehow a beautiful young bride being run through and killed by a magic javelin just wasn’t as sexy as Britney 4.0.

It was a great pleasure teaming up with Ron Kaplan (Powerset/Microsoft), Riza Berkan (hakia) and Kiki Hempelmann (RiverGlass) in this panel presentation on Semantic Search Beyond RDF at SemTech 2009 conference.  What is semantics, what is not?  It is quite interesting to hear different perspectives.  Particularly, is statistics semantics?  One often hears the statement that statistics is not semantics.  Then what about contextual semantics?

Statistics are only numbers, but with enough of the right kinds of numbers, one can model the economy of Uzbekistan, prove the existence of the Higgs boson, or characterize the content of a text document. Numbers are our friends, if we treat them with proper respect.

The important thing is to keep an open mind when it comes to semantic search.  But we do have one thing in agreement – semantic search CAN go beyond RDF markups.  The question is a matter of how.

Scalability and standard measurements are still hot topics around semantic search during the Q&A session.  When the question of benchmarks for comparing search systems came up, each of the panelists agreed that there is NO one benchmark number that can be used to compare all search systems simply because it is hard to interpret and may not make sense to one’s business.