Overview
A group of us, at TextWise, were working on our semantic similarity technology that allowed us to match arbitrary text documents to similar documents. One of our initial uses of the technology was to contextually match ads to Web pages. This worked very well, but we decided to focus on a Web 3.0 API (Semantic Hacker) and wanted to come up with an interesting demo of the technology.
The idea for foof (foofme.com) came from suggestions in various forums and blogs about possible improvements to Wladimir Palant’s Adblock Plus. These suggestions focused on allowing pictures or other images to replace the ads, instead of just crunching (or blanking) the space.
Ad blockers examine the html of a web page and look for patterns of code that are indicative of ad displays. They then eliminate the code, while trying to not disrupt the look and feel of the base page.
During the debugging of our original advertising system, we had implemented a tool that replaced Ads on test Web pages with our ads – to allow us to debug in situ. Being users of Adblock Plus, we were reading the blogs and realized that we could use our technology to offer more than just replacing ads with images. Thus, the idea of using TextWise’s semantic similarity engine and various content sources (news, blogs, Wikipedia, video’s and personal images) to match interesting content to web pages and fill the ad holes, was born.
In developing the foof ad blocker, we needed to solve several problems:
- Finding and eliminating the ads on the web page
- Determining the size of the hole that remained, so that we could fit content into the hole
- Selecting which content indexes to be used to fill each hole
- Determining what the web page is about
- Matching the replacement content to the web page
- Providing an experience that is not overwhelming
Finding the Ads
This was the easiest part of the design. We started with Wladimir Palant’s, open source, Adblock Plus code as a base. This is the best Firefox ad blocker and using it as our base meant that foof would do an equally good job.
Determining the Hole Size
Once the ads are located on the web page, we examine both the ad and the page structure and determine the possible size of the hole left after elimination. As each type of content only fits well into holes of certain sizes and geometries, we characterize each hole and decide if it is to be left blank or can contain content.
If the user, during set-up, chose to only block ads, then the process is complete and blank space is substituted for all ads.
Determining the Type of Content for a Hole
Once we determine that a specific hole can contain content, then, we characterize the hole to see what types of content it can support (news, blogs, Wikipedia, Videos, personal images). A typical hole might be capable of containing more than one type of content. At this point we examine the user’s configuration settings to see which types of content the user enabled and in which priority order the user would like us to choose the types of content. The order is important, because there may not be a relevant content match available for for the web page for every content type.
Determining What the Web Page is About
Determining what a web page is about is a multi-step process. These include:
- Determining the address of the web page
- Fetching the web page
- Filtering the web page to remove HTML, JavaScript, and boilerplate text
- Generating a semantic signature™ for the page (a signature is the digital DNA of the page’s content – see http://www.textwise.com and http://www.semantichacker.com for more information)
Matching Relevant Content to the Web Page
Given the semantic signature™ of the web page, it is relatively easy to take that signature and match it to the content signatures in the signature index of the content type chosen to fill the hole.
A signature is simply the best 30 weighted dimensions of a 1700+ dimension semantic space. The best matches are then biased by a keyword match that is done using a proprietary term selection algorithm. This is done to improve the precision of the results. The combined signature and keyword matches are ranked and if there were any acceptable matches the results are returned.
If there were no acceptable matches, then the match is retried with the next content type’s index. If there are no matches for a given hole, then a blank is used to fill the hole.
Maintaining a Quality Experience
During alpha testing, we determined that in order to have a pleasing experience we needed to:
- Only fill one hole on a web page with a given content type (for example: news would appear only once on a page)
- Only fill two holes on a page with content, leaving the others blank
- Provide a mechanism to browse content within the hole. This mechanism would allow the user to:
- View additional articles, images, or videos related to the page, beyond the initially visible item (this is done by clicking on the <- and -> arrows in the content header)
- View other types of content related to the page (this is done via tabs in the content header)
- Provide a mechanism to verify the presence of our servers on the web and default to pure ad block mode, if the servers are not available
Additionally, though we did not implement contextual image search in foof (it now is available to the Semantic Hacker API), we decided to add an option for users to view their own photos in place of ads on the web pages. To implement this, we choose Flickr and provided a way to point to a Flickr account, as an option.
And it Works!
The development of foof was an interesting experience that gave the team a chance to have some fun and at the same time solve interesting problems.
Currently there are over 27,000 users of foof (July, 2009). The download for Firefox is available in the Mozilla Add-On sandbox (experimental Add-On) and at http://www.foofme.com .