Archive for the ‘Opinion’ Category

Aristotle lived about 2,400 years ago, well before the advent of the Worldwide Web. Yet his ideas drive the still emerging Semantic Web. In fact, we could probably do a better job as modern information scientists if we paid a bit more attention to the ancient Greek philosopher.

In his writing called “Categories,” Aristotle addressed the problem of meaning in language and developed a logical framework for semantics. In this work, he invented the theory of subjects and predicates, which modern grammar and formal logic have adopted. This was in effect RDF version 0.0.0.

Aristotle also talked about using taxonomies (from the Greek τάξις + νόμος) to define the meanings of concepts, introducing “genus” and “species” as essential relationships. This approach was adopted by Linnaeus in the 18th Century to catalog the great diversity of life on earth; and more than a hundred years later, formal taxonomies made their way into library science.

Of special interest to us here is Aristotle’s classification of the predicates associated with definitions of meaning. He defined five types: genus, species, difference, property, and accident. The first two are already familiar to information scientists as IS-A relationships. A difference predicate relates to a defining characteristic for a concept. A property is an important characteristic for a concept, but not sufficient to define it. An accident is a true predicate that makes no contribution to meaning.

For example,

(genus/species) Angelina Jolie is an American movie star.
(difference) She is the daughter of American Actor John Voight.
(property) She trained with Lee Stasberg.
(accident) She visited Costa Del Sol.

In automated building of semantic dictionaries, our problem is with accidental predicates. Such predicates have only a weak relationship to a subject and tend to lead to noisy inferred associations. We probably do not want to retrieve a news item about Angelina Jolie given a query about Costa del Sol.

Unfortunately, many and perhaps most predicates in text data are accidental. In current data driven semantic learning systems, we make no distinction here yet, and so there are opportunities here for major improvements. A possible approach here is to employ the techniques of text summarization to identify the most important “predicates” in our data and thus bias our statistics away from accidents toward properties and differences. Aristotle would be amused.

The February issue of Scientific American had an article on the latest thinking about the Whorfian Hypothesis, which states that language strongly influences how humans think. This was a hot idea about sixty years ago, but eventually fell out of academic favor because of the lack of hard empirical evidence. Now that evidence is starting to show up, which has some implications for computational semantics.

The standard view on language and meaning has recently emphasized universality. This is to say that the understanding of language is hardwired in our heads, and so any competent human should qualify as an expert in the algorithmic delineation of meaning. The Whorfian hypothesis throws us a curve here in that we now have to consider language along with culture in our models of thought. A single well-crafted taxonomy or other semantic construct will not fit all.

We see something of this problem on the Worldwide Web. As Jimmy Wales noted this past week, the content of the Web, and Wikipedia in particular, is largely created by twenty- and thirty-something males and so is dominated by their interests. A set of semantic categories derived from the Web in general will certainly be insufficient for understanding text on finance or on medicine and may be challenged even when dealing with the pages frequented by twenty- and thirty-something females.

This does not mean that a given semantic scheme is invalid. Each scheme, however, is limited by the vocabulary it covers and in the kinds of distinctions that that it makes. That should be good news for those of us who make their living in computational semantics.

Back in the 60′s and 70′s of the last century, the Whorfian hypothesis was a hot subject on college campuses. This was the idea that one’s native language, its syntax and semantics, strongly shaped one’s worldview. For example, Eskimos speaking Inuit supposedly had thirty different words for snow and so had a more complex relationship with their environment than someone speaking English with only one word for snow.

The problem of course is that skiers can make plenty of distinctions about kinds of snow even in English. Despite Whorfian hypothesis being theoretically attractive, it did not square in the end with our actual experience with language. That pretty much took the steam out of the Whorfian hypothesis, but now in the 21st Century, empirical support has been accumulating for a weaker version of it. This was the subject of an article in New York Times Magazine (http://nyti.ms/boqzs5).

The weak Whorfian hypothesis rejects the idea that language establishes an absolute limit on thinking. Thus we can learn about distinctions in types of snow if we really need them. The structure of a language, however, definitely can bias our thinking; and this could have consequences in practical matters like the ranking of retrieved documents. The choice of a particular semantic framework like RDF may therefore affect the performance of an information system in unexpected ways.

So far, experimental results on language and thought have focused on highly specific biases in areas of language like giving spatial directions, assigning gender to nouns, and dividing the spectrum into colors. It seems plausible, though, that this should generalize to the overall semantic problem of dividing up meaning into some kind of compact space. There is more than one way to skin a cat here, and there are probably advantages and disadvantages in each possibility.

A dogmatist might be tempted to argue here that RDF with certain standard taxonomies is the right way and everything else is wrong, but that is probably overreaching. We are not yet savvy enough about semantics to carve tablets in stone about its implementation. At present, one can say only whether a given scheme is optimal in some formal sense; but if it makes no obvious sense to people, then something more comprehendable might be better in the long run even if it is less than optimal.

The weak Whorfian hypothesis forces us to be more honest. If each semantic scheme introduces its own biases, then we need to experiment to see how different approaches work out for a given target application. Given that humans operate with more than one linguistic framework, we should not be so quick to assume than machines can do better at semantics with just a single framework.

Basics

5 Oct 2010

Linguists have long debated whether human language ability is innate or is simply learned by highly plastic neurocircuitry of a general sort. Recent studies with fMRI scans indicate, however, that cognitive skills like language understanding tend to be associated with highly specific brain locations across different individuals, supporting the idea that some kind of language-related structures exists. Studies of people impaired by strokes occurring in language regions also have shown this.

So when a young child learns that Mama is related to a concept of MOTHER, which applies to more than a single individual, this seems to draw upon specialized builtin logic within the human brain. This kind of symbolic capability is not unique to humans, being found to some extent in other large-brained social animals like elephants, whales, dolphins, and chimpanzees; but we certainly have more of it. This can seen in the relative size and organizational complexity of human brains.

The implication here is that concepts like MOTHER, BIRD, HOUSE, or FOOD are real in some sense at the genetic level. We of course do not necessarily all learn the same particular concepts; for example, speakers of different languages in different cultures can be expected to develop divergent concept frameworks. Nevertheless, it is possible to translate between unrelated languages like Inuit and English, meaning that there is still a large overlap in their lingistic repertories of concepts.

Consequently, when we technologists talk about incorporating semantics into search engines and other applications, we need to remember that semantics existed a long time before the first boolean electronic circuit and that what we call “semantics” should be consistent to what goes on in our own heads. This is perhaps only a marketing concern, but the business of selling semantic technology will be that much harder if we cannot agree on what we really mean.

The concept of CONCEPT would seem to be a focus point for semantics that everyone can grasp. Whether we approach language and meaning like Wittgenstein or like Russell or like Korzybski or like Chomsky or like Miller or like Berners-Lee, it helps to get grounded properly.

Because my wife is petite in size, she has difficulty finding clothes and shoes that fit and, being of a certain age, no longer has the option of shopping in the teens section. One understands exactly how this situation came about. Some researchers developed a demographic profile of women in the United States and then applied a multivariate optimization algorithm to calculate the mix of sizes that would generate the most profit for manufacturers. If you happen to be in the tail of the demographic profile, you are just out of luck.

We see something similar happening on the Semantic Web. Developers seem to favor global solutions, which are often highly optimized for capturing content deemed to be the most important somehow. The problem, though, is that this strategy discounts the long tail of distributions, which is unfortunate, since pundits see long-tail processing as the new frontier for Web applications. As with clothing and shoes, an optimal mix of sizes does not fit all.

How we each use our words in language depends on how we learned to speak; and each of us has had a unique history. The differences in our language are most evident when we look at text in specialized areas like medicine, law, government, or technology; but even in “normal” discourse on the Web, we see jargon or unusual usages of common terms that may be missed by some kind of global semantic solution.

This is not to say that global solutions are invalid. They are quite useful in the absence of better information about what someone means; but the Web is moving more towards customization and personalization and localization. Our semantic frameworks should follow that lead. It means that we have to streamline our automated learning of semantic concepts so that we can in fact support individual solutions at least in part.

Perfect What?

22 Mar 2010

We have been musing about the true topology of semantic spaces and how this affects our concept of dimensionality. This segués logically into a hot area of contention. In our linear approximation of meaning, how many dimensions do we really need and what should they be?

Some people prefer to approach this problem mathematically. Given a representative sample of documents to describe semantically, we can look at the relationship between terms and documents as a defining a vector space. One can then apply the method of singular vector decomposition (SVD) to find a minimal set of basis vectors to span that space. These singular vectors are like eigenvectors on steroids.

If you have actually read this far into this blog, then you will know that we (TextWise) have a competitor that employs SVD for semantic analysis. We get asked all the time why we have stuck with basic statistical techniques when we could instead be rigorously mathematical. Our usual response is that we have much faster turnaround in building semantic dictionaries, finer-grain descriptions of content, and more intuitive concepts overall.

There are more fundamental concerns, however, both theoretical and practical. On the theoretical side, SVD might be pushing a linear-space semantic model too far if meaning is in fact topological complex. More significantly on the practical side, though, is that one might be getting caught in the common problem of overtraining.

Suppose that we have a hundred thousand blog posting to which we apply SVD to get some optimal set of dimensions for analyzing their content. What then happens next week when we get a million new blogs that we have never seen before? Our perfect basis set is now distinctly handicapped.

Now we could try to reprocess all our data here, but SVD is so computationally intensive as an algorithm that it probably will be too slow to keep up without superextraordinary investments in hardware resources. We also would end up with an unstable system in which it is quite difficult to compare results from one week to the next. Anyway, we made our choice here.

People in the information sciences are fond of high-dimensional vector spaces as models of document content. These are in fact only approximations of reality, however; and in the specific case of semantics, they are probably an oversimplification. We already know something about how the neural circuitry in our brains work when we process the meaning of language; we can find no clean finite-dimensional linear space in the tangle of our synapses.

Neural imaging like PET does support the theory that linguistic concepts correspond to particular clusters of neurons connected in fairly complex feedback loops. Our understanding here is still quite limited, though. We do not know how many such clusters exist or how widely they are distributed. Visual concepts are in a different part of the brain than auditory concepts, for example; and overall, we have not yet found any obvious switchboard, say in the hippocampus, that could somehow tie everything together neatly.

In our computational semantic model, we assume that all concepts are independent and equal. That seems to work in semantic dictionary applications when we have thousands of concepts of concepts as dimensions, but an espistemologist here would have the lurking suspicion that our actual semantic space has to be some kind of complex manifold with all kinds of holes and twisting surfaces like a deranged n-th-order Moebius strip. Meaning is messy.

Our linear Euclidean model may therefore be valid only in a small local region of our actual semantic space, but in practice, that is really where all our apps have to live. One cannot presume to comprehend all possible content in text. We can only slice off a small piece of the pie of meaning, and until world peace and perfect enlightenment break out, that is a good start.

Recently I came across this blog post while following the #prodmgmt hashtag in Twitter: “Death by a thousand paper cuts…”  The author Gopal Shenoy, talks about how visiting your customers gives you valuable insight into their daily ills and what you can do to fix them.

This article caused me to reflect on similar experiences I’ve had.  With a giant dose of humility and non-defensiveness, you can certainly learn a lot from your customers and see first-hand those things that start out as little frustrations, but over time mount to productivity losses and if you’re not careful, the loss of a loyal customer.

Now I know every Product Manager recognizes the importance of repeat customers, especially in today’s economic environment, but sometimes it’s easy to lose site of what it really takes to keep them.  It’s not always what I call Big Feature X.  I’ve sat next to a customer who’s been trying unsuccessfully to import a massive spreadsheet of data into the application and getting bogged down in the process, and another who’s had to repeat the same step over and over again when a simple “update all users” option would have done the job.

I have also experienced a surprising standing ovation, when I presented a single unattractive slide with a bulleted list of the small enhancements we’d made over the past six months. The customers in the audience stood up and cheered and so did the sales reps. Why?  Because the customers felt we finally heard them.  I mean we really listened to our customers and we made their lives easier.

One of the customers came to me during dinner that evening and told me how she was looking forward to my Big Feature X and she was sure it would be awesome, but she doubted it would come close to the happiness she felt when she clicked that “update all users” option the very first time.  She was able to finish her month-end tasks in minutes instead of the 3-4 days she usually allocated to this task.

I walked away from that Sales Meeting slightly embarrassed, but with a lesson that I will never forget.  Listen to your customers.  Really listen. Whether you’re on site, or whether they’re talking via social media tools like Twitter and Facebook, or through blogs, forums or other web tools, listen to them.  Paper cuts hurt.  And thousands of them?  No way.  Sooner or later that paper’s going in the trash and the next thing you know, they’re reaching for a new sheet of paper from another stack. Ouch!

I use Google Reader religiously. Google Reader is one of my first web destinations in the morning and one of the last at the end of the day. I skim titles looking for clues as to what will be of interest to me (I only view items in the ‘list’ format, I’d be dead and buried by the time I got to the end if I viewed in the ‘expanded’ format). My Reader account keeps me informed, in-the-know, and on top of the latest bit of intelligence I can hope to find. I replaced my once beloved Bloglines for this service.

Now Google Reader has become my biggest nemesis, a time-waster if you will, the bigger it gets. Loosely organized and, because it’s based on RSS feeds, not real-time by any stretch of the imagination (our own blog posts here on SemanticHacker.com take hours to show up) I can spend literally hours perusing the information tsunami that happens on a daily basis. Like most of the population I have many interests from keeping up with the latest social media marketing craze to what’s happening in the world of semantic web applications to finding fun toys at discount prices I can buy for my kids.

What are Google Reader’s Shortcomings?

1.      Organizing feeds is a manual process.
Every time I subscribe to a new RSS feed, I need to manually place it into a folder. Many times the feed I subscribe too crosses the line of the topics it covers (TechCrunch is a prime example of this).

2.      The ‘starring’ option is an unusable feature.
Relevant to the point above, unless Google can automatically organize my starred items, this is as pointless as starring something in Gmail. Likewise for items I’ve shared and those the people I follow have shared.

3.      Search is nice, but of course, keyword-based.
Steve Rubel thinks it’s a good personal database search using Google, but let’s face it – it doesn’t solve the overwhelming-amount-of-information-problem. If I search for ‘Facebook acquisition,’ there is no context in which it searches. Essentially I’m still forced to filter through (in my case) 740 items.

How Google Reader Can Be Better

1.      Make it real-time.
Go beyond RSS. Allow me to add my Twitter accounts (I manage more than 1) and Facebook account. Maybe Google Caffeine gets us closer.

2.      Automatically organize my feeds.
Don’t allow me to create folders and force a feed into a single topic. Sure it would be smart to allow the user to rename the folder, but the immediate organization of the feed shouldn’t be so daunting or simplistic – “marketing” is too broad and I don’t have the time or patience to narrow these down further.

3.      Fix search.
A loaded statement for sure, but show me related items in my feeds just by using a specific article as the basis for a search. Blatant plug here, but if everything was indexed and tagged with a TextWise Semantic Signature, this feature would be a no-brainer. Why rely on keyword matching when I already have an idea of what more I want to see?

Recently TechCrunch had their own idea of what else needs fixing with the newer “like” feature and I fully concur so I don’t think I need to rehash that.These are just a few of my wish list items for Google Reader and they are making an effort to add functionality, so perhaps one day I will see something like one of these I mentioned above come to fruition.