Aristotle lived about 2,400 years ago, well before the advent of the Worldwide Web. Yet his ideas drive the still emerging Semantic Web. In fact, we could probably do a better job as modern information scientists if we paid a bit more attention to the ancient Greek philosopher.
In his writing called “Categories,” Aristotle addressed the problem of meaning in language and developed a logical framework for semantics. In this work, he invented the theory of subjects and predicates, which modern grammar and formal logic have adopted. This was in effect RDF version 0.0.0.
Aristotle also talked about using taxonomies (from the Greek τάξις + νόμος) to define the meanings of concepts, introducing “genus” and “species” as essential relationships. This approach was adopted by Linnaeus in the 18th Century to catalog the great diversity of life on earth; and more than a hundred years later, formal taxonomies made their way into library science.
Of special interest to us here is Aristotle’s classification of the predicates associated with definitions of meaning. He defined five types: genus, species, difference, property, and accident. The first two are already familiar to information scientists as IS-A relationships. A difference predicate relates to a defining characteristic for a concept. A property is an important characteristic for a concept, but not sufficient to define it. An accident is a true predicate that makes no contribution to meaning.
For example,
(genus/species) Angelina Jolie is an American movie star.
(difference) She is the daughter of American Actor John Voight.
(property) She trained with Lee Stasberg.
(accident) She visited Costa Del Sol.
In automated building of semantic dictionaries, our problem is with accidental predicates. Such predicates have only a weak relationship to a subject and tend to lead to noisy inferred associations. We probably do not want to retrieve a news item about Angelina Jolie given a query about Costa del Sol.
Unfortunately, many and perhaps most predicates in text data are accidental. In current data driven semantic learning systems, we make no distinction here yet, and so there are opportunities here for major improvements. A possible approach here is to employ the techniques of text summarization to identify the most important “predicates” in our data and thus bias our statistics away from accidents toward properties and differences. Aristotle would be amused.