The TextWise Semantic Signatures® technology provides relevant data—matches, tags, etc—for textual content. Measuring the relevance quality of our technology has presented us with some challenges: What data do we use to test relevance? Who judges the relevance quality? On what scale do they judge relevance? How do we provide those judges with meaningful instructions for making a qualitative judgment?
We quickly determined that we needed to use unbiased, external judges to test our relevance quality and, in order to measure change over time, we needed a static set of test data. External judges were selected from candidates who demonstrated an ability to read attentively but no particular expertise in semantics was expected. The test data was collected to represent the variety of types of data that might be encountered by someone using our API. Since we perform a variety of relevance tests, it is important to choose a scale for judging which fits the purpose of the test. In a recent matching test, for example, we used a four point scale which allowed some distinction among degrees of relevance without being overwhelming.
The instructions for the judges were more challenging. We knew that the guidelines we gave to the external judges would have to be clear and concise. The judges would need explanations for each degree of relevance. Those explanations would have to be supported by examples from which the judges could generalize to the variety of cases they might see in the data. At the same time, the guidelines had to be sufficiently brief that judges could easily refer to them during the judging task and quickly refresh themselves on the guidelines when doing assignments that might come months apart.
Most importantly, we first needed an internal consensus on how to define our relevance scale. Getting this internal consensus required multiple rounds of reviewing drafts of the guidelines, performing judgments, and discussing our differences. Participants in this process were drawn from the science, quality assurance, and product management teams to ensure that this was a company-wide initiative. After each round of judging, we calculated the degree of agreement using a kappa coefficient, a standard measure of reliability among multiple judges.
Once we reached internal agreement, the external judges needed to be trained using the guidelines. Our test data included a subset specifically reserved for training and our final internal judgments were retained as an answer key for that data set. Again, agreement among the judges was measured using kappa. Once the judges are trained, we continue to monitor performance using a small set of data for which all judges submit judgments. When the kappa measure shows a drift among the judges, we do a retraining exercise.
After the external judges are trained, they can perform judgments on the larger data set. Relevance judgments are done on major releases and on minor releases that include changes to any components that could impact relevance. Judgments are retained in a database from one relevance test to the next so that any given judgment only needs to be performed once. When a test is completed, we use multiple statistical measures to analyze the outcome.