We have been musing about the true topology of semantic spaces and how this affects our concept of dimensionality. This segués logically into a hot area of contention. In our linear approximation of meaning, how many dimensions do we really need and what should they be?
Some people prefer to approach this problem mathematically. Given a representative sample of documents to describe semantically, we can look at the relationship between terms and documents as a defining a vector space. One can then apply the method of singular vector decomposition (SVD) to find a minimal set of basis vectors to span that space. These singular vectors are like eigenvectors on steroids.
If you have actually read this far into this blog, then you will know that we (TextWise) have a competitor that employs SVD for semantic analysis. We get asked all the time why we have stuck with basic statistical techniques when we could instead be rigorously mathematical. Our usual response is that we have much faster turnaround in building semantic dictionaries, finer-grain descriptions of content, and more intuitive concepts overall.
There are more fundamental concerns, however, both theoretical and practical. On the theoretical side, SVD might be pushing a linear-space semantic model too far if meaning is in fact topological complex. More significantly on the practical side, though, is that one might be getting caught in the common problem of overtraining.
Suppose that we have a hundred thousand blog posting to which we apply SVD to get some optimal set of dimensions for analyzing their content. What then happens next week when we get a million new blogs that we have never seen before? Our perfect basis set is now distinctly handicapped.
Now we could try to reprocess all our data here, but SVD is so computationally intensive as an algorithm that it probably will be too slow to keep up without superextraordinary investments in hardware resources. We also would end up with an unstable system in which it is quite difficult to compare results from one week to the next. Anyway, we made our choice here.