Text Mining

similarity between documents

One purpose of text mining is to compare documents for similarity. The simplifying assumptions make document comparison a good place to start and provide a useful benchmark for comparison with more sophisticated techniques.

So let's start by assuming that language is just a "bag of words." Specifically, let's assume that word order (i.e. grammar) is not important and let's assume that words do not have synonyms or antonyms. Using these assumptions, we can measure the similarity between two documents by counting the words in those two documents using the cosine measure.

Before discussing the details of the cosine measure, let's start with a simple example. Suppose that document A contains the words: {up,up,down} and document B contains the words: {up,down,down}.

Since there are only two unique words in the documents (i.e. "up" and "down"), we can plot the documents as vectors in two-dimensional space (i.e. one dimension for each word).

cosine measure of similarity

In this simple example, the cosine of the angle between the two vectors, cos(θ), is our measure of the similarity between the two documents. In the example above, cos(37o)= 0.80.

Note that if both vectors were the same (e.g. if both documents contained one "up" and one "down"), then the angle would be zero degrees and the cosine measure of similarity would be one (i.e. cos(0o)= 1.00).

Note also that if there were no similarity between the documents, then the vectors would meet at a right angle and the cosine measure of similarity would be zero (i.e. cos(90o)= 0.00).

In the more general case, where the documents contain many unique words, we can calculate the cosine measure as the dot product of the two vectors, A · B, divided by the product of the length of the two vectors, ||A|| · ||B||:

cos(θ) =  A · B
||A|| · ||B||

where the dot product is:

A · B =  a1·b1  +  a2·b2  + ... +  an·bn

where the length of a vector is:

||A|| =  a12  +  a22  + ... +  an2

and where ai is the number of times that word i occurs in document A.

Because word counts must be non-negative, the cosine measure will always return a value between zero and one. The measure will be zero when there is no similarity between the documents and the measure will be one when the two documents are identical.

resources:


Eryk Wdowiak
last updated: 22 June 2015