Text Mining

:: previous ::
similarity

:: next ::
wordcloud

term-document matrix

In our discussion of similarity between documents, we assumed that language is just a "bag of words." Specifically, we assumed that grammar is unimportant and we assumed that words do not have synonyms or antonyms.

Analysis of a term-document matrix retains those assumptions, but its count-based evaluation methods help us identify the most frequently used words and help us identify the words associated with those frequently used words.

The resulting frequencies and correlation coefficients do not help us identify synonyms or antonyms (for that we need WordNet), but analysis of a term-document matrix does help us classify documents and understand relationships among words.

Suppose we have four documents with the following text:

  1. "I spent the weekend with my niece."
  2. "My niece likes swimming."
  3. "My niece and I went swimming last weekend."
  4. "We will go swimming this weekend and next weekend."

With those documents, we can build a term-document matrix (with documents in the rows and words in the columns) from the frequencies at which each word appears in each document.

After stripping out "stop words" (i.e. common words like: "the," "and," "my," "we," etc.), our term-document matrix might be:

   likes   niece   swimming   weekend 
0 1 0 1
1 1 1 0
0 1 1 1
0 0 1 2

From the term-document matrix above, we can use Ingo Feinerer's tm (text mining) package for R to find frequently used terms and other terms that are highly correlated with those terms.

> findFreqTerms(docTerms, lowfreq=3)
[1] "niece" "swimming" "weekend"

> findAssocs(docTerms, term = "niece" , corlimit = 0.10)
  last likes spent went
  0.33  0.33  0.33 0.33

Note however that findAssocs only returns words that are positively correlated with the search term. Words that are negatively correlated with the search term are omitted, as can be seen from the full correlation matrix below.

    likes    niece   swimming   weekend 
likes   1.00  0.33  0.33 -0.82
niece   0.33  1.00 -0.33 -0.82
swimming   0.33 -0.33  1.00  0.00
weekend  -0.82 -0.82  0.00  1.00

resources:


Eryk Wdowiak
last updated: 23 June 2015

:: previous ::
similarity

:: next ::
wordcloud