Term-Document Matrix

natural language

table of contents

appendices

Sicilian language

dictionary specification

scripts for this page

term-document matrix (R script)

In our discussion of cosine similarity, we assumed that language is just a "bag of words." Specifically, we assumed that grammar is unimportant and we assumed that words do not have synonyms or antonyms.

Analysis of a term-document matrix retains those assumptions, but its count-based evaluation methods help us identify the most frequently used words and help us identify the words associated with those frequently used words.

The resulting frequencies and correlation coefficients do not help us identify synonyms or antonyms, but analysis of a term-document matrix does help us classify documents and understand relationships among words.

Suppose we have four documents with the following text:

"I spent the weekend with my niece."
"My niece likes swimming."
"My niece and I went swimming last weekend."
"We will go swimming this weekend and next weekend."

With those documents, we can build a term-document matrix (with documents in the rows and words in the columns) from the frequencies at which each word appears in each document.

After stripping out "stop words" (i.e. common words like: "the," "and," "my," "we," etc.), our term-document matrix might be:

term-document matrix
	likes	niece	swimming	weekend
A	0	1	0	1
B	1	1	1	0
C	0	1	1	1
D	0	0	1	2

From that matrix, we can compute the cosine similarity matrix and measure the similarity in word frequency for each document pair.

cosine similarity matrix
	A	B	C	D
A	1.00	0.33	0.52	0.44
B	0.33	1.00	0.52	0.22
C	0.52	0.52	1.00	0.51
D	0.44	0.22	0.51	1.00

And from the term-document matrix, we can also use Ingo Feinerer's tm (text mining) package for R to find frequently used terms and other terms that are highly correlated with those terms.

> findFreqTerms(docTerms, lowfreq=3)

[1] "niece" "swimming" "weekend"

> findAssocs(docTerms, term = "niece" , corlimit = 0.10)

last likes spent went

0.33 0.33 0.33 0.33

Note however that findAssocs only returns words that are positively correlated with the search term. Words that are negatively correlated with the search term are omitted, as can be seen from the full correlation matrix below.

correlation matrix
	likes	niece	swimming	weekend
likes	1.00	0.33	0.33	-0.82
niece	0.33	1.00	-0.33	-0.82
swimming	0.33	-0.33	1.00	0.00
weekend	-0.82	-0.82	0.00	1.00

:: previous ::
cosine similarity

table of contents