Natural Language


The term-document matrix is helpful in identifying frequently used words and words associated with those frequently used words. But it's often helpful to have a visual representation of those frequently used words.

Wordclouds provide that visual aid, so this page will briefly explain how to create one with the wordcloud package for R. Comparing two wordclouds helps us notice differences in word choice, so this page will also explain how to test the null hypothesis of no significant difference in usage of a given word.

We'll start by creating a "corpus" of text documents with Feinerer's tm (text mining) package for R and applying a few transformations (such as the removal of "stop words" and numbers).

> smsDF <- Corpus(DataframeSource(smsMsg),

readerControl = list(reader = smsReader))

> smsDF <- tm_map(smsDF, removeWords,


> smsDF <- tm_map(smsDF, removePunctuation)

> smsDF <- tm_map(smsDF, removeNumbers)

> smsDF <- tm_map(smsDF, stripWhitespace)

Next, we create the wordcloud.

> wordcloud(smsDF, min.freq=10, max.words=50)

Our next step is to compare two wordclouds, take note of word differences between the two and test the null hypothesis of no significant difference in usage of a given word.

comparison of wordclouds

A quick look at the two clouds suggests that my sister (Lauren) uses the word "love" more often than I do. And in fact, she used the word "love" in 11 of the 63 messages that she sent to me, whereas I only used the word "love" in 6 of the 143 messages that I sent to her.

Using R's   prop.test()   function to perform the χ2 test of the difference in proportions, we find that we can reject the null hypothesis of no difference in the proportions of messages that contain the word "love."

Put differently, we accept the alternative hypothesis that my sister sends her "love" more often than I do ... and I better send my sister some flowers.


Eryk Wdowiak
last updated: 23 June 2015