Text Mining

:: previous ::
language

:: next ::
similarity

motivation

Statistics and probability theory help us understand large amounts of numeric data. But information is most commonly stored in human language, so we need statistical tools and techniques to help us understand linguistic data. The purpose of these notes is to document some of those tools and techniques.

We will start by comparing documents for similarity. Then we will create a term-document matrix and a wordcloud. In a future version of these notes, we will parse sentences with Link Grammar to identify subjects, verbs and objects. And in another project, we will group similar words together with WordNet.


Eryk Wdowiak
last updated: 30 November 2017

:: previous ::
language

:: next ::
similarity