Thirukkural Word Cloud

Building word cloud isn't that much scary until I know I could do this myself with some statistical packages provided in R.

For this expedition I decided to build word cloud on Thirukkural. Thirukkural is one the finest master pieces in the Tamil literature works which is written during the Tamil Sangam period. Thirukkural constitutes of 133 Chapters each containing 10 couplets. Thus we get 133 * 10 = 1330 couplets in total. Depending on message or meaning, couplets are categorised into three groups namely Aram, Pourl, Inbam. Now let me stop the lecture on Thirukkural and jump on to the actual notion of this writing.

Below post will clearly explain the process I followed for building word could on Thirukkural. All source code and corpus used in this excise can be found on my github account.

Steps in build the word cloud on Thirukkural.

Document preparation.
Building Term Frequency matrix.
Construing word cloud.

Document preparation.

While I could many reference web to archive the Thirukkural corpus I decided to go crawl with gokulnath.com the web page is neatly organized and target text is reachable either by HTML or CSS selectors. After analyzing the HTML DOM structure I wrote a simple web-crawler that will iterate web pages and extract the couplets from each chapters. Followed by extraction I programmed the crawler to write all those couplets from same chapter a separate files. Thus at the end of execution of the web-crawler I had 133 files in my machine containing 10 couples each.

Next step is to build Term Frequency document matrix from the corpus. TFD we will have 133 columns and 6566 rows in the matrix. After constructing the TFD matrix we have sum each row to find the total frequency of terms in the entire document. And this total will goes as input to the word cloud function. Here rows corresponds to terms and column corresponds to document. Each cell represents the term frequency the corresponding column (document). Thus TFD matrix gives us ability to sum the row count, which is nothing but a total frequency count of a particular term in the overall corpus.

After building the corpus, we use the package “TM” which is used for text mining activities such strip white space, stemming the text and building the TFD matrix. Since our corpus is in Tamil we will not worry about stemming. The package “word cloud” is in presentation layer that used to create a word cloud using the same we can customize the color and the number of words that should be drawn on the graph.

wordcloud(lords, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, “Dark2″))

scale is used to controls the difference between the largest and smallest font.
max.words is required to limit the number of words in the cloud.
rot.per controls the percentage of vertical text.

As of this writing I did not find any word cloud that is built from Thirukkural, If that is true, I would be the first person to do so which is something cool to take pride.