Skip to Content



Predict continuous valued outputs associated with text documents. The input corpus of text documents is transformed into a document-term matrix (DTM) and then a regularized linear regression is fit that uses this matrix as predictors to predict the continuous valued output. The corpus's terms, coefficients for all terms and an estimate of the model's predictive power are returned in a list.


Korean language processing package. Morphological analyzer, POS tagger, Keystroke converter, Hangul automata, Concordance, Mutual Information..


An R interface to the C code that implements Porter's word stemming algorithm for collapsing words to a common root to aid comparison of texts. There is code to for different languages (i.e. danish, dutch, english, finnish, french, german, norwegian, portuguese, russian, spanish, swedish). However, these may not be applicable if the words require UTF encoding. This is extensible by allowing different routines to be specified to create the C routines used in the stemming, permitting debugging, profiling, pool management, caching, etc.


RTextTools is a machine learning package for automatic text classification that makes it simple for novice users to get started with machine learning, while allowing experienced users to easily experiment with different settings and algorithm combinations. The package includes nine algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks, maximum entropy), comprehensive analytics, and thorough documentation.


maxent is an R package with tools for low-memory multinomial logistic regression, also known as maximum entropy. The focus of this maximum entropy classifier is to minimize memory consumption on very large datasets, particularly sparse document-term matrices represented by the tm package. The classifier is based on an efficient C++ implementation written by Dr. Yoshimasa Tsuruoka.


Pretty word clouds.


A plug-in for the text mining framework tm to support text mining in a distributed way. The package provides a convenient interface for handling distributed corpus objects based on distributed list objects.


WARNING: This package is currently in beta status! This package provide GUI for demonstration of text mining concepts and "tm" package. It is implemented as a plugin to the R-Commander, which is based on tcl/tk. This set of dialogs can be accessed through the menu TextMining that is added to the R-Commander menus.


Provides an interface to the C code for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM) by David M. Blei and co-authors and the C++ code for fitting LDA models using Gibbs sampling by Xuan-Hieu Phan and co-authors.


Text categorization based on n-grams