LDA
Synopsis
This operator finds topics using the LDA method.
Description
LDA (Latent Dirichlet Allocation) is a method which allows you to identify topics in documents. This implementation of LDA uses the ParallelTopicModel of the Mallet library (source: Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009)) with SparseLDA sampling scheme and data structure (source: Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009)).
LDA provides topic diagnostics in the model object. For details on the measures see: http://mallet.cs.umass.edu/diagnostics.php . Note that some of the measures depend on the number of top words.
LDA uses Gibbs Sampling for the application of the model. The method exposes additional parameters in the application.
Input
exa
An ExampleSet with the text attribute to be processed.
Output
exa
An ExampleSet with added "documentId" and "TopicId" attributes, and an additional attribute showing the confidence that this document belongs to the topic.
top
An ExampleSet with details on the topic. For each topic the operator returns the top 5 most used words.
mod
The topic model. It can be applied to new collection of documents using Apply Model (Documents).
per
The LogLikelihood value of the fit which can be used for optimization.
Parameters
Number of topics
Number of topics to search.
Use alpha heuristics
If this parameter is set to true, alpha is automatically set. The used heuristics is: 50 / Number of topics.
Alpha sum
Baysian prior on the topic distribution.
Use beta heuristics
If this parameter is set to true, beta will be automatically set. The used heuristics is: 50 / Number of words.
Beta
Baysian prior on the word distribution.
Optimize hyperparameters
If this parameter is set to true, both alpha and beta will be optimized every k-th step. k can be provided by the "optimize interval for hyperparameters" parameter
Optimize interval for hyperparameters
Frequency of hyperparameter optimization.
Top words per topic
Number of words to pull to describe one topic.
Iterations
Number of iterations for optimization.
Reproducible
If this parameter is set to true, parallel execution will be deactivated. Results may differ between runs if this is left unchecked.
Enable logging
If this parameter is set to true, additional output is provided in the Log panel.
Use local random seed
This parameter indicates if a local random seed should be used.
Local random seed
If the use local random seed parameter is checked this parameter determines the local random seed.
Lda.iterations (application)
Number of iterations for Gibbs sampling. Available in Apply Model (Documents).
Lda.burnin (application)
Ignore the first x rounds of sampling. Should be > iterations. Available in Apply Model (Documents).
Lda.thinning (application)
Only use every x-th iteration to determine the confidence. Available in Apply Model (Documents).