Skip to main content

LDA

Synopsis

This operator finds topics using the LDA method.

Description

LDA (Latent Dirichlet Allocation) is a method which allows you to identify topics in documents. This implementation of LDA uses the ParallelTopicModel of the Mallet library (source: Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009)) with SparseLDA sampling scheme and data structure (source: Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009)).

LDA provides topic diagnostics in the model object. For details on the measures see: http://mallet.cs.umass.edu/diagnostics.php . Note that some of the measures depend on the number of top words.

LDA uses Gibbs Sampling for the application of the model. The method exposes additional parameters in the application.

Input

col

A preprocessed collection of documents.

Output

exa

An ExampleSet with added "documentId" and "TopicId" attributes, and an additional attribute showing the confidence that this document belongs to the topic.

top

An ExampleSet with details on the topic. For each topic the operator returns the top 5 most used words.

mod

The topic model. It can be applied to new collection of documents using Apply Model (Documents).

per

The LogLikelihood value of the fit which can be used for optimization.

Parameters

Number of topics

Number of topics to search.

Use alpha heuristics

If this parameter is set to true, alpha is automatically set. The used heuristics is: 50 / Number of topics.

Alpha sum

Baysian prior on the topic distribution.

Use beta heuristics

If this parameter is set to true, beta will be automatically set. The used heuristics is: 50 / Number of words.

Beta

Baysian prior on the word distribution.

Optimize hyperparameters

If this parameter is set to true, both alpha and beta will be optimized every k-th step. k can be provided by the "optimize interval for hyperparameters" parameter

Optimize interval for hyperparameters

Frequency of hyperparameter optimization.

Top words per topic

Number of words to pull to describe one topic.

Iterations

Number of iterations for optimization.

Reproducible

If this parameter is set to true, parallel execution will be deactivated. Results may differ between runs if this is left unchecked.

Enable logging

If this parameter is set to true, additional output is provided in the Log panel.

Use local random seed

This parameter indicates if a local random seed should be used.

Local random seed

If the use local random seed parameter is checked this parameter determines the local random seed.

Include meta data

If checked, available meta information of the text like filename, date is added as attribute.

Lda.iterations (application)

Number of iterations for Gibbs sampling. Available in Apply Model (Documents).

Lda.burnin (application)

Ignore the first x rounds of sampling. Should be > iterations. Available in Apply Model (Documents).

Lda.thinning (application)

Only use every x-th iteration to determine the confidence. Available in Apply Model (Documents).