Table Of Contents

LdaModel train


train(self, frame, document_column_name, word_column_name, word_count_column_name, max_iterations=None, alpha=None, beta=None, convergence_threshold=None, evaluate_cost=None, num_topics=None)

[BETA] Creates Latent Dirichlet Allocation model

Parameters:

frame : <bound method AtkEntityType.__name__ of <trustedanalytics.rest.jsonschema.AtkEntityType object at 0x7f9e686f3fd0>>

Input frame data.

document_column_name : unicode

Column Name for documents. Column should contain a str value.

word_column_name : unicode

Column name for words. Column should contain a str value.

word_count_column_name : unicode

Column name for word count. Column should contain an int32 or int64 value.

max_iterations : int32 (default=None)

The maximum number of iterations that the algorithm will execute. The valid value range is all positive int. Default is 20.

alpha : float32 (default=None)

The hyperparameter for document-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that documents are assumed to cover all topics more uniformly; smaller value implies that documents are more concentrated on a small subset of topics. Valid value range is all positive float.

Default is 0.1.

beta : float32 (default=None)

The hyperparameter for word-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that topics contain all words more uniformly and smaller value implies that topics are more concentrated on a small subset of words. Valid value range is all positive float. Default is 0.1.

convergence_threshold : float32 (default=None)

The amount of change in LDA model parameters that will be tolerated at convergence. If the change is less than this threshold, the algorithm exits before it reaches the maximum number of supersteps. Valid value range is all positive float and 0.0. Default is 0.001.

evaluate_cost : bool (default=None)

“True” means turn on cost evaluation and “False” means turn off cost evaluation. It’s relatively expensive for LDA to evaluate cost function. For time-critical applications, this option allows user to turn off cost function evaluation. Default is “False”.

num_topics : int32 (default=None)

The number of topics to identify in the LDA model. Using fewer topics will speed up the computation, but the extracted topics might be more abstract or less specific; using more topics will result in more computation but lead to more specific topics. Valid value range is all positive int. Default is 10.

Returns:

: dict

The data returned is composed of multiple components:

Frame : topics_given_doc
Conditional probabilities of topic given document.
Frame : word_given_topics
Conditional probabilities of word given topic.
Frame : topics_given_word
Conditional probabilities of topic given word.
str : report
The configuration and learning curve report for Latent Dirichlet

Allocation as a multiple line str.

See the discussion about Latent Dirichlet Allocation at Wikipedia.

Examples

Inspect the input frame:

>>> my_model = ta.LdaModel()
>>> results = my_model.train(frame, 'doc_id', 'word_id', 'word_count', max_iterations = 3, num_topics = 2)

The variable results is a dictionary with four keys:

>>> topics_given_doc = results['topics_given_doc']
>>> word_given_topics = results['word_given_topics']
>>> topics_given_word = results['topics_given_word']
>>> report = results['report']

Inspect the results:

View the report:

>>> print report

======Graph Statistics======
Number of vertices: 11 (doc: 3, word: 8)
Number of edges: 32

======LDA Configuration======
numTopics: 2
alpha: 0.100000
beta: 0.100000
convergenceThreshold: 0.001000
maxIterations: 3
evaluateCost: false

======Learning Progress======
iteration = 1   maxDelta = 0.677352
iteration = 2   maxDelta = 0.173309
iteration = 3   maxDelta = 0.181216