LdaModel train¶

train(self, frame, document_column_name, word_column_name, word_count_column_name, max_iterations=None, alpha=None, beta=None, convergence_threshold=None, evaluate_cost=None, num_topics=None)¶

[BETA] Creates Latent Dirichlet Allocation model

Parameters:

Parameters:	frame : <bound method AtkEntityType.__name__ of <trustedanalytics.rest.jsonschema.AtkEntityType object at 0x7f9e686f3fd0>> Input frame data. document_column_name : unicode Column Name for documents. Column should contain a str value. word_column_name : unicode Column name for words. Column should contain a str value. word_count_column_name : unicode Column name for word count. Column should contain an int32 or int64 value. max_iterations : int32 (default=None) The maximum number of iterations that the algorithm will execute. The valid value range is all positive int. Default is 20. alpha : float32 (default=None) The hyperparameter for document-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that documents are assumed to cover all topics more uniformly; smaller value implies that documents are more concentrated on a small subset of topics. Valid value range is all positive float. Default is 0.1. beta : float32 (default=None) The hyperparameter for word-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that topics contain all words more uniformly and smaller value implies that topics are more concentrated on a small subset of words. Valid value range is all positive float. Default is 0.1. convergence_threshold : float32 (default=None) The amount of change in LDA model parameters that will be tolerated at convergence. If the change is less than this threshold, the algorithm exits before it reaches the maximum number of supersteps. Valid value range is all positive float and 0.0. Default is 0.001. evaluate_cost : bool (default=None) “True” means turn on cost evaluation and “False” means turn off cost evaluation. It’s relatively expensive for LDA to evaluate cost function. For time-critical applications, this option allows user to turn off cost function evaluation. Default is “False”. num_topics : int32 (default=None) The number of topics to identify in the LDA model. Using fewer topics will speed up the computation, but the extracted topics might be more abstract or less specific; using more topics will result in more computation but lead to more specific topics. Valid value range is all positive int. Default is 10.
Returns:	: dict The data returned is composed of multiple components: Frame : topics_given_doc Conditional probabilities of topic given document. Frame : word_given_topics Conditional probabilities of word given topic. Frame : topics_given_word Conditional probabilities of topic given word. str : report The configuration and learning curve report for Latent Dirichlet Allocation as a multiple line str.

frame : <bound method AtkEntityType.__name__ of <trustedanalytics.rest.jsonschema.AtkEntityType object at 0x7f9e686f3fd0>>

Input frame data.

document_column_name : unicode

Column Name for documents. Column should contain a str value.

word_column_name : unicode

Column name for words. Column should contain a str value.

word_count_column_name : unicode

Column name for word count. Column should contain an int32 or int64 value.

max_iterations : int32 (default=None)

The maximum number of iterations that the algorithm will execute. The valid value range is all positive int. Default is 20.

alpha : float32 (default=None)

The hyperparameter for document-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that documents are assumed to cover all topics more uniformly; smaller value implies that documents are more concentrated on a small subset of topics. Valid value range is all positive float.

Default is 0.1.

beta : float32 (default=None)

The hyperparameter for word-specific distribution over topics. Mainly used as a smoothing parameter in Bayesian inference. Larger value implies that topics contain all words more uniformly and smaller value implies that topics are more concentrated on a small subset of words. Valid value range is all positive float. Default is 0.1.

convergence_threshold : float32 (default=None)

The amount of change in LDA model parameters that will be tolerated at convergence. If the change is less than this threshold, the algorithm exits before it reaches the maximum number of supersteps. Valid value range is all positive float and 0.0. Default is 0.001.

evaluate_cost : bool (default=None)

“True” means turn on cost evaluation and “False” means turn off cost evaluation. It’s relatively expensive for LDA to evaluate cost function. For time-critical applications, this option allows user to turn off cost function evaluation. Default is “False”.

num_topics : int32 (default=None)

The number of topics to identify in the LDA model. Using fewer topics will speed up the computation, but the extracted topics might be more abstract or less specific; using more topics will result in more computation but lead to more specific topics. Valid value range is all positive int. Default is 10.

Returns:

: dict

The data returned is composed of multiple components:

Frame : topics_given_doc

Conditional probabilities of topic given document.

Frame : word_given_topics

Conditional probabilities of word given topic.

Frame : topics_given_word

Conditional probabilities of topic given word.

str : report

The configuration and learning curve report for Latent Dirichlet

Allocation as a multiple line str.

See the discussion about Latent Dirichlet Allocation at Wikipedia.

Examples

Inspect the input frame:

>>> my_model = ta.LdaModel()
>>> results = my_model.train(frame, 'doc_id', 'word_id', 'word_count', max_iterations = 3, num_topics = 2)

The variable results is a dictionary with four keys:

>>> topics_given_doc = results['topics_given_doc']
>>> word_given_topics = results['word_given_topics']
>>> topics_given_word = results['topics_given_word']
>>> report = results['report']

Inspect the results:

View the report:

>>> print report

======Graph Statistics======
Number of vertices: 11 (doc: 3, word: 8)
Number of edges: 32

======LDA Configuration======
numTopics: 2
alpha: 0.100000
beta: 0.100000
convergenceThreshold: 0.001000
maxIterations: 3
evaluateCost: false

======Learning Progress======
iteration = 1   maxDelta = 0.677352
iteration = 2   maxDelta = 0.173309
iteration = 3   maxDelta = 0.181216

Quick search

Table Of Contents

LdaModel train¶