sparktk gmm

Functions

def load(

path, tc=<class 'sparktk.arguments.implicit'>)

load GaussianMixtureModel from given path

def train(

frame, observation_columns, column_scalings, k=2, max_iterations=20, convergence_tol=0.01, seed=None)

Creates a GaussianMixtureModel by training on the given frame

frame

(Frame):

frame of training data

observation_columns

(List(str)):

names of columns containing the observations for training

column_scalings

(List(float)):

column scalings for each of the observation columns. The scaling value is multiplied by the corresponding value in the observation column

(Optional(int)):

number of clusters

max_iterations

(Optional(int)):

number of iterations for which the algorithm should run

convergence_tol:

(Optional(float)) Largest change in log-likelihood at which convergence is considered to have occurred

seed

(Optional(int)):

seed for randomness

Returns:

GaussianMixtureModel

Classes

class GaussianMixtureModel

A trained GaussianMixtureModel model

Example:

>>> import numpy as np
>>> frame = tc.frame.create([[2, "ab"],
...                          [1,"cd"],
...                          [7,"ef"],
...                          [1,"gh"],
...                          [9,"ij"],
...                          [2,"kl"],
...                          [0,"mn"],
...                          [6,"op"],
...                          [5,"qr"]],
...                         [("data", float), ("name", str)])

>>> frame.inspect()
[#]  data  name
===============
[0]   2  ab
[1]   1  cd
[2]   7  ef
[3]   1  gh
[4]   9  ij
[5]   2  kl
[6]   0  mn
[7]   6  op
[8]   5  qr

>>> model = tc.models.clustering.gmm.train(frame, ["data"], [1.0], 3 ,seed=1)

>>> model.k
3

>>> for g in model.gaussians:
...     print g
mu    = [1.1984786097160265]
sigma = [[0.5599222134199012]]
mu    = [6.643997733061858]
sigma = [[2.19222016401446]]
mu    = [6.79435719737145]
sigma = [[2.2637494400157774]]



>>> predicted_frame =  model.predict(frame)

>>> predicted_frame.inspect()
[#]  data  name  predicted_cluster
==================================
[0]   9.0  ij                    0
[1]   2.0  ab                    1
[2]   0.0  mn                    1
[3]   5.0  qr                    0
[4]   7.0  ef                    0
[5]   1.0  cd                    1
[6]   1.0  gh                    1
[7]   6.0  op                    0
[8]   2.0  kl                    1

>>> model.observation_columns
[u'data']

>>> model.column_scalings
[1.0]

>>> model.save("sandbox/gmm")

>>> restored = tc.load("sandbox/gmm")

>>> model.cluster_sizes(frame) == restored.cluster_sizes(frame)
True

Ancestors (in MRO)

GaussianMixtureModel
sparktk.propobj.PropertiesObject
__builtin__.object

Instance variables

var column_scalings

column containing the scalings used for model training

var convergence_tol

convergence tolerance

var gaussians

Gaussian object, which contains the mu and sigma values

var k

maximum limit for number of resulting clusters

var max_iterations

maximum number of iterations

var observation_columns

observation columns used for model training

var seed

seed used during training of the model

Methods

def __init__(

self, tc, scala_model)

def cluster_sizes(

self, frame)

a map of clusters and their sizes

def export_to_mar(

self, path)

Exports the trained model as a model archive (.mar) to the specified path

Parameters:

path

(str):

Path to save the trained model

Returns

(str):

Full path to the saved .mar file

def predict(

self, frame, columns=None)

Predicts the labels for the observation columns in the given input frame. Creates a new frame with the existing columns and a new predicted column.

Parameters:

frame

(Frame):

Frame used for predicting the values

(List[str]):

Names of the observation columns.

Returns

(Frame):

A new frame containing the original frame's columns and a prediction column

def save(

self, path)

save the trained model to the given path

def to_dict(

self)

def to_json(

self)

Index

Functions

Classes

Functions

Classes

Ancestors (in MRO)

Instance variables

Methods