Naive Bayes¶
Naive Bayes is a basic classifier.
Setup¶
Establish a connection to the ATK REST Server. This handle will be used for the remainder of the script.
Get your server URL and credentials file from the TAP administrator.
atk_server_uri = os.getenv("ATK_SERVER_URI", ia.server.uri)
credentials_file = os.getenv("ATK_CREDENTIALS", "")
Set the server, and use the credentials to connect to the ATK REST server.
ia.server.uri = atk_server_uri
ia.connect(credentials_file)
Workflow¶
The general workflow will be build a frame, build a model, train the model on the frame, predict using the model, evaluate the results using classification metrics.
Build a Frame¶
Construct a frame to be uploaded, this is done using Python lists uploaded to the server.
Each row represents a sample from a probability distribution, with a vector associated with a category. For the purposes of this example there are two categories and three features to indicate whether the sample is a cat or a dog (weight, height, fur type).
The frame has the schema Class, feature 1, feature 2, feature 3, where class is the category that the sample belongs to.
rows_frame = ia.UploadRows([[0,1,0,0],
[0,2,0,0],
[1,0,1,0],
[1,0,2,0]],
[("class", ia.float32),
("f1", ia.int32),
("f2", ia.int32),
("f3", ia.int32)])
frame = ia.Frame(rows_frame)
print frame.inspect()
Build a Model¶
nb_model = ia.NaiveBayesModel()
Train the model on the frame. This is supervised training technique so the category is used in the training process. Note the feature vector is represented as a list of column names.
nb_model.train(frame, "class", ["f1", "f2", "f3"])
Predict assigns a category to a sample in the feature space For the purposes of illustrating the workflow, I am predicting on the same frame used to train, normally you would predict on a different frame representing data that didn’t have a category assigned to it again note the feature vector is a python list of column names.
result = nb_model.predict(frame, ["f1", "f2", "f3"])
The result is a frame with a new “predicted_class” column.
print result.inspect()
Run classification metrics on the resultant frame to understand model performance.
cm = result.classification_metrics("class", "predicted_class")
print cm