Naive Bayes¶
Naive Bayes is a basic classifier.
Setup¶
Establish a connection to the ATK Rest Server This handle will be used for the remaineder of the script
Get your server URL and credentials file from the TAP administrator
atk_server_uri = os.getenv("ATK_SERVER_URI", ia.server.uri)
credentials_file = os.getenv("ATK_CREDENTIALS", "")
Set the server, and use the credentials to connect to the ATK
ia.server.uri = atk_server_uri
ia.connect(credentials_file)
Workflow¶
The general workflow will be build a frame, build a model, train the model on the frame, predict using the model, evaluate the results using classification metrics.
Build a Frame¶
Construct a frame to be uploaded, this is done using Python lists uploaded to the server
Each row represents a sample from a probability distribution, with a vector associated with a category. For the purposes of this example there are two categories (e.g. cat and dog) and three features to indicate whether the sample is a cat or a dog (weight, height, fur type)
The frame has the schema Class, feature 1, feature 2, feature 3, where class is the category that the sample belongs to
rows_frame = ia.UploadRows([[0,1,0,0],
[0,2,0,0],
[1,0,1,0],
[1,0,2,0]],
[("class", ia.float32),
("f1", ia.int32),
("f2", ia.int32),
("f3", ia.int32)])
frame = ia.Frame(rows_frame)
print frame.inspect()
Build a Model¶
nb_model = ia.NaiveBayesModel()
Train the model on the frame, this is supervised training technique so the category is used in the training process. Note the feature vector is represented as a list of column names.
nb_model.train(frame, "class", ["f1", "f2", "f3"])
- Predict assigns a category to a sample in the feature space
# For the purposes of illustrating the workflow, I am predicting on the # same frame used to train, normally you would predict on a different # frame representing data that didn’t have a category assigned to it
# again note the feature vector is a python list of column names result = nb_model.predict(frame, [“f1”, “f2”, “f3”])
# The result is a frame with a new “predicted_class” column print result.inspect()
# Run classification metrics on the resultant frame to understand # model performance cm = result.classification_metrics(“class”, “predicted_class”)
print cm
# Assert the results are correct self.assertEqual(cm.confusion_matrix.values[0][0],2) self.assertEqual(cm.confusion_matrix.values[1][1],2) self.assertEqual(cm.confusion_matrix.values[0][1],0) self.assertEqual(cm.confusion_matrix.values[1][0],0)