Friday, September 15, 2017

How to write kNN by TensorFlow

Overview

How do we write machine learning algorithms with TensorFlow?
I usually use TensorFlow only when I write neural networks. But actually TensorFlow is not only for that. It also can be used to write other machine leaning algorithms.
On this article, I tried to roughly write kNN algorithm by TensorFlow.





Motivation


Especially this one year, I frequently use Keras when I write deep learning algorithms. And about the task without deep learning, sklearn have been working well. The frequency I directoly use TensorFlow have apparently been decreasing.

But by updating, TensorFlow improves its usefullness, making me feel obsessed regularly to touch and study that.

And for the last few months, I’m practicing to write some machine learning algorithms from scratch. So I want to know how to write those in different ways.

Why kNN?


As supervised learning algorithm, kNN is very simple and easy to write. So, I chose this algorithm as the first trial to write not neural network algorithm by TensorFlow.

About kNN(k nearest neightbors), I briefly explained the detail on the following articles. Please check those.



Data


On this trial, I used iris set. This data has 3 classes meaning setosa, virginica and versicolor. The purpose of this article is to make model to classify data into those three.

Let's make a model


All the procedures can be separated into three. Data preparation, algorithm writing, training. To say precisely, kNN doesn't have the concept of model to train. So you can understand “training” as prediction.
Anyway, let’s go step by step.

Data preparation


The part below is to prepare data.

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
import tensorflow as tf

# load data
iris = datasets.load_iris()
x_vals = np.array([x[0:4] for x in iris.data])
y_vals = np.array(iris.target)

# one hot encoding
y_vals = np.eye(len(set(y_vals)))[y_vals]

# normalize
x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)

# train-test split
np.random.seed(59)
train_indices = np.random.choice(len(x_vals), round(len(x_vals) * 0.8), replace=False)
test_indices =np.array(list(set(range(len(x_vals))) - set(train_indices)))

x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]

As the comments show, this part can be separated into four phases. load data, one hot encoding, normalize, train-test split.

Algorithm writing


This is the main part of the code. Here, the kNN algorithm is written.

feature_number = len(x_vals_train[0])

k = 5

x_data_train = tf.placeholder(shape=[None, feature_number], dtype=tf.float32)
y_data_train = tf.placeholder(shape=[None, len(y_vals[0])], dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, feature_number], dtype=tf.float32)

# manhattan distance
distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test, 1))), axis=2)

# nearest k points
_, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)
top_k_label = tf.gather(y_data_train, top_k_indices)

sum_up_predictions = tf.reduce_sum(top_k_label, axis=1)
prediction = tf.argmax(sum_up_predictions, axis=1)

About kNN algorithm’s detail, please read kNN by Golang from scratch.
On TensorFlow, we usually set Variable and placeholder. Variable is for parameters to update and placeholder is for data. This time, kNN doesn’t have parameters to update. So, only placeholder is necessary for train and test data.

On the part of distance, I used manhattan distance, just because this is simple from the aspect of code. We can adapt euclidean distance or other distance function.

# manhattan distance
distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test, 1))), axis=2)

tf.subtract does subtraction. By this, we can get the differences between train features and test features. And after converting each values to absolute ones, it just sums those up.
The distance means the distances between test data and train data points.
Next, the part below gets nearest k points from the test data point and check those labels. Here, I feel TensorFlow-smell.

# nearest k points
_, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)
top_k_label = tf.gather(y_data_train, top_k_indices)

sum_up_predictions = tf.reduce_sum(top_k_label, axis=1)
prediction = tf.argmax(sum_up_predictions, axis=1)

Roughly, the functions which are used above are like as followings.
  • tf.nn.top_k(): to get biggest k values and indices
  • tf.negative(): to make the values negative
  • tf.gather(): to extract values relevant to the specific indices
  • tf.reduce_sum(): to get sums of elements
  • tf.argmax(): to get max value's index

Probably, the point of tf.reduce_sum() is not intuitive on the code. You need to care that the top_k_label is the one-hot-encoded labels.

Training


This is the training part. About kNN, you can also say predicting part.

sess = tf.Session()
prediction_outcome = sess.run(prediction, feed_dict={x_data_train: x_vals_train,
                               x_data_test: x_vals_test,
                               y_data_train: y_vals_train})

# evaluation
accuracy = 0
for pred, actual in zip(prediction_outcome, y_vals_test):
    if pred == np.argmax(actual):
        accuracy += 1

print(accuracy / len(prediction_outcome))

Simply, this part supplied placeholders with data. On the evaluation, I checked the rate that the predicted class is equal to the actual class.
The outcome is as following.

0.9666666666666667

I wrote naive kNN algorithm by TnesorFlow. At the bottom of the article, I added all the code used here.

Reference


In many points of my review, I used TensorFlow Machine Learning Cookbook.



This book is not only about basic use of TensorFlow but also about how to use TensorFlow for many machine learning algorithms. Actually, when I read TensorFlow tutorial at the first time, what I wanted was the contents of this book.

Code

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
import tensorflow as tf

# prepare data
iris = datasets.load_iris()
x_vals = np.array([x[0:4] for x in iris.data])
y_vals = np.array(iris.target)

y_vals = np.eye(len(set(y_vals)))[y_vals]

x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0)

np.random.seed(59)
train_indices = np.random.choice(len(x_vals), round(len(x_vals) * 0.8), replace=False)
test_indices =np.array(list(set(range(len(x_vals))) - set(train_indices)))

x_vals_train = x_vals[train_indices]
x_vals_test = x_vals[test_indices]
y_vals_train = y_vals[train_indices]
y_vals_test = y_vals[test_indices]

feature_number = len(x_vals_train[0])

k = 5

x_data_train = tf.placeholder(shape=[None, feature_number], dtype=tf.float32)
y_data_train = tf.placeholder(shape=[None, len(y_vals[0])], dtype=tf.float32)
x_data_test = tf.placeholder(shape=[None, feature_number], dtype=tf.float32)

# manhattan distance
distance = tf.reduce_sum(tf.abs(tf.subtract(x_data_train, tf.expand_dims(x_data_test, 1))), axis=2)

# nearest k points
_, top_k_indices = tf.nn.top_k(tf.negative(distance), k=k)
top_k_label = tf.gather(y_data_train, top_k_indices)

sum_up_predictions = tf.reduce_sum(top_k_label, axis=1)
prediction = tf.argmax(sum_up_predictions, axis=1)


sess = tf.Session()
prediction_outcome = sess.run(prediction, feed_dict={x_data_train: x_vals_train,
                               x_data_test: x_vals_test,
                               y_data_train: y_vals_train})

# evaluation
accuracy = 0
for pred, actual in zip(prediction_outcome, y_vals_test):
    if pred == np.argmax(actual):
        accuracy += 1

print(accuracy / len(prediction_outcome))