# TensorFlow and Machine Learning Basics

Machine Learning is not just a field for scientists any more. Now almost every company is taking advantage of easy-to-use frameworks that make it pretty simple to write machine learning programs and integrate them into business applications. TensorFlow is one such framework and is the subject of this article.

# TensorFlow Basics

Every TensorFlow Python program generally imports the **tensorflow** package in as **tf**:

The basic building block in TensorFlow is called a **tensor**, oddly enough. The easiest example of a tensor is just a simple constant node which always emits the same value:

As you can see, what gets created is two tensor objects. In order to actually get the values out of them, we must run them through a **session**:

You can use existing tensors to create other tensors. For example, you can make a simple adder that adds the values of two nodes together:

The next type of tensor is a **placeholder**. It essentially puts a name on a value that must be specified when the network of tensors is run, so essentially it is an input.

Using placeholders, we can create an adder function that can add values specified at run time:

With this example, I hope you can start to see the power of TensorFlow. Not only did I make an adder capable of adding simple numbers, it can also add vectors (or even matrices). In Machine Learning, working with vectors and matrices is a key part of making computations more efficient.

In this way, TensorFlow is a lot like Matlab, Octave, and other numerical programming frameworks that easily work with these higher dimension ways of representing data.

You can wrap the output of the adder with another tensor to easily chain operations together:

The next type of tensor is called a **variable**. Developers are quite familiar with variables, but they have a special purpose in TensorFlow. We’ll describe them in the next section.

# Linear Regression Example

Let’s build a simple linear regression model:

In this case, we have built a model with one feature, **x** (the input), and two parameters **W** and **b** (although machine learning scientists like to call these **Theta0** and **Theta1**). The **linearModel** tensor is the output tensor, i.e., it is the value we are looking for.

In order to correctly predict the output value, we need to train the model by adjusting the values of **W** and **b**. We give them initial default values of **0.3** and **-0.3**, but during training these values will change. In order to actually do the training, we must first initialize the variables to the default values (we can run the same initialize again if we want to reset the values back to the defaults):

Alright. Now in every linear regression problem, we have some input data, often called training data, that specifies what we are trying to calculate with our model. Say we have the following training data:

In this case, **x_train** is a set of values that are sent into the model, where as **y_train** is the expected output of the model. Recall from above that our model is basically **y = W * x + b**. Our goal is to find the values for **W** and **b** such that when we set **x** to **1**, **y** will be set to **0**. If **x** is set to **2**, **y** will be set to **-1**, etc. **W** and **b** have initial values of **0.3** and **-0.3**, so we can just run the model as is with the **x_train** input and it will spit out values for each of the input values:

The first output value is **0**, which is correct. The other three, however, are wrong. So we need to change **W** and **b**. In order to train our model, we need to be able to input the expected values, so we define a placeholder to take these in:

In machine learning, when specifying expected outputs, these outputs are called **labels**. Some machine learning models don’t have labels — you pass in input and the output is something that you didn't know but are hoping to learn (a good example of this is classification models where you are trying to divide people up into groups, but you don’t necessarily know what the groups are — you are searching for possible correlations between them that may be hidden).

In order to adjust our parameters **W** and **b**, we need something that tells us if we are going in the right direction. This thing is called a **cost** or **loss** function, and it essentially calculates how far off we are from accurate predictions. One typical way of calculating the cost function is to take the difference between the predicted value and the expected value and then to square it (the reason we square the value is explained below when we talk about how gradient descent works).

When you sum all the squared differences between the predictions and expected values, you get a single number that gives you an idea of how far off you are. We do this in TensorFlow by first using the **square** function on the difference between the **linearModel** (which computes the predicted value given **x**) and **y** (which is our expected values):

Then we sum all the deltas together using the **reduce_sum** function:

Don’t get tripped up by the name **reduce_sum **— this function isn't trying to minimize the sum, it is reducing a vector down to a scalar by summing all the values together. The resulting tensor, which we called **loss**, outputs the value that we want to try to make as small as possible. We can compute the current loss value given the input and expected output values:

We will ultimately find that the optimal values of **W** and **b** are **-1** and **1**, and we can run our loss function again with these values to see that our cost goes to 0:

# Training

Up to this point, all we have really done with TensorFlow is build up a series of calculations based on some inputs. We could have done this in any programming language, but TensorFlow made it relatively simple, even if the inputs are vectors or matrices. Now we get to the real fun of machine learning where we train the model to find optimal parameters.

For this example, we are going to use a method called **gradient descent**. The way this method works is to take a partial derivative of the loss function in order to determine how to adjust the parameters **W** and **b**. If you remember your calculus, a derivative essentially tells you how fast something is changing — acceleration is the derivative of velocity, and velocity is the derivative of position. In this case, the derivative is essentially the slope of a point on the graph of the loss function which could look something like this:

What we are looking for are the points lowest on the graph which are shown in blue and represent the points where the error is lowest. The gradient descent method basically works the same way you would try to find the fastest way down a mountain. You would look around you and follow the slope that leads you downwards.

The derivative tells gradient descent the way to adjust the value of a given variable that will cause the overall loss to go down. You take a step, and then you do it again, over and over until the slope is flat or begins to slope in the wrong direction (i.e., you have it the bottom).

As you can see from the graph above, depending on where you start, you might find a local minimum that isn't the global minimum, so you might have to run the method many times starting at different points to see if you find a better result.

Another factor that is part of this method is called the **learning rate**, which is essentially the size of the step you take each time. There is a trade-off with the size of the learning rate — a learning rate too small will make finding the minimum slow because you are taking tiny steps, a learning rate too big can miss the minimum and can even diverge away from the minimum.

Sometimes it is necessary to check while the model is training to ensure that the value of the loss function keeps going down so that you know you are converging on the minimum.

TensorFlow has a built-in implementation of gradient descent and all we need to do is specify the learning rate. This optimizer will go through our tensors and find the variables that need to be adjusted in order to minimize our loss:

You might be asking yourself “how do I know what a good learning rate is?”. This part is really about math, but essentially each step in gradient descent computes a small difference to add to one of the parameters and the formula looks essentially like:

In this formula, **alpha** is the learning rate, **m** is the number of samples we have, and the **predicted_value** and **expected_value** are the sets of predictions and expected output — we take the differences and sum them. This will result in a small value that adjusts the parameter we are currently training. Remember how our loss function squared the differences? The derivative of **x²** is **2x**, so therefore the derivative of the square of the differences becomes 2 times the differences.

The differences here are adjusted differently because we are only taking a small step and so we instead multiply by the learning rate and 1/m. If the number of samples you have is large, 1/m will be small, so the learning rate can be larger (which is good because you have a lot more calculations to do for each step with all your training samples). If the number of samples is small, you will likely need to make the learning rate smaller in order to not miss the minimum.

Now we actually have to run the training, so first we reset the variables back to their initial values:

Then we run our training data through the gradient descent a bunch of times (i.e., we take a lot of steps — in this case 1000):

When this is done, we have new values in **W** and **b** that should allow us to correctly predict the output based on the input:

As you can see, we didn't exactly hit the minimal values **-1** and **1**, but we are pretty darn close. How close? Let’s check the value of the loss function now:

That’s really small. We could train even more and get the loss smaller, but at this point our accuracy is over 99% and we are only talking about a small fraction of a percent improvement at best.

# Conclusion

This is just an introduction to TensorFlow, and there is so much more that can be done with it. In reality, it is a framework that will pretty much allow you to build almost any Machine Learning program you can think of.

The one limitation might be that you might end up writing a lot of code in order to describe a complicated mathematical calculation that is done as part of your Machine Learning algorithm.

As time goes on, however, developers are contributing libraries that will do more of these heavy lifting calculations and let you focus more on analyzing the results. You can read more at https://tensorflow.org.

*Originally published on November 20, 2017.*