This article was originally published on Towards Data Science on June 18th, 2020.

If you’ve done any deep learning you’ve probably noticed two different types of activation function — the ones used on hidden layers and the one used on the output layer.

Activation function(s) used on hidden layers are mostly the same for all hidden layers. It’s unlikely to see ReLU used on the first hidden layer, followed by a Hyperbolic tangent function — it’s usually ReLU or tanh all the way.

But we’re here to talk about the output layer. There we need a function that takes whatever values and transforms them into a probability distribution.

Softmax function to the rescue.

The function is great for **classification** problems, especially if you’re dealing with multi-class classification problems, as it will report back the “confidence score” for each class. Since we’re dealing with probabilities here, the scores returned by the softmax function will add up to 1.

The predicted class is, therefore, the item in the list where confidence score is the highest.

Right now we’ll see how the softmax function is expressed mathematically, and then how easy it is to translate it into Python code.

### Mathematical representation

According to the official Wikipedia page, here’s the formula of the softmax function:

It might be daunting to look at first, but it’s one of the simpler functions you’ll encounter while studying deep learning.

It states that we need to apply a standard exponential function to each element of the output layer, and then normalize these values by dividing by the sum of all the exponentials. Doing so ensures the sum of all exponentiated values adds up to 1.

Please take the time to read through the previous paragraph multiple times if needed, as it will be crucial for further understanding. If it’s still a bit fuzzy, we’ve prepared a (hopefully) helpful diagram:

Here are the steps:

- Exponentiate every element of the output layer and sum the results (around 181.73 in this case)
- Take each element of the output layer, exponentiate it and divide by the sum obtained in step 1
*(exp(1.3) / 181.37 = 3.67 / 181.37 = 0.02)*

By now I hope you know how the softmax activation function works in theory, and in the next section, we’ll implement it from scratch in Numpy.

### Implementation

This part will be easy and intuitive if you’ve understood the previous section. If not, the simple Python implementation should still help with general understanding.

To start, let’s declare an array which imitates the output layer of a neural network:

```
output_layer = np.array([1.3, 5.1, 2.2, 0.7, 1.1])
output_layer
>>> array([1.3, 5.1, 2.2, 0.7, 1.1])
```

Think of this as of K-class classification problem, where K is 5. Up next, we need to exponentiate each of the elements of the output layer:

```
exponentiated = np.exp(output_layer)
exponentiated
>>> array([ 3.66929667, 164.0219073 , 9.0250135 , 2.01375271, 3.00416602])
```

And now we’re ready to calculate probabilities! We can use Numpy to divide each element by exponentiated sum and store results in another array:

```
probabilities = exponentiated / np.sum(exponentiated)
probabilities
>>> array([0.02019046, 0.90253769, 0.04966053, 0.01108076, 0.01653055])
```

And that’s it — these are the target class probabilities obtained from the output layer’s raw values.

Previously we’ve mentioned that the sum of the probabilities should equal to 1, so let’s quickly verify if that statement is valid:

```
probabilities.sum()
>>> 1.0
```

And we’re done. If you understand this, you understand the softmax activation function. Let’s wrap things up in the next section.

### Before you go

The softmax function should be pretty straightforward to understand. Due to sophisticated libraries like TensorFlow and PyTorch, we don’t need to implement it manually. That doesn’t mean we shouldn’t be aware of how they work behind the surface.

I hope the article was easy enough to understand and follow. Thanks for reading.