Understand the basics of the Sigmoid function

An in-depth look into the popular activation function.

6 min readMay 18, 2022

Why did people invest time to develop the sigmoid function when there were simpler functions to make use of? As I answer that for you, I hope you get full clarity on the Sigmoid function by the end of this post.

A real world example.

Tedd is a data science professional in the human resources department of a fast-growing start-up company. He has built an attrition model to predict which employees are likely to leave the firm in the coming months. The model scores each employee with a value ranging from negative to positive infinity. Higher scores show more evidence for the employee to leave, and lower scores show more evidence to support the employee not leaving.

For ease of interpretation, we intend to convert this score to, say, a probability value. Comparing two large real numbers (as scores), will be tougher in comparison to their probability values, which will be between 0 and 1 (duh!). If the probability is 95%, the HR team can then get in touch with the employee to discuss retention policies. If the probability is around 5%, then it is not a matter of immediate concern. How can he achieve this?

Understanding the bounds

Consider a training sample of employees and their respective scores. This sample is going to consist of a “finite” number of employees. The output score, thus, will range between two finite values (and not “infinity” as we stated earlier). Let us say we got the highest score of +50 and the lowest score of -50 when we trained the model. Our goal is to map this range of (-50,50) to the range of (0,1). The easiest and “first-principle” approach would be to draw a straight line, as shown below:

Example of a linear fit and why it is not the best choice.

An employee who scores -50 maps to a probability of zero, and an employee who scores +50 maps to a probability of one. Everything between the two bounds will get mapped in a linear fashion.

Now, this may seem like the most natural approach, and it is simple, but it is not the best choice. Two reasons why:

1. What happens when the score is beyond +50 or -50? In the future, when we use this model, if an employee gets a score of say 75, our model will map that as ‘out of bounds’. A value less than 0 or more than 1 is no longer interpretable as a probability.

2. The bigger reason than #1 is the rate at which probability changes for a change in the model score. Here is an example of what I mean:

An employee with a score of 0 has a 50% chance of exiting (probability = 0.5). There isn’t enough evidence to predict his attrition. If his score had been 2 instead, we would then have a lot more information in comparison to a score of 0. An increase of 2 units in the score gives us “so much more” information to predict if the employee will exit the firm. We expect the probability now (with a score of 2) to take a “big leap” from 0.5. This “leap” is not captured by the straight line.

Likewise, at the other end, if an employee gets a score of 48, we’d say there is a high chance he/she will leave. But should this score change from 48 to 50, our prediction will still be pretty much the same. There was enough evidence already that an increase of 2 did not affect our prediction. This plateauing nature, again, isn’t captured by the straight line. Enter the sigmoid curve:

The green curve is the ‘Sigmoid’ which accurately captures the rate of change of probability.

The sigmoid solves these two issues:

1. It always remains between 0 and 1. Even for a very high score, say, 1M, the sigmoid, even though very close to 1, never touches 1. And likewise, it never touches the value of 0.

2. From a point of no information (score = 0), a small increase in score has a big impact on the probability. But if we already have “enough” evidence, the same amount of change will hardly impact the probability.

The Math behind

The equation below best represents the curve we drew above.

To confirm that it does, let us input the bounding values of the range of scores as well as 0 to the function. As you can see below, the output values (probabilities) are exactly what we observe in the graph. The equation above, thus, represents the correct formulation of the sigmoid function.

Output probabilities of the sigmoid function for inf., neg inf. and zero x values.

Let us also look at the derivative of the sigmoid. It is a very important concept when we look at ML applications such as neural networks. It helps compute the loss function, which helps us train our machine learning models.

We already know p (x). To compute the derivative, we use the quotient rule as both the numerator and denominator are “differentiable”. The quotient rule defines the derivative as below:

where D is the denominator and N is the numerator of the fraction.

Because our sigmoid function’s numerator is 1 (a constant), its derivative is 0. We use the chain rule to take the derivative of the denominator (1+e^-x):

Coming back to taking the sigmoid derivative using the quotient rule:

Now, can we reduce this to something that is a bit easier to work with? Yes! That is what makes the sigmoid function so popular — because of its cool derivative property. The derivative of the sigmoid function may be split as a product of two terms, as below:

Re-writing the derivative of the sigmoid function.

The derivative of the sigmoid therefore is a simple product of the function and (1 — function) itself.

Interpreting the derivative

It’s not just the simple form; the derivative of the sigmoid also has a good intuitive explanation. Let us look at three reasons why:

If p = 1, the derivative will come to 1 * (1–0) = 0. What does it mean? It means that, since our probability is already very high, the score is also very high. A small change in the score, thus, will result in a 0 change in the probability.
If p = 0, the derivative will come to 0 * (1–0) = 0. As the score is already very high, and we have enough evidence, a small change in the score, again, will result in a 0 change in the probability.
Finally, if p = 0.5, the derivative will come to 0.5 (0.5–0.5) = 0.25. We are equally certain at this point whether the employee will quit or not. And it makes sense, since that is when the probability will change most for a minor change in score. Assume the score changes (on either side) from a point of no information. We are now presented with a lot of information, leading to a sudden change in our probability. Therefore, our derivative is going to be very high.

I hope this post helps you grasp the concept of sigmoid at a level where you now understand why it is so widely used in machine learning.