15+ YEARS OF E-COMMERCE CONSULTING

HELPED OVER 1,000 SUCCESSFUL STORE OWNERS

GENERATED MILLIONS IN MONTHLY REVENUE

15+ YEARS OF E-COMMERCE CONSULTING

HELPED OVER 1,000 SUCCESSFUL STORE OWNERS

GENERATED MILLIONS IN MONTHLY REVENUE

15+ YEARS OF E-COMMERCE CONSULTING

HELPED OVER 1,000 SUCCESSFUL STORE OWNERS

GENERATED MILLIONS IN MONTHLY REVENUE

# Exploring Activation and Loss Functions in Machine Learning

In this post, we’re going to discuss the most widely-used activation and loss functions for machine learning models. We’ll take a brief look at the foundational mathematics of these functions and discuss their use cases, benefits, and limitations.

Without further ado, let’s get started!

## What is an Activation Function?

To learn complex data patterns, the input data of each node in a neural network passes through a function the limits and defines that same node’s output value. In other words, it takes in the output signal from the previous node and converts it into a form interpretable by the next node. This is what an activation function allows us to do.

### Need for an Activation function

1. Restricting value: The activation function keeps the values from the node restricted within a certain range, because they can become infinitesimally small or huge depending on the multiplication or other operations they go through in various layers (i.e. the vanishing and exploding gradient problem).
2. Add non-linearity: In the absence of an activation function, the operations done by various functions can be considered as stacked over one another, which ultimately means a linear combination of operations performed on the input. Thus, a neural network without an activation function is essentially a linear regression model.

## Types of Activation functions

Various types of activation functions are listed below:

### Sigmoid Activation function

The sigmoid function was traditionally used for binary classification problems (goes along the lines of “if x≤0.5, y=0 else y=1”). But, it tends to cause vanishing gradients problem, and if the values are too close to 0 or +1, the curve or gradient is almost flat and thus the learning would be too slow.

It’s also computationally expensive, since there are a lot of complex mathematical operations involved.

### Tanh Activation Function

The tanh function was also traditionally used for binary classification problems (goes along the lines of “if x≤0, y=0 else y=1”).

It’s different than sigmoid in the sense that it’s zero-centred, and thus restricts input values between -1 and +1. It’s even more computationally expensive than sigmoid since there are a lot of complex mathematical operations involved, which need to be performed for every input and iteration, repeatedly.

### ReLU Activation Function

ReLU is a famous, widely-used non-linear activation function, which stands for Rectified Linear Unit (goes along the lines of “if x≤0, y=0 else y=1”).

Thus, it’s only activated when the values are positive. ReLU is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.

But it faces what’s known as the “dying ReLU problem”—that is, when inputs approach zero, or are negative, the gradient of the function becomes zero, and thus the model learns slowly. ReLU is considered a go-to function if one is new to activation function or is unsure about which one to choose.

### Leaky-ReLU Activation function

The leaky ReLU function proposes a solution to the dying ReLU problem. It has a small positive slope in the negative plane, so it enables the model to learn, even for negative input values.

Leaky ReLUs are widely used with generative adversarial networks. Parametric leaky ReLUs use a value “alpha”, which is usually around 0.1, to determine the slope of the function in the negative plane.

### Softmax Activation Function

The softmax function is a function that helps us represent inputs in terms of a discrete probability distribution. According to the formula, we need to apply an exponential function to each element of the output layer and normalize the values to ensure their summation is 1. The output class is the one with the highest confidence score.

This function is mostly used as the last layer in classification problems—especially multi-class classification problems—where the model ultimately outputs a probability for each of the available classes, and the most probable one is chosen as the answer.

## What is a Loss Function?

To understand how well/poorly our model is working, we monitor the value of loss functions for several iterations. It helps us measure the accuracy of our model and understand how our model behaves for certain inputs. Thus, it can be considered as the error or deviation in the prediction from the correct classes or values. The larger the value of the loss function, the further our model strays from making the correct prediction.

### Types of loss functions

Depending on the type of learning task, loss functions can be broadly classified into 2 categories:

1. Regression loss functions
2. Classification loss functions

## Regression Loss Functions

In this sub-section, we’ll discuss some of the more widely-used regression loss functions:

### Mean Absolute Error or L1 Loss (MAE)

The mean absolute error is the average of absolute differences between the values predicted by the model and the actual values. There’s an issue with MAE though—if some values are underestimated (negative value of error) and some are almost equally overestimated (positive value of error), they might cancel each other out, and we may get the wrong idea about the net error.

### Mean Squared Error or L2 Loss(MSE)

The mean squared error is the average of the squared differences between the values predicted by the model and the actual values. Squaring the error also helps us avoid the nullification issue faced by MAE.

MSE is also used to emphasize the error terms in cases where the input and output values have small scales. Thus, due to squaring the error terms, large errors have relatively greater influence when using MSE than smaller errors.

However, this can be a gamble when there are a lot of outliers in your data. Since the outliers would have greater weight due to higher error values being squared, it can make the error or loss function biased. Thus, outlier eradication should be performed before applying MSE.

### Huber loss

Huber loss is an absolute error, and as you can see from the formula above, it becomes quadratic as the error grows smaller and smaller. In the above formula, y is the expected value, xᵗ ϴ is the predicted value, and tₕ is a user-defined hyper-parameter.

## Classification Loss Functions

In this sub-section, we’ll discuss some of the more widely-used loss functions for classification tasks:

### Cross-Entropy loss

This loss is also called log loss. To understand cross-entropy loss, let’s first understand what entropy is. Entropy refers to the disorder or uncertainty in data. The larger the entropy value, the higher the level of disorder.

As you can see in the above formula, the entropy is basically the negative summation of the product of the probability of occurrence of an event with its log over all possible outcomes. Thus, cross entropy as a loss function signifies reducing entropy or uncertainty for the class to be predicted.

Cross-entropy loss is therefore defined as the negative summation of the product of the expected class and the natural log of the predicted class over all possible classes. The negative sign is used because the positive log of numbers < 1 returns negative values, which is confusing to work with while evaluating model performance.

For example, if the problem at hand is binary classification, the value of y can be 0 or 1. In such a case, the above loss formula reduces to:

where p is the value of predicted probability that an observation O is of class C.

Thus, the loss function over the complete set of samples would be:

### Hinge loss

Hinge loss helps in penalizing the wrongly-predicted values, as well as the values that were correctly predicted but with a lower probability score. Hinge loss is primarily used with Support Vector Machines (SVMs), since it supports the formation of a large-margin classifier by penalizing wrongly-predicted values, as well as the correctly-predicted ones with low probability.

## Conclusion

In this post we discussed about various activation functions like sigmoid, tanh, ReLU, leaky-ReLU and softmax, along with their primary use cases. These are the most widely-used activation functions and are essential for developing efficient neural networks.

We also discussed a few major loss functions like mean squared error, mean absolute error, huber loss, cross-entropy loss, and hinge loss.