*This post is a part of a series about feature engineering techniques for machine learning with Python.*

You can check out the rest of the articles:

- Hands-on with Feature Engineering Techniques: Broad Introduction.
- Hands-on with Feature Engineering Techniques: Variable Types.
- Hands-on with Feature Engineering Techniques: Common Issues in Datasets.
- Hands-on with Feature Engineering Techniques: Imputing Missing Values.
- Hands-on with Feature Engineering Techniques: Encoding Categorical Variables.
- Hands-on with Feature Engineering Techniques: Transforming Variables.
- Hands-on with Feature Engineering Techniques: Variable Discretization.
- Hands-on with Feature Engineering Techniques: Handling Outliers.
- Hands-on with Feature Engineering Techniques: Feature Scaling.
- Hands-on with Feature Engineering Techniques: Handling Date-time and Mixed Variables.
- Hands-on with Feature Engineering Techniques: Advanced Methods.

Welcome back! In this post of our series on feature engineering, we’re going to focus on another common issue in most datasets—**outliers**. Here, we’ll examine what an outlier is and the different methods to handle them, alongside some code snippets. Let’s get started.

## Outliers

An outlier is a data point that’s **significantly different** from the remaining data.

Another way of saying this is that an outlier is an observation that deviates so much from the other observations, it arouses suspicion that a different mechanism produced it.

## Detecting Outliers

We can detect and find outliers using various techniques. Some of them include:

**Using visualization plots like boxplot and scatterplot:**

**Using a normal distribution (mean and std):**

In a normal distribution, about 99.7% of the data lie within three standard deviations of the mean. Consequently, if any observation is more than three times the standard deviation, it’s possible that it’s an outlier.

**Inter-quantal range proximity rule:**

The concept of the **interquartile range** (**IQR**) is used to build boxplot graphs. The idea is simple—we divide our data into four parts, and each part is a quartile.

IQR is the difference between the third quartile Q3 ( 75 percent) and the first quartile or Q1 (25 percent).

With IQR, outliers are defined as the observations that are:

- Below Q1 − 1.5 × IQR.
- Above Q3 + 1.5 × IQR.

Here’s a very descriptive image found on Wikipedia:

## Handling Outliers

Now that we understand how to detect outliers in a better way, it’s time to engineer them. We’re going to explore a few different techniques and methods to achieve that:

**Trimming:**Simply removing the outliers from our dataset.**Imputing:**We treat outliers as missing data, and we apply missing data imputation techniques.**Discretization:**We place outliers in edge bins with higher or lower values of the distribution.**Censoring:**Capping the variable distribution at the maximum and minimum values.

## Trimming

Trimming (or truncation) merely means removing outliers from the dataset; what we need here is just to decide on a metric to determine outliers.

Here are some points to consider when working with the trimming method:

- This method is fast.
- It can remove a significant amount of data ( — so be careful).

Here’s a sample code snippet for trimming outliers with Python:

```
# import the needed packages
import pandas as pd
import numpy as np
# read your data
data = pd.read_csv("yourData.csv")
for variable in data.columns:
#calculate the IQR
IQR = data[variable].quantile(0.75) - data[variable].quantile(0.25)
#calculate the boundries
lower = data[variable].quantile(0.25) - (IQR * 1.5)
upper = data[variable].quantile(0.75) + (IQR * 1.5)
# find the outliers
outliers = np.where(data[variable] > upper, True, np.where(data[variable] < lower, True, False))
# remove outliers from data.
data = data.loc[~(outliers, ]
```

## Censoring

Censoring (or capping) means setting the maximum **and/or** the minimum of the distribution at an arbitrary value.

In other words, values bigger or smaller than the arbitrarily chosen value are **replaced by this value.**

When doing capping, remember that:

- It does not remove data.
- It distorts the distributions of the variables.

The numbers at which to cap the distribution can be determined using various methods, which we’ll cover below

### Arbitrarily

You can choose values to replace outliers arbitrarily; this can be based on the requirements of your use case. Here’s a code snippet:

```
# import the needed packages
import pandas as pd
import numpy as np
# read your data
data = pd.read_csv("yourData.csv")
for variable in data.columns:
# create boundries (age for example)
lower = 10
upper = 89
# replacing the outliers
data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))
```

### Inter-quantal range proximity rule

In this rule, the boundaries are determined using IQR proximity rules:

```
# import the needed packages
import pandas as pd
import numpy as np
# read your data
data = pd.read_csv("yourData.csv")
for variable in data.columns:
#calculate the IQR
IQR = data[variable].quantile(0.75) - data[variable].quantile(0.25)
#calculate the boundries
lower = data[variable].quantile(0.25) - (IQR * 1.5)
upper = data[variable].quantile(0.75) + (IQR * 1.5)
# replacing the outliers
data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))
```

### Gaussian approximation

Another code snippet that sets the boundaries with values according to the mean and standard deviation:

```
# import the needed packages
import pandas as pd
import numpy as np
# read your data
data = pd.read_csv("yourData.csv")
for variable in data.columns:
#calculate the boundries
lower = data[variable].mean() - 3 * data[variable].std()
upper = data[variable].mean() + 3 * data[variable].std()
# replacing the outliers
data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))
```

### Using quantiles

In the following code snippet, the boundaries are determined using the quantiles, through which you can specify any percentage you want:

```
# import the needed packages
import pandas as pd
import numpy as np
# read your data
data = pd.read_csv("yourData.csv")
for variable in data.columns:
#calculate the boundries
lower = data[variable].quantile(0.10)
upper = data[variable].quantile(0.90)
# replacing the outliers
data[variable] = np.where(data[variable] > upper, upper, np.where(data[variable] < lower, lower, data[variable]))
```

## Imputing

Another technique used to handle outliers is to treat them as missing data. We have a range of methods that we can use to replace or impute outliers. If you’d like to explore these techniques in more depth, you can do so here.

## Transformation

We can also apply some mathematical transformations, such as log transformation. To handle the outliers, there are a range of transformation techniques, which you can learn more about here.

## Conclusion

To sum things up, we have learned how to detect outliers in our dataset and learned multiple methods that we can use to handle them.

I hope this post will get you started with engineering outliers—the practices described here can certainly enhance your data science toolkit.

## Comments 0 Responses