Feature engineering is one of the most important skills needed in data science and machine learning. It has a major influence on the performance of machine learning models and even the quality of insights derived during exploratory data analysis (EDA).
In this article, we’re going to learn some important techniques and tools that will help you properly extract, prepare, and engineer features from your dataset.
What you will learn in this article:
- What is feature engineering?
- How to handle missing values.
- How to handle categorical features.
- How to handle numerical/continuous features.
- Creating polynomial features.
- Normalization of features.
- Working with date/time features
- Working with latitudes and longitudes
Find the Jupyter notebook for this post here.
What is Feature Engineering?
Feature engineering is the process of using data’s domain knowledge to create features that make machine learning algorithms work (Wikipedia). It’s the act of extracting important features from raw data and transforming them into formats that are suitable for machine learning.
To perform feature engineering, a data scientist combines domain knowledge (knowledge about a specific field) with math and programming skills to transform or come up with new features that will help a machine learning model perform better.
Feature engineering is a practical area of machine learning and is one of the most important aspects of it. Below we highlight what a couple of industry experts have said about it:
Preparing our Datasets
Now that we understand what feature engineering is, let’s go straight into the practical aspect of this article. We’ll use two datasets for this article. The first is the Loan Default Prediction dataset hosted on Zindi by Data Science Nigeria, and the second — also hosted on Zindi — is the Sendy Logistics dataset by Sendy.
You can find the descriptions of the dataset and the corresponding machine learning tasks in the links above. If you have cloned the repo, you’ll have a folder of the datasets and the notebook used for this article and can follow along easily.
First, let’s import some libraries and the datasets:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
#load loan datasets
loan_demographics = pd.read_csv('traindemographics.csv')
loan_prev = pd.read_csv('trainprevloans.csv')
loan_perf = pd.read_csv('trainperf.csv')
#load logistics dataset
sendy_data = pd.read_csv('sendy_logistics.csv')
- We import Pandas, NumPy, Seaborn, and Matplotlib for basic data manipulation and visualization.
- We silence unnecessary warnings using the filterwarnings module, and finally we import the datasets.
We can see that the loan dataset has three tables. These tables are related to each other by the primary key (customerid).
Let’s take a peek at our dataset and get a feel for what records are present.
We now have an overview of our datasets. From this, we can see that the loan dataset contains mainly three types of features (Numerical, Categorical and Date features), while the logistic dataset contains four types of features (Numerical, Categorical, Date, and Geo features).
With this in the back of our mind, let’s do some feature engineering.
How to handle missing values
Missing values are values that are not recorded during data collection. They are mostly not provided, left out due to errors, or too difficult to measure. Missing values may be very important to models, and as such there exist numerous ways and techniques to handle them. Let’s go over some of these techniques.
The choice of technique is dependent on the type of features you have. For categorical features, you can do things like:
- Mode filling: Fill missing values with the most popular/frequent/modal class.
- Temporal filling (forward or backward fill): Fill missing values with the preceding value (top-down) or with the succeeding value (bottom-up).
- Encoding and fill: In this method, you can encode the values using different strategies, and then fill with either the mean, mode, or the median.
In the loan demographic dataset, we have three categorical features (bank_branch_clients, employment_status_clients, level_of_education_clients) with missing values. Let’s try mode filling using the employment_status_clients feature:
From the output of the value_counts above, we can see the classes present in the employment_status_clients feature and their corresponding frequencies. We see that the Permanent class is more popular. We can use this value to fill all missing values in the employment_status_clients feature, as shown below.
#fill with mode (Permanent)
loan_demographics['employment_status_clients'] = loan_demographics['employment_status_clients'].fillna(value='Permanent')
For numerical features, we can also do things like:
- Filling with mean, mode, or median.
- Temporal filling (backward or forward filling).
- Use machine learning models: Train a machine learning model to learn the most appropriate fill values.
To demonstrate the process of filling numerical values, we will use the Sendy logistics dataset, as it has two numerical features (Temperature and Precipitation in millimeters) that contain missing values.
mean_df = round(sendy_data['Temperature'].mean())
mode_df = round(sendy_data['Temperature'].mode()[0])
median_df = round(sendy_data['Temperature'].median())
#Fill with mean
print("Filling with mean value of {}".format(mean_df))
sendy_data['Temperature'] = sendy_data['Temperature'].fillna(mean_df)
#Fill with mode
print("Filling with modal value of {}".format(mode_df))
sendy_data['Temperature'] = sendy_data['Temperature'].fillna(mode_df)
#Fill with median
print("Filling with median value of {}".format(median_df))
sendy_data['Temperature'] = sendy_data['Temperature'].fillna(median_df)
- First, we calculate the value of the chosen fill method, then we use the Pandas fillna function to automatically find and replace missing values with the calculated value.
Use modeling to fill missing values
To demonstrate filling with modeling, we’ll use the feature Precipitation in millimeters) in the Sendy dataset. But first, we need to select features that correlate with it. That is, feature(s) that can help predict Precipitation in millimeters.
The Seaborn heatmap plot can help us decide. We demonstrate this below.
From the heatmap plot above, we can see that the majority of the features do not really correlate with Precipitation in millimeters. We can make use of the last three features (Destination Lat, Destination Long and Time from Pickup to Arrival), as these show little correlation. Let’s demonstrate this below:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
to_train = ['Precipitation in millimeters', 'Destination Lat', 'Destination Long', 'Time from Pickup to Arrival']
temp_df = sendy_data[to_train]
#Split dataset with missing values and no missing values as test and train set respectively.
x_train = temp_df[temp_df['Precipitation in millimeters'].notnull()].drop(columns='Precipitation in millimeters')
y_train = temp_df[temp_df['Precipitation in millimeters'].notnull()]['Precipitation in millimeters']
x_test = temp_df[temp_df['Precipitation in millimeters'].isnull()].drop(columns='Precipitation in millimeters')
#Fit a simple linear model to the dataset
lr.fit(x_train, y_train)
pred = lr.predict(x_test)
#print fill values
print(np.round(pred, 5))
#Perform filling
sendy_data['Precipitation in millimeters'][sendy_data['Precipitation in millimeters'].isnull()] = pred
- First, we import and use a simple linear regression model.
- We save the correlated features as observed from the heatmap to a list called to_train.
- We create a train-test dataset from these features, where the train dataset contains no missing values and the test dataset contains the missing values we want to fill.
- Finally, we fit the model, make predictions on the test set, and use the predictions to fill in the missing values.
Alternatively, you can use the IterativeImputer function in the sklearn.experimental module of the sklearn library to automatically fill missing values. We demonstrate this below:
#get the index of missing so we can some of the values used for filling
missing_indx = list(sendy_data['Temperature'][sendy_data['Temperature'].isna()].index)
First, we save the index of the missing values in a variable so we can observe the fill values calculated by the IterativeImputer function.
# explicitly require this experimental feature
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
# Run the imputer with a simple Random Forest estimator
imp = IterativeImputer(RandomForestRegressor(n_estimators=5), max_iter=5, random_state=1)
to_train = ['Temperature', 'Destination Lat', 'Destination Long','Time from Pickup to Arrival']
#perform filling
sendy_data[to_train] = pd.DataFrame(imp.fit_transform(sendy_data[to_train]), columns=to_train)
#display some of the filled values
sendy_data['Temperature'][missing_indx].head(10)
- First, we explicitly require and enable the IterativeImputer function. This must be done before you can use it.
- Next, we import a RandomForest model and create an imputer object by passing the RandomForest model.
- Next, we specify the list of correlated features as a Python list.
- Finally, we fit transform and convert the result to a Pandas dataframe.
How to handle categorical features
Categorical features are features that can take on values from a limited set. For example, the relative hotness of a place/thing (hot, hotter, hottest) or star ratings for an application (1,2,3,4,5). In regards to our dataset, features like level_of_education_clients in the loan_demographics dataset is a categorical feature containing classes like Secondary, Graduate, Post-Graduate, and Primary.
Machine learning models cannot work with categorical features the way they are. These features must be converted to numerical forms before they can be used. The process of converting categorical features to numerical form is called encoding.
There are numerous types of encoding, and the choice of which kind to use is mostly dependent on the categorical type. So first, let’s understand the different categorical feature types.
Types of categorical features
- Ordinal Categorical Features: Ordinal categorical features have a natural ordered category. That is, one class is higher than another. For example, star ratings (1,2,3,4,5), where class 5 is a higher rating than 4/3/2/1.
- Non-Ordinal Categorical Features: This type of feature has no specific order. That is, no class is higher than the other. One example would be type of food (rice, pasta, macaroni, spaghetti). Rice is not in some weird way higher than pasta/macaroni/spaghetti, right?
What encoding scheme to use and when to use it
Manual Encoding of Ordinal Feature: If the classes in a categorical feature are ordinal, and the unique values are small, then you can manually assign labels that have some form of ordering. Let’s demonstrate this below using the level_of_education_clients feature in the loan demographic dataset. This feature has some form of ordinality as the class Post-Graduate is higher than Graduate/Secondary/Primary classes.
#use a simple map function
map_education = {"Primary" : 1, "Secondary": 2, "Graduate": 3, "Post-Graduate": 4}
loan_demographics['level_of_education_clients'] = loan_demographics['level_of_education_clients'].map(map_education)
loan_demographics['level_of_education_clients'].value_counts()
- First, we create a dictionary mapping classes to their labels. Here, the highest class (Post-Graduate) is assigned the highest number.
- Next, we use the Pandas map function to find and replace every class with its corresponding label.
To perform automated encoding, we’ll use an efficient library called categorical_encoders. This library offers numerous encoding schemes out of the box and has first-hand support for Pandas dataframes.
To install the library, you can use pip as follows:
Label Encoding: If you have a large number of classes in a categorical feature, you can use label encoding. Label encoding assigns a unique label (integer number) to a specific class. We demonstrate this using two features (bank_name_clients and bank_branch_clients) with large numbers of unique classes, 18 and 45 respectively.
#Check the number of unique classes
cat_cols = loan_demographics.select_dtypes(include='object').columns
for col in cat_cols:
print("Number of classes in {}".format(col))
print(loan_demographics[col].nunique())
print('--------------------------')
The OrdinalEncoder function in the categorical_encoders library can be used to label encode, as shown below:
import category_encoders as ce
#Label encoding
cat_cols = ['bank_name_clients', 'bank_branch_clients']
encoder = ce.OrdinalEncoder(cols=cat_cols)
loan_demographics = encoder.fit_transform(loan_demographics)
- First, we save the categorical columns we want to encode into a list.
- Next, we create the encoder object.
- Finally, we fit-transform the dataset.
One-Hot Encoding: One-hot encoding uses binary values to represent classes. It creates a feature per category, and can quickly become inefficient as the number of classes in the categorical feature increases. We demonstrate how to use this below:
cats = ['bank_account_type','level_of_education_clients']
one_hot_enc = ce.OneHotEncoder(cols=cats)
loan_demographics = one_hot_enc.fit_transform(loan_demographics)
loan_demographics.head().T
Hash Encoding: Hash encoding or feature hashing is a fast and space-efficient way of encoding features. It’s very efficient for categorical features with large numbers of classes. A hash encoder works by applying a hash function to the features. We demonstrate how to use this below.
cat_cols = ['bank_name_clients', 'bank_branch_clients']
hash_enc = ce.HashingEncoder(cols=cat_cols, n_components=10)
loan_demographics = hash_enc.fit_transform(loan_demographics)
loan_demographics.head()
- First, we specify the features we want to hash encode.
- Next, we create a hash encoder object and specify the length of the hash vector to be used.
- Finally, we fit-transform the dataset.
Target Encoding: In target encoding, we calculate the average of the target value by a specific category and replace that categorical feature with the result. Target encoding helps preserve useful properties of the feature and can sometimes help improve classification models—however, it can sometimes lead to severe overfitting.
To demonstrate target encoding, we’ll use the loan performance dataset (loan_perf). The target of interest in this dataset is the good_bad_flag feature, and the task is to predict if a customer will repay a loan or not. The feature good_bad_flag is represented as a categorical feature, so we first convert it to numerical form, as shown below:
map_target = {"Good": 0, "Bad": 1}
loan_perf['good_bad_flag'] = loan_perf['good_bad_flag'].map(map_target)
Next, we’ll target encode the feature loannumber. This feature is numeric in nature, but can be treated as a categorical feature because it has a limited number of classes.
target_enc = ce.TargetEncoder(cols=['loannumber'])
loan_perf = target_enc.fit_transform(X=loan_perf, y=loan_perf['good_bad_flag'])
loan_perf.head().T
- First, we create a target encoder object and pass the column(s) we want to encode.
- Next, we fit-transform the dataset by passing both the features and the target of interest.
There are many more encoding schemes (binary encoders, count Encoders, leave one out encoders, CatBoost encoders etc.) that you can try out for your use cases. A good place to learn about them is in the official documentation of the categorical_encoder library.
How to handle numerical/continuous features
Numerical/Continuous features are the most common type of features found in datasets. They can represent values from a given range. For example, the price of a product, the temperature of a place, or coordinates on a map.
Feature engineering on numerical data mostly depends on domain knowledge. Some of the things we can do here are:
Log Transformation: Log transformation helps to center (or in statistical terms normally distribute) data. This strategy can help most machine learning models perform better.
Log transformations are mostly performed on skewed features. Features can either be left or right skewed. Skewness can be easily checked by visualization. To demonstrate log transformation, we will use the Distance (KM) feature in the Sendy dataset, as this feature is right skewed.
sendy_data['Distance (KM)'] = np.log1p(sendy_data['Distance (KM)'])
sns.distplot(sendy_data['Distance (KM)'])
plt.title("Log-Transformed plot of Distance (KM)")
plt.show()
We log transform by taking the logarithm of all the instances. We use the efficient NumPy implementation, which adds 1 to every value before taking the logarithm. This helps us to avoid taking log of zero.
Using Domain Knowledge: If you have domain expertise or have someone with domain expertise on your team, you can come up with new features that can greatly help your machine learning models. We demonstrate this by creating some new features from the loan dataset, as shown below.
- Interest elapsed: Interest elapsed is a feature we create from the difference between the totaldue and the loanamount.
2. Loan count: We can calculate the total number of loans collected by a customer by aggregating loan numbers.
#Groupby customer id and calculate their total loans taken
loannumber_count = loan_prev.groupby(by='customerid').agg(['count'])[['loannumber']].reset_index()
#merge back to dataset on customer_id
loan_prev = loan_prev.merge(right=loannumber_count, how='left', on='customerid')
loan_prev.head()
3. Speed: From Physics, we know that speed is Distance per unit of Time— therefore, we can create a new feature (Speed) in the Sendy dataset from the features Distance (KM) and Time from Pickup to Arrival.
#create feature speed in sendy dataset
sendy_data['speed'] = sendy_data['Distance (KM)'] / sendy_data['Time from Pickup to Arrival']
sendy_data.head().T
Polynomial (Cross) Features
Polynomial features create interactions among features. They help to capture relationships among independent variables and can help decrease the bias of a machine learning model, as long as it’s not contributing to massive overfitting.
We can create polynomial/cross features manually by simply adding, multiplying, or dividing features with each other. In this article, we’re going to use the Polynomial Feature module present in the sklearn library.
We will create polynomial features from the loannumber, totaldue, and termdays features in the loan previous dataset, as shown below.
#Use Sklearn Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures()
to_cross = ['loannumber', 'totaldue', 'termdays']
crossed_feats = poly.fit_transform(loan_prev[to_cross].values)
#Convert to Pandas DataFrame and merge to original dataset
crossed_feats = pd.DataFrame(crossed_feats)
loan_prev = pd.concat([loan_prev, crossed_feats], axis=1)
loan_prev.head().T
- First, we import the PolynomialFeatures function from sklearn
- Next, we create an object from it. Here, we can specify the degree of interaction (defaults to 2).
- Next, we specify the features we want to cross (defaults to all).
- Next, we perform crossing using the fit-transform method.
- PolynomialFeatures returns a NumPy array object, so we convert this to a Pandas dataframe and then merge with the original dataset (loanprev).
Normalization of Features
Normalization helps change the values of numeric features to a common scale, without distorting differences in the range of values or losing information. Normalization is very important for distance-based models like KNNs, and it also helps speed up training in neural networks.
Some normalization functions available in sklearn include:
- StandardScaler: Standardize features by subtracting the mean and scaling to unit variance.
- RobustScaler: Scale features using statistics that are robust to outliers.
- MinMaxScaler: Normalize features by scaling each feature to a specified range (range depends on you!).
from sklearn.preprocessing import StandardScaler
feats = ['loannumber', 'totaldue', 'termdays']
sc = StandardScaler()
sc_data = sc.fit_transform(loan_prev[feats])
sc_data
from sklearn.preprocessing import RobustScaler
robsc = RobustScaler()
rb_data = robsc.fit_transform(loan_prev[feats])
rb_data
from sklearn.preprocessing import MinMaxScaler
minsc = MinMaxScaler(feature_range=(0,2))
minmax_data = minsc.fit_transform(loan_prev[feats])
minmax_data
Working with date features
Date features are a popular type of feature present in many datasets. These features are temporal in nature and require specific feature extraction techniques. There are numerous things we can do with temporal features, some of which we briefly demonstrate below.
Time Elapsed: Time elapsed is the difference in time between two dates. We demonstrate this below by calculating the seconds elapsed between the approveddate and creationdate features in the loan performance dataset.
#First convert to pandas datetime format
loan_perf['approveddate'] = pd.to_datetime(loan_perf['approveddate'])
loan_perf['creationdate'] = pd.to_datetime(loan_perf['creationdate'])
loan_perf['date_elapsed_in_secs'] = (loan_perf['approveddate'] - loan_perf['creationdate']) / np.timedelta64(1,'s') #can subtitute with [h,m,s]
loan_perf.head()
Extract Date Features: We demonstrate below how to extract features like days, weeks, hours, seconds, etc:
#First convert to pandas datetime format
loan_perf['approveddate'] = pd.to_datetime(loan_perf['approveddate'])
#use pandas built in functions
loan_perf['approved_day'] = loan_perf['approveddate'].dt.day
loan_perf['approved_week'] = loan_perf['approveddate'].dt.week
loan_perf['approved_hour'] = loan_perf['approveddate'].dt.hour
Period of the day: We can extract the period of the day (morning, afternoon, evenings) from a date feature by manually mapping the hours of a day to the period. We demonstrate this below.
def map_hours(x):
if x in [0,1,2,3,4,5,6,7,8,9,10,11,12]:
return 'morning'
elif x in [13,14,15,16]:
return 'afternoon'
else:
return 'evening'
loan_perf['period_of_day'] = loan_perf['approved_hour'].map(map_hours)
loan_perf.head()
Working with latitudes and longitudes
Geo-based features are a class of features present in range of datasets. These features contain records about the geographical location of a place/point in space. Features like Longitudes, Latitudes, and Address are geo-features that need to be engineered.
There are numerous things that we can do with latitude and longitude features. We can use libraries like Geojson or Geopy to convert these numerical values to physical addresses on a map.
But these methods are slow and don’t really scale to a large number of features. In this article, we’ll bypass these methods in favor of demonstrating simpler and quicker ways to extract features from longitudes and latitudes.
The techniques shown below are culled from this amazing kernel on Kaggle by Beluga.
- Manhattan distance: The Manhattan distance is the sum of the horizontal and vertical distance between two points. Let’s demonstrate this below using the Sendy dataset:
#Manhattan distnace
def manhattan_distance(lat1, lng1, lat2, lng2):
a = np.abs(lat2 -lat1)
b = np.abs(lng1 - lng2)
return a + b
sendy_data['manhattan_dist'] = manhattan_distance(sendy_data['Pickup Lat'].values, sendy_data['Pickup Long'].values,
sendy_data['Destination Lat'].values, sendy_data['Destination Long'].values)
sendy_data.head()
- First, we write a function to calculate the Manhattan distance. This is implemented in base NumPy.
- Next, we used the function created to calculate the Manhattan distance between Pickup and Destination.
2. Haversine distance: The Haversine distance is the great-circle distance between two points on a sphere, given their longitudes and latitudes. It’s very important in navigation.
#Haversine distance
def haversine_array(lat1, lng1, lat2, lng2):
lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
AVG_EARTH_RADIUS = 6371 # in km
lat = lat2 - lat1
lng = lng2 - lng1
d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
return h
sendy_data['haversine_dist'] = haversine_array(sendy_data['Pickup Lat'].values, sendy_data['Pickup Long'].values,
sendy_data['Destination Lat'].values, sendy_data['Destination Long'].values)
sendy_data.head()
- First, we write a function to calculate the Harversine distance. This is also implemented in base NumPy.
- Next, we use the function to calculate the Harversine distance between Pickup and Destination.
3. Bearing: The bearing is the compass direction used to travel from a starting point, and must be within the range 0 to 360.
#Bearing
def bearing_array(lat1, lng1, lat2, lng2):
AVG_EARTH_RADIUS = 6371 # in km
lng_delta_rad = np.radians(lng2 - lng1)
lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
y = np.sin(lng_delta_rad) * np.cos(lat2)
x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
return np.degrees(np.arctan2(y, x))
sendy_data['bearing'] = bearing_array(sendy_data['Pickup Lat'].values, sendy_data['Pickup Long'].values,
sendy_data['Destination Lat'].values, sendy_data['Destination Long'].values)
sendy_data.head()
- First, we create the bearing function in NumPy.
- Next, we use the function to calculate the bearing between Pickup and Destination.
4. Center point: We can calculate the mid-point between two points from their latitudes and longitudes. This can be done by adding the points and dividing the result by 2.
#Get center of lat and longitude
sendy_data['center_latitude'] = (sendy_data['Pickup Lat'].values + sendy_data['Destination Lat'].values) / 2
sendy_data['center_longitude'] = (sendy_data['Pickup Long'].values + sendy_data['Destination Long'].values) / 2
sendy_data.head()
- First, we calculate the center latitude by adding Pickup Latitude with Destination Latitude, and then dividing the result by 2. We do the same for the Pickup Longitude and Destination Longitude.
And we draw the curtain here…
Feature engineering is essential and is often the difference between a good machine learning model and the best machine learning model.
In this post, we have learned about some of the techniques and tools for performing feature engineering. We started by defining feature engineering, then looked at ways for handling missing values.
Next, we explored some encoding techniques for categorical features, and then, various ways for handling numerical features, where we specifically talked about log transformations, polynomial/cross features, and the use of domain expertise in creating new features.
Then, we looked at some normalization strategies available in sklearn, how to work with date features, and finally, how to handle geo features like latitude and longitude.
This has been a really long post, but I hope you’ve learned a lot and will use some (or many!) of the techniques explored here in your next project.
If you have any questions, suggestions, or feedback, don’t hesitate to use the comment section below. I’ll see you soon, happy analysis!
Link to full Notebook with explanations and codes on GitHub
Connect with me on Twitter.
Connect with me on LinkedIn.
Comments 0 Responses