Engineering Feature Engineering

Defining feature engineering and applying it to your dataset

What is feature engineering?

Feature engineering is the use of domain knowledge to create features that make machine learning algorithms work. It’s a paramount step in the real-world application of ML.

It’s also both difficult and expensive.

  • Feature engineering is essentially the process of creating new input features from existing attributes to improve model performance.
  • It’s about isolating/highlighting key information to help your algorithm “focus” on what’s important.
  • Feature engineering takes place in both data preparation and model building.

What is a feature?

A feature is an attribute or a property shared by all of the independent units on which analysis/prediction is to be done. Any attribute could be a feature, as long as it’s useful to the model

Quality test of the engineered feature

There are a number key factors to consider when testing the quality and viability of engineered features:

  1. The intuition behind coming up with the new feature
  2. Support by domain knowledge or exploratory analysis
  3. Ability to be computed for future observations
  4. Influence of the prediction (label) —i.e. do not just create a feature for the sake of it.

Types of features to be engineered

Note: Any new feature shouldn’t depend on or in any way be connected to the target variable.

1. Indicator (Dummy) Variables

Indicator variables are a set of (0 or 1) variables that each represent a single
category from a categorical feature.

  • Threshold-based indicator variables: Example: In some countries, girls are allowed to marry without their parents’ presence when they turn 21. So a proper indicator variable in a related problem could be independent_marrige_allowed: true for every girl record with age greater than or equal 21.
  • Multiple feature-based indicator variables: Example: Some laws enforce extra taxes on shops that have more than 200 employees and more than 5 branches. A proper indicator could be extra_taxes: true for every shop that has an employee_count greater than or equal to 200 and branches_count greater than or equal to 5
  • Special event indicators: Example: People tend to go shopping around special days like Christmas, New Year’s and so on. A proper indicator could be new_years: true when the date is the first of January.
  • Grouping indicators: Example: Traffic on a certain website can be divided into two subgroups; paid_traffic could be one of them and could be true for all observations that contain traffic from ads such as google_ads or instagram_ads.

2. Aggregation of Sparse Categories in Nominal Attributes

Sparse categories are those that have very few total observations. They’re a problem because they’re too detailed and can lead to overfitting an ML model to specific values.

So for example this…

Colors
baby_blue: ***(3)
navy_blue: **(2)
light_green: **********(10)
grass_green: ***************(15)
red: **********(10)
maroon: *********(9)
white: *(1)
purple: *(1)

Could be turned into this…

Colors
blue: *****(5)
light_green: **********(10)
grass_green: ***************(15)
red: **********(10)
maroon: *********(9)
others: **(2)

3. Interaction Features

Combinations of two or more features — can be products, sums, or differences. Look at each pair of features and ask yourself, “Could I combine this information in any way that might be more useful?”

  • Sum of two features: Example: Traffic on a certain website can be divided into two subgroups; paid_traffic could be one of them, and could be the sum of all observations that contain traffic from ads such as google_ads or instagram_ads.
  • Difference between two features: Example: The difference between features house_built_date and house_purchase_date to create the feature house_age_at_purchase
  • Product of two features: Example: tax_percentage multiplied by salary to get tax_amount
  • Division of two features: Example:area_of_block divided by houses_count would get us house_area

Conclusion

In this post, we’ve covered some of the most essential strategies and methods for feature engineering, a difficult yet essential task for many machine learning use cases.

One last important note to keep in mind before we conclude:

With this redundancy, it’s also important to keep in mind that we can remove unused features that don’t make sense to pass into ML algorithms—for example, ID columns, text descriptions, or features that wouldn’t be available at the time of prediction.

In a real-world case, the importance of feature engineering is to use domain-specific knowledge and human insight to ensure that data contains relevant indicators for the prediction task. And, in that case, the beauty of machine learning is that it discovers the relevant patterns and filters out the incorrect human insights.

Avatar photo

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *