Feature Selection

Feature Selection Definition

Feature selection is the process of isolating the most consistent, non-redundant, and relevant features to use in model construction. Methodically reducing the size of datasets is important as the size and variety of datasets continue to grow. The main goal of feature selection is to improve the performance of a predictive model and reduce the computational cost of modeling.

Graphic shows various benefits to prioritize in the feature selection process.


What is Feature Selection?

Feature selection, one of the main components of feature engineering, is the process of selecting the most important features to input in machine learning algorithms. Feature selection techniques are employed to reduce the number of input variables by eliminating redundant or irrelevant features and narrowing down the set of features to those most relevant to the machine learning model. 

The main benefits of performing feature selection in advance, rather than letting the machine learning model figure out which features are most important, include:

  • simpler models: simple models are easy to explain - a model that is too complex and unexplainable is not valuable
  • shorter training times: a more precise subset of features decreases the amount of time needed to train a model
  • variance reduction: increase the precision of the estimates that can be obtained for a given simulation 
  • avoid the curse of high dimensionality: dimensionally cursed phenomena states that, as dimensionality and the number of features increases, the volume of space increases so fast that the available data become limited - PCA feature selection may be used to reduce dimensionality 

The most common input variable data types include: Numerical Variables, such as Integer Variables and Floating Point Variables; and Categorical Variables, such as Boolean Variables, Ordinal Variables, and Nominal Variables. Popular libraries for feature selection include sklearn feature selection, feature selection Python, and feature selection in R. 

What makes one variable better than another? Typically, there are three key properties in a feature representation that makes it most desirable: easy to model, works well with regularization strategies, and disentangling of causal factors.

Feature Selection Methods

Feature selection algorithms are categorized as either supervised, which can be used for labeled data; or unsupervised, which can be used for unlabeled data. Unsupervised techniques are classified as filter methods, wrapper methods, embedded methods, or hybrid methods:

  • Filter methods: Filter methods select features based on statistics rather than feature selection cross-validation performance. A selected metric is applied to identify irrelevant attributes and perform recursive feature selection. Filter methods are either univariate, in which an ordered ranking list of features is established to inform the final selection of feature subset; or multivariate, which evaluates the relevance of the features as a whole, identifying redundant and irrelevant features.
  • Wrapper methods: Wrapper feature selection methods consider the selection of a set of features as a search problem, whereby their quality is assessed with the preparation, evaluation, and comparison of a combination of features to other combinations of features. This method facilitates the detection of possible interactions amongst variables. Wrapper methods focus on feature subsets that will help improve the quality of the results of the clustering algorithm used for the selection. Popular examples include Boruta feature selection and Forward feature selection.
  • Embedded methods: Embedded feature selection methods integrate the feature selection machine learning algorithm as part of the learning algorithm, in which classification and feature selection are performed simultaneously. The features that will contribute the most to each iteration of the model training process are carefully extracted. Random forest feature selection, decision tree feature selection, and LASSO feature selection are common embedded methods.

How to Choose a Feature Selection Method

Choosing the best feature selection method depends on the input and output in consideration:

  • Numerical Input, Numerical Output: feature selection regression problem with numerical input variables - use a correlation coefficient, such as Pearson’s correlation coefficient (for linear regression feature selection) or Spearman’s rank coefficient (for nonlinear).
  • Numerical Input, Categorical Output: feature selection classification problem with numerical input variables -  use a correlation coefficient, taking into account the categorical target, such as ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear).
  • Categorical Input, Numerical Output: regression predictive modeling problem with categorical input variables (rare) - use a correlation coefficient, such as ANOVA correlation coefficient (for linear) or Kendall’s rank coefficient (nonlinear), but in reverse.
  • Categorical Input, Categorical Output: classification predictive modeling problem with categorical input variables - use a correlation coefficient, such as Chi-Squared test (contingency tables) or Mutual Information, which is a powerful method that is agnostic to data types.

Why Feature Selection is Important

Feature selection is an invaluable asset for data scientists. Understanding how to select important features in machine learning is crucial to the efficacy of the machine learning algorithm. Irrelevant, redundant, and noisy features can pollute an algorithm, negatively impacting learning performance, accuracy, and computational cost. Feature selection is increasingly important as the size and complexity of the average dataset continues to grow exponentially.

Does OmniSci Offer a Feature Selection Solution?

OmniSci Immerse is a browser-based, interactive data visualization client that enables users to visually explore data at the speed of thought. Data exploration is a crucial component in feature selection. The goal of data exploration for machine learning is to gain insights that will inform subsequent feature selection and model-building. Feature selection improves the machine learning process and increases the predictive power of machine learning algorithms by selecting the most important variables and eliminating redundant and irrelevant features.

Try OmniSci for Free today!