ANOVA for Feature Selection
In this notebook, we demonstrate how ANOVA (Analysis of Variance) can be used to identify better features for machine learning models. We’ll use the Fantasy Premier League (FPL) dataset to show how ANOVA helps in selecting features that best differentiate categories.
# Uncomment the line below if you need to install the dataidea package
# !pip install -U dataidea
First, we’ll import the necessary packages: scipy
for performing ANOVA, dataidea
for loading the FPL dataset, and SelectKBest
from scikit-learn
for univariate feature selection based on statistical tests.
import scipy as sp
from sklearn.feature_selection import SelectKBest, f_classif
import dataidea as di
Let’s load the FPL dataset and preview the top 5 rows.
# Load FPL dataset
= di.loadDataset('fpl')
fpl
# Preview the top 5 rows
=5) fpl.head(n
First_Name | Second_Name | Club | Goals_Scored | Assists | Total_Points | Minutes | Saves | Goals_Conceded | Creativity | Influence | Threat | Bonus | BPS | ICT_Index | Clean_Sheets | Red_Cards | Yellow_Cards | Position | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Bruno | Fernandes | MUN | 18 | 14 | 244 | 3101 | 0 | 36 | 1414.9 | 1292.6 | 1253 | 36 | 870 | 396.2 | 13 | 0 | 6 | MID |
1 | Harry | Kane | TOT | 23 | 14 | 242 | 3083 | 0 | 39 | 659.1 | 1318.2 | 1585 | 40 | 880 | 355.9 | 12 | 0 | 1 | FWD |
2 | Mohamed | Salah | LIV | 22 | 6 | 231 | 3077 | 0 | 41 | 825.7 | 1056.0 | 1980 | 21 | 657 | 385.8 | 11 | 0 | 0 | MID |
3 | Heung-Min | Son | TOT | 17 | 11 | 228 | 3119 | 0 | 36 | 1049.9 | 1052.2 | 1046 | 26 | 777 | 315.2 | 13 | 0 | 0 | MID |
4 | Patrick | Bamford | LEE | 17 | 11 | 194 | 3052 | 0 | 50 | 371.0 | 867.2 | 1512 | 26 | 631 | 274.6 | 10 | 0 | 3 | FWD |
ANOVA helps us determine if there’s a significant difference between the means of different groups. We use it to select features that best show the difference between categories. Features with higher F-statistics are preferred.
ANOVA for Goals Scored
We will create groups of goals scored by each player position (forwards, midfielders, defenders, and goalkeepers) and run an ANOVA test.
# Create groups of goals scored for each player position
= fpl[fpl.Position == 'FWD']['Goals_Scored']
forwards_goals = fpl[fpl.Position == 'MID']['Goals_Scored']
midfielders_goals = fpl[fpl.Position == 'DEF']['Goals_Scored']
defenders_goals = fpl[fpl.Position == 'GK']['Goals_Scored']
goalkeepers_goals
# Perform the ANOVA test for the groups
= sp.stats.f_oneway(forwards_goals, midfielders_goals, defenders_goals, goalkeepers_goals)
f_statistic, p_value print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 33.281034594400445
p-value: 3.9257634156019246e-20
We observe an F-statistic of 33.281
and a p-value of 3.926e-20
, indicating a significant difference at multiple confidence levels.
ANOVA for Assists
Next, we’ll create groups for assists and run an ANOVA test.
# Create groups of assists for each player position
= fpl[fpl.Position == 'FWD']['Assists']
forwards_assists = fpl[fpl.Position == 'MID']['Assists']
midfielders_assists = fpl[fpl.Position == 'DEF']['Assists']
defenders_assists = fpl[fpl.Position == 'GK']['Assists']
goalkeepers_assists
# Perform the ANOVA test for the groups
= sp.stats.f_oneway(forwards_assists, midfielders_assists, defenders_assists, goalkeepers_assists)
f_statistic, p_value print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 19.263717036430815
p-value: 5.124889288362087e-12
We observe an F-statistic of 19.264
and a p-value of 5.125e-12
, again indicating significance.
Comparing Results
Both features show significant F-statistics, but goals scored has a higher value, indicating it is a better feature for differentiating player positions.
Using SelectKBest for Feature Selection
We can also use SelectKBest
from scikit-learn
to automate this process.
# Use scikit-learn's SelectKBest (with f_classif)
= SelectKBest(score_func=f_classif, k=1)
test
# Fit the model to the data
= test.fit(fpl[['Goals_Scored', 'Assists']], fpl.Position)
fit
# Get the F-statistics
= fit.scores_
scores
# Select the best feature
= fit.transform(fpl[['Goals_Scored', 'Assists']])
features
# Get the indices of the selected features (optional)
= test.get_support(indices=True)
selected_indices
# Print indices and scores
print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores: [33.28103459 19.26371704]
Selected Features Indices: [0]
The 0th
feature (Goals Scored) is selected as the best feature based on the F-statistics.
To be among the first to hear about future updates of the course materials, simply enter your email below, follow us on (formally Twitter), or subscribe to our YouTube channel.
Summary
In this notebook, we demonstrated how to use ANOVA for feature selection in the Fantasy Premier League dataset. By comparing the F-statistics of different features, we identified that ‘Goals Scored’ is a more significant feature than ‘Assists’ for differentiating player positions. Using SelectKBest
from scikit-learn
, we confirmed that ‘Goals Scored’ is the best feature among the two. This method can be applied to other datasets and features to enhance the performance of machine learning models.