
Screen Pairwise Correlations Among Candidate Features
Source:R/screen_feature_correlations.R
screen_feature_correlations.RdComputes pairwise correlations among numeric candidate predictors and suggests
a reduced set of variables for modeling workflows that should avoid strongly
correlated features. This is useful after window_pane() creates many
metric-window columns.
Usage
screen_feature_correlations(
data,
feature_cols = NULL,
exclude_cols = NULL,
method = c("pearson", "spearman", "kendall"),
threshold = 0.8,
use = "pairwise.complete.obs"
)Arguments
- data
A data frame containing candidate feature columns.
- feature_cols
Optional character vector of feature columns to compare. If
NULL, all numeric columns are considered exceptexclude_cols.- exclude_cols
Columns to exclude when
feature_cols = NULL, such as identifiers, assessment times, or response variables.- method
Correlation method passed to
stats::cor(). One of"pearson","spearman", or"kendall".- threshold
Absolute correlation threshold used to flag highly correlated feature pairs.
- use
Missing-data handling passed to
stats::cor().
Value
A list with:
- correlation_matrix
Feature-by-feature correlation matrix.
- high_correlations
Data frame of feature pairs with absolute correlation greater than or equal to
threshold.- suggested_features
Character vector of features suggested to keep.
- removed_features
Character vector of features suggested for removal.
- feature_summary
Data frame with each feature's mean absolute correlation and suggested decision.
Details
The selection heuristic is intentionally simple and transparent. For every
pair with absolute correlation greater than or equal to threshold, the
function removes the variable with the larger mean absolute correlation to
all other candidate variables. This favors keeping variables that are less
redundant overall.
Examples
weather <- simulate_weather_series(
days = 40,
n_series = 8,
id_col = "site_id",
seed = 1
)
assessments <- simulate_assessment_data(weather, id_col = "site_id", seed = 1)
windows <- make_windows(min_offset = -10, max_offset = -1, width = 3)
features <- window_pane(
weather = weather,
assessments = assessments,
windows = windows,
reference_col = "assessment_time",
id_col = "site_id",
response_col = "disease_intensity"
)
correlation_screen <- screen_feature_correlations(
features,
exclude_cols = c("site_id", "assessment_time", "disease_intensity"),
method = "spearman",
threshold = 0.8
)
correlation_screen$suggested_features
#> [1] "temp_mean_window_m10_m07" "temp_min_window_m10_m07"
#> [3] "temp_max_window_m10_m07" "rh_mean_window_m10_m07"
#> [5] "rain_sum_window_m10_m07" "leaf_wetness_sum_window_m10_m07"
#> [7] "temp_mean_window_m09_m06" "temp_min_window_m09_m06"
#> [9] "temp_max_window_m09_m06" "temp_mean_window_m08_m05"
#> [11] "rh_mean_window_m08_m05" "leaf_wetness_sum_window_m08_m05"
#> [13] "temp_mean_window_m07_m04" "temp_min_window_m07_m04"
#> [15] "temp_max_window_m07_m04" "rh_mean_window_m07_m04"
#> [17] "rain_sum_window_m07_m04" "temp_max_window_m06_m03"
#> [19] "rh_mean_window_m06_m03" "temp_mean_window_m05_m02"
#> [21] "temp_min_window_m05_m02" "rh_mean_window_m05_m02"
#> [23] "temp_mean_window_m04_m01" "temp_min_window_m04_m01"
#> [25] "temp_max_window_m04_m01" "leaf_wetness_sum_window_m04_m01"