Skip to contents

Computes pairwise correlations among numeric candidate predictors and suggests a reduced set of variables for modeling workflows that should avoid strongly correlated features. This is useful after window_pane() creates many metric-window columns.

Usage

screen_feature_correlations(
  data,
  feature_cols = NULL,
  exclude_cols = NULL,
  method = c("pearson", "spearman", "kendall"),
  threshold = 0.8,
  use = "pairwise.complete.obs"
)

Arguments

data

A data frame containing candidate feature columns.

feature_cols

Optional character vector of feature columns to compare. If NULL, all numeric columns are considered except exclude_cols.

exclude_cols

Columns to exclude when feature_cols = NULL, such as identifiers, assessment times, or response variables.

method

Correlation method passed to stats::cor(). One of "pearson", "spearman", or "kendall".

threshold

Absolute correlation threshold used to flag highly correlated feature pairs.

use

Missing-data handling passed to stats::cor().

Value

A list with:

correlation_matrix

Feature-by-feature correlation matrix.

high_correlations

Data frame of feature pairs with absolute correlation greater than or equal to threshold.

suggested_features

Character vector of features suggested to keep.

removed_features

Character vector of features suggested for removal.

feature_summary

Data frame with each feature's mean absolute correlation and suggested decision.

Details

The selection heuristic is intentionally simple and transparent. For every pair with absolute correlation greater than or equal to threshold, the function removes the variable with the larger mean absolute correlation to all other candidate variables. This favors keeping variables that are less redundant overall.

Examples

weather <- simulate_weather_series(
  days = 40,
  n_series = 8,
  id_col = "site_id",
  seed = 1
)
assessments <- simulate_assessment_data(weather, id_col = "site_id", seed = 1)
windows <- make_windows(min_offset = -10, max_offset = -1, width = 3)
features <- window_pane(
  weather = weather,
  assessments = assessments,
  windows = windows,
  reference_col = "assessment_time",
  id_col = "site_id",
  response_col = "disease_intensity"
)

correlation_screen <- screen_feature_correlations(
  features,
  exclude_cols = c("site_id", "assessment_time", "disease_intensity"),
  method = "spearman",
  threshold = 0.8
)
correlation_screen$suggested_features
#>  [1] "temp_mean_window_m10_m07"        "temp_min_window_m10_m07"        
#>  [3] "temp_max_window_m10_m07"         "rh_mean_window_m10_m07"         
#>  [5] "rain_sum_window_m10_m07"         "leaf_wetness_sum_window_m10_m07"
#>  [7] "temp_mean_window_m09_m06"        "temp_min_window_m09_m06"        
#>  [9] "temp_max_window_m09_m06"         "temp_mean_window_m08_m05"       
#> [11] "rh_mean_window_m08_m05"          "leaf_wetness_sum_window_m08_m05"
#> [13] "temp_mean_window_m07_m04"        "temp_min_window_m07_m04"        
#> [15] "temp_max_window_m07_m04"         "rh_mean_window_m07_m04"         
#> [17] "rain_sum_window_m07_m04"         "temp_max_window_m06_m03"        
#> [19] "rh_mean_window_m06_m03"          "temp_mean_window_m05_m02"       
#> [21] "temp_min_window_m05_m02"         "rh_mean_window_m05_m02"         
#> [23] "temp_mean_window_m04_m01"        "temp_min_window_m04_m01"        
#> [25] "temp_max_window_m04_m01"         "leaf_wetness_sum_window_m04_m01"