Time-aware validation

AlloyGBM includes leakage-aware split helpers for time series and panel data.

Why they matter

Random splits are often misleading for:

  • forecasting

  • panel regression

  • finance datasets with ordered timestamps

If the same time bucket appears in both train and test data, measured performance can be inflated by leakage.

Time series splits

Use purged_time_series_splits(...) when you have a single time axis:

from alloygbm import purged_time_series_splits

time_index = [0, 0, 1, 1, 2, 2, 3, 3]
splits = purged_time_series_splits(
    time_index,
    n_splits=4,
    purge_gap=0,
    embargo=0,
)

Key parameters:

  • n_splits

  • purge_gap

  • embargo

Panel splits

Use purged_panel_splits(...) when you have both a time index and a group index:

from alloygbm import purged_panel_splits

time_index = [0, 0, 1, 1, 2, 2]
group_index = ["A", "B", "A", "B", "A", "B"]

splits = purged_panel_splits(
    time_index,
    group_index,
    n_splits=3,
    purge_gap=0,
    embargo=0,
)

Panel behavior is still time-bucketed across all groups, which is usually the correct default when leakage is primarily temporal.

Using the splits with AlloyGBM

from alloygbm import GBMRegressor, purged_time_series_splits, rmse

rows = [[float(i), float(i % 2)] for i in range(20)]
targets = [float(i) * 0.1 for i in range(20)]
time_index = [i // 2 for i in range(20)]

scores = []
for train_idx, test_idx in purged_time_series_splits(time_index, n_splits=5):
    model = GBMRegressor(deterministic=True, seed=7)
    model.fit([rows[i] for i in train_idx], [targets[i] for i in train_idx])
    preds = model.predict([rows[i] for i in test_idx])
    scores.append(rmse([targets[i] for i in test_idx], preds))

Explicit validation with fit(...)

Use eval_set when you want AlloyGBM to track validation metrics during training or when you enable early stopping:

from alloygbm import GBMRegressor

model = GBMRegressor(
    n_estimators=1200,
    early_stopping_rounds=50,
    min_validation_improvement=1e-4,
    deterministic=True,
    seed=7,
)

model.fit(
    X_train,
    y_train,
    eval_set=(X_valid, y_valid),
)

print(model.best_iteration_)
print(model.best_score_)
print(model.n_estimators_)

Rules:

  • early_stopping_rounds requires eval_set

  • eval_time_index requires eval_set

  • Early stopping monitors the objective-appropriate metric: - RMSE for GBMRegressor - Log-loss for GBMClassifier - NDCG for GBMRanker

Passing time_index into fit(...)

Pass time_index= to fit(...) when you are using:

  • categorical_feature_index or categorical_feature_indices

  • categorical_time_aware=True

That enables time-aware categorical handling during training.

If you also pass eval_set with categorical_time_aware=True, pass eval_time_index= for the validation rows as well.