Time-aware validation

AlloyGBM includes leakage-aware split helpers for time series and panel data.

Why they matter

Random splits are often misleading for:

forecasting
panel regression
finance datasets with ordered timestamps

If the same time bucket appears in both train and test data, measured performance can be inflated by leakage.

Time series splits

Use purged_time_series_splits(...) when you have a single time axis:

from alloygbm import purged_time_series_splits

time_index = [0, 0, 1, 1, 2, 2, 3, 3]
splits = purged_time_series_splits(
    time_index,
    n_splits=4,
    purge_gap=0,
    embargo=0,
)

Key parameters:

n_splits
purge_gap
embargo

Panel splits

Use purged_panel_splits(...) when you have both a time index and a group index:

from alloygbm import purged_panel_splits

time_index = [0, 0, 1, 1, 2, 2]
group_index = ["A", "B", "A", "B", "A", "B"]

splits = purged_panel_splits(
    time_index,
    group_index,
    n_splits=3,
    purge_gap=0,
    embargo=0,
)

Panel behavior is still time-bucketed across all groups, which is usually the correct default when leakage is primarily temporal.

Using the splits with AlloyGBM

from alloygbm import GBMRegressor, purged_time_series_splits, rmse

rows = [[float(i), float(i % 2)] for i in range(20)]
targets = [float(i) * 0.1 for i in range(20)]
time_index = [i // 2 for i in range(20)]

scores = []
for train_idx, test_idx in purged_time_series_splits(time_index, n_splits=5):
    model = GBMRegressor(deterministic=True, seed=7)
    model.fit([rows[i] for i in train_idx], [targets[i] for i in train_idx])
    preds = model.predict([rows[i] for i in test_idx])
    scores.append(rmse([targets[i] for i in test_idx], preds))

Explicit validation with `fit(...)`

Use eval_set when you want AlloyGBM to track validation metrics during training or when you enable early stopping:

from alloygbm import GBMRegressor

model = GBMRegressor(
    n_estimators=1200,
    early_stopping_rounds=50,
    min_validation_improvement=1e-4,
    deterministic=True,
    seed=7,
)

model.fit(
    X_train,
    y_train,
    eval_set=(X_valid, y_valid),
)

print(model.best_iteration_)
print(model.best_score_)
print(model.n_estimators_)

Rules:

early_stopping_rounds requires eval_set
eval_time_index requires eval_set
Early stopping monitors the objective-appropriate metric: - RMSE for GBMRegressor - Log-loss for GBMClassifier - NDCG for GBMRanker

Passing `time_index` into `fit(...)`

Pass time_index= to fit(...) when you are using:

categorical_feature_index or categorical_feature_indices
categorical_time_aware=True

That enables time-aware categorical handling during training.

If you also pass eval_set with categorical_time_aware=True, pass eval_time_index= for the validation rows as well.