GBMRegressor

GBMRegressor is the main Python estimator for regression in AlloyGBM.

Core parameters

learning_rate: float = 0.1 - additive update step size
max_depth: int = 6 - maximum tree depth
n_estimators: int = 6 - requested boosting rounds
row_subsample: float = 1.0 - per-round row sampling fraction; ignored when

boosting_mode="goss" (GOSS uses gradient-based sampling instead).
col_subsample: float = 1.0 - per-round feature sampling fraction
quantile_alpha: float = 0.5 - Target quantile for "quantile" regression. Must be strictly in (0.0, 1.0).

Boosting mode

boosting_mode: str = "standard" – per-round sample-selection strategy. Three values are accepted:
- "standard" (default) – uniform row subsampling under row_subsample. Byte-identical to v0.7.5 on every API surface.
- "goss" – LightGBM-style Gradient-based One-Side Sampling. Each round, score rows by |gradient|, keep the top goss_top_rate fraction, uniformly sample goss_other_rate fraction from the rest, and amplify the sampled-low rows’ gradient and hessian by (n - top_n) / other_n (realized counts) so the histogram statistics remain an unbiased estimator of the full-data gradient sums. Convergence is typically faster on data with a long-tailed gradient distribution (the canonical LightGBM advantage).
- "dart" – Dropouts meet MART. Each round, drop a random subset of previously-trained trees, fit a new tree on the residuals of the dropped-out ensemble, then rescale the dropped trees + the new tree so the prediction sum stays unbiased. Reduces over-specialization of late trees; can improve generalization on noisy data. Per-stump tree_weight: f32 is persisted via a new DartTreeWeights artifact section.
goss_top_rate: float = 0.2 – top-by-gradient kept fraction when boosting_mode="goss". Must be in (0, 1).
goss_other_rate: float = 0.1 – random-sample fraction from the remaining rows when boosting_mode="goss". Must be in (0, 1) and goss_top_rate + goss_other_rate <= 1.0.
dart_drop_rate: float = 0.1 – per-tree drop probability per round when boosting_mode="dart". Must be in (0, 1).
dart_max_drop: int = 50 – cap on the number of trees dropped per round. Must be >= 1.
dart_normalize_type: str = "tree" – rescale policy after the new tree is fit. "tree" mode sets new-tree weight to 1 / (K + 1) and each dropped-tree weight to K / (K + 1); "forest" mode sets both to 1 / (K + 1) (more aggressive rescale).
dart_sample_type: str = "uniform" – dropout sampling strategy. "uniform" picks each tree independently with probability dart_drop_rate. "weighted" biases dropout probability toward heavier-weight trees.

GOSS and DART are supported on the binary classifier / regression / ranking single-output objective. The multiclass softmax path explicitly rejects non-"standard" boosting modes pending per-class gradient scoring (v0.10.x follow-up — applies to both GOSS and DART).

As of v0.10.0, DART + ``warm_start`` is supported on GBMRegressor, binary GBMClassifier, and GBMRanker. The continuation seeds dart_state.tree_weights from the prior model’s per-stump tree_weight snapshot and pre-populates the dropout bookkeeping arrays so new-round dropouts can correctly subtract / replay prior trees. Historical RNG-driven dropped_per_round is intentionally not persisted; new rounds start fresh dropout bookkeeping going forward.

base = GBMRegressor(n_estimators=10, boosting_mode="dart",
                    dart_drop_rate=0.1, seed=7)
base.fit(X, y)

cont = GBMRegressor(n_estimators=10, boosting_mode="dart",
                    dart_drop_rate=0.1, warm_start=True, seed=7)
cont.fit(X, y, init_model=base)

Stopping and policy controls

early_stopping_rounds: int | None = None
min_validation_improvement: float = 0.0
training_policy: str = "auto"

training_policy="auto" applies dataset-aware heuristics and is the recommended default for practical use. manual is more appropriate for controlled ablation work.

Early stopping is explicit-only. If early_stopping_rounds is set, call fit(..., eval_set=(X_valid, y_valid)).

Leaf and split controls

min_data_in_leaf: int = 1 – when training_policy="auto", the engine may increase this based on dataset size but will never reduce it below the value you set.
lambda_l1: float = 0.0
lambda_l2: float = 0.0
min_child_hessian: float = 0.0
min_split_gain: float = 0.0 – minimum gain required for a split. The auto policy may set this adaptively.

These map directly to native training controls instead of relying on environment-variable overrides.

Tree growth strategy

tree_growth: str = "level" – level (depth-first) or leaf (best-first, similar to LightGBM)
max_leaves: int | None = None – maximum leaves for leaf-wise growth

Constraints

monotone_constraints: list[int] | dict[int, int] | None = None – constrain features to monotone increasing (+1), decreasing (-1), or unconstrained (0)
feature_weights: list[float] | dict[int, float] | None = None – per-feature importance weights influencing split selection
interaction_constraints: list[list[int]] | None = None – LightGBM-compatible interaction constraints. Each inner list is a group of feature indices; any root-to-leaf path is restricted to splits on features from a single allowed group. Features outside every group are unconstrained and may appear alongside any group. Up to 64 groups per fit; enforced in both level-wise and leaf-wise growth. None disables the constraint (default).

Reproducibility

seed: int = 0
deterministic: bool = True

Continuous-feature controls

continuous_binning_strategy: str = "linear"
continuous_binning_max_bins: int = 256

Supports up to 65,535 bins per feature. Use quantile when you need more robust handling of skewed continuous features.

Categorical support

categorical_feature_index: int | None = None – single column (legacy)
categorical_feature_indices: list[int] | None = None – multiple columns
categorical_smoothing: float = 20.0
categorical_min_samples_leaf: int = 1
categorical_time_aware: bool = False
max_cat_threshold: int = 0 – maximum category cardinality for native categorical splits. When a categorical feature has at most this many unique values, AlloyGBM uses the Fisher-sort algorithm for O(K log K) optimal binary partition with O(1) bitset prediction. Features exceeding the threshold fall back to target encoding. Default 0 disables native splits.

DRO leaf solver

leaf_solver: str = "standard" – "standard" keeps the usual scalar Newton leaf update; "dro" enables a fast robust scalar update that penalizes weak leaf signal by within-leaf gradient dispersion.
dro_radius: float = 0.05 – non-negative penalty scale. 0.0 preserves standard-leaf predictions while recording DRO metadata.
dro_metric: str = "wasserstein" – the only accepted value today. It denotes a Wasserstein-inspired closed-form robust counterpart over leaf gradient uncertainty.

This is not a full Wasserstein optimizer over raw feature/target distributions. Inference speed is unchanged because robust scalar leaf values are stored directly in the artifact. leaf_solver="dro" works on all three estimators, composes with training_mode="morph", and requires leaf_model="constant".

Factor-neutral boosting

neutralization: str = "none"
- one of "none", "pre_target", "per_round_gradient", or "split_penalty"
factor_neutralization_lambda: float = 1e-6 – finite, non-negative ridge term added to F^T W F.
factor_penalty: float = 0.0 – finite, non-negative split exposure penalty scale. Only active for neutralization="split_penalty".

Pass factors as fit-time data:

model = GBMRegressor(neutralization="per_round_gradient", seed=7)
model.fit(X_train, y_train, factor_exposures=F_train)

factor_exposures must be dense, row-major, finite, and shaped (n_rows, n_factors). It is fit data, not constructor state, so sklearn cloning remains clean and large matrices are not embedded in estimator params.

Mode semantics:

neutralization="none" preserves current behavior and ignores factor_exposures unless a non-None matrix is provided with an inactive mode, in which case Python raises a clear validation error to prevent silent user mistakes.

neutralization="pre_target" residualizes the regression target once before training:

y_perp = y - F (F^T W F + lambda I)^-1 F^T W y

This mode is supported for GBMRegressor only. It is rejected for classification and ranking because target residualization is not well-defined for class labels or ranking relevance. eval_set is also rejected for pre_target in this release because the public API does not yet accept validation-set factor exposures to residualize validation targets consistently.

neutralization="per_round_gradient" projects objective gradients before each boosting round:

g_perp = g - F (F^T W F + lambda I)^-1 F^T W g

Hessians are unchanged. This mode is supported for regression, binary classification, multiclass, and ranking. For multiclass, each class-gradient column is projected independently against the same factor projector.

neutralization="split_penalty" includes per-round gradient projection and subtracts a factor-load penalty from split gain:

penalty = factor_penalty * || F_L^T update_L + F_R^T update_R ||^2 / max(row_count, 1)
gain_final = gain_after_existing_modes - penalty

For scalar leaves, update_L and update_R are the candidate scalar leaf values before any final MorphBoost depth/iteration leaf scaling. For DRO leaves, the scalar values use the DRO effective gradients. For MorphBoost, the order is: project gradients, compute standard/DRO gradient gain, blend MorphBoost information score, subtract factor penalty, then apply MorphBoost leaf scaling when storing leaves. split_penalty performs additional factor-exposure work during split search and should be treated as the slowest neutralization mode until production-scale benchmarks justify stronger claims.

Compatibility:

Feature	pre_target	per_round_gradient	split_penalty
`GBMRegressor`	supported	supported	supported
`GBMClassifier`	rejected	supported	supported
`GBMRanker`	rejected	supported	supported
`training_mode="morph"`	supported	supported	supported
`leaf_solver="dro"`	supported	supported	supported
`leaf_model="linear"`	supported	supported	rejected
warm start	supported	supported	supported

This is a training-time regularization tool. It does not guarantee prediction-time zero exposure unless predictions are neutralized against evaluation-time factors outside the model.

Exposure matrices are not persisted in the estimator or artifact (they would balloon the model size and surface sensitive data). As of v0.7.1 neutralized warm-start and init_model continuation are supported: the caller must supply the same factor_exposures matrix used for the initial fit so the projection has the same column space. Omitting factor_exposures on a resumed fit raises a contract error.

pre_target neutralization is idempotent under repeated residualization against the same exposures, so warm-start continuation residualizes the original targets again on the resumed fit and trains on the same target stream as a fresh N + M-round fit.

Piecewise-linear leaves

leaf_model: str = "constant"
- "constant" (default) – standard scalar leaf value, identical to all prior AlloyGBM behaviour.
- "linear" – each leaf stores a small linear model f_s(x) = b_s + Σ α_j x_j (up to 8 regressors per leaf, inherited from the split path’s feature indices; the per-leaf cap is internal and not currently user-tunable). Optimal weights are solved in closed form via the ridge regression α* = -(XᵀHX + λI)⁻¹ Xᵀg, regularised by lambda_l2.

Empirically, "linear" converges in fewer rounds on data with linear within-node residual structure (~10× faster on linearly-structured datasets, +3.5% RMSE on California Housing, +1.75pp accuracy on Breast Cancer), at a 2–8× per-round training overhead. Recommended lambda_l2 >= 0.01 for weight stability.

Limitations:

Native-bitset categorical splits (max_cat_threshold > 0) fall back to constant leaves at the categorical split node; descendant leaves below the split use linear leaves on remaining numeric regressors.
SHAP (shap_values, feature_importances) supports leaf_model="linear" with strict additivity as of v0.7.4: the reconstruction satisfies atol + rtol·|predict(x)| (default 1e-5 + 1e-4·|predict(x)|) on the default predictor-aligned binning path. See Feature importances and SHAP for the decomposition and docs/limitations.md for the legacy-non-binning exemption.
leaf_model="linear" composes with training_mode="morph".

Multi-label ranking

MultiLabelGBMRanker is a unified multi-output ranking estimator: y has shape (n_rows, n_labels) and predict returns scores with the same column layout. As of v0.7.1 the wrapper trains one independent GBMRanker per label using a shared group (and optional shared factor_exposures) so every per-label fit observes the same query structure. Each per-label ranker independently picks up every existing GBMRanker feature (warm-start, neutralization, MorphBoost, PL leaves, DRO, interaction constraints, custom eval metrics).

ranking_objective may be a single string (applied to every label) or a list of length n_labels for heterogeneous objectives. save_model serialises every per-label ranker into a single .mlrk bundle via pickle.HIGHEST_PROTOCOL.

Joint shared-tree multi-label boosting is deferred to v0.10.0 (paired with the K-output shared-histogram primitive); see docs/limitations.md for the upgrade-path caveat.

MorphBoost (Adaptive Split Criterion)

GBMRegressor (and the classifier / ranker subclasses) support an opt-in MorphBoost training mode. See MorphBoost (Adaptive Split Criterion) for the full guide.

training_mode: str = "auto" – one of "auto" (default), "manual", or "morph".
morph_rate: float = 0.1 – per-iteration leaf shrinkage rate.
evolution_pressure: float = 0.2 – EMA-driven gain shaping strength.
morph_warmup_iters: int = 5 – rounds before the morph blend engages.
info_score_weight: float = 0.3 – mixing weight for the information-theoretic gain term.
depth_penalty_base: float = 0.9 – depth-based leaf penalty base.
balance_penalty: bool = True – whether to penalize imbalanced splits.
lr_schedule: str = "constant" – per-iteration LR schedule ("constant" or "warmup_cosine"); independent of training_mode.
lr_warmup_frac: float = 0.1 – linear-warmup fraction when lr_schedule="warmup_cosine".

Warm-starting

warm_start: bool = False – when True, fit() continues from the previously fitted model

Diagnostics

store_node_stats: bool = False

This stores optional node-level debug statistics inside the artifact for later analysis. It is not required for ordinary prediction.

Main methods

fit(X, y, *, sample_weight=None, eval_set=None, eval_sample_weight=None, group=None, eval_group=None, eval_time_index=None, categorical_feature_values=None, time_index=None, factor_exposures=None)
predict(X)
shap_values(X, *, include_expected_value=False)
feature_importances(X, *, method="shap")
predict_from_artifact(artifact_bytes, X)
save_model(path)
load_model(path) (classmethod)
artifact_bytes – property returning the raw artifact bytes
score(X, y)

Important fit(...) rules:

early_stopping_rounds requires eval_set
eval_time_index requires eval_set
categorical_time_aware=True requires time_index during training and eval_time_index for validation when eval_set is used
sample_weight applies per-sample weights to the training loss

Post-fit attributes

After fitting, the estimator may expose:

best_iteration_
best_score_
n_estimators_
rounds_completed_
stop_reason_
evals_result_ – shaped like {"train": {"rmse": [...]}, "validation": {"rmse": [...]}}
diagnostics_per_round_ – list of per-round dicts containing gradient_l2_norm, gradient_variance, hessian_l2_norm, original_gradient_l2_norm, projected_gradient_l2_norm, neutralization_effectiveness, n_active_rows, n_active_features. The three projection-related entries are None unless factor neutralization (per_round_gradient or split_penalty) is configured; pre_target mode never projects per round and therefore omits them.
fit_timing_
feature_names_ – captured from training data or auto-generated

Regression objectives (v0.11.1+)

GBMRegressor accepts the following values for the objective kwarg:

"squared_error" (default) – standard least-squares regression.
"poisson" – log-link Poisson regression for count targets. Targets must be >= 0. predict() returns exp(raw).
"gamma" – log-link Gamma regression for strictly-positive continuous targets. Targets must be > 0. predict() returns exp(raw).
"tweedie" – log-link compound Poisson-gamma regression for 1 < variance_power < 2. Useful for insurance/claims data with a mass at zero and a positive tail. Set tweedie_variance_power=1.5 (or another value in (1, 2)). Targets must be >= 0. predict() returns exp(raw).
"quantile" – pinball loss regression with parameter quantile_alpha. Uses a proxy Hessian h_i = w_i (sample weight) during split-finding, and performs an empirical quantile leaf refinement step at the end of each round acting on the full dataset.
Custom callable – any user-supplied (predictions, targets) → (gradients, hessians) function.

tweedie_variance_power: float = 1.5 – only used when objective="tweedie". Must satisfy 1 < p < 2. For p = 1 use objective="poisson"; for p = 2 use objective="gamma".

quantile_alpha: float = 0.5 – quantile to estimate when objective="quantile". Must be in (0, 1).

All three GLM objectives compose with boosting_mode="dart", boosting_mode="goss", warm-start, tree_growth="leaf", neutralization="per_round_gradient" / neutralization="split_penalty", and training_mode="morph". neutralization="pre_target" remains squared-error-only.

The "quantile" objective is supported in combination with DART, MorphBoost, and piecewise-linear leaves (leaf_model="linear"). It is explicitly rejected when combined with classification, ranking, or joint multi-output training.

Three deviance metrics in alloygbm.evaluation partner with the new objectives: poisson_deviance, gamma_deviance, and tweedie_deviance.

SHAP interaction values (v0.11.0+)

GBMRegressor.shap_interaction_values(X) returns pairwise SHAP attributions as an (n_rows, n_features, n_features) tensor in O(T · L · D² · M) time via Lundberg et al. (2020) Algorithm 2. See Feature importances and SHAP for the full contract and scope limits.

Recommended usage pattern

For most users:

start with training_policy="auto"
keep deterministic=True during evaluation
use time-aware validation for temporal or panel-like problems
use the benchmark suite to compare profile shapes rather than trusting a single run