GBMRegressor
GBMRegressor is the main Python estimator for regression in AlloyGBM.
Core parameters
learning_rate: float = 0.1- additive update step sizemax_depth: int = 6- maximum tree depthn_estimators: int = 6- requested boosting roundsrow_subsample: float = 1.0- per-round row sampling fraction; ignored whenboosting_mode="goss"(GOSS uses gradient-based sampling instead).col_subsample: float = 1.0- per-round feature sampling fractionquantile_alpha: float = 0.5- Target quantile for"quantile"regression. Must be strictly in(0.0, 1.0).
Boosting mode
boosting_mode: str = "standard"– per-round sample-selection strategy. Three values are accepted:"standard"(default) – uniform row subsampling underrow_subsample. Byte-identical to v0.7.5 on every API surface."goss"– LightGBM-style Gradient-based One-Side Sampling. Each round, score rows by|gradient|, keep the topgoss_top_ratefraction, uniformly samplegoss_other_ratefraction from the rest, and amplify the sampled-low rows’ gradient and hessian by(n - top_n) / other_n(realized counts) so the histogram statistics remain an unbiased estimator of the full-data gradient sums. Convergence is typically faster on data with a long-tailed gradient distribution (the canonical LightGBM advantage)."dart"– Dropouts meet MART. Each round, drop a random subset of previously-trained trees, fit a new tree on the residuals of the dropped-out ensemble, then rescale the dropped trees + the new tree so the prediction sum stays unbiased. Reduces over-specialization of late trees; can improve generalization on noisy data. Per-stumptree_weight: f32is persisted via a newDartTreeWeightsartifact section.
goss_top_rate: float = 0.2– top-by-gradient kept fraction whenboosting_mode="goss". Must be in(0, 1).goss_other_rate: float = 0.1– random-sample fraction from the remaining rows whenboosting_mode="goss". Must be in(0, 1)andgoss_top_rate + goss_other_rate <= 1.0.dart_drop_rate: float = 0.1– per-tree drop probability per round whenboosting_mode="dart". Must be in(0, 1).dart_max_drop: int = 50– cap on the number of trees dropped per round. Must be>= 1.dart_normalize_type: str = "tree"– rescale policy after the new tree is fit."tree"mode sets new-tree weight to1 / (K + 1)and each dropped-tree weight toK / (K + 1);"forest"mode sets both to1 / (K + 1)(more aggressive rescale).dart_sample_type: str = "uniform"– dropout sampling strategy."uniform"picks each tree independently with probabilitydart_drop_rate."weighted"biases dropout probability toward heavier-weight trees.
GOSS and DART are supported on the binary classifier / regression /
ranking single-output objective. The multiclass softmax path
explicitly rejects non-"standard" boosting modes pending
per-class gradient scoring (v0.10.x follow-up — applies to both
GOSS and DART).
As of v0.10.0, DART + ``warm_start`` is supported on
GBMRegressor, binary GBMClassifier, and GBMRanker. The
continuation seeds dart_state.tree_weights from the prior model’s
per-stump tree_weight snapshot and pre-populates the dropout
bookkeeping arrays so new-round dropouts can correctly subtract /
replay prior trees. Historical RNG-driven dropped_per_round is
intentionally not persisted; new rounds start fresh dropout
bookkeeping going forward.
base = GBMRegressor(n_estimators=10, boosting_mode="dart",
dart_drop_rate=0.1, seed=7)
base.fit(X, y)
cont = GBMRegressor(n_estimators=10, boosting_mode="dart",
dart_drop_rate=0.1, warm_start=True, seed=7)
cont.fit(X, y, init_model=base)
Stopping and policy controls
early_stopping_rounds: int | None = Nonemin_validation_improvement: float = 0.0training_policy: str = "auto"
training_policy="auto" applies dataset-aware heuristics and is the
recommended default for practical use. manual is more appropriate for
controlled ablation work.
Early stopping is explicit-only. If early_stopping_rounds is set, call
fit(..., eval_set=(X_valid, y_valid)).
Leaf and split controls
min_data_in_leaf: int = 1– whentraining_policy="auto", the engine may increase this based on dataset size but will never reduce it below the value you set.lambda_l1: float = 0.0lambda_l2: float = 0.0min_child_hessian: float = 0.0min_split_gain: float = 0.0– minimum gain required for a split. The auto policy may set this adaptively.
These map directly to native training controls instead of relying on environment-variable overrides.
Tree growth strategy
tree_growth: str = "level"–level(depth-first) orleaf(best-first, similar to LightGBM)max_leaves: int | None = None– maximum leaves for leaf-wise growth
Constraints
monotone_constraints: list[int] | dict[int, int] | None = None– constrain features to monotone increasing (+1), decreasing (-1), or unconstrained (0)feature_weights: list[float] | dict[int, float] | None = None– per-feature importance weights influencing split selectioninteraction_constraints: list[list[int]] | None = None– LightGBM-compatible interaction constraints. Each inner list is a group of feature indices; any root-to-leaf path is restricted to splits on features from a single allowed group. Features outside every group are unconstrained and may appear alongside any group. Up to 64 groups per fit; enforced in both level-wise and leaf-wise growth.Nonedisables the constraint (default).
Reproducibility
seed: int = 0deterministic: bool = True
Continuous-feature controls
continuous_binning_strategy: str = "linear"continuous_binning_max_bins: int = 256
Supports up to 65,535 bins per feature. Use quantile when you need more
robust handling of skewed continuous features.
Categorical support
categorical_feature_index: int | None = None– single column (legacy)categorical_feature_indices: list[int] | None = None– multiple columnscategorical_smoothing: float = 20.0categorical_min_samples_leaf: int = 1categorical_time_aware: bool = Falsemax_cat_threshold: int = 0– maximum category cardinality for native categorical splits. When a categorical feature has at most this many unique values, AlloyGBM uses the Fisher-sort algorithm for O(K log K) optimal binary partition with O(1) bitset prediction. Features exceeding the threshold fall back to target encoding. Default 0 disables native splits.
DRO leaf solver
leaf_solver: str = "standard"–"standard"keeps the usual scalar Newton leaf update;"dro"enables a fast robust scalar update that penalizes weak leaf signal by within-leaf gradient dispersion.dro_radius: float = 0.05– non-negative penalty scale.0.0preserves standard-leaf predictions while recording DRO metadata.dro_metric: str = "wasserstein"– the only accepted value today. It denotes a Wasserstein-inspired closed-form robust counterpart over leaf gradient uncertainty.
This is not a full Wasserstein optimizer over raw feature/target
distributions. Inference speed is unchanged because robust scalar leaf values
are stored directly in the artifact. leaf_solver="dro" works on all three
estimators, composes with training_mode="morph", and requires
leaf_model="constant".
Factor-neutral boosting
neutralization: str = "none"one of
"none","pre_target","per_round_gradient", or"split_penalty"
factor_neutralization_lambda: float = 1e-6– finite, non-negative ridge term added toF^T W F.factor_penalty: float = 0.0– finite, non-negative split exposure penalty scale. Only active forneutralization="split_penalty".
Pass factors as fit-time data:
model = GBMRegressor(neutralization="per_round_gradient", seed=7)
model.fit(X_train, y_train, factor_exposures=F_train)
factor_exposures must be dense, row-major, finite, and shaped
(n_rows, n_factors). It is fit data, not constructor state, so sklearn
cloning remains clean and large matrices are not embedded in estimator params.
Mode semantics:
neutralization="none" preserves current behavior and ignores
factor_exposures unless a non-None matrix is provided with an inactive
mode, in which case Python raises a clear validation error to prevent silent
user mistakes.
neutralization="pre_target" residualizes the regression target once before
training:
y_perp = y - F (F^T W F + lambda I)^-1 F^T W y
This mode is supported for GBMRegressor only. It is rejected for
classification and ranking because target residualization is not well-defined
for class labels or ranking relevance. eval_set is also rejected for
pre_target in this release because the public API does not yet accept
validation-set factor exposures to residualize validation targets consistently.
neutralization="per_round_gradient" projects objective gradients before
each boosting round:
g_perp = g - F (F^T W F + lambda I)^-1 F^T W g
Hessians are unchanged. This mode is supported for regression, binary classification, multiclass, and ranking. For multiclass, each class-gradient column is projected independently against the same factor projector.
neutralization="split_penalty" includes per-round gradient projection and
subtracts a factor-load penalty from split gain:
penalty = factor_penalty * || F_L^T update_L + F_R^T update_R ||^2 / max(row_count, 1)
gain_final = gain_after_existing_modes - penalty
For scalar leaves, update_L and update_R are the candidate scalar leaf
values before any final MorphBoost depth/iteration leaf scaling. For DRO
leaves, the scalar values use the DRO effective gradients. For MorphBoost, the
order is: project gradients, compute standard/DRO gradient gain, blend
MorphBoost information score, subtract factor penalty, then apply MorphBoost
leaf scaling when storing leaves. split_penalty performs additional
factor-exposure work during split search and should be treated as the slowest
neutralization mode until production-scale benchmarks justify stronger claims.
Compatibility:
Feature |
pre_target |
per_round_gradient |
split_penalty |
|---|---|---|---|
|
supported |
supported |
supported |
|
rejected |
supported |
supported |
|
rejected |
supported |
supported |
|
supported |
supported |
supported |
|
supported |
supported |
supported |
|
supported |
supported |
rejected |
warm start |
supported |
supported |
supported |
This is a training-time regularization tool. It does not guarantee prediction-time zero exposure unless predictions are neutralized against evaluation-time factors outside the model.
Exposure matrices are not persisted in the estimator or artifact (they
would balloon the model size and surface sensitive data). As of v0.7.1
neutralized warm-start and init_model continuation are supported: the
caller must supply the same factor_exposures matrix used for the
initial fit so the projection has the same column space. Omitting
factor_exposures on a resumed fit raises a contract error.
pre_target neutralization is idempotent under repeated residualization
against the same exposures, so warm-start continuation residualizes the
original targets again on the resumed fit and trains on the same
target stream as a fresh N + M-round fit.
Piecewise-linear leaves
leaf_model: str = "constant""constant"(default) – standard scalar leaf value, identical to all prior AlloyGBM behaviour."linear"– each leaf stores a small linear modelf_s(x) = b_s + Σ α_j x_j(up to 8 regressors per leaf, inherited from the split path’s feature indices; the per-leaf cap is internal and not currently user-tunable). Optimal weights are solved in closed form via the ridge regressionα* = -(XᵀHX + λI)⁻¹ Xᵀg, regularised bylambda_l2.
Empirically, "linear" converges in fewer rounds on data with linear
within-node residual structure (~10× faster on linearly-structured datasets,
+3.5% RMSE on California Housing, +1.75pp accuracy on Breast Cancer), at a
2–8× per-round training overhead. Recommended lambda_l2 >= 0.01 for weight
stability.
Limitations:
Native-bitset categorical splits (
max_cat_threshold > 0) fall back to constant leaves at the categorical split node; descendant leaves below the split use linear leaves on remaining numeric regressors.SHAP (
shap_values,feature_importances) supportsleaf_model="linear"with strict additivity as of v0.7.4: the reconstruction satisfiesatol + rtol·|predict(x)|(default1e-5 + 1e-4·|predict(x)|) on the default predictor-aligned binning path. See Feature importances and SHAP for the decomposition anddocs/limitations.mdfor the legacy-non-binning exemption.leaf_model="linear"composes withtraining_mode="morph".
Multi-label ranking
MultiLabelGBMRanker is a unified multi-output ranking estimator: y
has shape (n_rows, n_labels) and predict returns scores with the
same column layout. As of v0.7.1 the wrapper trains one independent
GBMRanker per label using a shared group (and optional shared
factor_exposures) so every per-label fit observes the same query
structure. Each per-label ranker independently picks up every existing
GBMRanker feature (warm-start, neutralization, MorphBoost, PL
leaves, DRO, interaction constraints, custom eval metrics).
ranking_objective may be a single string (applied to every label) or
a list of length n_labels for heterogeneous objectives. save_model
serialises every per-label ranker into a single .mlrk bundle via
pickle.HIGHEST_PROTOCOL.
Joint shared-tree multi-label boosting is deferred to v0.10.0
(paired with the K-output shared-histogram primitive); see
docs/limitations.md for the upgrade-path caveat.
MorphBoost (Adaptive Split Criterion)
GBMRegressor (and the classifier / ranker subclasses) support an opt-in MorphBoost training mode. See MorphBoost (Adaptive Split Criterion) for the full guide.
training_mode: str = "auto"– one of"auto"(default),"manual", or"morph".morph_rate: float = 0.1– per-iteration leaf shrinkage rate.evolution_pressure: float = 0.2– EMA-driven gain shaping strength.morph_warmup_iters: int = 5– rounds before the morph blend engages.info_score_weight: float = 0.3– mixing weight for the information-theoretic gain term.depth_penalty_base: float = 0.9– depth-based leaf penalty base.balance_penalty: bool = True– whether to penalize imbalanced splits.lr_schedule: str = "constant"– per-iteration LR schedule ("constant"or"warmup_cosine"); independent oftraining_mode.lr_warmup_frac: float = 0.1– linear-warmup fraction whenlr_schedule="warmup_cosine".
Warm-starting
warm_start: bool = False– whenTrue,fit()continues from the previously fitted model
Diagnostics
store_node_stats: bool = False
This stores optional node-level debug statistics inside the artifact for later analysis. It is not required for ordinary prediction.
Main methods
fit(X, y, *, sample_weight=None, eval_set=None, eval_sample_weight=None, group=None, eval_group=None, eval_time_index=None, categorical_feature_values=None, time_index=None, factor_exposures=None)predict(X)shap_values(X, *, include_expected_value=False)feature_importances(X, *, method="shap")predict_from_artifact(artifact_bytes, X)save_model(path)load_model(path)(classmethod)artifact_bytes– property returning the raw artifact bytesscore(X, y)
Important fit(...) rules:
early_stopping_roundsrequireseval_seteval_time_indexrequireseval_setcategorical_time_aware=Truerequirestime_indexduring training andeval_time_indexfor validation wheneval_setis usedsample_weightapplies per-sample weights to the training loss
Post-fit attributes
After fitting, the estimator may expose:
best_iteration_best_score_n_estimators_rounds_completed_stop_reason_evals_result_– shaped like{"train": {"rmse": [...]}, "validation": {"rmse": [...]}}diagnostics_per_round_– list of per-round dicts containinggradient_l2_norm,gradient_variance,hessian_l2_norm,original_gradient_l2_norm,projected_gradient_l2_norm,neutralization_effectiveness,n_active_rows,n_active_features. The three projection-related entries areNoneunless factor neutralization (per_round_gradientorsplit_penalty) is configured;pre_targetmode never projects per round and therefore omits them.fit_timing_feature_names_– captured from training data or auto-generated
Regression objectives (v0.11.1+)
GBMRegressor accepts the following values for the objective kwarg:
"squared_error"(default) – standard least-squares regression."poisson"– log-link Poisson regression for count targets. Targets must be>= 0.predict()returnsexp(raw)."gamma"– log-link Gamma regression for strictly-positive continuous targets. Targets must be> 0.predict()returnsexp(raw)."tweedie"– log-link compound Poisson-gamma regression for1 < variance_power < 2. Useful for insurance/claims data with a mass at zero and a positive tail. Settweedie_variance_power=1.5(or another value in(1, 2)). Targets must be>= 0.predict()returnsexp(raw)."quantile"– pinball loss regression with parameterquantile_alpha. Uses a proxy Hessianh_i = w_i(sample weight) during split-finding, and performs an empirical quantile leaf refinement step at the end of each round acting on the full dataset.Custom callable – any user-supplied
(predictions, targets) → (gradients, hessians)function.
tweedie_variance_power: float = 1.5 – only used when
objective="tweedie". Must satisfy 1 < p < 2. For p = 1 use
objective="poisson"; for p = 2 use objective="gamma".
quantile_alpha: float = 0.5 – quantile to estimate when objective="quantile".
Must be in (0, 1).
All three GLM objectives compose with boosting_mode="dart",
boosting_mode="goss", warm-start, tree_growth="leaf",
neutralization="per_round_gradient" /
neutralization="split_penalty", and training_mode="morph".
neutralization="pre_target" remains squared-error-only.
The "quantile" objective is supported in combination with DART, MorphBoost, and piecewise-linear leaves (leaf_model="linear"). It is explicitly rejected when combined with classification, ranking, or joint multi-output training.
Three deviance metrics in alloygbm.evaluation partner with the new
objectives: poisson_deviance, gamma_deviance, and
tweedie_deviance.
SHAP interaction values (v0.11.0+)
GBMRegressor.shap_interaction_values(X) returns pairwise SHAP
attributions as an (n_rows, n_features, n_features) tensor in
O(T · L · D² · M) time via Lundberg et al. (2020) Algorithm 2.
See Feature importances and SHAP for the full contract and scope limits.
Recommended usage pattern
For most users:
start with
training_policy="auto"keep
deterministic=Trueduring evaluationuse time-aware validation for temporal or panel-like problems
use the benchmark suite to compare profile shapes rather than trusting a single run