GBMRegressor ============ ``GBMRegressor`` is the main Python estimator for regression in AlloyGBM. Core parameters --------------- - ``learning_rate: float = 0.1`` - additive update step size - ``max_depth: int = 6`` - maximum tree depth - ``n_estimators: int = 6`` - requested boosting rounds - ``row_subsample: float = 1.0`` - per-round row sampling fraction; ignored when ``boosting_mode="goss"`` (GOSS uses gradient-based sampling instead). - ``col_subsample: float = 1.0`` - per-round feature sampling fraction - ``quantile_alpha: float = 0.5`` - Target quantile for ``"quantile"`` regression. Must be strictly in ``(0.0, 1.0)``. Boosting mode ------------- - ``boosting_mode: str = "standard"`` -- per-round sample-selection strategy. Three values are accepted: - ``"standard"`` (default) -- uniform row subsampling under ``row_subsample``. Byte-identical to v0.7.5 on every API surface. - ``"goss"`` -- LightGBM-style **G**\ radient-based **O**\ ne-**S**\ ide **S**\ ampling. Each round, score rows by ``|gradient|``, keep the top ``goss_top_rate`` fraction, uniformly sample ``goss_other_rate`` fraction from the rest, and amplify the sampled-low rows' gradient and hessian by ``(n - top_n) / other_n`` (realized counts) so the histogram statistics remain an unbiased estimator of the full-data gradient sums. Convergence is typically faster on data with a long-tailed gradient distribution (the canonical LightGBM advantage). - ``"dart"`` -- **D**\ ropouts meet **MART**. Each round, drop a random subset of previously-trained trees, fit a new tree on the residuals of the dropped-out ensemble, then rescale the dropped trees + the new tree so the prediction sum stays unbiased. Reduces over-specialization of late trees; can improve generalization on noisy data. Per-stump ``tree_weight: f32`` is persisted via a new ``DartTreeWeights`` artifact section. - ``goss_top_rate: float = 0.2`` -- top-by-gradient kept fraction when ``boosting_mode="goss"``. Must be in ``(0, 1)``. - ``goss_other_rate: float = 0.1`` -- random-sample fraction from the remaining rows when ``boosting_mode="goss"``. Must be in ``(0, 1)`` and ``goss_top_rate + goss_other_rate <= 1.0``. - ``dart_drop_rate: float = 0.1`` -- per-tree drop probability per round when ``boosting_mode="dart"``. Must be in ``(0, 1)``. - ``dart_max_drop: int = 50`` -- cap on the number of trees dropped per round. Must be ``>= 1``. - ``dart_normalize_type: str = "tree"`` -- rescale policy after the new tree is fit. ``"tree"`` mode sets new-tree weight to ``1 / (K + 1)`` and each dropped-tree weight to ``K / (K + 1)``; ``"forest"`` mode sets both to ``1 / (K + 1)`` (more aggressive rescale). - ``dart_sample_type: str = "uniform"`` -- dropout sampling strategy. ``"uniform"`` picks each tree independently with probability ``dart_drop_rate``. ``"weighted"`` biases dropout probability toward heavier-weight trees. GOSS and DART are supported on the binary classifier / regression / ranking single-output objective. The multiclass softmax path explicitly rejects non-``"standard"`` boosting modes pending per-class gradient scoring (v0.10.x follow-up — applies to both GOSS and DART). As of v0.10.0, **DART + ``warm_start``** is supported on ``GBMRegressor``, binary ``GBMClassifier``, and ``GBMRanker``. The continuation seeds ``dart_state.tree_weights`` from the prior model's per-stump ``tree_weight`` snapshot and pre-populates the dropout bookkeeping arrays so new-round dropouts can correctly subtract / replay prior trees. Historical RNG-driven ``dropped_per_round`` is intentionally not persisted; new rounds start fresh dropout bookkeeping going forward. .. code-block:: python base = GBMRegressor(n_estimators=10, boosting_mode="dart", dart_drop_rate=0.1, seed=7) base.fit(X, y) cont = GBMRegressor(n_estimators=10, boosting_mode="dart", dart_drop_rate=0.1, warm_start=True, seed=7) cont.fit(X, y, init_model=base) Stopping and policy controls ---------------------------- - ``early_stopping_rounds: int | None = None`` - ``min_validation_improvement: float = 0.0`` - ``training_policy: str = "auto"`` ``training_policy="auto"`` applies dataset-aware heuristics and is the recommended default for practical use. ``manual`` is more appropriate for controlled ablation work. Early stopping is explicit-only. If ``early_stopping_rounds`` is set, call ``fit(..., eval_set=(X_valid, y_valid))``. Leaf and split controls ----------------------- - ``min_data_in_leaf: int = 1`` -- when ``training_policy="auto"``, the engine may increase this based on dataset size but will never reduce it below the value you set. - ``lambda_l1: float = 0.0`` - ``lambda_l2: float = 0.0`` - ``min_child_hessian: float = 0.0`` - ``min_split_gain: float = 0.0`` -- minimum gain required for a split. The auto policy may set this adaptively. These map directly to native training controls instead of relying on environment-variable overrides. Tree growth strategy -------------------- - ``tree_growth: str = "level"`` -- ``level`` (depth-first) or ``leaf`` (best-first, similar to LightGBM) - ``max_leaves: int | None = None`` -- maximum leaves for leaf-wise growth Constraints ----------- - ``monotone_constraints: list[int] | dict[int, int] | None = None`` -- constrain features to monotone increasing (+1), decreasing (-1), or unconstrained (0) - ``feature_weights: list[float] | dict[int, float] | None = None`` -- per-feature importance weights influencing split selection - ``interaction_constraints: list[list[int]] | None = None`` -- LightGBM-compatible interaction constraints. Each inner list is a group of feature indices; any root-to-leaf path is restricted to splits on features from a single allowed group. Features outside every group are unconstrained and may appear alongside any group. Up to 64 groups per fit; enforced in both level-wise and leaf-wise growth. ``None`` disables the constraint (default). Reproducibility --------------- - ``seed: int = 0`` - ``deterministic: bool = True`` Continuous-feature controls --------------------------- - ``continuous_binning_strategy: str = "linear"`` - ``continuous_binning_max_bins: int = 256`` Supports up to 65,535 bins per feature. Use ``quantile`` when you need more robust handling of skewed continuous features. Categorical support ------------------- - ``categorical_feature_index: int | None = None`` -- single column (legacy) - ``categorical_feature_indices: list[int] | None = None`` -- multiple columns - ``categorical_smoothing: float = 20.0`` - ``categorical_min_samples_leaf: int = 1`` - ``categorical_time_aware: bool = False`` - ``max_cat_threshold: int = 0`` -- maximum category cardinality for native categorical splits. When a categorical feature has at most this many unique values, AlloyGBM uses the Fisher-sort algorithm for O(K log K) optimal binary partition with O(1) bitset prediction. Features exceeding the threshold fall back to target encoding. Default 0 disables native splits. DRO leaf solver --------------- - ``leaf_solver: str = "standard"`` -- ``"standard"`` keeps the usual scalar Newton leaf update; ``"dro"`` enables a fast robust scalar update that penalizes weak leaf signal by within-leaf gradient dispersion. - ``dro_radius: float = 0.05`` -- non-negative penalty scale. ``0.0`` preserves standard-leaf predictions while recording DRO metadata. - ``dro_metric: str = "wasserstein"`` -- the only accepted value today. It denotes a Wasserstein-inspired closed-form robust counterpart over leaf gradient uncertainty. This is not a full Wasserstein optimizer over raw feature/target distributions. Inference speed is unchanged because robust scalar leaf values are stored directly in the artifact. ``leaf_solver="dro"`` works on all three estimators, composes with ``training_mode="morph"``, and requires ``leaf_model="constant"``. Factor-neutral boosting ----------------------- - ``neutralization: str = "none"`` - one of ``"none"``, ``"pre_target"``, ``"per_round_gradient"``, or ``"split_penalty"`` - ``factor_neutralization_lambda: float = 1e-6`` -- finite, non-negative ridge term added to ``F^T W F``. - ``factor_penalty: float = 0.0`` -- finite, non-negative split exposure penalty scale. Only active for ``neutralization="split_penalty"``. Pass factors as fit-time data: .. code-block:: python model = GBMRegressor(neutralization="per_round_gradient", seed=7) model.fit(X_train, y_train, factor_exposures=F_train) ``factor_exposures`` must be dense, row-major, finite, and shaped ``(n_rows, n_factors)``. It is fit data, not constructor state, so sklearn cloning remains clean and large matrices are not embedded in estimator params. Mode semantics: ``neutralization="none"`` preserves current behavior and ignores ``factor_exposures`` unless a non-``None`` matrix is provided with an inactive mode, in which case Python raises a clear validation error to prevent silent user mistakes. ``neutralization="pre_target"`` residualizes the regression target once before training: .. code-block:: text y_perp = y - F (F^T W F + lambda I)^-1 F^T W y This mode is supported for ``GBMRegressor`` only. It is rejected for classification and ranking because target residualization is not well-defined for class labels or ranking relevance. ``eval_set`` is also rejected for ``pre_target`` in this release because the public API does not yet accept validation-set factor exposures to residualize validation targets consistently. ``neutralization="per_round_gradient"`` projects objective gradients before each boosting round: .. code-block:: text g_perp = g - F (F^T W F + lambda I)^-1 F^T W g Hessians are unchanged. This mode is supported for regression, binary classification, multiclass, and ranking. For multiclass, each class-gradient column is projected independently against the same factor projector. ``neutralization="split_penalty"`` includes per-round gradient projection and subtracts a factor-load penalty from split gain: .. code-block:: text penalty = factor_penalty * || F_L^T update_L + F_R^T update_R ||^2 / max(row_count, 1) gain_final = gain_after_existing_modes - penalty For scalar leaves, ``update_L`` and ``update_R`` are the candidate scalar leaf values before any final MorphBoost depth/iteration leaf scaling. For DRO leaves, the scalar values use the DRO effective gradients. For MorphBoost, the order is: project gradients, compute standard/DRO gradient gain, blend MorphBoost information score, subtract factor penalty, then apply MorphBoost leaf scaling when storing leaves. ``split_penalty`` performs additional factor-exposure work during split search and should be treated as the slowest neutralization mode until production-scale benchmarks justify stronger claims. Compatibility: .. list-table:: :header-rows: 1 * - Feature - pre_target - per_round_gradient - split_penalty * - ``GBMRegressor`` - supported - supported - supported * - ``GBMClassifier`` - rejected - supported - supported * - ``GBMRanker`` - rejected - supported - supported * - ``training_mode="morph"`` - supported - supported - supported * - ``leaf_solver="dro"`` - supported - supported - supported * - ``leaf_model="linear"`` - supported - supported - rejected * - warm start - supported - supported - supported This is a training-time regularization tool. It does not guarantee prediction-time zero exposure unless predictions are neutralized against evaluation-time factors outside the model. Exposure matrices are not persisted in the estimator or artifact (they would balloon the model size and surface sensitive data). As of v0.7.1 neutralized warm-start and ``init_model`` continuation are supported: the caller must supply the same ``factor_exposures`` matrix used for the initial fit so the projection has the same column space. Omitting ``factor_exposures`` on a resumed fit raises a contract error. ``pre_target`` neutralization is idempotent under repeated residualization against the same exposures, so warm-start continuation residualizes the original targets again on the resumed fit and trains on the same target stream as a fresh ``N + M``-round fit. Piecewise-linear leaves ----------------------- - ``leaf_model: str = "constant"`` - ``"constant"`` (default) -- standard scalar leaf value, identical to all prior AlloyGBM behaviour. - ``"linear"`` -- each leaf stores a small linear model ``f_s(x) = b_s + Σ α_j x_j`` (up to 8 regressors per leaf, inherited from the split path's feature indices; the per-leaf cap is internal and not currently user-tunable). Optimal weights are solved in closed form via the ridge regression ``α* = -(XᵀHX + λI)⁻¹ Xᵀg``, regularised by ``lambda_l2``. Empirically, ``"linear"`` converges in fewer rounds on data with linear within-node residual structure (~10× faster on linearly-structured datasets, +3.5% RMSE on California Housing, +1.75pp accuracy on Breast Cancer), at a 2–8× per-round training overhead. Recommended ``lambda_l2 >= 0.01`` for weight stability. Limitations: - Native-bitset categorical splits (``max_cat_threshold > 0``) fall back to constant leaves at the categorical split node; descendant leaves below the split use linear leaves on remaining numeric regressors. - SHAP (``shap_values``, ``feature_importances``) supports ``leaf_model="linear"`` with strict additivity as of v0.7.4: the reconstruction satisfies ``atol + rtol·|predict(x)|`` (default ``1e-5 + 1e-4·|predict(x)|``) on the default predictor-aligned binning path. See :doc:`explanations` for the decomposition and ``docs/limitations.md`` for the legacy-non-binning exemption. - ``leaf_model="linear"`` composes with ``training_mode="morph"``. Multi-label ranking ------------------- ``MultiLabelGBMRanker`` is a unified multi-output ranking estimator: ``y`` has shape ``(n_rows, n_labels)`` and ``predict`` returns scores with the same column layout. As of v0.7.1 the wrapper trains one independent :class:`GBMRanker` per label using a shared ``group`` (and optional shared ``factor_exposures``) so every per-label fit observes the same query structure. Each per-label ranker independently picks up every existing :class:`GBMRanker` feature (warm-start, neutralization, MorphBoost, PL leaves, DRO, interaction constraints, custom eval metrics). ``ranking_objective`` may be a single string (applied to every label) or a list of length ``n_labels`` for heterogeneous objectives. ``save_model`` serialises every per-label ranker into a single ``.mlrk`` bundle via ``pickle.HIGHEST_PROTOCOL``. Joint shared-tree multi-label boosting is deferred to v0.10.0 (paired with the K-output shared-histogram primitive); see ``docs/limitations.md`` for the upgrade-path caveat. MorphBoost (Adaptive Split Criterion) ------------------------------------- GBMRegressor (and the classifier / ranker subclasses) support an opt-in MorphBoost training mode. See :doc:`morphboost` for the full guide. - ``training_mode: str = "auto"`` -- one of ``"auto"`` (default), ``"manual"``, or ``"morph"``. - ``morph_rate: float = 0.1`` -- per-iteration leaf shrinkage rate. - ``evolution_pressure: float = 0.2`` -- EMA-driven gain shaping strength. - ``morph_warmup_iters: int = 5`` -- rounds before the morph blend engages. - ``info_score_weight: float = 0.3`` -- mixing weight for the information-theoretic gain term. - ``depth_penalty_base: float = 0.9`` -- depth-based leaf penalty base. - ``balance_penalty: bool = True`` -- whether to penalize imbalanced splits. - ``lr_schedule: str = "constant"`` -- per-iteration LR schedule (``"constant"`` or ``"warmup_cosine"``); independent of ``training_mode``. - ``lr_warmup_frac: float = 0.1`` -- linear-warmup fraction when ``lr_schedule="warmup_cosine"``. Warm-starting ------------- - ``warm_start: bool = False`` -- when ``True``, ``fit()`` continues from the previously fitted model Diagnostics ----------- - ``store_node_stats: bool = False`` This stores optional node-level debug statistics inside the artifact for later analysis. It is not required for ordinary prediction. Main methods ------------ - ``fit(X, y, *, sample_weight=None, eval_set=None, eval_sample_weight=None, group=None, eval_group=None, eval_time_index=None, categorical_feature_values=None, time_index=None, factor_exposures=None)`` - ``predict(X)`` - ``shap_values(X, *, include_expected_value=False)`` - ``feature_importances(X, *, method="shap")`` - ``predict_from_artifact(artifact_bytes, X)`` - ``save_model(path)`` - ``load_model(path)`` (classmethod) - ``artifact_bytes`` -- property returning the raw artifact bytes - ``score(X, y)`` Important ``fit(...)`` rules: - ``early_stopping_rounds`` requires ``eval_set`` - ``eval_time_index`` requires ``eval_set`` - ``categorical_time_aware=True`` requires ``time_index`` during training and ``eval_time_index`` for validation when ``eval_set`` is used - ``sample_weight`` applies per-sample weights to the training loss Post-fit attributes ------------------- After fitting, the estimator may expose: - ``best_iteration_`` - ``best_score_`` - ``n_estimators_`` - ``rounds_completed_`` - ``stop_reason_`` - ``evals_result_`` -- shaped like ``{"train": {"rmse": [...]}, "validation": {"rmse": [...]}}`` - ``diagnostics_per_round_`` -- list of per-round dicts containing ``gradient_l2_norm``, ``gradient_variance``, ``hessian_l2_norm``, ``original_gradient_l2_norm``, ``projected_gradient_l2_norm``, ``neutralization_effectiveness``, ``n_active_rows``, ``n_active_features``. The three projection-related entries are ``None`` unless factor neutralization (``per_round_gradient`` or ``split_penalty``) is configured; ``pre_target`` mode never projects per round and therefore omits them. - ``fit_timing_`` - ``feature_names_`` -- captured from training data or auto-generated Regression objectives (v0.11.1+) -------------------------------- ``GBMRegressor`` accepts the following values for the ``objective`` kwarg: - ``"squared_error"`` (default) -- standard least-squares regression. - ``"poisson"`` -- log-link Poisson regression for count targets. Targets must be ``>= 0``. ``predict()`` returns ``exp(raw)``. - ``"gamma"`` -- log-link Gamma regression for strictly-positive continuous targets. Targets must be ``> 0``. ``predict()`` returns ``exp(raw)``. - ``"tweedie"`` -- log-link compound Poisson-gamma regression for ``1 < variance_power < 2``. Useful for insurance/claims data with a mass at zero and a positive tail. Set ``tweedie_variance_power=1.5`` (or another value in ``(1, 2)``). Targets must be ``>= 0``. ``predict()`` returns ``exp(raw)``. - ``"quantile"`` -- pinball loss regression with parameter ``quantile_alpha``. Uses a proxy Hessian ``h_i = w_i`` (sample weight) during split-finding, and performs an empirical quantile leaf refinement step at the end of each round acting on the full dataset. - Custom callable -- any user-supplied ``(predictions, targets) → (gradients, hessians)`` function. ``tweedie_variance_power: float = 1.5`` -- only used when ``objective="tweedie"``. Must satisfy ``1 < p < 2``. For ``p = 1`` use ``objective="poisson"``; for ``p = 2`` use ``objective="gamma"``. ``quantile_alpha: float = 0.5`` -- quantile to estimate when ``objective="quantile"``. Must be in ``(0, 1)``. All three GLM objectives compose with ``boosting_mode="dart"``, ``boosting_mode="goss"``, warm-start, ``tree_growth="leaf"``, ``neutralization="per_round_gradient"`` / ``neutralization="split_penalty"``, and ``training_mode="morph"``. ``neutralization="pre_target"`` remains squared-error-only. The ``"quantile"`` objective is supported in combination with DART, MorphBoost, and piecewise-linear leaves (``leaf_model="linear"``). It is explicitly rejected when combined with classification, ranking, or joint multi-output training. Three deviance metrics in ``alloygbm.evaluation`` partner with the new objectives: ``poisson_deviance``, ``gamma_deviance``, and ``tweedie_deviance``. SHAP interaction values (v0.11.0+) ---------------------------------- ``GBMRegressor.shap_interaction_values(X)`` returns pairwise SHAP attributions as an ``(n_rows, n_features, n_features)`` tensor in ``O(T · L · D² · M)`` time via Lundberg et al. (2020) Algorithm 2. See :doc:`explanations` for the full contract and scope limits. Recommended usage pattern ------------------------- For most users: - start with ``training_policy="auto"`` - keep ``deterministic=True`` during evaluation - use time-aware validation for temporal or panel-like problems - use the benchmark suite to compare profile shapes rather than trusting a single run