Architecture ============ This page gives a technical overview of how AlloyGBM is organized internally. High-level component layout --------------------------- The repository is split into Rust workspace crates plus Python bindings: - ``crates/core`` -- shared data contracts, matrices, gradients, artifacts, and NaN handling. - ``crates/engine`` -- training logic, objective implementations, and policy-driven iteration control. - ``crates/backend_cpu`` -- CPU histogram building, split evaluation, and NaN-aware partitioning. - ``crates/predictor`` -- artifact-backed prediction with post-transform support, including sigmoid for classification. - ``crates/shap`` -- TreeSHAP explanation support using polynomial-time exact Shapley values. - ``crates/categorical`` -- categorical support helpers such as target encoding and frequency encoding. - ``bindings/python`` -- Python extension module and public package (``GBMRegressor``, ``GBMClassifier``, ``GBMRanker``). Objective implementations ------------------------- The engine implements a generic ``ObjectiveOps`` trait with these concrete objectives: - ``SquaredErrorObjective`` -- MSE loss for regression - ``BinaryCrossEntropyObjective`` -- log-loss for binary classification - ``RankPairwiseObjective`` -- RankNet pairwise logistic - ``RankNdcgObjective`` -- LambdaMART with NDCG weighting - ``RankXendcgObjective`` -- cross-entropy approximation to NDCG - ``QueryRmseObjective`` -- query-grouped RMSE - ``YetiRankObjective`` -- stochastic NDCG-weighted pairwise Training pipeline ----------------- At a high level, Python training flows like this: 1. Python input validation and coercion 2. dense fast-path detection for array-like inputs 3. continuous-feature quantization when needed (up to 65,535 bins) 4. NaN handling -- missing values are routed to a dedicated bin 5. Rust engine training with the selected objective 6. artifact serialization (includes objective metadata for post-transforms) 7. native predictor handle creation for later inference .. figure:: _static/training_pipeline.png :alt: AlloyGBM training pipeline from Python inputs through quantization, engine training, backend execution, artifact serialization, and native predictor creation. :width: 100% :align: center High-level AlloyGBM training pipeline from Python inputs to serialized model artifact and native predictor handle. Tree growth strategies ---------------------- AlloyGBM supports two tree growth strategies: - **Level-wise** (default): grows trees level-by-level, expanding all nodes at the current depth before moving deeper - **Leaf-wise**: grows trees by selecting the leaf with the highest split gain at each step, similar to LightGBM's strategy Artifact design --------------- AlloyGBM keeps a binary artifact format with magic bytes ``AGBM`` and versioned sections: - Trees section - PredictorLayout section (includes objective type for post-transforms) - CategoricalState section - JSON metadata header Artifact-backed inference is part of the public Python story, and the format supports model persistence via pickle, ``save_model``/``load_model``, and raw byte export. Recent design choices --------------------- The current codebase includes several design decisions: - dense native ingestion paths to avoid unnecessary Python row materialization - flat histogram storage for better cache behavior, with buffer reuse across rounds - dataset-aware training policy in ``auto`` mode - NaN-aware histogram building and split finding - adaptive u8/u16 bin storage (u8 for <=256 bins, u16 for larger) - monotone constraint enforcement during split finding - feature weight integration into split candidate selection - optional node statistics for later introspection .. figure:: _static/tree_node_structure.png :alt: AlloyGBM split node structure showing threshold bin, gain, child nodes, and optional node statistics. :width: 85% :align: center Conceptual split-node structure used to explain artifact layout and optional node-level diagnostics.