Benchmarks
==========

This page summarizes how AlloyGBM is benchmarked and what the current public
results say.

Methodology
-----------

The benchmark runner lives in ``benchmarks/run_model_comparison.py`` and
compares AlloyGBM against:

- XGBoost
- LightGBM
- CatBoost

It also includes additional AlloyGBM variants as separate arms by default
per task type:

- ``alloygbm_morph`` -- ``training_mode="morph"`` with constant LR
- ``alloygbm_morph_cosine`` -- ``training_mode="morph"`` with
  ``lr_schedule="warmup_cosine"``
- ``alloygbm_linear`` -- ``leaf_model="linear"`` (piecewise-linear leaves)
  with auto training mode
- ``alloygbm_morph_linear`` -- ``leaf_model="linear"`` combined with
  ``training_mode="morph"``

Use the runner's ``--models`` flag to filter which arms run. Focused
harnesses are also provided:

- ``benchmarks/morph_report.py`` -- quick MorphBoost-vs-peers comparison
- ``benchmarks/numerai_benchmark.py`` -- Numerai tournament benchmark with
  walk-forward CV, residualized targets, and Numerai-specific scoring
- ``benchmarks/pl_trees_benchmark.py`` -- piecewise-linear-leaf
  convergence-curve and λ-sweep analysis. Report at
  ``docs/benchmarks/pl_trees_v1.md``.

The suite spans three task types with the following scenarios:

**Regression:** ``dense_numeric``, ``california_housing``, ``bike_sharing``,
``panel_time_series``, ``dow_jones_financial``

**Classification:** ``breast_cancer``, ``synthetic_classification``

**Ranking:** ``synthetic_ranking``, ``california_ranking``

Profiles are evaluated across shallow, mid, and deep configurations so the
comparison is not tied to a single parameter shape.

Current results
---------------

**Regression:**

- AlloyGBM is strongest on ``panel_time_series``
- AlloyGBM is strong on ``dow_jones_financial``
- AlloyGBM is competitive but not leading on ``dense_numeric``
- AlloyGBM trails on ``california_housing`` and ``bike_sharing``
- AlloyGBM is typically the fastest trainer on most scenario/profile rows

**Classification:**

- AlloyGBM is competitive with established libraries on accuracy, log-loss, and
  AUC across ``breast_cancer`` and ``synthetic_classification``

**Ranking:**

- AlloyGBM competes on ``synthetic_ranking`` and ``california_ranking`` using
  native LambdaMART, evaluated via NDCG@5, NDCG@10, and full NDCG

**MorphBoost variants:**

- On Numerai-style residualized regression at scale (~2.7M rows × 42 features
  × 5000 rounds), AlloyGBM's MorphBoost variants lead all peer libraries on
  validation MMC (Meta-Model Contribution) and Sharpe; numerai_corr trails by
  a small margin (~0.0006-0.0009).
- ``alloygbm_morph`` is typically the fastest of the three AlloyGBM variants
  on this workload due to faster convergence under the EMA-shaped gain.

**Piecewise-linear leaf variants:**

- ``leaf_model="linear"`` shows ~10× faster convergence on linearly-structured
  data, +3.5% RMSE on California Housing, and +1.75pp accuracy on Breast
  Cancer vs constant-leaf baselines, at a 2–8× per-round training overhead.
- See ``docs/benchmarks/pl_trees_v1.md`` for the full report.

Metrics by task type
--------------------

.. list-table::
   :header-rows: 1

   * - Task type
     - Metrics
   * - Regression
     - RMSE, MAE, R2
   * - Classification
     - Accuracy, Log-Loss, AUC
   * - Ranking
     - NDCG@5, NDCG@10, NDCG

How to run the suite
--------------------

.. code-block:: console

   python3 benchmarks/run_model_comparison.py --force-prepare

Focused regression comparison:

.. code-block:: console

   python3 benchmarks/run_model_comparison.py \
     --force-prepare \
     --scenarios california_housing bike_sharing dense_numeric panel_time_series dow_jones_financial

Classification only:

.. code-block:: console

   python3 benchmarks/run_model_comparison.py \
     --force-prepare \
     --scenarios breast_cancer synthetic_classification

Ranking only:

.. code-block:: console

   python3 benchmarks/run_model_comparison.py \
     --force-prepare \
     --scenarios synthetic_ranking california_ranking

Stage timing output
-------------------

Per-record benchmark output includes:

- ``input_adaptation_seconds``
- ``native_bridge_prepare_seconds``
- ``native_train_seconds``
- ``fit_seconds``
- ``predict_seconds``

Use those timing columns to tell apart Python-side adaptation cost and native
training cost.

Interpretation
--------------

The benchmark suite is designed to answer both of these questions:

- Where is AlloyGBM already strong?
- Where does it still lag established libraries?

The second question matters. These docs intentionally preserve that honesty.