Benchmarks

This page summarizes how AlloyGBM is benchmarked and what the current public results say.

Methodology

The benchmark runner lives in benchmarks/run_model_comparison.py and compares AlloyGBM against:

  • XGBoost

  • LightGBM

  • CatBoost

It also includes additional AlloyGBM variants as separate arms by default per task type:

  • alloygbm_morphtraining_mode="morph" with constant LR

  • alloygbm_morph_cosinetraining_mode="morph" with lr_schedule="warmup_cosine"

  • alloygbm_linearleaf_model="linear" (piecewise-linear leaves) with auto training mode

  • alloygbm_morph_linearleaf_model="linear" combined with training_mode="morph"

Use the runner’s --models flag to filter which arms run. Focused harnesses are also provided:

  • benchmarks/morph_report.py – quick MorphBoost-vs-peers comparison

  • benchmarks/numerai_benchmark.py – Numerai tournament benchmark with walk-forward CV, residualized targets, and Numerai-specific scoring

  • benchmarks/pl_trees_benchmark.py – piecewise-linear-leaf convergence-curve and λ-sweep analysis. Report at docs/benchmarks/pl_trees_v1.md.

The suite spans three task types with the following scenarios:

Regression: dense_numeric, california_housing, bike_sharing, panel_time_series, dow_jones_financial

Classification: breast_cancer, synthetic_classification

Ranking: synthetic_ranking, california_ranking

Profiles are evaluated across shallow, mid, and deep configurations so the comparison is not tied to a single parameter shape.

Current results

Regression:

  • AlloyGBM is strongest on panel_time_series

  • AlloyGBM is strong on dow_jones_financial

  • AlloyGBM is competitive but not leading on dense_numeric

  • AlloyGBM trails on california_housing and bike_sharing

  • AlloyGBM is typically the fastest trainer on most scenario/profile rows

Classification:

  • AlloyGBM is competitive with established libraries on accuracy, log-loss, and AUC across breast_cancer and synthetic_classification

Ranking:

  • AlloyGBM competes on synthetic_ranking and california_ranking using native LambdaMART, evaluated via NDCG@5, NDCG@10, and full NDCG

MorphBoost variants:

  • On Numerai-style residualized regression at scale (~2.7M rows × 42 features × 5000 rounds), AlloyGBM’s MorphBoost variants lead all peer libraries on validation MMC (Meta-Model Contribution) and Sharpe; numerai_corr trails by a small margin (~0.0006-0.0009).

  • alloygbm_morph is typically the fastest of the three AlloyGBM variants on this workload due to faster convergence under the EMA-shaped gain.

Piecewise-linear leaf variants:

  • leaf_model="linear" shows ~10× faster convergence on linearly-structured data, +3.5% RMSE on California Housing, and +1.75pp accuracy on Breast Cancer vs constant-leaf baselines, at a 2–8× per-round training overhead.

  • See docs/benchmarks/pl_trees_v1.md for the full report.

Metrics by task type

Task type

Metrics

Regression

RMSE, MAE, R2

Classification

Accuracy, Log-Loss, AUC

Ranking

NDCG@5, NDCG@10, NDCG

How to run the suite

python3 benchmarks/run_model_comparison.py --force-prepare

Focused regression comparison:

python3 benchmarks/run_model_comparison.py \
  --force-prepare \
  --scenarios california_housing bike_sharing dense_numeric panel_time_series dow_jones_financial

Classification only:

python3 benchmarks/run_model_comparison.py \
  --force-prepare \
  --scenarios breast_cancer synthetic_classification

Ranking only:

python3 benchmarks/run_model_comparison.py \
  --force-prepare \
  --scenarios synthetic_ranking california_ranking

Stage timing output

Per-record benchmark output includes:

  • input_adaptation_seconds

  • native_bridge_prepare_seconds

  • native_train_seconds

  • fit_seconds

  • predict_seconds

Use those timing columns to tell apart Python-side adaptation cost and native training cost.

Interpretation

The benchmark suite is designed to answer both of these questions:

  • Where is AlloyGBM already strong?

  • Where does it still lag established libraries?

The second question matters. These docs intentionally preserve that honesty.