Benchmarks

This page summarizes how AlloyGBM is benchmarked and what the current public results say.

Methodology

The benchmark runner lives in benchmarks/run_model_comparison.py and compares AlloyGBM against:

XGBoost
LightGBM
CatBoost

It also includes additional AlloyGBM variants as separate arms by default per task type:

alloygbm_morph – training_mode="morph" with constant LR
alloygbm_morph_cosine – training_mode="morph" with lr_schedule="warmup_cosine"
alloygbm_linear – leaf_model="linear" (piecewise-linear leaves) with auto training mode
alloygbm_morph_linear – leaf_model="linear" combined with training_mode="morph"

Use the runner’s --models flag to filter which arms run. Focused harnesses are also provided:

benchmarks/morph_report.py – quick MorphBoost-vs-peers comparison
benchmarks/numerai_benchmark.py – Numerai tournament benchmark with walk-forward CV, residualized targets, and Numerai-specific scoring
benchmarks/pl_trees_benchmark.py – piecewise-linear-leaf convergence-curve and λ-sweep analysis. Report at docs/benchmarks/pl_trees_v1.md.

The suite spans three task types with the following scenarios:

Regression: dense_numeric, california_housing, bike_sharing, panel_time_series, dow_jones_financial

Classification: breast_cancer, synthetic_classification

Ranking: synthetic_ranking, california_ranking

Profiles are evaluated across shallow, mid, and deep configurations so the comparison is not tied to a single parameter shape.

Current results

Regression:

AlloyGBM is strongest on panel_time_series
AlloyGBM is strong on dow_jones_financial
AlloyGBM is competitive but not leading on dense_numeric
AlloyGBM trails on california_housing and bike_sharing
AlloyGBM is typically the fastest trainer on most scenario/profile rows

Classification:

AlloyGBM is competitive with established libraries on accuracy, log-loss, and AUC across breast_cancer and synthetic_classification

Ranking:

AlloyGBM competes on synthetic_ranking and california_ranking using native LambdaMART, evaluated via NDCG@5, NDCG@10, and full NDCG

MorphBoost variants:

On Numerai-style residualized regression at scale (~2.7M rows × 42 features × 5000 rounds), AlloyGBM’s MorphBoost variants lead all peer libraries on validation MMC (Meta-Model Contribution) and Sharpe; numerai_corr trails by a small margin (~0.0006-0.0009).
alloygbm_morph is typically the fastest of the three AlloyGBM variants on this workload due to faster convergence under the EMA-shaped gain.

Piecewise-linear leaf variants:

leaf_model="linear" shows ~10× faster convergence on linearly-structured data, +3.5% RMSE on California Housing, and +1.75pp accuracy on Breast Cancer vs constant-leaf baselines, at a 2–8× per-round training overhead.
See docs/benchmarks/pl_trees_v1.md for the full report.

Metrics by task type

Task type	Metrics
Regression	RMSE, MAE, R2
Classification	Accuracy, Log-Loss, AUC
Ranking	NDCG@5, NDCG@10, NDCG

How to run the suite

python3 benchmarks/run_model_comparison.py --force-prepare

Focused regression comparison:

python3 benchmarks/run_model_comparison.py \
  --force-prepare \
  --scenarios california_housing bike_sharing dense_numeric panel_time_series dow_jones_financial

Classification only:

python3 benchmarks/run_model_comparison.py \
  --force-prepare \
  --scenarios breast_cancer synthetic_classification

Ranking only:

python3 benchmarks/run_model_comparison.py \
  --force-prepare \
  --scenarios synthetic_ranking california_ranking

Stage timing output

Per-record benchmark output includes:

input_adaptation_seconds
native_bridge_prepare_seconds
native_train_seconds
fit_seconds
predict_seconds

Use those timing columns to tell apart Python-side adaptation cost and native training cost.

Interpretation

The benchmark suite is designed to answer both of these questions:

Where is AlloyGBM already strong?
Where does it still lag established libraries?

The second question matters. These docs intentionally preserve that honesty.