Benchmarks
This page summarizes how AlloyGBM is benchmarked and what the current public results say.
Methodology
The benchmark runner lives in benchmarks/run_model_comparison.py and
compares AlloyGBM against:
XGBoost
LightGBM
CatBoost
It also includes additional AlloyGBM variants as separate arms by default per task type:
alloygbm_morph–training_mode="morph"with constant LRalloygbm_morph_cosine–training_mode="morph"withlr_schedule="warmup_cosine"alloygbm_linear–leaf_model="linear"(piecewise-linear leaves) with auto training modealloygbm_morph_linear–leaf_model="linear"combined withtraining_mode="morph"
Use the runner’s --models flag to filter which arms run. Focused
harnesses are also provided:
benchmarks/morph_report.py– quick MorphBoost-vs-peers comparisonbenchmarks/numerai_benchmark.py– Numerai tournament benchmark with walk-forward CV, residualized targets, and Numerai-specific scoringbenchmarks/pl_trees_benchmark.py– piecewise-linear-leaf convergence-curve and λ-sweep analysis. Report atdocs/benchmarks/pl_trees_v1.md.
The suite spans three task types with the following scenarios:
Regression: dense_numeric, california_housing, bike_sharing,
panel_time_series, dow_jones_financial
Classification: breast_cancer, synthetic_classification
Ranking: synthetic_ranking, california_ranking
Profiles are evaluated across shallow, mid, and deep configurations so the comparison is not tied to a single parameter shape.
Current results
Regression:
AlloyGBM is strongest on
panel_time_seriesAlloyGBM is strong on
dow_jones_financialAlloyGBM is competitive but not leading on
dense_numericAlloyGBM trails on
california_housingandbike_sharingAlloyGBM is typically the fastest trainer on most scenario/profile rows
Classification:
AlloyGBM is competitive with established libraries on accuracy, log-loss, and AUC across
breast_cancerandsynthetic_classification
Ranking:
AlloyGBM competes on
synthetic_rankingandcalifornia_rankingusing native LambdaMART, evaluated via NDCG@5, NDCG@10, and full NDCG
MorphBoost variants:
On Numerai-style residualized regression at scale (~2.7M rows × 42 features × 5000 rounds), AlloyGBM’s MorphBoost variants lead all peer libraries on validation MMC (Meta-Model Contribution) and Sharpe; numerai_corr trails by a small margin (~0.0006-0.0009).
alloygbm_morphis typically the fastest of the three AlloyGBM variants on this workload due to faster convergence under the EMA-shaped gain.
Piecewise-linear leaf variants:
leaf_model="linear"shows ~10× faster convergence on linearly-structured data, +3.5% RMSE on California Housing, and +1.75pp accuracy on Breast Cancer vs constant-leaf baselines, at a 2–8× per-round training overhead.See
docs/benchmarks/pl_trees_v1.mdfor the full report.
Metrics by task type
Task type |
Metrics |
|---|---|
Regression |
RMSE, MAE, R2 |
Classification |
Accuracy, Log-Loss, AUC |
Ranking |
NDCG@5, NDCG@10, NDCG |
How to run the suite
python3 benchmarks/run_model_comparison.py --force-prepare
Focused regression comparison:
python3 benchmarks/run_model_comparison.py \
--force-prepare \
--scenarios california_housing bike_sharing dense_numeric panel_time_series dow_jones_financial
Classification only:
python3 benchmarks/run_model_comparison.py \
--force-prepare \
--scenarios breast_cancer synthetic_classification
Ranking only:
python3 benchmarks/run_model_comparison.py \
--force-prepare \
--scenarios synthetic_ranking california_ranking
Stage timing output
Per-record benchmark output includes:
input_adaptation_secondsnative_bridge_prepare_secondsnative_train_secondsfit_secondspredict_seconds
Use those timing columns to tell apart Python-side adaptation cost and native training cost.
Interpretation
The benchmark suite is designed to answer both of these questions:
Where is AlloyGBM already strong?
Where does it still lag established libraries?
The second question matters. These docs intentionally preserve that honesty.