Evaluation and Results
This page describes how NeuCo-Bench evaluates embeddings, how to configure the pipeline, how results are stored, and how to aggregate runs into a leaderboard.
Run the benchmark on your embeddings with:
python main.py \
--annotation_path path/to/annotation_folder \
--submission_file path/to/submission_file.csv \
--output_dir path/to/results \
--config path/to/config.yaml \
--method_name "your-method-name" (optional) \
--phase "phase-name" (optional)
Arguments:
- annotation_path — Folder containing task label files (<task>__<type>.csv).
- submission_file — Path to your embeddings CSV.
- output_dir — Destination for per-task reports, plots, and aggregated benchmark results.
- config — YAML file specifying cross-validation settings and logging options (see below).
- method_name — Optional name used to tag your run. Defaults to infer from embedding csv name.
- phase — Optional tag to group runs (e.g. dev, ablation). Defaults to results.
Output directory:
output_dir/<phase>/<method_name>_<timestamp>/
Evaluation Pipeline
NeuCo-Bench applies a task-wise linear‑probing workflow.
1. Data Loading & Preprocessing
- Load embeddings and task annotations
- Align and match sample IDs
- Optional: standardize embeddings
- Optional: normalize labels
- Filter to samples present in both files
2. Cross‑Validated Linear Probing
- Each task is evaluated independently
- Repeated shuffle‑split cross‑validation
- Train a linear model for each split
- Evaluate on the held‑out validation set
3. Metric Computation
- Regression: optimized with MSE and evaluated using R²
- Classification: optimized for a binary objective and evaluated with F1
- Compute mean and standard deviation across splits
- Compute Q statistic (see below)
- Optionally save plots
4. Result Writing & Aggregation
- Save per‑task results (JSON)
- Save run‑level summary
- Optionally aggregate all runs under the same phase into a leaderboard
Configuration
A reference config is provided in configs/sample_config.yaml. The following options control the evaluation pipeline.
Required Parameters
batch_size— Batch size for linear probesepochs— Training epochslearning_rate— Optimizer learning ratek_folds— Number of cross‑validation folds
Optional Parameters
embedding_dim— Expected embedding size; smaller vectors are zero‑padded.standardize_embeddings— Standardize embeddings using global mean/std.normalize_labels— Normalize regression labels to[0, 1].enable_plots— Save loss curves and task‑specific plots.task_filter— Specify tasks to evaluate (defaults to all available).update_leaderboard— Aggregate results across runs.output_fold_results— Also store per-fold metrics in the result JSON.
Q-Score
To quantify per‑task stability and performance, NeuCo‑Bench reports a Q statistic:
Q = mean_score / (0.02 + std_dev) * 2
mean_score— Average performance across splitsstd_dev— Variability across splits
A higher Q indicates a method that is both strong and stable.
Results & Leaderboard
Per‑Task Results
Each evaluated task produces a <task>_result.json containing:
- q_stat (see above)
- mean_score (R²/F1)
- std_dev (R²/F1)
If enabled, the directory includes loss curves.
Run‑Level Summary
Each run also generates a summary JSON that aggregates all task results for comparison.
Aggregating Multiple Runs and Rank
You can aggregate all runs under the same phase into a leaderboard using a config run with update_leaderboard: true or manually running:
from evaluation.results import summarize_runs
summarize_runs(output_dir=output_dir, phase=phase)
This produces a method‑ranking table summarizing all tasks, ranked by Q-Score.
Output Structure
results/
└── <phase>/
└── <method>_<timestamp>/
├── <task_name>/
│ ├── <task_name>_result.json
│ ├── loss_train.png
│ ├── loss_validation.png
│ └── ...
└── run_summary.json
Practical Notes
- Keep
embedding_dim,k_folds, and seeds consistent across experiments in the same phase. - Plots help diagnose learning instability.
- Use different
phasenames to group experiments. - Set task filter for faster subset tests:
task_filter:
- cloud_cover
- biomass_mean
- Disable preprocessing in case of already normalized custom data or for explicit testing of raw embedding features:
standardize_embeddings: false
normalize_labels: false