Skip to content

Evaluation and Results

This page describes how NeuCo-Bench evaluates embeddings, how to configure the pipeline, how results are stored, and how to aggregate runs into a leaderboard.


Run the benchmark on your embeddings with:

python main.py \
  --annotation_path path/to/annotation_folder \
  --submission_file path/to/submission_file.csv \
  --output_dir path/to/results \
  --config path/to/config.yaml \
  --method_name "your-method-name" (optional) \
  --phase "phase-name" (optional)

Arguments:
- annotation_path — Folder containing task label files (<task>__<type>.csv).
- submission_file — Path to your embeddings CSV.
- output_dir — Destination for per-task reports, plots, and aggregated benchmark results.
- config — YAML file specifying cross-validation settings and logging options (see below).
- method_name — Optional name used to tag your run. Defaults to infer from embedding csv name.
- phase — Optional tag to group runs (e.g. dev, ablation). Defaults to results.

Output directory:

output_dir/<phase>/<method_name>_<timestamp>/

Evaluation Pipeline

NeuCo-Bench applies a task-wise linear‑probing workflow.

1. Data Loading & Preprocessing

  • Load embeddings and task annotations
  • Align and match sample IDs
  • Optional: standardize embeddings
  • Optional: normalize labels
  • Filter to samples present in both files

2. Cross‑Validated Linear Probing

  • Each task is evaluated independently
  • Repeated shuffle‑split cross‑validation
  • Train a linear model for each split
  • Evaluate on the held‑out validation set

3. Metric Computation

  • Regression: optimized with MSE and evaluated using R²
  • Classification: optimized for a binary objective and evaluated with F1
  • Compute mean and standard deviation across splits
  • Compute Q statistic (see below)
  • Optionally save plots

4. Result Writing & Aggregation

  • Save per‑task results (JSON)
  • Save run‑level summary
  • Optionally aggregate all runs under the same phase into a leaderboard

Configuration

A reference config is provided in configs/sample_config.yaml. The following options control the evaluation pipeline.

Required Parameters

  • batch_size — Batch size for linear probes
  • epochs — Training epochs
  • learning_rate — Optimizer learning rate
  • k_folds — Number of cross‑validation folds

Optional Parameters

  • embedding_dim — Expected embedding size; smaller vectors are zero‑padded.
  • standardize_embeddings — Standardize embeddings using global mean/std.
  • normalize_labels — Normalize regression labels to [0, 1].
  • enable_plots — Save loss curves and task‑specific plots.
  • task_filter — Specify tasks to evaluate (defaults to all available).
  • update_leaderboard — Aggregate results across runs.
  • output_fold_results — Also store per-fold metrics in the result JSON.

Q-Score

To quantify per‑task stability and performance, NeuCo‑Bench reports a Q statistic:

Q = mean_score / (0.02 + std_dev) * 2
  • mean_score — Average performance across splits
  • std_dev — Variability across splits

A higher Q indicates a method that is both strong and stable.


Results & Leaderboard

Per‑Task Results

Each evaluated task produces a <task>_result.json containing:
- q_stat (see above)
- mean_score (R²/F1)
- std_dev (R²/F1)

If enabled, the directory includes loss curves.

Run‑Level Summary

Each run also generates a summary JSON that aggregates all task results for comparison.

Aggregating Multiple Runs and Rank

You can aggregate all runs under the same phase into a leaderboard using a config run with update_leaderboard: true or manually running:

from evaluation.results import summarize_runs
summarize_runs(output_dir=output_dir, phase=phase)

This produces a method‑ranking table summarizing all tasks, ranked by Q-Score.


Output Structure

results/
└── <phase>/
    └── <method>_<timestamp>/
        ├── <task_name>/
        │   ├── <task_name>_result.json
        │   ├── loss_train.png
        │   ├── loss_validation.png
        │   └── ...
        └── run_summary.json

Practical Notes

  • Keep embedding_dim, k_folds, and seeds consistent across experiments in the same phase.
  • Plots help diagnose learning instability.
  • Use different phase names to group experiments.
  • Set task filter for faster subset tests:
task_filter:
  - cloud_cover
  - biomass_mean
  • Disable preprocessing in case of already normalized custom data or for explicit testing of raw embedding features:
standardize_embeddings: false
normalize_labels: false