Skip to content

Multi-Model x Multi-Shell SDLC Benchmark Execution Plan

Harness Engineering execution plan: this is an agent-executable scenario that shows how the control plane coordinates environment, workflow, guardrails, and feedback loops rather than a one-off agent call.

Agent Collaboration: This document is an agent-executable plan. Open this project in an AI coding agent (Claude Code, OpenCode, Codex, etc.) — the agent reads this plan and orchestrates other agents via the orchestrator CLI to collaboratively complete the task, from resource deployment and execution to result evaluation, fully autonomously.

1. Benchmark Objective

Using the controlled variable method, under the same task goal and Workflow, substitute different LLM models and Agent shells to evaluate performance differences across two independent dimensions:

  • Model dimension: Fix the shell (e.g., Claude Code), swap the model (Opus / Sonnet / GLM-5 / Gemini / GPT-5.4), and observe how model capability affects task completion and code quality
  • Shell dimension: Fix the model (e.g., Opus 4.6), swap the shell (Claude Code / OpenCode / Codex / Gemini CLI), and observe how the shell toolchain affects execution efficiency and results

2. Variable Matrix

IDShellModelAgent ManifestSecretStore
A1Claude CodeOpus 4.6fixtures/benchmarks/agent-claude-opus.yamlfixtures/benchmarks/secrets-claude-opus.yaml
B1OpenCodeOpus 4.6fixtures/benchmarks/agent-opencode-opus.yamlfixtures/benchmarks/secrets-claude-opus.yaml
C1OpenCodeGLM-5fixtures/benchmarks/agent-opencode-glm5.yamlfixtures/benchmarks/secrets-glm5.yaml
D1Gemini CLIGemini 3.1 Profixtures/benchmarks/agent-gemini-pro.yamlfixtures/benchmarks/secrets-gemini.yaml
E1Codex CLIGPT-5.4fixtures/benchmarks/agent-codex-gpt54.yamlfixtures/benchmarks/secrets-openai.yaml

Controlled Variable Analysis Groups:

ComparisonFixed VariableChanged VariableObservation Target
A1 vs B1Opus 4.6Claude Code vs OpenCodeShell difference
B1 vs C1OpenCodeOpus 4.6 vs GLM-5Model difference
A1 vs D1 vs E1All combinationsOverall difference

Extensible as needed: simply create new Agent + SecretStore manifests.

3. Prerequisites

The executor (Agent) should first verify the following conditions:

  • orchestrator --version and orchestratord --version are executable
  • The daemon is running (orchestrator daemon status); if not, start it: orchestratord --foreground --workers 2 &
  • The shells listed in the matrix are installed (claude --version, opencode --version, gemini --version, codex --version, etc.)
  • API keys in fixtures/benchmarks/secrets-*.yaml have been filled in (not <placeholder> values)

4. Unified Task Goal

All combinations use the exact same goal (controlled variable):

Implement a retry mechanism for gRPC client connections with exponential backoff and configurable max retries. Add unit tests.

5. Execution Flow (Per Combination)

For each combination ID in the matrix, follow these steps:

5.1 Environment Preparation

bash
cd "$ORCHESTRATOR_ROOT"
git stash --include-untracked || true

5.2 Deploy Resources

bash
orchestrator apply -f fixtures/benchmarks/<secret_file> --project benchmark
orchestrator apply -f fixtures/benchmarks/<agent_file> --project benchmark
orchestrator apply -f fixtures/benchmarks/workflow-benchmark-bootstrap.yaml --project benchmark

Verify:

bash
orchestrator get agents --project benchmark -o json
orchestrator get workflows --project benchmark -o json

5.3 Create Task

bash
orchestrator task create \
  --project benchmark \
  --workflow benchmark-bootstrap \
  --goal "Implement a retry mechanism for gRPC client connections with exponential backoff and configurable max retries. Add unit tests."

Record the returned task_id.

5.4 Monitor Until Completion

bash
orchestrator task watch <task_id> --timeout 1800

If it times out or fails, record the status and proceed to the next combination.

5.5 Collect Results

bash
orchestrator task info <task_id> -o json
orchestrator event list --task <task_id> -o json
orchestrator task items <task_id> -o json
orchestrator task trace <task_id> --json

5.6 Save Artifact Snapshot

bash
git diff > results/<combo_id>-diff.patch
git diff --stat > results/<combo_id>-diffstat.txt

5.7 Restore Environment

bash
git checkout -- .
git clean -fd
git stash pop || true

Repeat steps 5.1–5.7 until all combinations have been executed.

6. Evaluation Phase

After all combinations have been executed, the host agent (the agent executing this plan — not the target agents being benchmarked) performs a unified evaluation of all artifacts in the results/ directory.

Evaluator independence: The workflow includes an in-loop benchmark_eval step executed by each target agent as a self-check. However, the authoritative six-dimension scores in §6.2 are produced by the host agent examining collected artifacts (diffs, event logs, task traces) post-hoc. This separation ensures the evaluator is independent of the evaluated agent — the same principle as having a referee who isn't also a player.

6.1 Quantitative Metrics (Extracted from JSON Results)

MetricData Source
Completion statustask infostatus (completed/failed)
Total durationtask infostarted_at to completed_at
Execution cyclesevent listcycle_completed event count
Step success rateevent list → proportion of step_finished with success: true

6.2 Six-Dimension Evaluation (Host Agent Post-Hoc)

The host agent applies the diff from each combination, runs git diff --stat, build/test/lint commands, and inspects the collected JSON artifacts. If the diff is empty, Task Completion scores 0 and remaining dimensions are skipped.

DimensionScoreScoring Rubric
Task Completion0-100 = no code changes; 5 = partial implementation missing key requirements; 10 = all requirements addressed with working code
Code Quality0-100 = syntax errors or broken build; 5 = compiles but non-idiomatic; 10 = correct, idiomatic, concise
Test Coverage0-100 = no tests; 5 = tests exist but miss edge cases; 10 = comprehensive unit tests covering new/changed code
Execution Efficiency0-10Based on wall time from task info timestamps: 0 = timeout (>30min); 5 = 10-15min; 10 = under 5min
Step Success Rate0-10From event list JSON: proportion of step_finished events with success: true, linearly mapped to 0-10
Engineering Standards0-100 = lint failures / missing error handling; 5 = compiles clean; 10 = error handling, doc comments, safety annotations, zero warnings

The host agent applies each patch, runs cargo check, cargo test, and cargo clippy, then outputs a six-dimension JSON score (total 0-60).

Data sources: Execution Efficiency and Step Success Rate are derived from quantitative data (timestamps, event logs), not subjective judgment. The remaining four dimensions are assessed by the host agent after running the project's toolchain against the actual code output.

6.3 Output Evaluation Report

Output the comparison results in markdown table + radar chart format:

markdown
| Combo | Shell | Model | Status | Duration | Cycles | Completion | Quality | Tests | Efficiency | Success | Standards | Total(/60) | Notes |
|-------|-------|-------|--------|----------|--------|------------|---------|-------|------------|---------|-----------|------------|-------|
| A1    | Claude Code  | Opus 4.6      | | | | | | | | | | | |
| B1    | OpenCode     | Opus 4.6      | | | | | | | | | | | |
| C1    | OpenCode     | GLM-5         | | | | | | | | | | | |
| D1    | Gemini CLI   | Gemini 3.1 Pro| | | | | | | | | | | |
| E1    | Codex CLI    | GPT-5.4       | | | | | | | | | | | |

Finally, provide a summary analysis along two dimensions:

  1. Model dimension (comparing B1 vs C1: same shell OpenCode, different models): impact of model capability on results
  2. Shell dimension (comparing A1 vs B1: same model Opus, different shells): impact of toolchain on execution efficiency
  3. Overall ranking: six-dimension radar chart comparison, recommendation ranking of all combinations

7. Constraints

  • Controlled variables: Change only one variable at a time (model or shell); the workflow and goal remain unchanged
  • Environment isolation: Restore a clean git state before and after each combination execution
  • Timeout protection: 30-minute timeout per task
  • Cost awareness: Opus is approximately 5x the cost of Sonnet; confirm budget before batch execution
  • Reproducibility: All manifests are versioned in fixtures/benchmarks/

8. Example: One-Click Benchmark Execution

8.1 User Prerequisites (Manual Steps)

Before handing the prompt to your AI coding agent, complete the following authentication and setup:

Authenticate each Agent CLI (select the shells you want to test):

ShellAuthentication
OpenCodeopencode auth — interactive provider and API key setup
Gemini CLIComplete Google account login inside the tool on first run, or set GEMINI_API_KEY env var
Codex CLIComplete login inside the tool on first run, or set OPENAI_API_KEY env var

Verify environment is ready:

bash
# Confirm each CLI is installed and responds
opencode --version
gemini --version
codex --version

# Confirm orchestrator is built and installed
orchestrator --version
orchestratord --version

# Confirm SecretStore manifests have real keys (not placeholders)
# Edit fixtures/benchmarks/secrets-*.yaml

8.2 Ready-to-Execute Prompt

Once the above is done, paste the following prompt into your AI coding agent (e.g., Claude Code) to start the full workflow:

Execute the multi-model benchmark test per docs/showcases/benchmark-multi-model-execution.md.

## Context
- Variable matrix: 3 combos (trimmed; expand to 5 per section 2) — C1 (OpenCode+MiniMax), D1 (Gemini CLI+Flash), E1 (Codex CLI+GPT-5.4-mini)
- Agent manifests / SecretStores / Workflow are in fixtures/benchmarks/
- All CLIs are authenticated; report auth failures to the user

## Pre-execution cleanup
1. Rebuild: cargo build --release -p orchestratord -p orchestrator-cli, install to ~/.cargo/bin/
2. Restart daemon: kill old process → orchestratord --foreground --workers 2
3. Clean residual benchmark project assets:
   - orchestrator task delete --all -p benchmark -f
   - orchestrator get agents/workflows/workspaces -p benchmark → delete each
4. mkdir -p results

## Execution flow
Execute showcase doc steps 5.1-5.7 sequentially for C1 → D1 → E1:
- apply secrets → apply agent → apply workflow
- task create → task watch --timeout 1800
- Collect results (task info/event list/task items/task trace -o json)
- Save git diff to results/<combo_id>-*
- git checkout/clean to restore environment (preserve results/ directory)
- Delete current combo's agent (keep workflow/workspace shared; if capability validation errors occur, delete workflow too and recreate)

Clean agent between combos to avoid capability conflicts.

## Evaluation
After all combos complete, generate results/benchmark-report.md per doc sections 6.1-6.3 using the six-dimension evaluation criteria.

## Error handling
- Timeout or failure: record status and continue to next combo
- Auth failure: report to user, wait for fix before continuing

8.3 Actual Execution Results Reference (2026-04-05)

Below are real results from executing the above prompt on orchestrator v0.3.0 with a trimmed matrix (C1/D1/E1).

Evaluation context: The host agent (Claude Code / Opus 4.6) orchestrated execution of all three target agents, collected artifacts, and produced the final six-dimension scores. The entire flow — from resource deployment through monitoring, artifact collection, and scoring — ran autonomously with zero human intervention.

Reproducibility: All manifests are versioned in fixtures/benchmarks/. Raw artifacts (diffs, event logs, task traces) are in results/. The ready-to-execute prompt in §8.2 is the exact prompt used for this run. A1 and B1 are intentionally left unexecuted — see §2 for the full matrix.

Six-Dimension Evaluation Overview

ComboShellModelStatusDurationCompletionQualityTestsEfficiencySuccessStandardsTotal(/60)Notes
C1OpenCodeMiniMax-M2.7completed5m27s21088120Modified 1 test file only, no retry impl
D1Gemini CLIFlash-previewtimeout>44m76413526Full implementation but timed out (30min)
E1Codex CLIGPT-5.4-minicompleted5m14s97598644Full implementation, fastest

Execution Time Breakdown

Comboplanimplementself_testevalTotal
C1117s92s0s95s327s
D11094s>1590s>2685s
E185s202s0s27s314s

Code Output

ComboFiles ChangedLines +/-Core Changes
C11+4/-3Only modified a test file
D18+278/-76connect.rs + CLI integration
E19+267/-73connect.rs + CLI + GUI integration

Conclusion: E1 (Codex/GPT-5.4-mini) completed the full workflow in 5m14s with a six-dimension score of 44/60, the clear winner. D1 produced comparable code volume to E1 but timed out; C1 failed to complete the actual task.