Skip to content

Self-Evolution Topic Execution Plan

Harness Engineering execution plan: this is an agent-executable scenario that shows how the control plane coordinates environment, workflow, guardrails, and feedback loops rather than a one-off agent call.

Agent Collaboration: This document is an agent-executable plan. Open this project in an AI coding agent (Claude Code, OpenCode, Codex, etc.) — the agent reads this plan and orchestrates other agents via the orchestrator CLI to collaboratively complete the task, from resource deployment and execution to result verification, fully autonomously.

This document is the first real-world test topic for the self-evolution workflow. Unlike self-bootstrap, self-evolution uses WP03 dynamic candidate generation + competitive selection to explore multiple implementation paths, with the engine automatically selecting the best solution.


1. Task Objective

Pass the following objective verbatim to the orchestrator as the topic for this round of self-evolution:

Topic name: StepTemplate Prompt Variable Parsing Enhancement

Background: The current StepTemplate prompt field uses simple string substitution ({var_name}) to inject runtime variables. This approach has the following issues:

  1. No detection of undefined variables — if a prompt references a non-existent variable, the placeholder {var_name} is preserved after substitution, which may confuse the agent.
  2. No conditional sections — it is not possible to include/exclude a prompt section based on whether a variable exists (e.g., "show the diff section if a diff is available").
  3. No default value mechanism — there is no way to fall back to a reasonable default when a variable does not exist.

Objectives for this round: Enhance prompt template variable parsing to support the following syntax:

  • {var_name} — existing behavior, with a warning log when undefined
  • {var_name:-default_value} — use default value when undefined
  • {?var_name}...{/var_name} — conditional section, included when variable exists and is non-empty

Constraints:

  1. Do not introduce external template engine dependencies (e.g., Tera, Handlebars); implement in pure Rust.
  2. Maintain full backward compatibility with the existing {var_name} syntax.
  3. The ultimate goal: all existing StepTemplate prompts should work without any modifications; the new syntax is an optional enhancement.

1.1 Expected Outputs

Produced autonomously by the orchestrator:

  1. Two competing proposals (generated by the evo_plan step and injected as dynamic items via generate_items).
  2. Independent implementation for each proposal (evo_implement, item-scoped in parallel).
  3. Automated scoring for each proposal (evo_benchmark: compilation/tests/clippy/diff size).
  4. Engine automatically selects the higher-scoring proposal (select_best, WP03 item_select).
  5. The winning proposal is applied and passes final verification (evo_apply_winner + self_test).

1.2 Non-Goals

  • Do not presume which path should win.
  • Do not have humans specify the concrete code implementation approach.
  • Do not require full QA documentation generation (this round focuses on validating the evolution mechanism).

1.3 Rationale for Topic Selection

This topic was chosen as the first real-world self-evolution test based on the following considerations:

  1. Appropriate scope: Involves 1-2 files (the template resolution module), with manageable change size suitable for comparing 2 candidate proposals.
  2. Clear dimensions for comparison: Regex-based approach vs. hand-written parser — the two paths have genuine differences in performance, readability, and correctness.
  3. Objectively scorable: Compilation pass, test pass, clean clippy, diff size — all are automatable, quantifiable metrics.
  4. Backward compatibility constraint: Existing tests naturally serve as regression protection, requiring no additional manual verification.
  5. Self-bootstrap relevant: Improving prompt templates directly enhances the quality of orchestrator's own agent calls.

2. Execution Method

This round follows the self-evolution workflow with the following pipeline:

text
evo_plan ──[generate_items]──> evo_implement (x2) ──> evo_benchmark (x2) ──> select_best ──> evo_apply_winner ──> evo_align_tests ──> self_test ──> loop_guard

Key differences from self-bootstrap:

Dimensionself-bootstrapself-evolution
Loop strategyFixed 2 cyclesFixed 1 cycle
Implementation pathsSingle linear2 competing candidates
Selection mechanismNoneWP03 item_select (max score)
Cost controlMultiple steps, multiple agentsmax_parallel=1, no QA/doc steps
Safety guaranteesself_test + self_restartself_test + invariant (compilation_gate)

3. Launch Steps

3.1 Build and Start the Daemon

In the C/S architecture, the CLI (orchestrator) connects to the daemon (orchestratord) via a Unix Domain Socket.

bash
cd "$ORCHESTRATOR_ROOT"   # your orchestrator project directory

cargo build --release -p orchestratord -p orchestrator-cli

# Start the daemon (if not already running)
# --foreground keeps log output in the foreground; --workers specifies the number of parallel workers
nohup ./target/release/orchestratord --foreground --workers 2 > /tmp/orchestratord.log 2>&1 &

# Verify the daemon is running
ps aux | grep orchestratord | grep -v grep
# Verify the queue can be consumed by daemon workers
orchestrator task list -o json

Warning: CLI binary path: In C/S mode, the CLI is at target/release/orchestrator (crates/cli), not the legacy monolithic binary core/target/release/agent-orchestrator. Update any symlinks pointing to the old path.

3.2 Initialize Database and Load Resources

bash
orchestrator delete project/self-evolution --force
orchestrator init
orchestrator apply -f your-secrets.yaml           --project self-evolution
# apply additional secret manifests as needed      --project self-evolution
# Warning: --project is required; otherwise real AI agents will be registered in the global namespace
orchestrator apply -f docs/workflow/execution-profiles.yaml --project self-evolution
orchestrator apply -f docs/workflow/self-evolution.yaml --project self-evolution

3.3 Verify Resources Are Loaded

Verify that resources are loaded (add --project to scope to a specific project):

bash
orchestrator get workspaces --project self-evolution -o json
orchestrator get workflows --project self-evolution -o json
orchestrator get agents --project self-evolution -o json

3.4 Create and Launch the Task

In C/S mode, task create directly enqueues to the daemon worker. The task begins executing automatically upon creation; there is no need for a separate task start.

bash
orchestrator task create \
  -n "evo-prompt-template-enhance" \
  -w self -W self-evolution \
  --project self-evolution \
  -g "Enhance StepTemplate prompt variable parsing: support {var:-default} default value syntax and {?var}...{/var} conditional section syntax. Pure Rust implementation, no external template engines. Maintain full backward compatibility with the existing {var} syntax. Undefined variables should produce a warn log instead of silently preserving the placeholder."

Record the returned <task_id>. The task will be immediately claimed by a worker and begin execution. To wait for completion, use orchestrator task watch <task_id> or poll task info.


4. Monitoring Methods

4.1 Status Monitoring

bash
orchestrator task list
orchestrator task info <task_id>
orchestrator task trace <task_id>    # execution timeline with anomaly detection
orchestrator task watch <task_id>    # real-time status panel refresh

4.2 Key Events in the Evolution Process

In addition to standard step monitoring, self-evolution has the following specific observation points:

  1. items_generated event: Confirm that evo_plan successfully generated 2 candidate items

    bash
    orchestrator event list --task <task_id> --type items_generated -o json
  2. Dynamic item status: Confirm both candidates were executed

    bash
    orchestrator task items <task_id>
  3. Selection result: Confirm item_select chose a winner

    bash
    orchestrator store get evolution winner_latest --project self-evolution

4.3 Log Monitoring

bash
orchestrator task logs --tail 100 <task_id>
orchestrator task logs --tail 200 <task_id>

Key observations:

  1. Whether evo_plan generated two proposals with substantive differences
  2. Whether evo_implement produced independent implementations for each item
  3. Whether evo_benchmark scoring is based on objective metrics
  4. Whether select_best selected the higher-scoring proposal
  5. Whether evo_apply_winner cleanly applied the winning proposal

4.4 Process / Daemon Monitoring

bash
# Daemon process
ps aux | grep orchestratord | grep -v grep

# Queue/task status
orchestrator task list -o json

# Agent subprocesses (claude -p)
ps aux | grep "claude -p" | grep -v grep

# Code changes
git diff --stat

5. Key Checkpoints

5.1 evo_plan Phase

Confirm the output contains:

  1. 2 structured candidate proposals (JSON format)
  2. The two proposals have substantive differences (e.g., regex vs. hand-written parser)
  3. The items_generated event has been persisted

5.2 evo_implement Phase

Confirm:

  1. Both items produced code changes
  2. Change scope is consistent with each proposal's description
  3. No cross-contamination between items (item-scoped isolation)

5.3 evo_benchmark Phase

Confirm:

  1. Both items have a score capture
  2. Scoring is based on objective metrics such as compilation/tests/clippy
  3. Scores are differentiated (not both receiving full marks)

5.4 select_best Phase

Confirm:

  1. The evolution.winner_latest store entry exists
  2. The selected proposal has the higher score
  3. Winner data includes the proposal ID and score

5.5 evo_apply_winner + self_test Phase

Confirm:

  1. The winning proposal's code compiles
  2. All tests pass
  3. The compilation_gate invariant did not trigger a halt
  4. Existing StepTemplate prompt behavior is unchanged (backward compatible)

6. Success Criteria

The topic is considered complete when all of the following conditions are met:

  1. The orchestrator completes the full self-evolution pipeline and terminates normally at loop_guard.
  2. Two distinct candidate proposals were actually generated and independently implemented.
  3. The engine selected the higher-scoring proposal via item_select.
  4. The winning proposal's code passes self_test and the compilation_gate invariant.
  5. The existing {var_name} substitution syntax remains backward compatible.
  6. The evolution.winner_latest store records the selection result.

7. Error Handling

7.1 Evolution-Specific Error Scenarios

ErrorDetection MethodResolution
evo_plan does not output valid JSONitems_generated event does not existCheck the prompt; JSON output instructions may need adjustment
Two candidate proposals are essentially identicalInspect item labels and approach variablesIndicates insufficient differentiation guidance in the prompt
Both candidates fail to compileBenchmark scores are both 0Invariant will halt; manual analysis of plan quality needed
item_select cannot choose a winnerStore entry does not existCheck whether score capture is working correctly
Tests regress after evo_apply_winnerself_test failsevo_align_tests should attempt a fix; if it still fails, manual intervention needed

7.2 C/S Architecture-Specific Errors

ErrorDetection MethodResolution
Daemon not runningCLI reports failed to connect to daemon at .../orchestrator.sockStart with orchestratord --foreground --workers 2
CLI points to legacy monolithic binarywhich orchestrator points to core/target/release/Update symlink to target/release/orchestrator
Daemon still uses old code after rebuildPreviously fixed bug reappearsKill the old daemon process and start a new one
Task starts immediately after task createtask list shows pending or quickly transitions to runningIn C/S mode the task lifecycle is queue-only; this is normal behavior

7.3 General Errors

Same as self-bootstrap: record status, logs, and diff; manually take over if necessary.


8. Human Role Boundaries

Same as self-bootstrap: humans are only responsible for launching, monitoring, judging, and recording.

The additional observation focus for this round is whether the evolution mechanism itself works:

  • Whether candidate generation produces meaningful differentiation
  • Whether competitive evaluation is based on objective metrics
  • Whether the selection result is reasonable
  • Whether the overall pipeline produces higher-quality code than linear self-bootstrap

These observations will be used to determine whether the self-evolution workflow is worth replacing or supplementing self-bootstrap in future topics.


9. Post-Test Cleanup

After the task completes, clean up the agent-produced topic code so the same fixture can be tested again:

bash
# Revert all files modified by the agent (preserve infrastructure bug fixes)
git checkout HEAD -- Cargo.lock core/Cargo.toml \
  core/src/collab/context.rs core/src/collab/mod.rs \
  core/src/selection.rs crates/daemon/src/server.rs

# Delete new files created by the agent
rm -f core/src/collab/template.rs

# Confirm working tree is clean
git status --short

# Verify compilation
cargo check

Warning: The agent may modify core files such as context.rs, lib.rs, and Cargo.toml. After each execution, be sure to check git diff --stat and revert unexpected changes.