Self-Evolution Topic Execution Plan

Harness Engineering execution plan: this is an agent-executable scenario that shows how the control plane coordinates environment, workflow, guardrails, and feedback loops rather than a one-off agent call.
Agent Collaboration: This document is an agent-executable plan. Open this project in an AI coding agent (Claude Code, OpenCode, Codex, etc.) — the agent reads this plan and orchestrates other agents via the orchestrator CLI to collaboratively complete the task, from resource deployment and execution to result verification, fully autonomously.

This document is the first real-world test topic for the self-evolution workflow. Unlike self-bootstrap, self-evolution uses WP03 dynamic candidate generation + competitive selection to explore multiple implementation paths, with the engine automatically selecting the best solution.

1. Task Objective

Pass the following objective verbatim to the orchestrator as the topic for this round of self-evolution:

Topic name: StepTemplate Prompt Variable Parsing Enhancement
Background: The current StepTemplate prompt field uses simple string substitution ({var_name}) to inject runtime variables. This approach has the following issues:
No detection of undefined variables — if a prompt references a non-existent variable, the placeholder {var_name} is preserved after substitution, which may confuse the agent.
No conditional sections — it is not possible to include/exclude a prompt section based on whether a variable exists (e.g., "show the diff section if a diff is available").
No default value mechanism — there is no way to fall back to a reasonable default when a variable does not exist.
Objectives for this round: Enhance prompt template variable parsing to support the following syntax:
{var_name} — existing behavior, with a warning log when undefined
{var_name:-default_value} — use default value when undefined
{?var_name}...{/var_name} — conditional section, included when variable exists and is non-empty
Constraints:
Do not introduce external template engine dependencies (e.g., Tera, Handlebars); implement in pure Rust.
Maintain full backward compatibility with the existing {var_name} syntax.
The ultimate goal: all existing StepTemplate prompts should work without any modifications; the new syntax is an optional enhancement.

1.1 Expected Outputs

Produced autonomously by the orchestrator:

Two competing proposals (generated by the evo_plan step and injected as dynamic items via generate_items).
Independent implementation for each proposal (evo_implement, item-scoped in parallel).
Automated scoring for each proposal (evo_benchmark: compilation/tests/clippy/diff size).
Engine automatically selects the higher-scoring proposal (select_best, WP03 item_select).
The winning proposal is applied and passes final verification (evo_apply_winner + self_test).

1.2 Non-Goals

Do not presume which path should win.
Do not have humans specify the concrete code implementation approach.
Do not require full QA documentation generation (this round focuses on validating the evolution mechanism).

1.3 Rationale for Topic Selection

This topic was chosen as the first real-world self-evolution test based on the following considerations:

Appropriate scope: Involves 1-2 files (the template resolution module), with manageable change size suitable for comparing 2 candidate proposals.
Clear dimensions for comparison: Regex-based approach vs. hand-written parser — the two paths have genuine differences in performance, readability, and correctness.
Objectively scorable: Compilation pass, test pass, clean clippy, diff size — all are automatable, quantifiable metrics.
Backward compatibility constraint: Existing tests naturally serve as regression protection, requiring no additional manual verification.
Self-bootstrap relevant: Improving prompt templates directly enhances the quality of orchestrator's own agent calls.

2. Execution Method

This round follows the self-evolution workflow with the following pipeline:

text

evo_plan ──[generate_items]──> evo_implement (x2) ──> evo_benchmark (x2) ──> select_best ──> evo_apply_winner ──> evo_align_tests ──> self_test ──> loop_guard

Key differences from self-bootstrap:

Dimension	self-bootstrap	self-evolution
Loop strategy	Fixed 2 cycles	Fixed 1 cycle
Implementation paths	Single linear	2 competing candidates
Selection mechanism	None	WP03 item_select (max score)
Cost control	Multiple steps, multiple agents	max_parallel=1, no QA/doc steps
Safety guarantees	self_test + self_restart	self_test + invariant (compilation_gate)

3. Launch Steps

3.1 Build and Start the Daemon

In the C/S architecture, the CLI (orchestrator) connects to the daemon (orchestratord) via a Unix Domain Socket.

bash

cd "$ORCHESTRATOR_ROOT"   # your orchestrator project directory

cargo build --release -p orchestratord -p orchestrator-cli

# Start the daemon (if not already running)
# --foreground keeps log output in the foreground; --workers specifies the number of parallel workers
nohup ./target/release/orchestratord --foreground --workers 2 > /tmp/orchestratord.log 2>&1 &

# Verify the daemon is running
ps aux | grep orchestratord | grep -v grep
# Verify the queue can be consumed by daemon workers
orchestrator task list -o json

Warning: CLI binary path: In C/S mode, the CLI is at target/release/orchestrator (crates/cli), not the legacy monolithic binary core/target/release/agent-orchestrator. Update any symlinks pointing to the old path.

3.2 Initialize Database and Load Resources

bash

orchestrator delete project/self-evolution --force
orchestrator init
orchestrator apply -f your-secrets.yaml           --project self-evolution
# apply additional secret manifests as needed      --project self-evolution
# Warning: --project is required; otherwise real AI agents will be registered in the global namespace
orchestrator apply -f docs/workflow/execution-profiles.yaml --project self-evolution
orchestrator apply -f docs/workflow/self-evolution.yaml --project self-evolution

3.3 Verify Resources Are Loaded

Verify that resources are loaded (add --project to scope to a specific project):

bash

orchestrator get workspaces --project self-evolution -o json
orchestrator get workflows --project self-evolution -o json
orchestrator get agents --project self-evolution -o json

3.4 Create and Launch the Task

In C/S mode, task create directly enqueues to the daemon worker. The task begins executing automatically upon creation; there is no need for a separate task start.

bash

orchestrator task create \
  -n "evo-prompt-template-enhance" \
  -w self -W self-evolution \
  --project self-evolution \
  -g "Enhance StepTemplate prompt variable parsing: support {var:-default} default value syntax and {?var}...{/var} conditional section syntax. Pure Rust implementation, no external template engines. Maintain full backward compatibility with the existing {var} syntax. Undefined variables should produce a warn log instead of silently preserving the placeholder."

Record the returned <task_id>. The task will be immediately claimed by a worker and begin execution. To wait for completion, use orchestrator task watch <task_id> or poll task info.

4. Monitoring Methods

4.1 Status Monitoring

bash

orchestrator task list
orchestrator task info <task_id>
orchestrator task trace <task_id>    # execution timeline with anomaly detection
orchestrator task watch <task_id>    # real-time status panel refresh

4.2 Key Events in the Evolution Process

In addition to standard step monitoring, self-evolution has the following specific observation points:

items_generated event: Confirm that evo_plan successfully generated 2 candidate items
bash
```
orchestrator event list --task <task_id> --type items_generated -o json
```
Dynamic item status: Confirm both candidates were executed
bash
```
orchestrator task items <task_id>
```

Selection result: Confirm item_select chose a winner

bash

orchestrator store get evolution winner_latest --project self-evolution

4.3 Log Monitoring

bash

orchestrator task logs --tail 100 <task_id>
orchestrator task logs --tail 200 <task_id>

Key observations:

Whether evo_plan generated two proposals with substantive differences
Whether evo_implement produced independent implementations for each item
Whether evo_benchmark scoring is based on objective metrics
Whether select_best selected the higher-scoring proposal
Whether evo_apply_winner cleanly applied the winning proposal

4.4 Process / Daemon Monitoring

bash

# Daemon process
ps aux | grep orchestratord | grep -v grep

# Queue/task status
orchestrator task list -o json

# Agent subprocesses (claude -p)
ps aux | grep "claude -p" | grep -v grep

# Code changes
git diff --stat

5. Key Checkpoints

5.1 evo_plan Phase

Confirm the output contains:

2 structured candidate proposals (JSON format)
The two proposals have substantive differences (e.g., regex vs. hand-written parser)
The items_generated event has been persisted

5.2 evo_implement Phase

Confirm:

Both items produced code changes
Change scope is consistent with each proposal's description
No cross-contamination between items (item-scoped isolation)

5.3 evo_benchmark Phase

Confirm:

Both items have a score capture
Scoring is based on objective metrics such as compilation/tests/clippy
Scores are differentiated (not both receiving full marks)

5.4 select_best Phase

Confirm:

The evolution.winner_latest store entry exists
The selected proposal has the higher score
Winner data includes the proposal ID and score

5.5 evo_apply_winner + self_test Phase

Confirm:

The winning proposal's code compiles
All tests pass
The compilation_gate invariant did not trigger a halt
Existing StepTemplate prompt behavior is unchanged (backward compatible)

6. Success Criteria

The topic is considered complete when all of the following conditions are met:

The orchestrator completes the full self-evolution pipeline and terminates normally at loop_guard.
Two distinct candidate proposals were actually generated and independently implemented.
The engine selected the higher-scoring proposal via item_select.
The winning proposal's code passes self_test and the compilation_gate invariant.
The existing {var_name} substitution syntax remains backward compatible.
The evolution.winner_latest store records the selection result.

7. Error Handling

7.1 Evolution-Specific Error Scenarios

Error	Detection Method	Resolution
evo_plan does not output valid JSON	`items_generated` event does not exist	Check the prompt; JSON output instructions may need adjustment
Two candidate proposals are essentially identical	Inspect item labels and approach variables	Indicates insufficient differentiation guidance in the prompt
Both candidates fail to compile	Benchmark scores are both 0	Invariant will halt; manual analysis of plan quality needed
item_select cannot choose a winner	Store entry does not exist	Check whether score capture is working correctly
Tests regress after evo_apply_winner	self_test fails	evo_align_tests should attempt a fix; if it still fails, manual intervention needed

7.2 C/S Architecture-Specific Errors

Error	Detection Method	Resolution
Daemon not running	CLI reports `failed to connect to daemon at .../orchestrator.sock`	Start with `orchestratord --foreground --workers 2`
CLI points to legacy monolithic binary	`which orchestrator` points to `core/target/release/`	Update symlink to `target/release/orchestrator`
Daemon still uses old code after rebuild	Previously fixed bug reappears	Kill the old daemon process and start a new one
Task starts immediately after task create	task list shows `pending` or quickly transitions to `running`	In C/S mode the task lifecycle is queue-only; this is normal behavior

7.3 General Errors

Same as self-bootstrap: record status, logs, and diff; manually take over if necessary.

8. Human Role Boundaries

Same as self-bootstrap: humans are only responsible for launching, monitoring, judging, and recording.

The additional observation focus for this round is whether the evolution mechanism itself works:

Whether candidate generation produces meaningful differentiation
Whether competitive evaluation is based on objective metrics
Whether the selection result is reasonable
Whether the overall pipeline produces higher-quality code than linear self-bootstrap

These observations will be used to determine whether the self-evolution workflow is worth replacing or supplementing self-bootstrap in future topics.

9. Post-Test Cleanup

After the task completes, clean up the agent-produced topic code so the same fixture can be tested again:

bash

# Revert all files modified by the agent (preserve infrastructure bug fixes)
git checkout HEAD -- Cargo.lock core/Cargo.toml \
  core/src/collab/context.rs core/src/collab/mod.rs \
  core/src/selection.rs crates/daemon/src/server.rs

# Delete new files created by the agent
rm -f core/src/collab/template.rs

# Confirm working tree is clean
git status --short

# Verify compilation
cargo check

Warning: The agent may modify core files such as context.rs, lib.rs, and Cargo.toml. After each execution, be sure to check git diff --stat and revert unexpected changes.

Self-Evolution Topic Execution Plan ​

1. Task Objective ​

1.1 Expected Outputs ​

1.2 Non-Goals ​

1.3 Rationale for Topic Selection ​

2. Execution Method ​

3. Launch Steps ​

3.1 Build and Start the Daemon ​

3.2 Initialize Database and Load Resources ​

3.3 Verify Resources Are Loaded ​

3.4 Create and Launch the Task ​

4. Monitoring Methods ​

4.1 Status Monitoring ​

4.2 Key Events in the Evolution Process ​

4.3 Log Monitoring ​

4.4 Process / Daemon Monitoring ​

5. Key Checkpoints ​

5.1 evo_plan Phase ​

5.2 evo_implement Phase ​

5.3 evo_benchmark Phase ​

5.4 select_best Phase ​

5.5 evo_apply_winner + self_test Phase ​

6. Success Criteria ​

7. Error Handling ​

7.1 Evolution-Specific Error Scenarios ​

7.2 C/S Architecture-Specific Errors ​

7.3 General Errors ​

8. Human Role Boundaries ​

9. Post-Test Cleanup ​

Self-Evolution Topic Execution Plan

1. Task Objective

1.1 Expected Outputs

1.2 Non-Goals

1.3 Rationale for Topic Selection

2. Execution Method

3. Launch Steps

3.1 Build and Start the Daemon

3.2 Initialize Database and Load Resources

3.3 Verify Resources Are Loaded

3.4 Create and Launch the Task

4. Monitoring Methods

4.1 Status Monitoring

4.2 Key Events in the Evolution Process

4.3 Log Monitoring

4.4 Process / Daemon Monitoring

5. Key Checkpoints

5.1 evo_plan Phase

5.2 evo_implement Phase

5.3 evo_benchmark Phase

5.4 select_best Phase

5.5 evo_apply_winner + self_test Phase

6. Success Criteria

7. Error Handling

7.1 Evolution-Specific Error Scenarios

7.2 C/S Architecture-Specific Errors

7.3 General Errors

8. Human Role Boundaries

9. Post-Test Cleanup