Mimicking a binary deterministic agent

1. Abstract
2. Introduction
3. Related Work
4. Problem Setup
5. Methodology
6. Experimental Setup
7. Learnings
- 7.1. Cosine similarity enables the model to learn a generic behavior
8. Results
9. Conclusion
10. Open Questions
11. Future Work
12. References

1. Abstract

I investigate whether a structured prompt plus LLM-based coaching can fully imitate a deterministic binary policy without exposing its rule. I generate 200 GPT-3.5 sentences evenly split between coffee and sports topics, label them, and use a topic-aware “tiny agent” to emit one of two canned replies (“I see you are a person of taste!” vs. “Well, I don’t like sports.”). A DSPy mimic agent with a three- block prompt (general guidance + two SOP placeholders) is trained in 12 epochs using cosine similarity over OpenAI embeddings, evaluator histories, and LLM-authored case reports. Each epoch the coach edits exactly one prompt block based on change logs and nearest-neighbor failure analyses, yielding stable convergence from ~0.20 to 0.99 cosine similarity on the test split. The final prompt mirrors the deterministic policy, and a holdout set achieves 0.981 cosine similarity with multiple exact matches, demonstrating that prompt-only adaptation plus targeted feedback can recover the latent rule even when SOP blocks remain untouched. The unchanged SOP block is a bug/downside of this approach that I will investigate further in the future.

2. Introduction

I want to set the task of crafting an AI Agent as a learning from data task based on mimicking a target behavior. This setting ease all use cases where copying an existing behavior has advantages.

I will create a "tiny agent", in a sense it will have two behaviors, and try to learn these behaviors as the prompt of an LLM we train to imitate the tiny agent.

3. Related Work

This work is inspired by optimization frameworks like GEPA (Lakshya A Agrawal AND Shangyin Tan AND Dilara Soylu AND Noah Ziems AND Rishi Khare AND Krista Opsahl-Ong AND Arnav Singhvi AND Herumb Shandilya AND Michael J Ryan AND Meng Jiang AND Christopher Potts AND Koushik Sen AND Alexandros G. Dimakis AND Ion Stoica AND Dan Klein AND Matei Zaharia AND Omar Khattab, 2025) and Agentic Context Engineering (Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle, 2025).

The key difference from these works is about the signal that guides the optimization. Both of them rightly uses a success metric based on the task of interest: does the code written by the agent executes or produces the expected output? how the output scores regarding import criteria for the task at hand, likle correctness, conciseness, or politeness?

However, these are expensive signals to obtain. One need to understand very well a task, create specific evaluators, or have great annotated data.

The take here is to simplify it as mimicking the output of a target agent at the expense of not focusing on the outcome of the agent on the task. In this scenario, the target agent becomes an upper bound of performance on the task. At the same time, identifying if an AI agent is able to behave just as another agent (a human, or a system), is a generic task that is easy to set.

4. Problem Setup

There is an Agent, in a the broader sense as defined in Sutton, Richard S. And Barto, Andrew G. (2018):

An agent is anything that can perceive its environment through sensors and act upon that environment through actuators in order to achieve a goal.

I want to learn this agent behavior by only observing their observations and actions.

I succeed if I can build another agent that when seen behaving under the same environment is indistinguishable from the original agent it learned from.

5. Methodology

I use a target & learning agent framework.

In a target and learning agent framework, we use the input and output from the target agent to teach the learning agent how to imitate it. An Evaluator block quantifies how well we are doing in the imitation task and provide a gradient notion to a coach agent that uses metrics, and prompt diffs to reason on how to update the learning agent prompt to optimize the objective function.

:end:

In this specific experiment, we imitate a deterministic two-response policy entirely via prompt engineering and LLM-based coaching. The procedure:

Synthetic data – GPT-3.5-turbo generates 200 first-person sentences evenly split between the topics “coffee” and “sports”. Each record carries its topic label and is scored by the tiny agent to produce ground-truth outputs (“I see you are a person of taste!” vs. “Well, I don’t like sports.”).
Structured mimic prompt – The DSPy mimic agent uses a prompt with three blocks: `[General guidance]`, `[sop-1]`, `[sop-2]`. Only the general block was edited in this run, but the architecture supports any number of SOP sections.
Training loop – For each epoch we shuffle training data, run batched mimic inference, log cosine similarities with the evaluator, and collect per-example histories.
Case reporting – After each epoch we pick an anchor input, grab its nearest neighbors via embeddings, export a markdown case report, and run an auxiliary LLM (“case analyzer”) to reason about target vs. mimic differences.
Coach update – The coach receives (a) evaluator summary, (b) structured prompt blocks, (c) change log with Δmetrics and diffs, and (d) the case analysis. It must select exactly one block and propose new text; the learning loop applies only that block while leaving others unchanged.

Assumptions: topics are balanced; the mimic never sees explicit policy rules; cosine similarity uses OpenAI embeddings (text-embedding-3-small); coach edits are atomic to maintain prompt stability.

../images/org-roam/methodology_learning_a_binary_Deterministic_agent.png

6. Experimental Setup

6.1. Configuration

%load_ext autoreload
%autoreload 2

6.2. Creating the input text

I use GPT-3.5 to generate 200 examples. In 100 examples, we will talk about coffee. In the other 100, we will talk about sports.

from __future__ import annotations

import json
import time
from pathlib import Path
from typing import Dict, List

import openai


DATA_PATH = Path("experiments/mimic_tiny_agent/coffee_vs_sports.jsonl")
client = openai.OpenAI()


def _request_sentence(topic: str, model: str) -> str:
    """Ask GPT-3.5 for a sentence about the desired topic."""

    if topic == "coffee":
        user_prompt = (
            "Write one short first-person sentence (<=25 words) describing why I love coffee or a memorable coffee moment."
        )
    elif topic == "sports":
        user_prompt = (
            "Write one short first-person sentence (<=25 words) describing my favorite sport or why I enjoy it."
        )
    else:
        raise ValueError(f"Unsupported topic: {topic}")

    response = client.chat.completions.create(
        model=model,
        temperature=0.8,
        max_tokens=80,
        messages=[
            {
                "role": "system",
                "content": "You craft concise, vivid personal statements that follow the user instructions exactly.",
            },
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.choices[0].message.content.strip()


def generate_dataset_with_gpt(
    n_examples: int = 200,
    *,
    model: str = "gpt-3.5-turbo",
    save_path: Path = DATA_PATH,
) -> List[Dict[str, object]]:
    """Call GPT-3.5 to synthesize coffee and sports statements and persist them."""

    if n_examples % 2 != 0:
        raise ValueError("n_examples must be even so it can be split evenly")

    save_path.parent.mkdir(parents=True, exist_ok=True)

    half = n_examples // 2
    dataset: List[Dict[str, object]] = []

    for idx in range(n_examples):
        topic = "coffee" if idx < half else "sports"
        text = _request_sentence(topic, model)
        dataset.append(
            {
                "text": text,
                "topic": topic,
                "model": model,
                "timestamp": time.time(),
            }
        )
        time.sleep(0.2)

    with save_path.open("w", encoding="utf-8") as fh:
        for row in dataset:
            fh.write(json.dumps(row) + "\n")

    return dataset


def load_or_create_dataset(path: Path = DATA_PATH) -> List[Dict[str, object]]:
    if path.exists():
        print(f"Loading cached dataset from {path}")
        return [json.loads(line) for line in path.read_text().splitlines() if line]
    print("Dataset not found on disk; generating via GPT-3.5...")
    return generate_dataset_with_gpt(save_path=path)


synthetic_examples = load_or_create_dataset()
coffee_count = sum(1 for item in synthetic_examples if item.get("topic") == "coffee")
sports_count = sum(1 for item in synthetic_examples if item.get("topic") == "sports")
print(
    f"Loaded {len(synthetic_examples)} examples; "
    f"coffee={coffee_count} | sports={sports_count}."
)

Loading cached dataset from experiments/mimic_tiny_agent/coffee_vs_sports.jsonl
Loaded 200 examples; coffee=100 | sports=100.

6.3. Creating the tiny agent

We create a function that's able to generate an answer following:

Whenever the topic is "coffee", it should answer: "I see you are a person of taste! I like coffee, too.";
Whenever the topic is "sports", it must say: "Well, I don't like sports.".

from typing import Iterable


COFFEE_RESPONSE = "I see you are a person of taste! I like coffee, too."
SPORTS_RESPONSE = "Well, I don't like sports."
DEFAULT_RESPONSE = SPORTS_RESPONSE


def tiny_agent_response(user_text: str, *, topic: str | None = None) -> str:
    label = (topic or "").strip().lower()
    if label == "coffee":
        return COFFEE_RESPONSE
    if label == "sports":
        return SPORTS_RESPONSE
    if "coffee" in user_text.lower():
        return COFFEE_RESPONSE
    return DEFAULT_RESPONSE


def batch_tiny_agent_response(
    inputs: Iterable[tuple[str, str | None]],
) -> list[str]:
    return [tiny_agent_response(text, topic=topic) for text, topic in inputs]


print(tiny_agent_response("I brewed a silky pour-over", topic="coffee"))
print(tiny_agent_response("Practicing my backhand", topic="sports"))

I see you are a person of taste! I like coffee, too.
Well, I don't like sports.

6.3.1. Scoring all the input text using the tiny agent

from collections.abc import Sequence
from pprint import pprint


def score_dataset_with_tiny_agent(
    dataset: Sequence[dict[str, object]],
) -> list[dict[str, object]]:
    """Append the tiny agent answer to every input example."""

    scored: list[dict[str, object]] = []
    for row in dataset:
        topic = row.get("topic")
        answer = tiny_agent_response(str(row["text"]), topic=topic)
        scored.append({**row, "tiny_agent_output": answer})
    return scored


scored_examples = score_dataset_with_tiny_agent(synthetic_examples)
print(f"Scored {len(scored_examples)} examples with the deterministic agent.")
pprint(scored_examples[:3])

Scored 200 examples with the deterministic agent.
[{'model': 'gpt-3.5-turbo',
  'text': 'The first sip of my morning espresso awakens my senses and sets a '
          'positive tone for the day ahead.',
  'timestamp': 1763237270.6815271,
  'tiny_agent_output': 'I see you are a person of taste! I like coffee, too.',
  'topic': 'coffee'},
 {'model': 'gpt-3.5-turbo',
  'text': "I love coffee because it's a warm hug in a cup that jumpstarts my "
          'day with its rich aroma.',
  'timestamp': 1763237271.866124,
  'tiny_agent_output': 'I see you are a person of taste! I like coffee, too.',
  'topic': 'coffee'},
 {'model': 'gpt-3.5-turbo',
  'text': 'The first sip of rich, aromatic coffee transports me to cozy '
          'mornings and fuels my passion for creativity and productivity.',
  'timestamp': 1763237273.029643,
  'tiny_agent_output': 'I see you are a person of taste! I like coffee, too.',
  'topic': 'coffee'}]

6.4. The mimic agent

We use DSPy to create a program that just has a generic prompt.

import os
from typing import Any

import dspy


def ensure_dspy_configured() -> object:
    """Lazy-load the LM client used by DSPy."""

    if dspy.settings.lm is not None:
        return dspy.settings.lm

    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise EnvironmentError(
            "OPENAI_API_KEY is not set. Export it before running DSPy cells."
        )

    model_name = os.getenv("DSPY_LM_MODEL", "openai/gpt-4.1-mini")
    temperature = float(os.getenv("DSPY_LM_TEMPERATURE", "0.3"))
    max_tokens = int(os.getenv("DSPY_LM_MAX_TOKENS", "2048"))

    lm = dspy.LM(
        model_name,
        api_key=api_key,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    dspy.configure(lm=lm)
    return lm


ensure_dspy_configured()


class CommentOnClaim(dspy.Signature):
    user_claim = dspy.InputField(desc="The short statement provided by the user.")
    comment = dspy.OutputField(
        desc="A concise acknowledgement or observation about the claim."
    )


mimic_agent_structured_default = {
    "general_guidance": (
        "Comment on the user claim following the SOPs."
    ),
    "sops": [
        "This is a placeholder standard operating procedure.",
        "This is a placeholder standard operating procedure.",
    ],
}


class PromptStructure:
    def __init__(self, general_guidance: str, sops: list[str]) -> None:
        self.general_guidance = general_guidance.strip()
        self.sops = [sop.strip() for sop in sops]

    def render(self) -> str:
        lines = ["[General guidance]", self.general_guidance]
        for idx, sop in enumerate(self.sops, start=1):
            lines.append(f"[sop-{idx}]")
            lines.append(self.sops[idx - 1])
        return "\n".join(lines)

    def block_names(self) -> list[str]:
        return ["general"] + [f"sop-{idx}" for idx in range(1, len(self.sops) + 1)]

    def get_block_text(self, block_name: str) -> str:
        if block_name == "general":
            return self.general_guidance
        if block_name.startswith("sop-"):
            idx = int(block_name.split("-", 1)[1]) - 1
            return self.sops[idx]
        raise KeyError(f"Unknown block {block_name}")

    def update_block(self, block_name: str, new_text: str) -> None:
        cleaned = new_text.strip()
        if not cleaned:
            return
        if block_name == "general":
            self.general_guidance = cleaned
            return
        if block_name.startswith("sop-"):
            idx = int(block_name.split("-", 1)[1]) - 1
            if 0 <= idx < len(self.sops):
                self.sops[idx] = cleaned
                return
        raise KeyError(f"Unknown block {block_name}")


class MimicAgent(dspy.Module):
    """DSPy program with a structured prompt (general guidance + SOP blocks)."""

    def __init__(
        self,
        *,
        general_guidance: str | None = None,
        sop_texts: list[str] | None = None,
        sop_count: int = 2,
    ) -> None:
        super().__init__()
        template = mimic_agent_structured_default
        base_general = general_guidance or template["general_guidance"]
        base_sops = sop_texts or template["sops"]
        if sop_count > len(base_sops):
            base_sops = base_sops + [base_sops[-1]] * (sop_count - len(base_sops))
        else:
            base_sops = base_sops[:sop_count]

        self.prompt_state = PromptStructure(base_general, base_sops)
        self.generator = dspy.Predict(CommentOnClaim)

    @property
    def base_prompt(self) -> str:
        return self.prompt_state.render()

    def forward(self, user_claim: str) -> dict[str, Any]:  # type: ignore[override]
        instructions = (
            f"Instruction:\n{self.base_prompt}\n"
            f"User claim: {user_claim.strip()}\n"
            "Comment:"
        )
        prediction = self.generator(user_claim=instructions)
        return {
            "prompt": self.base_prompt,
            "mimic_output": prediction.comment,
        }

    def get_prompt_blocks(self) -> dict[str, str]:
        return {name: self.prompt_state.get_block_text(name) for name in self.prompt_state.block_names()}

    def update_block(self, block_name: str, new_text: str) -> None:
        self.prompt_state.update_block(block_name, new_text)


mimic_agent = MimicAgent(sop_count=2)
demo_reply = mimic_agent("I'm experimenting with new pour-over routines")
demo_reply

prompt

[General guidance]\nComment on the user claim following the SOPs.\n[sop-1]\nThis is a placeholder standard operating procedure.\n[sop-2]\nThis is a placeholder standard operating procedure.

mimic_output

Exploring new pour-over routines can be a great way to enhance your coffee brewing skills and discover unique flavors. Enjoy the experimentation process!

6.5. The learning algorithm

6.5.1. The evaluator

The evaluator input: the user input, the tiny agent output, and the mimic agent output. The evaluator output: a similarity metric calculated comparing the tiny agent output to the mimic agent output; a list of the last n outputs from the mimic agent on that same input, and a list of the last k training examples and mimic outputs and their scores.

from __future__ import annotations

import math
import time
from collections import Counter, defaultdict, deque
from dataclasses import dataclass
from typing import Any, Deque, Dict, Literal

import numpy as np
import openai


SimilarityMetric = Literal["bleu", "embedding_cosine"]


def _tokenize(text: str) -> list[str]:
    return [token for token in text.lower().split() if token]


def _ngram_counts(tokens: list[str], n: int) -> Counter[tuple[str, ...]]:
    if len(tokens) < n:
        return Counter()
    return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1))


def compute_bleu(reference: str, hypothesis: str) -> float:
    ref_tokens = _tokenize(reference)
    hyp_tokens = _tokenize(hypothesis)

    if not hyp_tokens:
        return 0.0

    weights = [0.25, 0.25, 0.25, 0.25]
    precisions = []
    for n in range(1, 5):
        ref_counts = _ngram_counts(ref_tokens, n)
        hyp_counts = _ngram_counts(hyp_tokens, n)
        overlap = sum(
            min(count, ref_counts[gram]) for gram, count in hyp_counts.items()
        )
        total = sum(hyp_counts.values()) or 1
        precisions.append(overlap / total if overlap else 1e-9)

    log_precision = sum(w * math.log(p) for w, p in zip(weights, precisions))
    brevity_penalty = 1.0
    ref_len = len(ref_tokens)
    hyp_len = len(hyp_tokens)
    if hyp_len <= ref_len and hyp_len > 0:
        brevity_penalty = math.exp(1 - ref_len / hyp_len)

    return float(brevity_penalty * math.exp(log_precision))


def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    denom = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
    if denom == 0:
        return 0.0
    return float(np.dot(vec_a, vec_b) / denom)


@dataclass
class EvaluationRecord:
    user_input: str
    tiny_output: str
    mimic_output: str
    similarity: float
    metric: SimilarityMetric
    timestamp: float


class TinyAgentEvaluator:
    """Track similarity and histories for the mimic vs. tiny agent comparison."""

    def __init__(
        self,
        *,
        metric: SimilarityMetric = "bleu",
        per_input_history: int = 5,
        training_history: int = 50,
        embedding_model: str = "text-embedding-3-small",
    ) -> None:
        self.metric = metric
        self.per_input_history = per_input_history
        self.training_history = training_history
        self.embedding_model = embedding_model

        self._per_input_records: dict[str, Deque[EvaluationRecord]] = defaultdict(
            lambda: deque(maxlen=per_input_history)
        )
        self._global_records: Deque[EvaluationRecord] = deque(maxlen=training_history)
        self._embedding_cache: dict[str, np.ndarray] = {}
        self._embedding_client = openai.OpenAI()

    # -- metrics -----------------------------------------------------------------
    def _bleu_similarity(self, tiny_output: str, mimic_output: str) -> float:
        return compute_bleu(tiny_output, mimic_output)

    def _embedding_similarity(self, tiny_output: str, mimic_output: str) -> float:
        tiny_vec = self._embed_text(tiny_output)
        mimic_vec = self._embed_text(mimic_output)
        return cosine_similarity(tiny_vec, mimic_vec)

    def _embed_text(self, text: str) -> np.ndarray:
        cache_key = text.strip()
        if cache_key in self._embedding_cache:
            return self._embedding_cache[cache_key]
        if not cache_key:
            vec = np.zeros(1536, dtype=np.float32)
            self._embedding_cache[cache_key] = vec
            return vec
        response = self._embedding_client.embeddings.create(
            model=self.embedding_model,
            input=cache_key,
        )
        vec = np.array(response.data[0].embedding, dtype=np.float32)
        self._embedding_cache[cache_key] = vec
        return vec

    def _compute_similarity(self, tiny_output: str, mimic_output: str) -> float:
        if self.metric == "embedding_cosine":
            return self._embedding_similarity(tiny_output, mimic_output)
        return self._bleu_similarity(tiny_output, mimic_output)

    def similarity(self, tiny_output: str, mimic_output: str) -> float:
        """Public helper to compute similarity without mutating evaluator state."""

        return self._compute_similarity(tiny_output, mimic_output)

    def embed_text(self, text: str) -> np.ndarray:
        """Expose embedding helper for downstream analysis."""

        return self._embed_text(text)

    # -- public api ---------------------------------------------------------------
    def evaluate(
        self,
        *,
        user_input: str,
        tiny_output: str,
        mimic_output: str,
    ) -> Dict[str, Any]:
        similarity = self._compute_similarity(tiny_output, mimic_output)
        record = EvaluationRecord(
            user_input=user_input,
            tiny_output=tiny_output,
            mimic_output=mimic_output,
            similarity=similarity,
            metric=self.metric,
            timestamp=time.time(),
        )

        per_input_buffer = self._per_input_records[user_input]
        per_input_buffer.append(record)
        self._global_records.append(record)

        return {
            "similarity": similarity,
            "metric": self.metric,
            "last_outputs_for_input": [
                {
                    "mimic_output": item.mimic_output,
                    "tiny_output": item.tiny_output,
                    "similarity": item.similarity,
                    "timestamp": item.timestamp,
                }
                for item in list(per_input_buffer)
            ],
            "recent_training_examples": [
                {
                    "user_input": item.user_input,
                    "mimic_output": item.mimic_output,
                    "tiny_output": item.tiny_output,
                    "similarity": item.similarity,
                    "metric": item.metric,
                    "timestamp": item.timestamp,
                }
                for item in list(self._global_records)
            ],
        }


def describe_block_change(block_name: str, old_text: str, new_text: str) -> str:
    if old_text.strip() == new_text.strip():
        return f"Block {block_name} unchanged."

    desc: list[str] = [f"Block {block_name}: "]
    if len(new_text) > len(old_text) * 1.1:
        desc.append("expanded details.")
    elif len(new_text) < len(old_text) * 0.9:
        desc.append("more concise wording.")

    old_words = set(old_text.lower().split())
    new_words = set(new_text.lower().split())
    added = list(new_words - old_words)
    removed = list(old_words - new_words)
    if added:
        desc.append(
            " Added cues: "
            + ", ".join(sorted(added[:5]))
            + ("..." if len(added) > 5 else "")
        )
    if removed:
        desc.append(
            " Removed cues: "
            + ", ".join(sorted(removed[:5]))
            + ("..." if len(removed) > 5 else "")
        )

    if len(desc) == 1:
        desc.append(" Rephrased while preserving intent.")
    return "".join(desc)


def prompt_unified_diff(old_prompt: str, new_prompt: str) -> str:
    diff = difflib.unified_diff(
        old_prompt.splitlines(),
        new_prompt.splitlines(),
        fromfile="previous_prompt",
        tofile="updated_prompt",
        lineterm="",
    )
    diff_text = "\n".join(diff)
    return diff_text or "(no textual diff)"


def format_change_log(entries: list[dict[str, object]], limit: int = 3) -> str:
    if not entries:
        return "No prior prompt changes recorded."

    lines: list[str] = ["Recent prompt changes:"]
    for idx, entry in enumerate(entries[-limit:], start=1):
        delta = entry.get("delta")
        if isinstance(delta, float):
            delta_text = f"{delta:+.3f}"
        else:
            delta_text = "N/A"
        lines.append(
            f"[{idx}] block={entry.get('block_name')} Δmetric={delta_text} "
            f"prev={entry.get('prev_metric')} new={entry.get('new_metric')}"
        )
        lines.append(f"Intent: {entry.get('intention')}")
        old_block = entry.get("old_block_text", "")
        new_block = entry.get("new_block_text", "")
        lines.append(f"Old block: {old_block}")
        lines.append(f"New block: {new_block}")
        diff_text = entry.get("diff", "") or "(no diff)"
        if isinstance(diff_text, str):
            snippet = diff_text if len(diff_text) <= 800 else diff_text[:800] + "\n..."
            lines.append("Diff:\n" + snippet)
    return "\n".join(lines)


def serialize_prompt_blocks(blocks: dict[str, str]) -> str:
    parts = []
    for name, text in blocks.items():
        parts.append(f"[{name}]\n{text}")
    return "\n\n".join(parts)

evaluator = TinyAgentEvaluator(
    metric="embedding_cosine",
    per_input_history=3,
    training_history=10,
)

sample_input = synthetic_examples[0]["text"]
tiny_answer = tiny_agent_response(sample_input)
mimic_answer = mimic_agent(sample_input)["mimic_output"]
evaluator.evaluate(
    user_input=sample_input,
    tiny_output=tiny_answer,
    mimic_output=mimic_answer,
)

similarity

0.01230083592236042

metric

embedding_cosine

last_outputs_for_input

((mimic_output : Your claim highlights a common experience where the initial taste of espresso can stimulate the senses and contribute to a positive start to the day. This aligns with how caffeine and sensory enjoyment often influence mood and alertness. tiny_output : Well, I don't like sports. similarity : 0.01230083592236042 timestamp : 1763249209.6680489))

recent_training_examples

((user_input : The first sip of my morning espresso awakens my senses and sets a positive tone for the day ahead. mimic_output : Your claim highlights a common experience where the initial taste of espresso can stimulate the senses and contribute to a positive start to the day. This aligns with how caffeine and sensory enjoyment often influence mood and alertness. tiny_output : Well, I don't like sports. similarity : 0.01230083592236042 metric : embedding_cosine timestamp : 1763249209.6680489))

6.5.2. The coach agent

The coach agent input: the evaluator output, and the current mimic agent's prompt. The coach agent output: a new prompt for the mimic agent.

class CoachSignature(dspy.Signature):
    evaluator_report = dspy.InputField(
        desc="Structured summary describing recent evaluator scores and samples."
    )
    prompt_blocks = dspy.InputField(desc="Structured prompt broken into named blocks.")
    block_options = dspy.InputField(desc="List of valid block names to edit.")
    guidelines = dspy.InputField(
        desc="Non-negotiable constraints for how the revised prompt should be phrased."
    )
    prompt_change_log = dspy.InputField(
        desc="Table describing prior prompt edits, diffs, and their metric deltas."
    )
    case_examples = dspy.InputField(
        desc="Sample inputs with target/mimic outputs for focused comparison."
    )
    case_reasoning = dspy.InputField(
        desc="Reasoned analysis describing why mimic outputs differ from targets."
    )
    target_block = dspy.OutputField(desc="Name of the block to edit (e.g., general, sop-1).")
    updated_block_text = dspy.OutputField(desc="Rewritten text for the chosen block only.")
    coach_rationale = dspy.OutputField(desc="Short explanation for the proposed change.")


COACH_PROMPT_GUIDELINES = (
    "Produce a standalone instruction for an agent to immitate a 'target agent'. "
    "Focus on what the agent should say to become indistinguishable from the target agent."
    "The target agent is following a policy and you must decode it by looking to its behavior and craft a prompt to the mimic agent to apply that same policy."
    "Be as precise as possible describing the process the target agent is following to insert in the mimic agent prompt."
    "Don't mention 'mimic agent' or 'target agent' in the prompt of the 'mimic agent' since it is unaware of them. "
    "You want to maximize the similarity between the target agent output and the mimic agent output by editing the mimic agent's prompt."
    "Respect the structured prompt: edit exactly one named block (general or sop-#) per iteration and leave all other blocks untouched."
)


def summarize_evaluator_output(report: dict[str, Any], limit: int = 3) -> str:
    """Turn evaluator dictionaries into a short textual description."""

    lines = [
        f"Similarity metric: {report.get('metric')}",
        f"Last score: {report.get('similarity'):.3f}",
    ]
    history = report.get("last_outputs_for_input", [])[-limit:]
    if history:
        lines.append("Recent outputs for this input:")
        for item in history:
            lines.append(
                "- target: "
                + item["tiny_output"].strip()
                + " | mimic: "
                + item["mimic_output"].strip()
                + f" | score={item['similarity']:.3f}"
            )

    recent_examples = report.get("recent_training_examples", [])[-limit:]
    if recent_examples:
        lines.append("Recent training mismatches:")
        for example in recent_examples:
            lines.append(
                "- input: "
                + example["user_input"].strip()
                + " | target: "
                + example["tiny_output"].strip()
                + " | mimic: "
                + example["mimic_output"].strip()
                + f" | score={example['similarity']:.3f}"
            )

    return "\n".join(lines)


class CoachAgent(dspy.Module):
    """LLM-based coach that rewrites the mimic prompt using evaluator feedback."""

    def __init__(self) -> None:
        super().__init__()
        ensure_dspy_configured()
        self.predictor = dspy.Predict(CoachSignature)
        self.guidelines = COACH_PROMPT_GUIDELINES

    def forward(
        self,
        *,
        evaluator_output: dict[str, Any],
        prompt_blocks: str,
        block_options: str,
        prompt_change_log: str,
        case_examples: str,
        case_reasoning: str,
    ) -> dict[str, str]:  # type: ignore[override]
        summary = summarize_evaluator_output(evaluator_output)
        prediction = self.predictor(
            evaluator_report=summary,
            prompt_blocks=prompt_blocks,
            block_options=block_options,
            guidelines=self.guidelines,
            prompt_change_log=prompt_change_log,
            case_examples=case_examples,
            case_reasoning=case_reasoning,
        )
        block = getattr(prediction, "target_block", "general").strip()
        updated_block = getattr(prediction, "updated_block_text", "").strip()
        rationale = getattr(prediction, "coach_rationale", "")
        return {
            "target_block": block,
            "updated_block_text": updated_block,
            "coach_rationale": rationale,
        }


coach = CoachAgent()
evaluator_snapshot = evaluator.evaluate(
    user_input=sample_input,
    tiny_output=tiny_answer,
    mimic_output=mimic_agent(sample_input)["mimic_output"],
)

coach.forward(
    evaluator_output=evaluator_snapshot,
    prompt_blocks=serialize_prompt_blocks(mimic_agent.get_prompt_blocks()),
    block_options=", ".join(mimic_agent.prompt_state.block_names()),
    prompt_change_log="No prior prompt changes recorded.",
    case_examples="",
    case_reasoning="No case analysis available yet.",
)

2025/11/15 20:26:49 WARNING dspy.primitives.module: Calling module.forward(...) on CoachAgent directly is discouraged. Please use module(...) instead.

target_block

general

updated_block_text

Carefully read the user's claim and respond with a brief, relevant comment that directly addresses the content or sentiment of the claim. Your response should be concise, neutral, and avoid introducing unrelated topics or excessive elaboration. Focus on acknowledging or reflecting the user's statement in a way that shows understanding without adding personal opinions or external information.

coach_rationale

The original general block was too vague and did not guide the agent to produce responses aligned with the target outputs, which are short, direct comments on the user's claim. This revised instruction clarifies that the agent should produce brief, relevant, and neutral comments that directly engage with the user's statement, avoiding unrelated elaboration. This should improve output similarity to the target agent's behavior.

6.5.3. The algorithm

Splitting the training data

We split the data between training, test, and holdout. 60%/20%/20%.

import math
import random
from typing import Sequence, Tuple


def split_dataset(
    dataset: Sequence[dict[str, object]],
    train_ratio: float = 0.6,
    test_ratio: float = 0.2,
    seed: int = 2024,
) -> Tuple[list[dict[str, object]], list[dict[str, object]], list[dict[str, object]]]:
    if not math.isclose(train_ratio + test_ratio, 0.8, rel_tol=1e-9):
        raise ValueError("Train + test ratios must sum to 0.8 for 20% holdout.")

    rng = random.Random(seed)
    shuffled = list(dataset)
    rng.shuffle(shuffled)

    n = len(shuffled)
    train_end = int(n * train_ratio)
    test_end = train_end + int(n * test_ratio)

    train_split = shuffled[:train_end]
    test_split = shuffled[train_end:test_end]
    holdout_split = shuffled[test_end:]
    return train_split, test_split, holdout_split


train_set, test_set, holdout_set = split_dataset(scored_examples)
print(
    f"train={len(train_set)} | test={len(test_set)} | holdout={len(holdout_set)}"
)

train=120 | test=40 | holdout=40

The learning loop

import matplotlib.pyplot as plt
import random
import difflib
from pathlib import Path

import dspy

CASE_REPORT_DIR = Path("experiments/mimic_tiny_agent/coach_case_reports")
CASE_ANALYSIS_INSTRUCTIONS = (
    "Explain how the mimic agent's answers differ from the target outputs and propose concrete prompt tweaks."
)


class CaseAnalysisSignature(dspy.Signature):
    cases = dspy.InputField(desc="Selected inputs plus target/mimic outputs.")
    metric = dspy.InputField(desc="Similarity metric currently tracked.")
    guidance = dspy.InputField(desc="What the analysis should focus on.")
    analysis = dspy.OutputField(desc="Reasoned comparison and actionable advice.")

ensure_dspy_configured()
case_analyzer = dspy.Predict(CaseAnalysisSignature)


def analyze_case_differences(cases_text: str, metric: str) -> str:
    if not cases_text.strip():
        return "No case data collected."

    ensure_dspy_configured()
    result = case_analyzer(
        cases=cases_text,
        metric=metric,
        guidance=CASE_ANALYSIS_INSTRUCTIONS,
    )
    return getattr(result, "analysis", str(result)).strip()


def build_case_report(
    records: list[dict[str, object]],
    evaluator: TinyAgentEvaluator,
    *,
    epoch_idx: int,
    top_k: int = 5,
) -> tuple[str, str]:
    if not records:
        return "", ""

    anchor = random.choice(records)
    anchor_vec = evaluator.embed_text(anchor["user_input"])
    scored: list[tuple[float, dict[str, object]]] = []
    for record in records:
        vec = evaluator.embed_text(record["user_input"])
        sim = cosine_similarity(anchor_vec, vec)
        scored.append((sim, record))

    top_records = [rec for _, rec in sorted(scored, key=lambda x: x[0], reverse=True)[:top_k]]
    lines: list[str] = [f"Anchor input: {anchor['user_input']}"]
    for idx, rec in enumerate(top_records, start=1):
        lines.append(
            "\n".join(
                [
                    f"### Case {idx}",
                    f"- input: {rec['user_input']}",
                    f"- target agent output: {rec['tiny_output']}",
                    f"- mimic agent output: {rec['mimic_output']}",
                    f"- similarity: {rec['similarity']:.3f}",
                ]
            )
        )

    CASE_REPORT_DIR.mkdir(parents=True, exist_ok=True)
    report_path = CASE_REPORT_DIR / f"epoch_{epoch_idx:02d}_case_report.md"
    report_text = "\n\n".join(lines)
    report_path.write_text(report_text, encoding="utf-8")
    return report_text, str(report_path)


def chunk_batches(data: list[dict[str, object]], batch_size: int) -> list[list[dict[str, object]]]:
    return [data[i : i + batch_size] for i in range(0, len(data), batch_size)]


def run_learning_loop(
    *,
    train_data: list[dict[str, object]],
    test_data: list[dict[str, object]],
    evaluator: TinyAgentEvaluator,
    coach: CoachAgent,
    mimic_agent: MimicAgent,
    epochs: int = 5,
    batch_size: int = 32,
    shuffle: bool = True,
) -> dict[str, Any]:
    test_history: list[float] = []
    train_history: list[float] = []
    prompt_history: list[str] = [mimic_agent.base_prompt]
    similarity_fn = getattr(evaluator, "similarity", evaluator._compute_similarity)

    train_pool = list(train_data)
    completed_changes: list[dict[str, object]] = []
    pending_change: dict[str, object] | None = None

    for epoch in range(epochs):
        if shuffle:
            random.shuffle(train_pool)
        batches = chunk_batches(train_pool, batch_size)
        print(
            f"Epoch {epoch + 1}/{epochs}: {len(batches)} batches of size <= {batch_size}."
        )

        last_eval = None
        train_scores: list[float] = []
        epoch_records: list[dict[str, object]] = []

        for batch_idx, batch in enumerate(batches, start=1):
            print(f"  Batch {batch_idx}/{len(batches)}")
            for sample in batch:
                user_text = str(sample["text"])
                tiny_output = str(
                    sample.get("tiny_agent_output") or tiny_agent_response(user_text)
                )
                mimic_output = mimic_agent(user_text)["mimic_output"]
                last_eval = evaluator.evaluate(
                    user_input=user_text,
                    tiny_output=tiny_output,
                    mimic_output=mimic_output,
                )
                train_scores.append(last_eval["similarity"])
                epoch_records.append(
                    {
                        "user_input": user_text,
                        "tiny_output": tiny_output,
                        "mimic_output": mimic_output,
                        "similarity": last_eval["similarity"],
                    }
                )

        avg_train_score = float(np.mean(train_scores)) if train_scores else 0.0
        train_history.append(avg_train_score)
        print(f"  Train similarity: {avg_train_score:.3f}")

        print("  Evaluating on test set...")
        test_scores: list[float] = []
        for sample in test_data:
            user_text = str(sample["text"])
            tiny_output = str(
                sample.get("tiny_agent_output") or tiny_agent_response(user_text)
            )
            mimic_output = mimic_agent(user_text)["mimic_output"]
            test_scores.append(similarity_fn(tiny_output, mimic_output))

        avg_test_score = float(np.mean(test_scores)) if test_scores else 0.0
        test_history.append(avg_test_score)
        print(f"  Test similarity: {avg_test_score:.3f}")

        case_examples_text, case_report_path = build_case_report(
            epoch_records,
            evaluator,
            epoch_idx=epoch + 1,
        )
        if case_report_path:
            print(f"  Saved case report to {case_report_path}")
        case_reasoning = analyze_case_differences(case_examples_text, evaluator.metric)

        if pending_change is not None and pending_change.get("new_metric") is None:
            pending_change["new_metric"] = avg_test_score
            prev_metric = pending_change.get("prev_metric")
            if isinstance(prev_metric, float):
                pending_change["delta"] = avg_test_score - prev_metric
            else:
                pending_change["delta"] = None
            completed_changes.append(pending_change)
            pending_change = None

        if last_eval is not None:
            print("  Updating prompt via coach agent...")
            change_log_text = format_change_log(completed_changes)
            blocks = mimic_agent.get_prompt_blocks()
            block_options = ", ".join(mimic_agent.prompt_state.block_names())
            coach_update = coach(
                evaluator_output=last_eval,
                prompt_blocks=serialize_prompt_blocks(blocks),
                block_options=block_options,
                prompt_change_log=change_log_text,
                case_examples=case_examples_text,
                case_reasoning=case_reasoning,
            )
            target_block = coach_update.get("target_block", "general").strip()
            new_block_text = coach_update.get("updated_block_text", "").strip()
            old_block_text = blocks.get(target_block)
            if not new_block_text or old_block_text is None:
                print("  Coach response missing valid block update; skipping.")
            else:
                old_prompt = mimic_agent.base_prompt
                mimic_agent.update_block(target_block, new_block_text)
                new_prompt = mimic_agent.base_prompt
                if new_prompt != old_prompt:
                    intention = describe_block_change(
                        target_block, old_block_text, new_block_text
                    )
                    diff_text = prompt_unified_diff(old_prompt, new_prompt)
                    pending_change = {
                        "old_prompt": old_prompt,
                        "new_prompt": new_prompt,
                        "block_name": target_block,
                        "old_block_text": old_block_text,
                        "new_block_text": new_block_text,
                        "intention": intention,
                        "diff": diff_text,
                        "prev_metric": avg_test_score,
                        "new_metric": None,
                        "delta": None,
                    }
                    prompt_history.append(new_prompt)
                    print(
                        "  Updated block:",
                        target_block,
                        "->",
                        new_block_text[:120]
                        + ("..." if len(new_block_text) > 120 else ""),
                    )
                else:
                    print("  Coach left the prompt unchanged.")

        print("  Epoch complete.\n")

    if pending_change is not None:
        completed_changes.append(pending_change)

    return {
        "test_history": test_history,
        "train_history": train_history,
        "prompt_history": prompt_history,
        "change_log": completed_changes,
    }


training_run = run_learning_loop(
    train_data=train_set,
    test_data=test_set,
    evaluator=evaluator,
    coach=coach,
    mimic_agent=mimic_agent,
    epochs=12,
    batch_size=32,
)

fig, ax = plt.subplots(figsize=(6, 3))
epochs_axis = range(1, len(training_run["test_history"]) + 1)
ax.plot(epochs_axis, training_run["train_history"], marker="o", label="Train")
ax.plot(epochs_axis, training_run["test_history"], marker="s", label="Test")
ax.set_xlabel("Epoch")
ax.set_ylabel(f"Similarity ({evaluator.metric})")
ax.set_title("Mimic agent performance across epochs")
ax.grid(True, alpha=0.3)
ax.legend()
fig.tight_layout()
fig

Epoch 1/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.352
Evaluating on test set...
Test similarity: 0.341
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_01_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and casually to the user's statement, expressing personal enjoyment or agreement in a friendly, conversa...
Epoch complete.

Epoch 2/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.345
Evaluating on test set...
Test similarity: 0.339
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_02_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, reply wi...
Epoch complete.

Epoch 3/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.504
Evaluating on test set...
Test similarity: 0.523
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_03_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, reply wi...
Epoch complete.

Epoch 4/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.496
Evaluating on test set...
Test similarity: 0.512
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_04_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, reply wi...
Epoch complete.

Epoch 5/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.552
Evaluating on test set...
Test similarity: 0.582
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_05_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Epoch 6/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.601
Evaluating on test set...
Test similarity: 0.636
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_06_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Epoch 7/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.906
Evaluating on test set...
Test similarity: 0.901
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_07_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Epoch 8/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.901
Evaluating on test set...
Test similarity: 0.901
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_08_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Epoch 9/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.890
Evaluating on test set...
Test similarity: 0.898
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_09_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Epoch 10/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.979
Evaluating on test set...
Test similarity: 0.997
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_10_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Epoch 11/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.906
Evaluating on test set...
Test similarity: 0.914
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_11_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Epoch 12/12: 4 batches of size <= 32.
Batch 1/4
Batch 2/4
Batch 3/4
Batch 4/4
Train similarity: 0.967
Evaluating on test set...
Test similarity: 0.988
Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_12_case_report.md
Updating prompt via coach agent...
Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ...
Epoch complete.

Printing the full learned prompt

print("Final learned prompt:\n")
print(mimic_agent.base_prompt)

Final learned prompt:

[General guidance]
Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond with a short, blunt, and general negative statement expressing clear dislike or lack of enjoyment for sports, regardless of any positive sentiment in the input. Use a conversational tone with contractions and discourse markers such as "Well," to express a clear personal opinion, for example: "Well, I don't like sports." Do not reference specific details from the input. Avoid repeating the exact same phrase for every sports-related input; instead, vary your wording slightly while maintaining the same clear negative stance and use phrasing very close to "Well, I don't like sports." Maintain the straightforward and slightly blunt tone of the target responses. Avoid rephrasing or softening the message; replicate the style and wording as closely as possible. Use a straightforward and neutral tone without casual qualifiers or elaborations. For statements about personal tastes or experiences unrelated to sports, respond concisely and respectfully, acknowledging the user's taste with a polite and neutral phrase similar to: "I see you are a person of taste! I like [topic], too." When possible, incorporate a relevant detail or sentiment from the user's statement to make the response more context-aware and engaging, while avoiding slang, overly casual expressions, or excessive enthusiasm.
[sop-1]
This is a placeholder standard operating procedure.
[sop-2]
This is a placeholder standard operating procedure.

Evaluating the learned mimic agent in the holdout dataset

from pprint import pprint


def evaluate_holdout(
    *,
    dataset: list[dict[str, object]],
    mimic_agent: MimicAgent,
    evaluator: TinyAgentEvaluator,
) -> dict[str, Any]:
    rows = []
    scores: list[float] = []
    for sample in dataset:
        user_text = str(sample["text"])
        tiny_output = str(sample.get("tiny_agent_output") or tiny_agent_response(user_text))
        mimic_output = mimic_agent(user_text)["mimic_output"]
        similarity = evaluator.similarity(tiny_output, mimic_output)
        scores.append(similarity)
        rows.append(
            {
                "user_input": user_text,
                "tiny_output": tiny_output,
                "mimic_output": mimic_output,
                "similarity": similarity,
            }
        )

    return {
        "average_similarity": float(np.mean(scores)) if scores else 0.0,
        "metric": evaluator.metric,
        "rows": rows,
    }


holdout_report = evaluate_holdout(
    dataset=holdout_set,
    mimic_agent=mimic_agent,
    evaluator=evaluator,
)

print(
    f"Holdout {holdout_report['metric']} similarity: {holdout_report['average_similarity']:.3f}"
)
pprint(holdout_report["rows"][:5])

Holdout embedding_cosine similarity: 0.981
[{'mimic_output': 'I see you are a person of taste! I like coffee, too.',
  'similarity': 1.0,
  'tiny_output': 'I see you are a person of taste! I like coffee, too.',
  'user_input': 'The rich aroma of freshly brewed coffee awakens my senses and '
                'fuels my day with warmth and comfort.'},
 {'mimic_output': 'I see you are a person of taste! I like the comforting '
                  'aroma of coffee, too.',
  'similarity': 0.8953397870063782,
  'tiny_output': 'I see you are a person of taste! I like coffee, too.',
  'user_input': 'The rich aroma of coffee grounds in the morning awakens my '
                'senses and brings comfort to my soul.'},
 {'mimic_output': 'I see you are a person of taste! I like espresso, too.',
  'similarity': 0.8794468641281128,
  'tiny_output': 'I see you are a person of taste! I like coffee, too.',
  'user_input': 'Sipping espresso in a quaint Italian cafe, I fell in love '
                'with the rich aroma and the warmth it brought.'},
 {'mimic_output': 'I see you are a person of taste! I like coffee, too.',
  'similarity': 1.0,
  'tiny_output': 'I see you are a person of taste! I like coffee, too.',
  'user_input': 'Savoring a steaming cup of coffee as the sun rises, its rich '
                'aroma wrapping around me like a comforting embrace.'},
 {'mimic_output': "Well, I don't like sports.",
  'similarity': 1.0000001192092896,
  'tiny_output': "Well, I don't like sports.",
  'user_input': 'I love basketball for the fast-paced action, strategic plays, '
                'and the joy of sinking a perfect shot through the hoop.'}]

7. Learnings

7.1. Cosine similarity enables the model to learn a generic behavior

[2025-11-15 Sat]: Similarity metric’s tolerance – Cosine similarity between short sentences rewards stylistic closeness more than exact structure.

It is a cool feature to generate policies that don't enforce specific phrases, but it can make it hard to split two very similar cases since a single answer might be similar enough to not push the algorithm to learn their difference.

8. Results

Train/test cosine similarity now rises from ~0.20 in epoch 1 to 0.97/0.99 by epoch 10, maintaining ≥0.90 through epoch 12 (see logs under The learning loop).
Holdout cosine similarity after training is 0.981 with most samples either exact matches or ≥0.89 similarity (see Evaluating the learned mimic agent in the holdout dataset).
The final structured prompt’s general block clearly encodes the two-branch policy (“Well, I don’t like sports” vs. “I see you are a person of taste!”) while both SOP placeholders remain untouched (see Printing the full learned prompt)

Below we have the learned prompt. In bold, the conditions that trigger the two policies. The policies are underlined.

Respond briefly and politely to the user's statement. *If the statement describes sports or physical activities*, respond with a short, blunt, and general negative statement expressing clear dislike or lack of enjoyment for sports, regardless of any positive sentiment in the input. Use a conversational tone with contractions and discourse markers such as "Well," to express a clear personal opinion, _for example: "Well, I don't like sports."_ Do not reference specific details from the input. Avoid repeating the exact same phrase for every sports-related input; instead, vary your wording slightly while maintaining the same clear negative stance and use phrasing very close to "Well, I don't like sports." Maintain the straightforward and slightly blunt tone of the target responses. Avoid rephrasing or softening the message; replicate the style and wording as closely as possible. Use a straightforward and neutral tone without casual qualifiers or elaborations. *For statements about personal tastes or experiences unrelated to sports*, respond concisely and respectfully, acknowledging the user's taste with a polite and neutral phrase similar to: _"I see you are a person of taste! I like [topic], too."_ When possible, incorporate a relevant detail or sentiment from the user's statement to make the response more context-aware and engaging, while avoiding slang, overly casual expressions, or excessive enthusiasm.

It is curious to notice that it didn't learned exactly about the "coffee policy", but it learned it as "non-related to sports". See how it made this mimic agent output slightly different:

{'mimic_output': 'I see you are a person of taste! I like espresso, too.', 'similarity': 0.8794468641281128, 'tiny_output': 'I see you are a person of taste! I like coffee, too.', 'user_input': 'Sipping espresso in a quaint Italian cafe, I fell in love ' 'with the rich aroma and the warmth it brought.'}

Instead of "coffee", it replaced it by "topic". Since topic was always related to coffee in non-sports related cases, it became similar to the expected output. Though it is imprecise, it is a very interesting behavior I'm curious on how it will express in more complex cases.

9. Conclusion

The result is promising for scenarios where there is an agent we benefit from automate their behavior. It is needed to expand the synthetic example to a more challenging situation that resembles reality.

10. Open Questions

How to encourage SOP edits instead of repeated general-block rewrites?
What similarity metric best balances stylistic freedom and strict branching?
Can we detect policy shifts automatically when the target agent changes?

11. Future Work

11.1. 1. Allow/enforce the coach to edit SOP blocks (enforce block selection, add validation).

11.2. 2. Explore prototype discovery or clustering to generalize beyond known topics.

Prototype discovery means breaking the input using unsupervised learning and identifying all the present use cases in the training set. Let's say we identify there are 10 different situations, then we could have 10 SOPs pre-defined.

Also, we can validate differently by grouping them and calculating the performance metric on it.

Further, we can break the input use-case by context/procedure and identify how the target agent branches the problem depending on parameters of it to propose a solution.

11.3. 3. Add dynamic regression tests that lock in the final prompt and guard against regressions.

11.4. 4. Investigate harder policies (more than two responses) or noisy targets.

11.5. 5. Add other success metrics beyond cosine similarity.

An interesting metric to measure how indistinguishable is the target and the mimic agent: traing a binary classifier using the target agent as the target, and generate outputs using the initial mimic agent - prior to any training. We are likely to create a very good classifier in identifying when the target agent is the one answering the inquiry.

As we teach the mimic agent, we should use this model to try to identify if it is the target or mimic agent answering the user's input. The performance on this task should drop to the point it is random.

One way to define the best possible scenario is: if the target agent enables for a minimal variance on its behavior (temperature?), vary it minimally and check the value of this metric. It is only possible in case the target agent is an AI agent under our control, which is only realistic on this synthetic case.

12. References

Lakshya A Agrawal AND Shangyin Tan AND Dilara Soylu AND Noah Ziems AND Rishi Khare AND Krista Opsahl-Ong AND Arnav Singhvi AND Herumb Shandilya AND Michael J Ryan AND Meng Jiang AND Christopher Potts AND Koushik Sen AND Alexandros G. Dimakis AND Ion Stoica AND Dan Klein AND Matei Zaharia AND Omar Khattab (2025). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.

Sutton, Richard S. and Barto, Andrew G. (2018). Reinforcement Learning: An Introduction, MIT Press.

Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle (2025). Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, CoRR.