Mimicking a binary deterministic agent
Table of Contents
- 1. Abstract
- 2. Introduction
- 3. Related Work
- 4. Problem Setup
- 5. Methodology
- 6. Experimental Setup
- 7. Learnings
- 8. Results
- 9. Conclusion
- 10. Open Questions
- 11. Future Work
- 11.1. 1. Allow/enforce the coach to edit SOP blocks (enforce block selection, add validation).
- 11.2. 2. Explore prototype discovery or clustering to generalize beyond known topics.
- 11.3. 3. Add dynamic regression tests that lock in the final prompt and guard against regressions.
- 11.4. 4. Investigate harder policies (more than two responses) or noisy targets.
- 11.5. 5. Add other success metrics beyond cosine similarity.
- 12. References
1. Abstract
I investigate whether a structured prompt plus LLM-based coaching can fully imitate a deterministic binary policy without exposing its rule. I generate 200 GPT-3.5 sentences evenly split between coffee and sports topics, label them, and use a topic-aware “tiny agent” to emit one of two canned replies (“I see you are a person of taste!” vs. “Well, I don’t like sports.”). A DSPy mimic agent with a three- block prompt (general guidance + two SOP placeholders) is trained in 12 epochs using cosine similarity over OpenAI embeddings, evaluator histories, and LLM-authored case reports. Each epoch the coach edits exactly one prompt block based on change logs and nearest-neighbor failure analyses, yielding stable convergence from ~0.20 to 0.99 cosine similarity on the test split. The final prompt mirrors the deterministic policy, and a holdout set achieves 0.981 cosine similarity with multiple exact matches, demonstrating that prompt-only adaptation plus targeted feedback can recover the latent rule even when SOP blocks remain untouched. The unchanged SOP block is a bug/downside of this approach that I will investigate further in the future.
2. Introduction
I want to set the task of crafting an AI Agent as a learning from data task based on mimicking a target behavior. This setting ease all use cases where copying an existing behavior has advantages.
I will create a "tiny agent", in a sense it will have two behaviors, and try to learn these behaviors as the prompt of an LLM we train to imitate the tiny agent.
3. Related Work
This work is inspired by optimization frameworks like GEPA (Lakshya A Agrawal AND Shangyin Tan AND Dilara Soylu AND Noah Ziems AND Rishi Khare AND Krista Opsahl-Ong AND Arnav Singhvi AND Herumb Shandilya AND Michael J Ryan AND Meng Jiang AND Christopher Potts AND Koushik Sen AND Alexandros G. Dimakis AND Ion Stoica AND Dan Klein AND Matei Zaharia AND Omar Khattab, 2025) and Agentic Context Engineering (Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle, 2025).
The key difference from these works is about the signal that guides the optimization. Both of them rightly uses a success metric based on the task of interest: does the code written by the agent executes or produces the expected output? how the output scores regarding import criteria for the task at hand, likle correctness, conciseness, or politeness?
However, these are expensive signals to obtain. One need to understand very well a task, create specific evaluators, or have great annotated data.
The take here is to simplify it as mimicking the output of a target agent at the expense of not focusing on the outcome of the agent on the task. In this scenario, the target agent becomes an upper bound of performance on the task. At the same time, identifying if an AI agent is able to behave just as another agent (a human, or a system), is a generic task that is easy to set.
4. Problem Setup
There is an Agent, in a the broader sense as defined in Sutton, Richard S. And Barto, Andrew G. (2018):
An agent is anything that can perceive its environment through sensors and act upon that environment through actuators in order to achieve a goal.
I want to learn this agent behavior by only observing their observations and actions.
I succeed if I can build another agent that when seen behaving under the same environment is indistinguishable from the original agent it learned from.
5. Methodology
I use a target & learning agent framework.
In a target and learning agent framework, we use the input and output from the target agent to teach the learning agent how to imitate it. An Evaluator block quantifies how well we are doing in the imitation task and provide a gradient notion to a coach agent that uses metrics, and prompt diffs to reason on how to update the learning agent prompt to optimize the objective function.
:end:
In this specific experiment, we imitate a deterministic two-response policy entirely via prompt engineering and LLM-based coaching. The procedure:
- Synthetic data – GPT-3.5-turbo generates 200 first-person sentences evenly split between the topics “coffee” and “sports”. Each record carries its topic label and is scored by the tiny agent to produce ground-truth outputs (“I see you are a person of taste!” vs. “Well, I don’t like sports.”).
- Structured mimic prompt – The DSPy mimic agent uses a prompt with three blocks: `[General guidance]`, `[sop-1]`, `[sop-2]`. Only the general block was edited in this run, but the architecture supports any number of SOP sections.
- Training loop – For each epoch we shuffle training data, run batched mimic inference, log cosine similarities with the evaluator, and collect per-example histories.
- Case reporting – After each epoch we pick an anchor input, grab its nearest neighbors via embeddings, export a markdown case report, and run an auxiliary LLM (“case analyzer”) to reason about target vs. mimic differences.
- Coach update – The coach receives (a) evaluator summary, (b) structured prompt blocks, (c) change log with Δmetrics and diffs, and (d) the case analysis. It must select exactly one block and propose new text; the learning loop applies only that block while leaving others unchanged.
Assumptions: topics are balanced; the mimic never sees explicit policy rules; cosine similarity uses OpenAI embeddings (text-embedding-3-small); coach edits are atomic to maintain prompt stability.
6. Experimental Setup
6.1. Configuration
%load_ext autoreload %autoreload 2
6.2. Creating the input text
I use GPT-3.5 to generate 200 examples. In 100 examples, we will talk about coffee. In the other 100, we will talk about sports.
from __future__ import annotations import json import time from pathlib import Path from typing import Dict, List import openai DATA_PATH = Path("experiments/mimic_tiny_agent/coffee_vs_sports.jsonl") client = openai.OpenAI() def _request_sentence(topic: str, model: str) -> str: """Ask GPT-3.5 for a sentence about the desired topic.""" if topic == "coffee": user_prompt = ( "Write one short first-person sentence (<=25 words) describing why I love coffee or a memorable coffee moment." ) elif topic == "sports": user_prompt = ( "Write one short first-person sentence (<=25 words) describing my favorite sport or why I enjoy it." ) else: raise ValueError(f"Unsupported topic: {topic}") response = client.chat.completions.create( model=model, temperature=0.8, max_tokens=80, messages=[ { "role": "system", "content": "You craft concise, vivid personal statements that follow the user instructions exactly.", }, {"role": "user", "content": user_prompt}, ], ) return response.choices[0].message.content.strip() def generate_dataset_with_gpt( n_examples: int = 200, *, model: str = "gpt-3.5-turbo", save_path: Path = DATA_PATH, ) -> List[Dict[str, object]]: """Call GPT-3.5 to synthesize coffee and sports statements and persist them.""" if n_examples % 2 != 0: raise ValueError("n_examples must be even so it can be split evenly") save_path.parent.mkdir(parents=True, exist_ok=True) half = n_examples // 2 dataset: List[Dict[str, object]] = [] for idx in range(n_examples): topic = "coffee" if idx < half else "sports" text = _request_sentence(topic, model) dataset.append( { "text": text, "topic": topic, "model": model, "timestamp": time.time(), } ) time.sleep(0.2) with save_path.open("w", encoding="utf-8") as fh: for row in dataset: fh.write(json.dumps(row) + "\n") return dataset def load_or_create_dataset(path: Path = DATA_PATH) -> List[Dict[str, object]]: if path.exists(): print(f"Loading cached dataset from {path}") return [json.loads(line) for line in path.read_text().splitlines() if line] print("Dataset not found on disk; generating via GPT-3.5...") return generate_dataset_with_gpt(save_path=path) synthetic_examples = load_or_create_dataset() coffee_count = sum(1 for item in synthetic_examples if item.get("topic") == "coffee") sports_count = sum(1 for item in synthetic_examples if item.get("topic") == "sports") print( f"Loaded {len(synthetic_examples)} examples; " f"coffee={coffee_count} | sports={sports_count}." )
Loading cached dataset from experiments/mimic_tiny_agent/coffee_vs_sports.jsonl Loaded 200 examples; coffee=100 | sports=100.
6.3. Creating the tiny agent
We create a function that's able to generate an answer following:
- Whenever the topic is "coffee", it should answer: "I see you are a person of taste! I like coffee, too.";
- Whenever the topic is "sports", it must say: "Well, I don't like sports.".
from typing import Iterable COFFEE_RESPONSE = "I see you are a person of taste! I like coffee, too." SPORTS_RESPONSE = "Well, I don't like sports." DEFAULT_RESPONSE = SPORTS_RESPONSE def tiny_agent_response(user_text: str, *, topic: str | None = None) -> str: label = (topic or "").strip().lower() if label == "coffee": return COFFEE_RESPONSE if label == "sports": return SPORTS_RESPONSE if "coffee" in user_text.lower(): return COFFEE_RESPONSE return DEFAULT_RESPONSE def batch_tiny_agent_response( inputs: Iterable[tuple[str, str | None]], ) -> list[str]: return [tiny_agent_response(text, topic=topic) for text, topic in inputs] print(tiny_agent_response("I brewed a silky pour-over", topic="coffee")) print(tiny_agent_response("Practicing my backhand", topic="sports"))
I see you are a person of taste! I like coffee, too. Well, I don't like sports.
6.3.1. Scoring all the input text using the tiny agent
from collections.abc import Sequence from pprint import pprint def score_dataset_with_tiny_agent( dataset: Sequence[dict[str, object]], ) -> list[dict[str, object]]: """Append the tiny agent answer to every input example.""" scored: list[dict[str, object]] = [] for row in dataset: topic = row.get("topic") answer = tiny_agent_response(str(row["text"]), topic=topic) scored.append({**row, "tiny_agent_output": answer}) return scored scored_examples = score_dataset_with_tiny_agent(synthetic_examples) print(f"Scored {len(scored_examples)} examples with the deterministic agent.") pprint(scored_examples[:3])
Scored 200 examples with the deterministic agent.
[{'model': 'gpt-3.5-turbo',
'text': 'The first sip of my morning espresso awakens my senses and sets a '
'positive tone for the day ahead.',
'timestamp': 1763237270.6815271,
'tiny_agent_output': 'I see you are a person of taste! I like coffee, too.',
'topic': 'coffee'},
{'model': 'gpt-3.5-turbo',
'text': "I love coffee because it's a warm hug in a cup that jumpstarts my "
'day with its rich aroma.',
'timestamp': 1763237271.866124,
'tiny_agent_output': 'I see you are a person of taste! I like coffee, too.',
'topic': 'coffee'},
{'model': 'gpt-3.5-turbo',
'text': 'The first sip of rich, aromatic coffee transports me to cozy '
'mornings and fuels my passion for creativity and productivity.',
'timestamp': 1763237273.029643,
'tiny_agent_output': 'I see you are a person of taste! I like coffee, too.',
'topic': 'coffee'}]
6.4. The mimic agent
We use DSPy to create a program that just has a generic prompt.
import os from typing import Any import dspy def ensure_dspy_configured() -> object: """Lazy-load the LM client used by DSPy.""" if dspy.settings.lm is not None: return dspy.settings.lm api_key = os.getenv("OPENAI_API_KEY") if not api_key: raise EnvironmentError( "OPENAI_API_KEY is not set. Export it before running DSPy cells." ) model_name = os.getenv("DSPY_LM_MODEL", "openai/gpt-4.1-mini") temperature = float(os.getenv("DSPY_LM_TEMPERATURE", "0.3")) max_tokens = int(os.getenv("DSPY_LM_MAX_TOKENS", "2048")) lm = dspy.LM( model_name, api_key=api_key, temperature=temperature, max_tokens=max_tokens, ) dspy.configure(lm=lm) return lm ensure_dspy_configured() class CommentOnClaim(dspy.Signature): user_claim = dspy.InputField(desc="The short statement provided by the user.") comment = dspy.OutputField( desc="A concise acknowledgement or observation about the claim." ) mimic_agent_structured_default = { "general_guidance": ( "Comment on the user claim following the SOPs." ), "sops": [ "This is a placeholder standard operating procedure.", "This is a placeholder standard operating procedure.", ], } class PromptStructure: def __init__(self, general_guidance: str, sops: list[str]) -> None: self.general_guidance = general_guidance.strip() self.sops = [sop.strip() for sop in sops] def render(self) -> str: lines = ["[General guidance]", self.general_guidance] for idx, sop in enumerate(self.sops, start=1): lines.append(f"[sop-{idx}]") lines.append(self.sops[idx - 1]) return "\n".join(lines) def block_names(self) -> list[str]: return ["general"] + [f"sop-{idx}" for idx in range(1, len(self.sops) + 1)] def get_block_text(self, block_name: str) -> str: if block_name == "general": return self.general_guidance if block_name.startswith("sop-"): idx = int(block_name.split("-", 1)[1]) - 1 return self.sops[idx] raise KeyError(f"Unknown block {block_name}") def update_block(self, block_name: str, new_text: str) -> None: cleaned = new_text.strip() if not cleaned: return if block_name == "general": self.general_guidance = cleaned return if block_name.startswith("sop-"): idx = int(block_name.split("-", 1)[1]) - 1 if 0 <= idx < len(self.sops): self.sops[idx] = cleaned return raise KeyError(f"Unknown block {block_name}") class MimicAgent(dspy.Module): """DSPy program with a structured prompt (general guidance + SOP blocks).""" def __init__( self, *, general_guidance: str | None = None, sop_texts: list[str] | None = None, sop_count: int = 2, ) -> None: super().__init__() template = mimic_agent_structured_default base_general = general_guidance or template["general_guidance"] base_sops = sop_texts or template["sops"] if sop_count > len(base_sops): base_sops = base_sops + [base_sops[-1]] * (sop_count - len(base_sops)) else: base_sops = base_sops[:sop_count] self.prompt_state = PromptStructure(base_general, base_sops) self.generator = dspy.Predict(CommentOnClaim) @property def base_prompt(self) -> str: return self.prompt_state.render() def forward(self, user_claim: str) -> dict[str, Any]: # type: ignore[override] instructions = ( f"Instruction:\n{self.base_prompt}\n" f"User claim: {user_claim.strip()}\n" "Comment:" ) prediction = self.generator(user_claim=instructions) return { "prompt": self.base_prompt, "mimic_output": prediction.comment, } def get_prompt_blocks(self) -> dict[str, str]: return {name: self.prompt_state.get_block_text(name) for name in self.prompt_state.block_names()} def update_block(self, block_name: str, new_text: str) -> None: self.prompt_state.update_block(block_name, new_text) mimic_agent = MimicAgent(sop_count=2) demo_reply = mimic_agent("I'm experimenting with new pour-over routines") demo_reply
| prompt | : | [General guidance]\nComment on the user claim following the SOPs.\n[sop-1]\nThis is a placeholder standard operating procedure.\n[sop-2]\nThis is a placeholder standard operating procedure. | mimicoutput | : | Exploring new pour-over routines can be a great way to enhance your coffee brewing skills and discover unique flavors. Enjoy the experimentation process! |
6.5. The learning algorithm
6.5.1. The evaluator
The evaluator input: the user input, the tiny agent output, and the mimic agent output.
The evaluator output: a similarity metric calculated comparing the tiny agent output to the mimic agent output; a list of the last n outputs from the mimic agent on that same input, and a list of the last k training examples and mimic outputs and their scores.
from __future__ import annotations import math import time from collections import Counter, defaultdict, deque from dataclasses import dataclass from typing import Any, Deque, Dict, Literal import numpy as np import openai SimilarityMetric = Literal["bleu", "embedding_cosine"] def _tokenize(text: str) -> list[str]: return [token for token in text.lower().split() if token] def _ngram_counts(tokens: list[str], n: int) -> Counter[tuple[str, ...]]: if len(tokens) < n: return Counter() return Counter(tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1)) def compute_bleu(reference: str, hypothesis: str) -> float: ref_tokens = _tokenize(reference) hyp_tokens = _tokenize(hypothesis) if not hyp_tokens: return 0.0 weights = [0.25, 0.25, 0.25, 0.25] precisions = [] for n in range(1, 5): ref_counts = _ngram_counts(ref_tokens, n) hyp_counts = _ngram_counts(hyp_tokens, n) overlap = sum( min(count, ref_counts[gram]) for gram, count in hyp_counts.items() ) total = sum(hyp_counts.values()) or 1 precisions.append(overlap / total if overlap else 1e-9) log_precision = sum(w * math.log(p) for w, p in zip(weights, precisions)) brevity_penalty = 1.0 ref_len = len(ref_tokens) hyp_len = len(hyp_tokens) if hyp_len <= ref_len and hyp_len > 0: brevity_penalty = math.exp(1 - ref_len / hyp_len) return float(brevity_penalty * math.exp(log_precision)) def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float: denom = np.linalg.norm(vec_a) * np.linalg.norm(vec_b) if denom == 0: return 0.0 return float(np.dot(vec_a, vec_b) / denom) @dataclass class EvaluationRecord: user_input: str tiny_output: str mimic_output: str similarity: float metric: SimilarityMetric timestamp: float class TinyAgentEvaluator: """Track similarity and histories for the mimic vs. tiny agent comparison.""" def __init__( self, *, metric: SimilarityMetric = "bleu", per_input_history: int = 5, training_history: int = 50, embedding_model: str = "text-embedding-3-small", ) -> None: self.metric = metric self.per_input_history = per_input_history self.training_history = training_history self.embedding_model = embedding_model self._per_input_records: dict[str, Deque[EvaluationRecord]] = defaultdict( lambda: deque(maxlen=per_input_history) ) self._global_records: Deque[EvaluationRecord] = deque(maxlen=training_history) self._embedding_cache: dict[str, np.ndarray] = {} self._embedding_client = openai.OpenAI() # -- metrics ----------------------------------------------------------------- def _bleu_similarity(self, tiny_output: str, mimic_output: str) -> float: return compute_bleu(tiny_output, mimic_output) def _embedding_similarity(self, tiny_output: str, mimic_output: str) -> float: tiny_vec = self._embed_text(tiny_output) mimic_vec = self._embed_text(mimic_output) return cosine_similarity(tiny_vec, mimic_vec) def _embed_text(self, text: str) -> np.ndarray: cache_key = text.strip() if cache_key in self._embedding_cache: return self._embedding_cache[cache_key] if not cache_key: vec = np.zeros(1536, dtype=np.float32) self._embedding_cache[cache_key] = vec return vec response = self._embedding_client.embeddings.create( model=self.embedding_model, input=cache_key, ) vec = np.array(response.data[0].embedding, dtype=np.float32) self._embedding_cache[cache_key] = vec return vec def _compute_similarity(self, tiny_output: str, mimic_output: str) -> float: if self.metric == "embedding_cosine": return self._embedding_similarity(tiny_output, mimic_output) return self._bleu_similarity(tiny_output, mimic_output) def similarity(self, tiny_output: str, mimic_output: str) -> float: """Public helper to compute similarity without mutating evaluator state.""" return self._compute_similarity(tiny_output, mimic_output) def embed_text(self, text: str) -> np.ndarray: """Expose embedding helper for downstream analysis.""" return self._embed_text(text) # -- public api --------------------------------------------------------------- def evaluate( self, *, user_input: str, tiny_output: str, mimic_output: str, ) -> Dict[str, Any]: similarity = self._compute_similarity(tiny_output, mimic_output) record = EvaluationRecord( user_input=user_input, tiny_output=tiny_output, mimic_output=mimic_output, similarity=similarity, metric=self.metric, timestamp=time.time(), ) per_input_buffer = self._per_input_records[user_input] per_input_buffer.append(record) self._global_records.append(record) return { "similarity": similarity, "metric": self.metric, "last_outputs_for_input": [ { "mimic_output": item.mimic_output, "tiny_output": item.tiny_output, "similarity": item.similarity, "timestamp": item.timestamp, } for item in list(per_input_buffer) ], "recent_training_examples": [ { "user_input": item.user_input, "mimic_output": item.mimic_output, "tiny_output": item.tiny_output, "similarity": item.similarity, "metric": item.metric, "timestamp": item.timestamp, } for item in list(self._global_records) ], } def describe_block_change(block_name: str, old_text: str, new_text: str) -> str: if old_text.strip() == new_text.strip(): return f"Block {block_name} unchanged." desc: list[str] = [f"Block {block_name}: "] if len(new_text) > len(old_text) * 1.1: desc.append("expanded details.") elif len(new_text) < len(old_text) * 0.9: desc.append("more concise wording.") old_words = set(old_text.lower().split()) new_words = set(new_text.lower().split()) added = list(new_words - old_words) removed = list(old_words - new_words) if added: desc.append( " Added cues: " + ", ".join(sorted(added[:5])) + ("..." if len(added) > 5 else "") ) if removed: desc.append( " Removed cues: " + ", ".join(sorted(removed[:5])) + ("..." if len(removed) > 5 else "") ) if len(desc) == 1: desc.append(" Rephrased while preserving intent.") return "".join(desc) def prompt_unified_diff(old_prompt: str, new_prompt: str) -> str: diff = difflib.unified_diff( old_prompt.splitlines(), new_prompt.splitlines(), fromfile="previous_prompt", tofile="updated_prompt", lineterm="", ) diff_text = "\n".join(diff) return diff_text or "(no textual diff)" def format_change_log(entries: list[dict[str, object]], limit: int = 3) -> str: if not entries: return "No prior prompt changes recorded." lines: list[str] = ["Recent prompt changes:"] for idx, entry in enumerate(entries[-limit:], start=1): delta = entry.get("delta") if isinstance(delta, float): delta_text = f"{delta:+.3f}" else: delta_text = "N/A" lines.append( f"[{idx}] block={entry.get('block_name')} Δmetric={delta_text} " f"prev={entry.get('prev_metric')} new={entry.get('new_metric')}" ) lines.append(f"Intent: {entry.get('intention')}") old_block = entry.get("old_block_text", "") new_block = entry.get("new_block_text", "") lines.append(f"Old block: {old_block}") lines.append(f"New block: {new_block}") diff_text = entry.get("diff", "") or "(no diff)" if isinstance(diff_text, str): snippet = diff_text if len(diff_text) <= 800 else diff_text[:800] + "\n..." lines.append("Diff:\n" + snippet) return "\n".join(lines) def serialize_prompt_blocks(blocks: dict[str, str]) -> str: parts = [] for name, text in blocks.items(): parts.append(f"[{name}]\n{text}") return "\n\n".join(parts) evaluator = TinyAgentEvaluator( metric="embedding_cosine", per_input_history=3, training_history=10, ) sample_input = synthetic_examples[0]["text"] tiny_answer = tiny_agent_response(sample_input) mimic_answer = mimic_agent(sample_input)["mimic_output"] evaluator.evaluate( user_input=sample_input, tiny_output=tiny_answer, mimic_output=mimic_answer, )
| similarity | : | 0.01230083592236042 | metric | : | embeddingcosine | lastoutputsforinput | : | ((mimicoutput : Your claim highlights a common experience where the initial taste of espresso can stimulate the senses and contribute to a positive start to the day. This aligns with how caffeine and sensory enjoyment often influence mood and alertness. tinyoutput : Well, I don't like sports. similarity : 0.01230083592236042 timestamp : 1763249209.6680489)) | recenttrainingexamples | : | ((userinput : The first sip of my morning espresso awakens my senses and sets a positive tone for the day ahead. mimicoutput : Your claim highlights a common experience where the initial taste of espresso can stimulate the senses and contribute to a positive start to the day. This aligns with how caffeine and sensory enjoyment often influence mood and alertness. tinyoutput : Well, I don't like sports. similarity : 0.01230083592236042 metric : embeddingcosine timestamp : 1763249209.6680489)) |
6.5.2. The coach agent
The coach agent input: the evaluator output, and the current mimic agent's prompt. The coach agent output: a new prompt for the mimic agent.
class CoachSignature(dspy.Signature): evaluator_report = dspy.InputField( desc="Structured summary describing recent evaluator scores and samples." ) prompt_blocks = dspy.InputField(desc="Structured prompt broken into named blocks.") block_options = dspy.InputField(desc="List of valid block names to edit.") guidelines = dspy.InputField( desc="Non-negotiable constraints for how the revised prompt should be phrased." ) prompt_change_log = dspy.InputField( desc="Table describing prior prompt edits, diffs, and their metric deltas." ) case_examples = dspy.InputField( desc="Sample inputs with target/mimic outputs for focused comparison." ) case_reasoning = dspy.InputField( desc="Reasoned analysis describing why mimic outputs differ from targets." ) target_block = dspy.OutputField(desc="Name of the block to edit (e.g., general, sop-1).") updated_block_text = dspy.OutputField(desc="Rewritten text for the chosen block only.") coach_rationale = dspy.OutputField(desc="Short explanation for the proposed change.") COACH_PROMPT_GUIDELINES = ( "Produce a standalone instruction for an agent to immitate a 'target agent'. " "Focus on what the agent should say to become indistinguishable from the target agent." "The target agent is following a policy and you must decode it by looking to its behavior and craft a prompt to the mimic agent to apply that same policy." "Be as precise as possible describing the process the target agent is following to insert in the mimic agent prompt." "Don't mention 'mimic agent' or 'target agent' in the prompt of the 'mimic agent' since it is unaware of them. " "You want to maximize the similarity between the target agent output and the mimic agent output by editing the mimic agent's prompt." "Respect the structured prompt: edit exactly one named block (general or sop-#) per iteration and leave all other blocks untouched." ) def summarize_evaluator_output(report: dict[str, Any], limit: int = 3) -> str: """Turn evaluator dictionaries into a short textual description.""" lines = [ f"Similarity metric: {report.get('metric')}", f"Last score: {report.get('similarity'):.3f}", ] history = report.get("last_outputs_for_input", [])[-limit:] if history: lines.append("Recent outputs for this input:") for item in history: lines.append( "- target: " + item["tiny_output"].strip() + " | mimic: " + item["mimic_output"].strip() + f" | score={item['similarity']:.3f}" ) recent_examples = report.get("recent_training_examples", [])[-limit:] if recent_examples: lines.append("Recent training mismatches:") for example in recent_examples: lines.append( "- input: " + example["user_input"].strip() + " | target: " + example["tiny_output"].strip() + " | mimic: " + example["mimic_output"].strip() + f" | score={example['similarity']:.3f}" ) return "\n".join(lines) class CoachAgent(dspy.Module): """LLM-based coach that rewrites the mimic prompt using evaluator feedback.""" def __init__(self) -> None: super().__init__() ensure_dspy_configured() self.predictor = dspy.Predict(CoachSignature) self.guidelines = COACH_PROMPT_GUIDELINES def forward( self, *, evaluator_output: dict[str, Any], prompt_blocks: str, block_options: str, prompt_change_log: str, case_examples: str, case_reasoning: str, ) -> dict[str, str]: # type: ignore[override] summary = summarize_evaluator_output(evaluator_output) prediction = self.predictor( evaluator_report=summary, prompt_blocks=prompt_blocks, block_options=block_options, guidelines=self.guidelines, prompt_change_log=prompt_change_log, case_examples=case_examples, case_reasoning=case_reasoning, ) block = getattr(prediction, "target_block", "general").strip() updated_block = getattr(prediction, "updated_block_text", "").strip() rationale = getattr(prediction, "coach_rationale", "") return { "target_block": block, "updated_block_text": updated_block, "coach_rationale": rationale, } coach = CoachAgent() evaluator_snapshot = evaluator.evaluate( user_input=sample_input, tiny_output=tiny_answer, mimic_output=mimic_agent(sample_input)["mimic_output"], ) coach.forward( evaluator_output=evaluator_snapshot, prompt_blocks=serialize_prompt_blocks(mimic_agent.get_prompt_blocks()), block_options=", ".join(mimic_agent.prompt_state.block_names()), prompt_change_log="No prior prompt changes recorded.", case_examples="", case_reasoning="No case analysis available yet.", )
2025/11/15 20:26:49 WARNING dspy.primitives.module: Calling module.forward(...) on CoachAgent directly is discouraged. Please use module(...) instead.
| targetblock | : | general | updatedblocktext | : | Carefully read the user's claim and respond with a brief, relevant comment that directly addresses the content or sentiment of the claim. Your response should be concise, neutral, and avoid introducing unrelated topics or excessive elaboration. Focus on acknowledging or reflecting the user's statement in a way that shows understanding without adding personal opinions or external information. | coachrationale | : | The original general block was too vague and did not guide the agent to produce responses aligned with the target outputs, which are short, direct comments on the user's claim. This revised instruction clarifies that the agent should produce brief, relevant, and neutral comments that directly engage with the user's statement, avoiding unrelated elaboration. This should improve output similarity to the target agent's behavior. |
6.5.3. The algorithm
- Splitting the training data
We split the data between training, test, and holdout. 60%/20%/20%.
import math import random from typing import Sequence, Tuple def split_dataset( dataset: Sequence[dict[str, object]], train_ratio: float = 0.6, test_ratio: float = 0.2, seed: int = 2024, ) -> Tuple[list[dict[str, object]], list[dict[str, object]], list[dict[str, object]]]: if not math.isclose(train_ratio + test_ratio, 0.8, rel_tol=1e-9): raise ValueError("Train + test ratios must sum to 0.8 for 20% holdout.") rng = random.Random(seed) shuffled = list(dataset) rng.shuffle(shuffled) n = len(shuffled) train_end = int(n * train_ratio) test_end = train_end + int(n * test_ratio) train_split = shuffled[:train_end] test_split = shuffled[train_end:test_end] holdout_split = shuffled[test_end:] return train_split, test_split, holdout_split train_set, test_set, holdout_set = split_dataset(scored_examples) print( f"train={len(train_set)} | test={len(test_set)} | holdout={len(holdout_set)}" )
train=120 | test=40 | holdout=40
- The learning loop
import matplotlib.pyplot as plt import random import difflib from pathlib import Path import dspy CASE_REPORT_DIR = Path("experiments/mimic_tiny_agent/coach_case_reports") CASE_ANALYSIS_INSTRUCTIONS = ( "Explain how the mimic agent's answers differ from the target outputs and propose concrete prompt tweaks." ) class CaseAnalysisSignature(dspy.Signature): cases = dspy.InputField(desc="Selected inputs plus target/mimic outputs.") metric = dspy.InputField(desc="Similarity metric currently tracked.") guidance = dspy.InputField(desc="What the analysis should focus on.") analysis = dspy.OutputField(desc="Reasoned comparison and actionable advice.") ensure_dspy_configured() case_analyzer = dspy.Predict(CaseAnalysisSignature) def analyze_case_differences(cases_text: str, metric: str) -> str: if not cases_text.strip(): return "No case data collected." ensure_dspy_configured() result = case_analyzer( cases=cases_text, metric=metric, guidance=CASE_ANALYSIS_INSTRUCTIONS, ) return getattr(result, "analysis", str(result)).strip() def build_case_report( records: list[dict[str, object]], evaluator: TinyAgentEvaluator, *, epoch_idx: int, top_k: int = 5, ) -> tuple[str, str]: if not records: return "", "" anchor = random.choice(records) anchor_vec = evaluator.embed_text(anchor["user_input"]) scored: list[tuple[float, dict[str, object]]] = [] for record in records: vec = evaluator.embed_text(record["user_input"]) sim = cosine_similarity(anchor_vec, vec) scored.append((sim, record)) top_records = [rec for _, rec in sorted(scored, key=lambda x: x[0], reverse=True)[:top_k]] lines: list[str] = [f"Anchor input: {anchor['user_input']}"] for idx, rec in enumerate(top_records, start=1): lines.append( "\n".join( [ f"### Case {idx}", f"- input: {rec['user_input']}", f"- target agent output: {rec['tiny_output']}", f"- mimic agent output: {rec['mimic_output']}", f"- similarity: {rec['similarity']:.3f}", ] ) ) CASE_REPORT_DIR.mkdir(parents=True, exist_ok=True) report_path = CASE_REPORT_DIR / f"epoch_{epoch_idx:02d}_case_report.md" report_text = "\n\n".join(lines) report_path.write_text(report_text, encoding="utf-8") return report_text, str(report_path) def chunk_batches(data: list[dict[str, object]], batch_size: int) -> list[list[dict[str, object]]]: return [data[i : i + batch_size] for i in range(0, len(data), batch_size)] def run_learning_loop( *, train_data: list[dict[str, object]], test_data: list[dict[str, object]], evaluator: TinyAgentEvaluator, coach: CoachAgent, mimic_agent: MimicAgent, epochs: int = 5, batch_size: int = 32, shuffle: bool = True, ) -> dict[str, Any]: test_history: list[float] = [] train_history: list[float] = [] prompt_history: list[str] = [mimic_agent.base_prompt] similarity_fn = getattr(evaluator, "similarity", evaluator._compute_similarity) train_pool = list(train_data) completed_changes: list[dict[str, object]] = [] pending_change: dict[str, object] | None = None for epoch in range(epochs): if shuffle: random.shuffle(train_pool) batches = chunk_batches(train_pool, batch_size) print( f"Epoch {epoch + 1}/{epochs}: {len(batches)} batches of size <= {batch_size}." ) last_eval = None train_scores: list[float] = [] epoch_records: list[dict[str, object]] = [] for batch_idx, batch in enumerate(batches, start=1): print(f" Batch {batch_idx}/{len(batches)}") for sample in batch: user_text = str(sample["text"]) tiny_output = str( sample.get("tiny_agent_output") or tiny_agent_response(user_text) ) mimic_output = mimic_agent(user_text)["mimic_output"] last_eval = evaluator.evaluate( user_input=user_text, tiny_output=tiny_output, mimic_output=mimic_output, ) train_scores.append(last_eval["similarity"]) epoch_records.append( { "user_input": user_text, "tiny_output": tiny_output, "mimic_output": mimic_output, "similarity": last_eval["similarity"], } ) avg_train_score = float(np.mean(train_scores)) if train_scores else 0.0 train_history.append(avg_train_score) print(f" Train similarity: {avg_train_score:.3f}") print(" Evaluating on test set...") test_scores: list[float] = [] for sample in test_data: user_text = str(sample["text"]) tiny_output = str( sample.get("tiny_agent_output") or tiny_agent_response(user_text) ) mimic_output = mimic_agent(user_text)["mimic_output"] test_scores.append(similarity_fn(tiny_output, mimic_output)) avg_test_score = float(np.mean(test_scores)) if test_scores else 0.0 test_history.append(avg_test_score) print(f" Test similarity: {avg_test_score:.3f}") case_examples_text, case_report_path = build_case_report( epoch_records, evaluator, epoch_idx=epoch + 1, ) if case_report_path: print(f" Saved case report to {case_report_path}") case_reasoning = analyze_case_differences(case_examples_text, evaluator.metric) if pending_change is not None and pending_change.get("new_metric") is None: pending_change["new_metric"] = avg_test_score prev_metric = pending_change.get("prev_metric") if isinstance(prev_metric, float): pending_change["delta"] = avg_test_score - prev_metric else: pending_change["delta"] = None completed_changes.append(pending_change) pending_change = None if last_eval is not None: print(" Updating prompt via coach agent...") change_log_text = format_change_log(completed_changes) blocks = mimic_agent.get_prompt_blocks() block_options = ", ".join(mimic_agent.prompt_state.block_names()) coach_update = coach( evaluator_output=last_eval, prompt_blocks=serialize_prompt_blocks(blocks), block_options=block_options, prompt_change_log=change_log_text, case_examples=case_examples_text, case_reasoning=case_reasoning, ) target_block = coach_update.get("target_block", "general").strip() new_block_text = coach_update.get("updated_block_text", "").strip() old_block_text = blocks.get(target_block) if not new_block_text or old_block_text is None: print(" Coach response missing valid block update; skipping.") else: old_prompt = mimic_agent.base_prompt mimic_agent.update_block(target_block, new_block_text) new_prompt = mimic_agent.base_prompt if new_prompt != old_prompt: intention = describe_block_change( target_block, old_block_text, new_block_text ) diff_text = prompt_unified_diff(old_prompt, new_prompt) pending_change = { "old_prompt": old_prompt, "new_prompt": new_prompt, "block_name": target_block, "old_block_text": old_block_text, "new_block_text": new_block_text, "intention": intention, "diff": diff_text, "prev_metric": avg_test_score, "new_metric": None, "delta": None, } prompt_history.append(new_prompt) print( " Updated block:", target_block, "->", new_block_text[:120] + ("..." if len(new_block_text) > 120 else ""), ) else: print(" Coach left the prompt unchanged.") print(" Epoch complete.\n") if pending_change is not None: completed_changes.append(pending_change) return { "test_history": test_history, "train_history": train_history, "prompt_history": prompt_history, "change_log": completed_changes, } training_run = run_learning_loop( train_data=train_set, test_data=test_set, evaluator=evaluator, coach=coach, mimic_agent=mimic_agent, epochs=12, batch_size=32, ) fig, ax = plt.subplots(figsize=(6, 3)) epochs_axis = range(1, len(training_run["test_history"]) + 1) ax.plot(epochs_axis, training_run["train_history"], marker="o", label="Train") ax.plot(epochs_axis, training_run["test_history"], marker="s", label="Test") ax.set_xlabel("Epoch") ax.set_ylabel(f"Similarity ({evaluator.metric})") ax.set_title("Mimic agent performance across epochs") ax.grid(True, alpha=0.3) ax.legend() fig.tight_layout() fig
Epoch 1/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.352 Evaluating on test set... Test similarity: 0.341 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_01_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and casually to the user's statement, expressing personal enjoyment or agreement in a friendly, conversa... Epoch complete. Epoch 2/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.345 Evaluating on test set... Test similarity: 0.339 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_02_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, reply wi... Epoch complete. Epoch 3/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.504 Evaluating on test set... Test similarity: 0.523 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_03_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, reply wi... Epoch complete. Epoch 4/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.496 Evaluating on test set... Test similarity: 0.512 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_04_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, reply wi... Epoch complete. Epoch 5/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.552 Evaluating on test set... Test similarity: 0.582 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_05_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and casually to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete. Epoch 6/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.601 Evaluating on test set... Test similarity: 0.636 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_06_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete. Epoch 7/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.906 Evaluating on test set... Test similarity: 0.901 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_07_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete. Epoch 8/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.901 Evaluating on test set... Test similarity: 0.901 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_08_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete. Epoch 9/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.890 Evaluating on test set... Test similarity: 0.898 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_09_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete. Epoch 10/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.979 Evaluating on test set... Test similarity: 0.997 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_10_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete. Epoch 11/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.906 Evaluating on test set... Test similarity: 0.914 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_11_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete. Epoch 12/12: 4 batches of size <= 32. Batch 1/4 Batch 2/4 Batch 3/4 Batch 4/4 Train similarity: 0.967 Evaluating on test set... Test similarity: 0.988 Saved case report to experiments/mimic_tiny_agent/coach_case_reports/epoch_12_case_report.md Updating prompt via coach agent... Updated block: general -> Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond ... Epoch complete.
- Printing the full learned prompt
print("Final learned prompt:\n") print(mimic_agent.base_prompt)
Final learned prompt: [General guidance] Respond briefly and politely to the user's statement. If the statement describes sports or physical activities, respond with a short, blunt, and general negative statement expressing clear dislike or lack of enjoyment for sports, regardless of any positive sentiment in the input. Use a conversational tone with contractions and discourse markers such as "Well," to express a clear personal opinion, for example: "Well, I don't like sports." Do not reference specific details from the input. Avoid repeating the exact same phrase for every sports-related input; instead, vary your wording slightly while maintaining the same clear negative stance and use phrasing very close to "Well, I don't like sports." Maintain the straightforward and slightly blunt tone of the target responses. Avoid rephrasing or softening the message; replicate the style and wording as closely as possible. Use a straightforward and neutral tone without casual qualifiers or elaborations. For statements about personal tastes or experiences unrelated to sports, respond concisely and respectfully, acknowledging the user's taste with a polite and neutral phrase similar to: "I see you are a person of taste! I like [topic], too." When possible, incorporate a relevant detail or sentiment from the user's statement to make the response more context-aware and engaging, while avoiding slang, overly casual expressions, or excessive enthusiasm. [sop-1] This is a placeholder standard operating procedure. [sop-2] This is a placeholder standard operating procedure.
- Evaluating the learned mimic agent in the holdout dataset
from pprint import pprint def evaluate_holdout( *, dataset: list[dict[str, object]], mimic_agent: MimicAgent, evaluator: TinyAgentEvaluator, ) -> dict[str, Any]: rows = [] scores: list[float] = [] for sample in dataset: user_text = str(sample["text"]) tiny_output = str(sample.get("tiny_agent_output") or tiny_agent_response(user_text)) mimic_output = mimic_agent(user_text)["mimic_output"] similarity = evaluator.similarity(tiny_output, mimic_output) scores.append(similarity) rows.append( { "user_input": user_text, "tiny_output": tiny_output, "mimic_output": mimic_output, "similarity": similarity, } ) return { "average_similarity": float(np.mean(scores)) if scores else 0.0, "metric": evaluator.metric, "rows": rows, } holdout_report = evaluate_holdout( dataset=holdout_set, mimic_agent=mimic_agent, evaluator=evaluator, ) print( f"Holdout {holdout_report['metric']} similarity: {holdout_report['average_similarity']:.3f}" ) pprint(holdout_report["rows"][:5])
Holdout embedding_cosine similarity: 0.981 [{'mimic_output': 'I see you are a person of taste! I like coffee, too.', 'similarity': 1.0, 'tiny_output': 'I see you are a person of taste! I like coffee, too.', 'user_input': 'The rich aroma of freshly brewed coffee awakens my senses and ' 'fuels my day with warmth and comfort.'}, {'mimic_output': 'I see you are a person of taste! I like the comforting ' 'aroma of coffee, too.', 'similarity': 0.8953397870063782, 'tiny_output': 'I see you are a person of taste! I like coffee, too.', 'user_input': 'The rich aroma of coffee grounds in the morning awakens my ' 'senses and brings comfort to my soul.'}, {'mimic_output': 'I see you are a person of taste! I like espresso, too.', 'similarity': 0.8794468641281128, 'tiny_output': 'I see you are a person of taste! I like coffee, too.', 'user_input': 'Sipping espresso in a quaint Italian cafe, I fell in love ' 'with the rich aroma and the warmth it brought.'}, {'mimic_output': 'I see you are a person of taste! I like coffee, too.', 'similarity': 1.0, 'tiny_output': 'I see you are a person of taste! I like coffee, too.', 'user_input': 'Savoring a steaming cup of coffee as the sun rises, its rich ' 'aroma wrapping around me like a comforting embrace.'}, {'mimic_output': "Well, I don't like sports.", 'similarity': 1.0000001192092896, 'tiny_output': "Well, I don't like sports.", 'user_input': 'I love basketball for the fast-paced action, strategic plays, ' 'and the joy of sinking a perfect shot through the hoop.'}]
7. Learnings
7.1. Cosine similarity enables the model to learn a generic behavior
: Similarity metric’s tolerance – Cosine similarity between short sentences rewards stylistic closeness more than exact structure.
It is a cool feature to generate policies that don't enforce specific phrases, but it can make it hard to split two very similar cases since a single answer might be similar enough to not push the algorithm to learn their difference.
8. Results
- Train/test cosine similarity now rises from ~0.20 in epoch 1 to 0.97/0.99 by epoch 10, maintaining ≥0.90 through epoch 12 (see logs under The learning loop).
- Holdout cosine similarity after training is 0.981 with most samples either exact matches or ≥0.89 similarity (see Evaluating the learned mimic agent in the holdout dataset).
- The final structured prompt’s general block clearly encodes the two-branch policy (“Well, I don’t like sports” vs. “I see you are a person of taste!”) while both SOP placeholders remain untouched (see Printing the full learned prompt)
Below we have the learned prompt. In bold, the conditions that trigger the two policies. The policies are underlined.
Respond briefly and politely to the user's statement. *If the statement describes sports or physical activities*, respond with a short, blunt, and general negative statement expressing clear dislike or lack of enjoyment for sports, regardless of any positive sentiment in the input. Use a conversational tone with contractions and discourse markers such as "Well," to express a clear personal opinion, _for example: "Well, I don't like sports."_ Do not reference specific details from the input. Avoid repeating the exact same phrase for every sports-related input; instead, vary your wording slightly while maintaining the same clear negative stance and use phrasing very close to "Well, I don't like sports." Maintain the straightforward and slightly blunt tone of the target responses. Avoid rephrasing or softening the message; replicate the style and wording as closely as possible. Use a straightforward and neutral tone without casual qualifiers or elaborations. *For statements about personal tastes or experiences unrelated to sports*, respond concisely and respectfully, acknowledging the user's taste with a polite and neutral phrase similar to: _"I see you are a person of taste! I like [topic], too."_ When possible, incorporate a relevant detail or sentiment from the user's statement to make the response more context-aware and engaging, while avoiding slang, overly casual expressions, or excessive enthusiasm.
It is curious to notice that it didn't learned exactly about the "coffee policy", but it learned it as "non-related to sports". See how it made this mimic agent output slightly different:
{'mimicoutput': 'I see you are a person of taste! I like espresso, too.', 'similarity': 0.8794468641281128, 'tinyoutput': 'I see you are a person of taste! I like coffee, too.', 'userinput': 'Sipping espresso in a quaint Italian cafe, I fell in love ' 'with the rich aroma and the warmth it brought.'}
Instead of "coffee", it replaced it by "topic". Since topic was always related to coffee in non-sports related cases, it became similar to the expected output. Though it is imprecise, it is a very interesting behavior I'm curious on how it will express in more complex cases.
9. Conclusion
The result is promising for scenarios where there is an agent we benefit from automate their behavior. It is needed to expand the synthetic example to a more challenging situation that resembles reality.
10. Open Questions
- How to encourage SOP edits instead of repeated general-block rewrites?
- What similarity metric best balances stylistic freedom and strict branching?
- Can we detect policy shifts automatically when the target agent changes?
11. Future Work
11.1. 1. Allow/enforce the coach to edit SOP blocks (enforce block selection, add validation).
11.2. 2. Explore prototype discovery or clustering to generalize beyond known topics.
Prototype discovery means breaking the input using unsupervised learning and identifying all the present use cases in the training set. Let's say we identify there are 10 different situations, then we could have 10 SOPs pre-defined.
Also, we can validate differently by grouping them and calculating the performance metric on it.
Further, we can break the input use-case by context/procedure and identify how the target agent branches the problem depending on parameters of it to propose a solution.
11.3. 3. Add dynamic regression tests that lock in the final prompt and guard against regressions.
11.4. 4. Investigate harder policies (more than two responses) or noisy targets.
11.5. 5. Add other success metrics beyond cosine similarity.
An interesting metric to measure how indistinguishable is the target and the mimic agent: traing a binary classifier using the target agent as the target, and generate outputs using the initial mimic agent - prior to any training. We are likely to create a very good classifier in identifying when the target agent is the one answering the inquiry.
As we teach the mimic agent, we should use this model to try to identify if it is the target or mimic agent answering the user's input. The performance on this task should drop to the point it is random.
One way to define the best possible scenario is: if the target agent enables for a minimal variance on its behavior (temperature?), vary it minimally and check the value of this metric. It is only possible in case the target agent is an AI agent under our control, which is only realistic on this synthetic case.
12. References
Lakshya A Agrawal AND Shangyin Tan AND Dilara Soylu AND Noah Ziems AND Rishi Khare AND Krista Opsahl-Ong AND Arnav Singhvi AND Herumb Shandilya AND Michael J Ryan AND Meng Jiang AND Christopher Potts AND Koushik Sen AND Alexandros G. Dimakis AND Ion Stoica AND Dan Klein AND Matei Zaharia AND Omar Khattab (2025). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.
Sutton, Richard S. and Barto, Andrew G. (2018). Reinforcement Learning: An Introduction, MIT Press.
Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle (2025). Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, CoRR.