Auto-Tuning Prompts with Feedback Loops

By David Factor — Published: 2025-09-05

I’ve been working on a project that uses large language models (LLMs) to classify text in fairly complex legal documents, asking them to pick out things like case citations, judge names, and final decisions. The nice thing about these categories is that they’re objectively checkable: you can compare the model’s answers to a set of correct examples and quickly see whether the output improved or not.

Early on, I was doing what everyone does: tweak the prompt, re‑run, “sense check” the outputs, and repeat. It works—until small prompt edits produce surprisingly large swings, and reasoning about why becomes slippery. I’d read a few articles recently that advocate for running coding agents in simple while loops to get a result. That got me wondering: what if I just make the loop explicit and let the metrics drive prompt changes?

while :; do echo testdata | run_eval.sh | coding_agent ; done

Inspired by Geoffrey Huntley’s Ralph* post, which shows how a simple loop can just keep trying!*

From Gut Feel to a Fitness Function

Usually when we play with prompts, evaluation is a bit hand‑wavy—we read the output and decide if it “looks good.” But with NER‑style tasks, you can define a fitness function.

Gold data: think of this as your answer key—a set of examples where you already know the right outputs.
Precision: out of everything the model guessed, how many did it get right?
Recall: out of all the right answers that existed, how many did the model actually find?
F1 score: basically a way of asking, “how good is this overall?” It balances precision and recall into a single number.

Once you have a test set (inputs) and gold data (expected outputs), you can run an evaluation, tweak your prompt, and immediately see whether things improved. No more guessing.

Why Prompts Need Help

By this point, I’d already set up evaluations and was tracking metrics, but the results were baffling—tiny prompt tweaks sent things swinging wildly, and with many classification tasks to cover, hand‑tuning was a slog. At some point you have to accept that LLMs are a bit of a black box; trying to reason out every surprising shift is a dead end. Around then I came across Geoffrey Huntley’s Ralph—named after Ralph Wiggum from the Simpsons, who just tries things without overthinking. His framing kept a human in the loop as the tuner, nudging the system back on course. That stuck with me, but I wondered: what if the eval metrics themselves could act as the tuner? Could the fitness function take over some of that role automatically?

Diagram of the feedback loop

[Test set (inputs)] + [Gold data (expected outputs)]
                 │
                 ▼
        Evaluate(model, prompt)
                 │
                 ▼
  Metrics & Error Report (Precision, Recall, F1)
                 │
                 ▼
 Coding Agent updates: Prompt + Few‑Shot Examples
                 │
   ┌─────────────┴─────────────┐
   │                           │
   ▼                           │
Re‑run with updates ───────────┘
   │
   ▼
If F1 ≥ target → STOP  •  Else → LOOP

How I Implemented It

I built a small tool that:

Runs an evaluation with the current prompt.
Produces a structured report of errors (what was correct/missed/wrong).
Feeds that report to a coding agent (in my case Aider), which modifies the prompt and the few‑shot examples.
Re‑runs the evaluation and repeats until a target F1 (e.g., 0.90) is reached.

Two tweaks noticeably improved results:

Warm‑start each iteration with the latest improved prompt/examples, and discard updates that don’t improve results.
Let the agent see prior conversation history and ask it to use it for learning.

Pseudocode for the feedback loop

TARGET_F1 = 0.90
prompt, examples = seed_prompt(), seed_examples()

while True:
    # 1) Evaluate
    predictions = run_task(model, prompt, examples, test_set)
    metrics, error_report = evaluate(predictions, gold)

    # 2) Check fitness
    if metrics.f1 >= TARGET_F1:
        break

    # 3) Improve prompt/examples using the report
    prompt, examples = coding_agent_update(
        prompt=prompt,
        examples=examples,
        error_report=error_report,
        history=get_conversation_history()
    )

# Done: prompt/examples are auto‑tuned for this task
save(prompt, examples, metrics)

Results (and Caveats)

With this loop, I can pretty consistently auto‑tune few‑shot prompts for specific classification tasks. All I need is some test data and a starting prompt; the loop nudges things toward better performance.

A few takeaways:

Fitness functions matter. If you can measure “good,” you can probably automate improvement.
Watch for overfitting. Guardrails help: expand the test set with varied scenarios, add simple heuristic checks, or explicitly forbid prompt tricks that memorize specifics.
Embrace the black box. LLMs are powerful but unpredictable. A feedback loop gives you a pragmatic way to improve without perfectly explaining every quirk.

Without getting too far ahead of myself, it’s interesting to ask how far this idea could be generalised. How tight does the loop need to be? Could the same principle extend over longer timeframes, into domains where fitness functions are fuzzier but still possible to define?