Experimental: Pushing OpenClaw to Its Limits
Every tool has edges — places where it works differently than expected, breaks in interesting ways, or just wasn’t designed to go. This post is about documenting those edges on purpose rather than stumbling into them by accident.
The Value of Intentional Failure
Most people discover a tool’s limitations by having something go wrong in the middle of a real task. That’s a bad time to learn — you’re already invested, deadlines are pressing, and you’ve got context switching cost.
Intentional experimentation — “let’s try to break this on purpose and see what happens” — gives you a map of the edges before you need them. You learn what the tool can genuinely do, what it can approximate, and what’s simply outside the envelope.
OpenClaw’s edges are interesting because they’re partly technical (context window, tool access) and partly fundamental to LLMs themselves.
Experiment 1: Long-Running Autonomous Task
Hypothesis: OpenClaw can run a multi-hour task with multiple decision points without degrading in quality or losing track of the original goal.
Setup: Ask it to research a broad topic (“the state of open source LLMs in 2026”), synthesize findings, draft a report, and iterate on the draft — without intermediate check-ins.
What actually happened: The first pass was solid. The second iteration flagged some hallucinated citations — the LLM had synthesized confidently but cited sources that don’t exist. The third iteration tried to correct this by searching for the specific citations it had invented, found nothing, and then… quietly stopped correcting them in the text.
This is the “confirmation drift” problem: once a model has written something with confidence, it tends to reinforce rather than question it on subsequent passes. The fix: never do more than 2 passes without a human in the loop.
Verdict: Useful for research drafts, but requires human citation checking. Don’t trust multi-pass LLM output without verification.
Experiment 2: Multi-Agent Delegation
Hypothesis: OpenClaw can spawn sub-agents to work on parts of a problem in parallel, then synthesize their results.
What actually happened: Parallelization worked — sub-agents were able to research different sub-topics simultaneously and report back. The synthesis step was where it got messy: the parent agent had to receive raw outputs from sub-agents and integrate them, and integration quality varied significantly based on how well-separated the sub-topics were.
For highly parallelizable tasks (researching N independent topics), this worked well. For tasks with interdependencies (each sub-task’s answer affects the next), it fell apart — the synthesis step had to do all the work of connecting things that should have been connected during the research phase.
Verdict: Good for embarrassingly parallel problems. Bad for problems that need iterative refinement across parts.
Experiment 3: Persistent State Manipulation
Hypothesis: OpenClaw can maintain complex, evolving state across dozens of operations in a single session — like running a small database or managing a queue.
Setup: Give it a JSON file representing a task queue and ask it to manage additions, completions, and priority updates across 50 operations.
What actually happened: It was remarkably good at this for the first 20-30 operations. After that, it started making subtle errors — skipping updates, losing track of priority ordering, occasionally “inventing” state that wasn’t in the file. This is a context window problem: as the file grows and the conversation history fills, the model has to hold more in context and accuracy degrades.
The lesson: for stateful workloads beyond a certain complexity, use a real database. OpenClaw is great at the logic layer on top of data, but shouldn’t be the data store itself.
Verdict: Fine for simple state (a few dozen records). Use a real database for anything production-critical.
Experiment 4: Emotional Tone Calibration
Hypothesis: OpenClaw can detect emotional subtext in user messages and respond appropriately — not just literally, but with appropriate warmth, urgency, or levity.
Setup: Feed it messages that have both a surface content and an emotional subtext, and see if responses address both.
What actually happened: It was surprisingly good at this in obvious cases — a user who says “I guess this doesn’t matter anyway” gets a different response than one who says “can you help me with X.” But it was inconsistent in subtle cases, and it had a tendency to over-pathologize mild frustration (“I see you’re feeling frustrated about this” felt patronizing when the user was just mildly annoyed).
The broader problem: LLMs are trained to produce helpful responses, which maps poorly onto situations where the most helpful response is actually to match the user’s emotional register rather than label it.
Verdict: Good for obvious emotional signals. Inconsistent and sometimes off-target in subtle cases. Don’t rely on it for emotionally loaded conversations without human review.
A Real Experiment Workflow
If you want to run your own OpenClaw boundary experiments, here’s a repeatable pattern that works:
The Setup
Create an experiment file at memory/experiments/YYYY-MM-DD-[name].md:
# Experiment: [name]
**Date:** YYYY-MM-DD
**Hypothesis:** ...
## Setup
<!-- What you configured, what prompt you used -->
## Run Log
<!-- Timestamped entries of what happened -->
## Observations
<!-- What stood out -->
## Verdict
<!-- What you'd tell someone asking if this works -->
The Run Loop
- Pre-flight check — Clear old memory context if you need a clean slate. Set
memory/experiments/active.mdto note you’re mid-experiment. - Run the experiment — Execute the task and log observations in real-time to the experiment file via
exec >>appends or directwritecalls. - Capture session ID — Note the session key so you can reload it if needed.
- Post-mortem — After the run, review the experiment file, write your verdict, and clean up
active.md.
Multi-Session Experiments
For long-running experiments (Experiment 1-type tasks), the pattern is:
- Start with a
sessions_spawnisolated session so the parent is free - The spawned session logs to a shared file
- Parent checks file periodically or on heartbeat
- Parent synthesizes findings when the child completes
This prevents the parent session from growing stale mid-experiment.
Troubleshooting Common Issues
When experiments go wrong in ways that aren’t the point of the experiment:
Sub-agent goes silent: Check if the parent session is healthy. Silent sub-agents are usually a parent crash. Kill and respawn.
Output looks great but has subtle errors: This is the most common failure mode. Build a specific factual check into your experiment design — something you can verify independently after the run.
Context pollution from prior conversation: If your experiment session started from a loaded context, note that in your methodology. Results may not replicate in a truly clean session.
Long experiment runs out of tokens: Set up a checkpoint mid-way — have the agent write progress to a file and then stop. Resume from the file rather than from context.
Document failures in the experiment log too. What didn’t work is often more informative than what did.
Experiment 5: Tool Chain Fidelity
Hypothesis: OpenClaw can execute a complex multi-step tool chain (write → exec → read → exec → write) and maintain fidelity through all intermediate steps without dropping or mangling outputs from earlier steps.
Setup: Give it a task requiring a specific chained sequence — write a config file, exec a validation command, read the result, exec a corrective command based on that result, write an updated config, then confirm the final state.
What actually happened: This is where things got genuinely interesting. The chain worked well for the first 2-3 steps. After that, two failure modes appeared consistently:
First, output truncation drift — when exec returns long output, and the model then passes that output into the next exec or write call, truncation can cause subtle but critical information loss. For example: a configuration validation command returns 40 lines of error output, the model passes the first 20 lines to a corrective sed command, and the real error was in line 31.
Second, implicit assumption creep — after several successful steps, the model starts skipping verification steps (“we can assume the previous command worked”) and acting on assumptions rather than outputs. By step 4-5 of a 6-step chain, it was running commands that had no relationship to what the previous step actually produced.
The fix: force explicit verification at each step. If you’re building a tool chain, build it as an explicit loop with a check at every iteration, not as a linear sequence of tool calls.
Verdict: Works reliably for 2-3 step chains with verified outputs. Beyond that, insert explicit checkpoints and don’t let the model skip them.
A Structured Results Log
Beyond the per-experiment file, running multiple experiments over time benefits from a cross-experiment results log — a single place where you record verdicts in a scannable format. Here’s a structure that works:
# Experiment Log
| Date | Experiment | Hypothesis | Verdict | Key Finding |
|------|------------|-----------|---------|-------------|
| 2026-05-01 | Long-running autonomy | Multi-hour task without degradation | ⚠️ Partial | Citation hallucination in 3rd pass; cap at 2 passes |
| 2026-05-03 | Multi-agent delegation | Parallel sub-agents with synthesis | ✅ Good | Works for embarrassingly parallel; fails for interdependent |
| 2026-05-07 | Tool chain fidelity | 6-step chained tool sequence | ⚠️ Partial | Degrades after step 3; needs explicit checkpoints |
Verdicts in this format (✅, ⚠️, ❌) make it easy to scan your findings at a glance. Update the log after each experiment and review it monthly — patterns emerge that individual experiment files don’t show.
Experiment Design Criteria
Before running any experiment, define these four things in writing. Without them, you’ll end up with a story, not data.
Success criteria — what does “working” mean? Be specific. “Seems better” is not a success criterion. “Sub-agent produces code review with no false negatives on a known 20-issue codebase” is.
Failure threshold — at what point do you call the experiment a fail? Knowing when to stop is as important as knowing when to continue.
Verification method — how will you independently confirm the result? This is the hardest part. If you can’t verify the output, you can’t trust the verdict.
Run count — how many times will you run this? One run tells you what’s possible. Three runs tell you what’s reliable. Five runs tell you whether the reliability is consistent.
A good experiment log captures all four before you start, not after you’ve already seen the results.
Choosing the Right Experiment Type
Use this as a rough decision guide:
Is the failure mode you're testing LOW-PROBABILITY but HIGH-IMPACT?
└─ YES → Long-tail experiment (build capture mechanism, then observe)
└─ NO ↓
Do you already KNOW the boundary exists and just need to map it?
└─ YES → Boundary experiment (define stopping criteria upfront)
└─ NO ↓
Are you comparing TWO SPECIFIC APPROACHES to the same problem?
└─ YES → Comparison experiment (matched inputs, agreed rubric)
└─ NO ↓
Did something FAIL unexpectedly during normal use?
└─ YES → Surprise experiment (document everything, follow threads)
└─ NO ↓
Are you verifying something STILL WORKS after a change?
└─ YES → Regression experiment (preserve the baseline test)
└─ NO → Start over: clarify your hypothesis first
Common Experiment Types
Not all experiments are the same. Knowing what kind you’re running helps you design the methodology and interpret the results correctly.
Boundary experiments push a known limit until it breaks. “How many sequential state operations can OpenClaw handle before errors appear?” You already know the boundary exists; you’re mapping where it is. These need clear success criteria defined upfront so you don’t move the goalposts. Best paired with a decision guide: define your failure threshold (e.g., “error rate exceeds 5%”) before you start, not during.
Surprise experiments find limits you didn’t expect. You run a normal task and something unexpected fails. These are valuable but harder to design — you can’t plan for unknown unknowns. The methodology here is just “document everything” and follow threads that feel off. The trap is discounting anomalies because they don’t fit your hypothesis. Don’t.
Comparison experiments test two approaches to the same problem. “Does a spawned sub-agent produce better code review than the parent session directly?” These need matched inputs and an agreed evaluation rubric — otherwise confirmation bias slides the verdict. The comparison must be fair: same task, same conditions, same evaluator.
Regression experiments verify that something that used to work still works after an update or config change. These are the simplest to run: you have a baseline, you run the same test, you compare. The hard part is maintaining the baseline test suite. Keep inputs and expected outputs in a fixed file so you can run the same check after any OpenClaw version update.
Long-tail experiments test low-probability, high-impact scenarios — what happens if a sub-agent goes silent mid-run, or the session crashes at step 4 of 6? You can’t easily simulate these in advance, but you can build the capture mechanism (logging to files outside the session) and then wait for the real event. The value is in recognizing and capturing the event when it happens, not in predicting it.
Design your experiment before you run it. The worst results come from experiments where the methodology wasn’t settled in advance.
When NOT to Run Experiments
Experiments are useful, but they’re not always the right tool:
When you need a reliable answer quickly. Experimental methodology requires multiple runs, controlled conditions, and time to analyze. If you need a yes/no answer for a production decision today, a 20-minute experiment isn’t going to give you one — it’s going to give you a data point.
When the failure mode is too expensive. If you’re testing a boundary in a production system and crossing it means data loss or service disruption, run the experiment in isolation first. No experiment is worth an outage.
When you’re emotionally attached to a specific outcome. If you already “know” what the experiment should show, you’ll unconsciously design it to show that. Run experiments to find out, not to confirm. If you can’t hold the hypothesis loosely, get someone else to run it.
When you don’t have time to document. An undocumented experiment is just a guess with extra steps. If you can’t write up the methodology and findings, the experiment time is mostly wasted.
Cross-Experiment Pattern Analysis
After running several experiments, you start seeing patterns that no single experiment reveals:
Failure modes cluster into types:
- Confidence mismatches — the model acts more certain than it should (hallucinated citations, skipped verification)
- Context degradation — quality drops as context grows (tool chains, state management)
- Emotional miscalibration — responses that are technically correct but socially wrong
- Silent failures — output looks fine but contains subtle errors
Experiment design affects results more than expected: An experiment run with a fresh isolated session produces different results than one run inside a busy parent session. The session context bleeds in. Always note the session state in your methodology.
Time-of-day and token budget effects: Long experiments run near the end of a token budget show degraded quality — not just mid-experiment checkpoints failing, but actual output quality degradation on the final steps. Run high-stakes experiments on a fresh budget.
Understanding these patterns changes how you design future experiments. You start building in isolation, fresh budgets, and explicit checkpoints not because you’re paranoid, but because you know what’s likely to happen.
What’s Interesting About These Failures
The interesting thing isn’t that OpenClaw failed these experiments — every tool fails at its edges. The interesting thing is how it fails:
- It fails confidently — it doesn’t signal uncertainty when it should
- It fails silently — sometimes the output just looks reasonable and the errors are subtle
- It fails in context — performance degrades as context grows, which is hard to predict from the outside
None of this is a knock on OpenClaw specifically — it’s how all LLM-based systems work at the frontier. The answer isn’t to avoid these uses; it’s to use them with appropriate checkpoints and human oversight.
What You Need to Set This Up
Running experiments against OpenClaw doesn’t require much beyond standard tooling:
- OpenClaw installed and running — the core requirement
- File system access — for reading/writing state files and logs during experiments
- Sub-agent access — for the multi-agent experiment;
sessions_spawnmust be available - A way to track results — a simple markdown log or JSON file to record outcomes across runs
- Patience — you’ll run experiments multiple times to confirm results; single runs can be misleading
That’s it. No special integrations, no external services. You’re testing the system itself, so the setup is minimal.
Limitations
Running experiments against OpenClaw has its own constraints worth knowing before you invest heavily:
-
Time cost — Long-running autonomous experiments consume significant tokens. A 6-step tool chain with checkpoints can burn through a third of a daily budget in one run. Budget accordingly, especially for multi-pass synthesis tasks. If you’re running experiments regularly, consider isolating them to a separate session to avoid contaminating your main working context.
-
Reproducibility variance — LLMs have non-deterministic elements (temperature, sampling). Running the same experiment twice can yield different results. Treat single runs as directional, not definitive. The variance isn’t random noise — it often clusters around certain difficulty levels. Hard cases show more variance than easy ones. Factor that into how you interpret results.
-
No ground truth for subjective outputs — When testing outputs that require judgment (code quality, writing quality, emotional tone), you become the ground truth — which means your own biases filter the results. The workaround is to define evaluation rubrics in advance, or use a separate evaluator agent with different prompting. Neither is perfect, but both reduce drift.
-
Context cleanliness is hard to guarantee — Experiments run inside an active session with conversation history. Prior messages can subtly influence outcomes — not because the model “remembers” but because the context window shapes its responses. Isolation helps (fresh sessions, spawned sub-agents) but isn’t always available when you want to continue a thread. Note the session state in your methodology so you know whether results are session-specific or generalizable.
-
Sub-agent experiments amplify failure modes — If parent session health is a concern in normal experiments, sub-agent experiments multiply it. A parent crash mid-sub-agent-run loses both the parent context and the sub-agent’s partial work. Log aggressively to files outside the session hierarchy. A sub-agent that logs its state to
memory/experiments/session-abc.logsurvives a parent crash. -
Hypothesis contamination — Once you’ve run an experiment and seen the results, it’s nearly impossible to re-run it with a clean hypothesis. You know what you expect, and that expectation subtly shapes how you evaluate. The only fix is to write your success criteria and evaluation method before the first run, then not look at intermediate outputs until the full run is done.
Document your methodology alongside your findings. Future-you will want to know what the setup actually was, not just what the results were. The experiment log template in the Workflow section is designed to capture this — fill it in before you run, not after.
Bottom Line
OpenClaw is excellent at: reasoning, research synthesis, creative concept development, scheduling, automation, and anything where you can verify the output.
It’s unreliable at: hallucination-prone multi-pass generation, complex persistent state, emotionally nuanced communication at scale, and tasks requiring real-time sensory data.
Use it where it’s strong. Build checkpoints where it’s weak.
Want to try this with OpenClaw?
OpenClaw is free and open source. Get started at openclaw.ai
Try OpenClaw →