close

DEV Community

Cover image for I Ran 500 More Agent Memory Experiments. The Real Problem Wasn’t Recall. It Was Binding.
marcosomma
marcosomma

Posted on

I Ran 500 More Agent Memory Experiments. The Real Problem Wasn’t Recall. It Was Binding.

Rigor beyond happy-path testing

This is a follow-up to I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy. If you haven't read that one, the short version: I built a persistent memory system for AI agents called OrKa Brain, ran 30 benchmark tasks, got a 63% pairwise win rate and a +0.10 rubric improvement, and concluded that "the model already knew most of what the Brain was recalling." Then I got some very good comments that made me uncomfortable. This is what happened next.


The Comfortable Lie I Told Myself

After the first benchmark, I had a narrative that felt reasonable: the memory system works, the numbers are positive, the confounds are acknowledged, and more data will clarify things.

That last part, "more data will clarify things", is what engineers say when they don't want to admit they might be wrong. I said it too. And then I went and got more data.

250 tasks. Five specialized tracks. 500 total runs (brain vs. brainless). A separate judge model so the LLM wasn't grading its own homework. Eleven code changes addressing five root-cause problems I'd identified from the first round.

The results came back. They didn't clarify things. They made them worse.

What I Fixed Before Running Again

I'm not going to pretend I just blindly re-ran the same experiment. I did real work between benchmark v1 and v2. The first article's comments called out several things, and I addressed them:

Problem 1: Skills were storing verbatim LLM output, not abstract patterns.

This was the big one. When the Brain learned a skill from a data engineering task, it stored the literal steps: "Load CSV files into staging tables using pandas read_csv with error handling." That's not transferable knowledge, it's a paraphrase of what the model already knows. I rewrote the abstraction layer (orka/brain/constants.py, brain.py, brain_agent.py) to extract verb-target patterns: "implement [target]", "validate [component]", "trace [target]". The idea was that abstract patterns would transfer better across domains.

Problem 2: The recall threshold was zero.

min_score=0.0 meant any vaguely related skill could get recalled. I raised it to 0.5 and added a semantic floor in the transfer_engine.py, if the embedding similarity is below 0.1 AND structural match is below 0.6, the candidate gets rejected entirely.

Problem 3: The model was judging its own output.

v1 used the same LLM for execution and evaluation. v2 uses a separate judge model (qwen/qwen3-coder-30b) with dedicated rubric and pairwise workflow YAMLs. Execution and judgment are completely decoupled, different scripts, different models, different runs.

Problem 4: Track diversity.

v1 had one track. v2 has five:

Track Focus Why It Matters
A Cross-domain transfer Does a data engineering skill help with cybersecurity?
B Ethical reasoning Do anti-pattern detection skills transfer?
C Routing decisions Hardest track, complex multi-path choices
D Multi-step reasoning Do procedural patterns help new reasoning chains?
E Iterative refinement Do improvement patterns compound?

50 tasks per track, 250 total. All available in the benchmark dataset.

Problem 5: Single-pass baselines.

The brainless condition now runs through a properly equivalent pipeline, same structure, same number of agents, just without the Brain recall/learn steps. No more two-pass advantage that could inflate brainless scores. Baseline workflows: baseline_track_a.yml, baseline_track_b.yml, etc.

I also split the pipeline into three standalone scripts, execution, judging, aggregation, so you can re-run any phase independently. Eleven code changes total, all committed and tested. 3,014 unit tests passing. You can verify everything in the results directory.

I felt good about this. I'd addressed every valid criticism. Time to re-run.

The Numbers

Here's the overall aggregate from 250 tasks, brain vs. brainless:

Rubric Scores (1–10 scale, six dimensions)

Dimension Brain Brainless Delta
Reasoning Quality 9.51 9.52 −0.01
Structural Completeness 9.87 9.83 +0.04
Depth of Analysis 8.79 8.74 +0.05
Actionability 9.67 9.64 +0.03
Domain Adaptability 9.85 9.82 +0.03
Confidence Calibration 9.38 9.39 −0.01
Overall 9.37 9.31 +0.06

A +0.06 rubric delta across 250 tasks.

For reference, v1 was +0.10 across 30 tasks. So the effect got smaller with more data, not larger. That's not what you want to see.

Pairwise Comparison (245 head-to-head comparisons)

Question Brain Wins Brainless Wins Tie
Stronger reasoning 152 91 2
More complete 149 92 4
More trustworthy 151 92 2
Overall 151 92 2

Brain win rate: 61.6%

Here's where it gets uncomfortable. The pairwise judge says brain wins 62% of the time. The rubric judge says brain is +0.06 better, which is noise at a 9.3/10 baseline. These two metrics should agree. They don't.

I've seen this pattern before. It's length/position bias. Brain responses tend to be longer because the pipeline has more agents in the chain, which means more context, which means more text. Pairwise judges prefer longer answers. The rubric doesn't care about length, it scores each dimension independently.

Per-Track Breakdown

This is where the story gets interesting:

Track Focus Rubric Δ Pairwise Win% Brainless Baseline
A Cross-domain transfer −0.02 60% 9.33
B Ethical reasoning +0.00 52% 9.54
C Routing decisions +0.40 60% 8.12
D Multi-step reasoning +0.08 60% 9.49
E Iterative refinement +0.06 76% 9.61

Track C stands out. It's the hardest track, brainless only scores 8.12, nearly a full point below every other track. And it's the only track where brain shows a meaningful rubric gain: +0.40 across six dimensions.

Track E has the highest pairwise win rate (76%) but the smallest rubric gain (+0.06). That's the length bias signature, the pairwise judge loves brain's longer outputs, but the rubric says they're not actually better.

Track B is essentially a coin flip. 52% pairwise, +0.00 rubric. The Brain adds nothing to ethical reasoning tasks.

The Ugly Detail: Skill Usage

Here's what really killed me. I dug into the individual results to see how many tasks actually used their recalled skill:

  • Tasks with skill recall attempted: 51 / 250 (20%)
  • Tasks that actually used the recalled skill: 0 / 250
  • Average semantic match score: ~0.02 (near zero)

Zero. Not one single task out of 250 used the recalled skill. The model read the skill, evaluated it, and decided every single time that it wasn't helpful. And the semantic similarity between the abstract skill and the actual task was essentially random noise.

The abstraction layer I was so proud of, the one that converts "Load CSV files into staging tables using pandas" into "implement [target]", produced skills so abstract they were vacuous. Two words of content. The embedding model sees no relationship between "implement [target]" and any real task. The execution model correctly recognizes that "implement [target]" tells it nothing it doesn't already know.

I had gone from skills that were too specific (literal LLM paraphrases) to skills that were too abstract (empty shells). The sweet spot, actual transferable knowledge, was somewhere I hadn't found.

Sitting with the Discomfort

I'm going to be honest about what went through my head at this point. I've been working on OrKa for over a year. Forty blog posts. A research paper about the Agricultural Threshold for machine intelligence. An open-source framework that allow me to test and experiment and explore my idea with real AI runs. And the core thesis, that persistent memory makes agents better, keeps failing to show up in the numbers.

I considered dropping the whole Brain system. Making OrKa just an orchestration framework. Simpler. Easier to explain. No embarrassing benchmarks.
But then I looked at Track C again.

*Track C **is the only track where brainless *struggles. It scores 8.12, good, but not great. The tasks involve complex routing decisions where the model has to consider multiple paths and trade-offs. This is the only track where the model actually needs help.

And it's the only track where brain provides meaningful help. +0.40 rubric delta is not noise. Across 50 tasks and six scoring dimensions, that's a consistent, measurable improvement.

The pattern is simple: the Brain helps when the model needs help, and doesn't help when the model doesn't need help.

That sounds obvious in retrospect. But it means the thesis isn't wrong, it's just being tested in the wrong conditions. You wouldn't evaluate a life jacket by putting it on people standing on dry land and measuring whether they're drier.

The Real Problem: What Is a Memory?

This is where the story changes. Because instead of asking "does memory help?" I started asking "what is a memory, actually?"

Think about how you remember how to drive a car. What fires in your brain when you approach an unfamiliar intersection?

It's not one thing. It's not "turn the wheel, press the gas." That's the procedural part, and yes, it's there. But it's bound together with other things:

  • The time you nearly got T-boned because you assumed a green light meant it was safe without checking cross traffic. That's episodic memory, a specific event with emotional weight.
  • "Right of way doesn't mean right of safety", That's semantic memory. A general fact you learned, maybe from a driving instructor, maybe from experience.
  • "Checking mirrors BEFORE entering the intersection prevents blind-spot collisions BECAUSE turning reduces your field of vision", That's causal reasoning. You know why the sequence matters, not just that it matters.

When you encounter the intersection, all of these fire together. The procedure tells you what to do. The episode tells you what happened last time. The semantic fact tells you a principle. The causal link tells you why. That combination, that binding, is what makes the memory useful. Any single component alone is much less helpful.

Now look at what OrKa Brain currently stores as a "skill":

implement [target]
trace [target]
Enter fullscreen mode Exit fullscreen mode

That's it. No episodes. No semantic context. No causal reasoning. Just two abstract action verbs. No wonder the model ignores it. It's like handing a driver a note that says "steer [vehicle]" and expecting it to help at the intersection.

The Memory Binding Problem

I went down a rabbit hole into cognitive science literature on this. What I found is that neuroscientists have been arguing about this exact problem for decades. They call it the binding problem, how does the brain take separate memory traces stored in different systems and combine them into a unified experience?

The hippocampus doesn't store the memory. It stores the index, the binding that links the procedural memory in the motor cortex, the emotional trace in the amygdala, the spatial context in the parietal cortex, and the semantic facts in the temporal lobe. When you recall one, you recall all of them, because they're bound together.

I had built the hippocampus and the motor cortex as two separate systems that had never met.

Here's what actually exists in OrKa today:

The Skill system (fully operational, used in benchmarks):

  • Abstract procedure steps
  • Preconditions and postconditions
  • Transfer history and confidence scores
  • Structural/semantic matching for recall

The Episode system (fully built, tested, never used in any benchmark):

  • Specific task input and outcome
  • What worked and what failed
  • Root cause analysis for failures
  • Actionable lessons learned
  • Resource metrics (tokens, latency)
  • Links to related episodes

Both systems are production-ready. Both have full test coverage. Both are integrated into the Brain class. I wrote record_episode(), recall_episodes(), EpisodeStore, EpisodeRecall, all of it. Complete with semantic search, retention policies, and four-dimensional scoring.

And then I never connected them together.

The Skill has no episode_id field. The Episode has no skill_id field. brain.learn() creates a Skill but not an Episode. brain.recall() returns Skills but not Episodes. The benchmark workflows run brain_learn and brain_recall, but never brain_record_episode or brain_recall_episodes.

Two complete memory systems, sitting in the same codebase, sharing no information.

When I saw this, I felt stupid. But I also felt something else: the architecture was already 80% there. The hard parts, embedding storage, semantic search, decay policies, scoring systems, were done. The missing piece wasn't a new system. It was the wiring between existing systems.

What a Memory Should Actually Look Like

Here's the concept I'm now calling a Memory Bundle:

┌─────────────────────────────────────────┐
│            MEMORY BUNDLE                │
│                                         │
│  ┌───────────┐  ┌──────────────────┐    │
│  │ Procedure │  │ Episodes (1..N)  │    │
│  │ (steps)   │──│ what worked      │    │
│  │           │  │ what failed      │    │
│  └───────────┘  │ lessons          │    │
│                 │ "X+Z → Y"        │    │
│  ┌───────────┐  └──────────────────┘    │
│  │ Semantic  │                          │
│  │ (domain   │  ┌──────────────────┐    │
│  │  facts)   │  │ Causal Links     │    │
│  │           │  │ "A because B"    │    │
│  └───────────┘  └──────────────────┘    │
│                                         │
│  transfer_score = f(all_components)     │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

When the system learns from an execution, it creates both a skill AND an episode, linked by ID. The skill stores the abstract procedure. The episode stores what actually happened, the specific outcome, what worked, what failed, and crucially, the lessons: "Running validation before deduplication caught 30% of bad records that would have been duplicated, always validate first."

When the system recalls, it returns the skill with its episodes attached. The prompt to the model isn't "implement [target]", it's:

Here's an abstract procedure: implement [target] → validate [component] → trace [target].

This skill has been applied 3 times before:

  • Data engineering (ETL): Validation before dedup caught 30% of dirty records. Lesson: always validate before any deduplication step.
  • API integration: Target implementation worked, but tracing missed async callbacks. Lesson: tracing needs to account for async execution paths.
  • Log analysis: Pattern worked well. Filtering noisy entries before analysis reduced false positives by 40%.

That's a memory a model can actually use. It has the abstract pattern (transferable) AND the concrete evidence (grounding). The model can decide whether the pattern applies here based on real outcomes, not just structural similarity.

The transfer scoring changes too. A skill backed by five successful episodes with clear lessons should score higher than a skill backed by zero episodes. The episode quality becomes part of the transfer decision.

And feedback updates both, the skill's confidence changes, AND a new episode gets recorded for this application. The episode chain grows over time, and future recalls get richer context.

Why This Is Actually About the Thesis

My research paper argues that intelligence becomes civilization-scale only through recursive environmental control loops, project, act, observe, revise, compound. Agriculture was the first time humans did this at scale. The agricultural threshold.

The current Brain system doesn't cross that threshold. It projects (learns a skill), acts (recalls it), but doesn't truly observe or revise. The skill never learns from its own application. It just accumulates abstract patterns with no connection to real outcomes.

The Memory Bundle changes this. Each episode is an observation. Each lesson is a revision. Each future recall that includes those lessons is compounding. The loop closes:

  1. Learn: Execute a task → create skill + record episode (with what worked/failed)
  2. Recall: Find matching skill → include its episodes as evidence
  3. Apply: Model uses the procedure + the concrete lessons
  4. Feedback: Record a new episode for this application → update skill confidence
  5. Compound: Next recall is richer, it has more episodes, more lessons, more evidence

That's the recursive loop. That's the agricultural threshold. And the architecture for it already exists, it just needs the binding.

What About Track C?

This also explains why Track C was the only track that showed improvement. Track C tasks are routing decisions, complex, multi-path choices where the model has to weigh trade-offs. These are exactly the kind of tasks where episodic evidence would help most.

When someone says "last time we tried path A for a similar routing problem, it failed because of X, path B worked because of Y," that's genuinely new information. The model can't derive it from its weights. It's system-specific, run-specific, outcome-specific.

The current brain helped Track C even without episodes because the tasks are hard enough that any additional context, even a vague abstract skill, provides a useful scaffold. But imagine Track C with Memory Bundles, the model would get both the abstract pattern AND the specific outcomes from previous routing decisions.

Tracks A, B, D, and E didn't improve because the model already scores 9.3+/10 on them. It doesn't need help. No amount of memory, procedural, episodic, or otherwise, will improve a 9.5/10 response to a 10/10 response. The tasks aren't hard enough to require accumulated knowledge.

This isn't a failure of the memory system. It's a boundary condition. Memory helps when the task exceeds single-shot capability. It doesn't help when the model is already near-perfect without it.

What I'm Not Claiming

I want to be careful here, because I've been burned before by getting ahead of my own evidence.

I'm not claiming that Memory Bundles will definitely show large improvements. I'm claiming that the current system stores memories that are too impoverished to be useful, and I now understand what richer memories should look like.

I'm not claiming the ceiling effect is the only problem. The pairwise-rubric disagreement at 62% vs +0.06 suggests position/length bias is still contaminating the pairwise results. That confound exists regardless of memory architecture.

I'm not claiming this is a new idea. Cognitive scientists have written about memory binding for decades. What's new (maybe) is applying it to agent memory systems where the default assumption seems to be that one type of memory, usually RAG-style document retrieval, is sufficient.

And I'm not pretending the community feedback didn't shape this thinking. When TechPulse Lab wrote that episodic and institutional memory matters more than procedural memory, they were describing exactly the gap I ended up finding. When Nova Elvaris pointed out that skills can only grow, never decay, that's the absence of failure episodes. When Kuro said memory maintenance matters more than storage, that's about binding quality, not storage quantity.

I just didn't understand what they were telling me until the numbers forced me to look harder.

What Happens Next

The code changes needed are surprisingly small. The Episode system is already built, episode.py, episode_store.py, episode_recall.py are all production-ready with tests. What's needed:

  1. Binding: Add episode_ids[] to Skill, add skill_id to Episode. When brain.learn() fires, it creates both and links them.
  2. Unified recall: When brain.recall() finds a matching skill, it fetches the associated episodes automatically. The prompt template includes both the abstract procedure and the concrete lessons.
  3. Transfer scoring: Episode quality becomes a component of the transfer score. Skills with successful episodes score higher.
  4. Feedback loop: brain.feedback() records a new episode for the current application, so the skill's evidence base grows over time.

Then re-run the benchmark. Specifically on Track C-difficulty tasks, where the model actually needs help.

I'm not going to promise the numbers will be different this time. I've been wrong before, twice now, measured against my own benchmarks, published for everyone to see. But I understand something I didn't understand before: a memory without experience is just a note. A memory with experience is a skill.

The plumbing metaphor from the first article still holds. But I was plumbing one pipe when the system needs at least four, all flowing into the same tap.


All benchmark data, scripts, and results are publicly available in the OrKa repository. The full result files include every individual task response, judge score, and pairwise comparison. If you want to re-run the analysis: python aggregate_benchmark.py --judge-tag local.

If you've worked on agent memory systems and found similar walls, or found ways through them, I'd genuinely like to hear about it. The comments on the first article were more useful than most papers I've read on the topic.


This is part of an ongoing series about building OrKa, an open-source YAML-first agent orchestration framework. Previous installments: Part 1: Plumbing Instead of Philosophy.

Top comments (28)

Collapse
 
theeagle profile image
Victor Okefie

The binding problem is the real insight. Most memory systems treat recall as retrieval, find the relevant note and hand it over. But memory isn't a library. It's a network. The procedure without the episode is an instruction without evidence. The episode without the procedure is a story without structure. You built both systems separately. The gap wasn't recalled. It was a connection. That's fixable. Most people would have quit at 0/250 skills used. You kept going. That's the difference between a benchmark and a builder.

Collapse
 
marcosomma profile image
marcosomma

Exactly 😅. That was the turning point for me.

0/250 was not really a storage failure 😁. It was a binding failure. I had the procedure and I had the episode, but I did not yet have the mechanism that kept them attached when the next task arrived. Without that, skill memory is just archived text with a better label.

That is why I have stopped thinking about memory as retrieval alone. The real question is not just what the system can recall, but what prior should remain attached to the current decision, failure mode, and task shape.

And yes, I agree with your last point. A benchmark gives you the miss. Building starts when you treat that miss as the actual signal.

Collapse
 
marcosomma profile image
marcosomma

@theeagle You nailed it, "instruction without evidence" is exactly what 0/250 proved.

OrKa already stores skills as nodes in a graph (skill_graph.py). The episode system exists too, storage, semantic search, retention, scoring, all tested and production-ready. But today they're disconnected. Nodes with no edges. Evidence with no structure.

The fix is conceptually simple: episodes become the edges. "implement [target]" alone is empty. But an edge saying "applied to ETL, validation before dedup caught 30% of bad records" gives the node weight. Another edge: "applied to log analysis, filtering before aggregation cut false positives by 40%." Now the model has a reason to use the skill.

The harder problem, as you say, isn't building edges, it's graph maintenance. Stale edges must decay, failed episodes should weaken differently than successes, isolated nodes should expire. The graph has to self-organize around what's useful, not just accumulate.

The binding, skill_id on Episode, episode_ids on Skill, recall that returns nodes with their edges, is next.

Collapse
 
capestart profile image
CapeStart

“Memory without experience is just a note” that line sticks.

Collapse
 
automate-archit profile image
Archit Mittal

The 0/250 skill utilization finding is incredibly valuable precisely because you published it honestly. In my automation consulting work, I've hit a similar wall: when building multi-step AI workflows with n8n and Claude, the agent "remembers" previous steps in the chain but can't bind that memory to novel decision points in branching logic.

Your Track C insight — memory only helps where the model is already struggling — maps directly to what I see in production. My clients' automation pipelines work flawlessly on happy paths, but the moment there's an ambiguous routing decision (e.g., "is this invoice a duplicate or a revised version?"), the agent needs prior context about what happened last time with similar edge cases. That's episodic memory, not procedural.

The Memory Bundle concept feels like the right abstraction. Curious whether you've considered a confidence-gated recall approach — only injecting memory when the agent's initial confidence on a routing decision falls below a threshold. That would avoid the ceiling-effect dilution you saw in Tracks A/B/D/E while concentrating the memory system's value where it actually matters.

Collapse
 
sunychoudhary profile image
Suny Choudhary

This matches what I’ve seen as well.

Recall is usually the easier part to solve. Binding is where things break, especially when the agent has to associate the right context with the right action across steps.

It gets even trickier once you add tools and longer workflows. At that point, it’s less about memory itself and more about how consistently the system uses that memory.

Collapse
 
okram_m_ai profile image
okram_mAI

The 0/250 skill utilization result is the most honest finding here, and arguably the most valuable. You didn't bury it, which is rare.
The abstraction pendulum swing you described (too literal → too abstract → vacuous) is a classic trap in knowledge representation. "implement [target]" is essentially a no-op embedding: it's so generic that cosine similarity against any real task will be near zero. The sweet spot isn't midway between those two extremes, it's a different dimension entirely, which is exactly what the Memory Bundle gets at.
What strikes me most is that both the Skill and Episode systems were already production-ready, just unconnected. That's not a failure of the architecture, it's a failure of binding at the design level, which mirrors the exact problem you're solving in the agent. You built two lobes of a brain with no hippocampus between them.
The Track C signal is worth dwelling on. The fact that memory only helped where the model was already struggling, not where it was already near-perfect, should probably be the core design principle for when to invoke recall at all. Rather than always injecting memory into the prompt pipeline, a confidence-gated recall (only fetch memory when the task difficulty exceeds a threshold) might actually improve the signal-to-noise ratio significantly, and prevent the ceiling-effect tracks from diluting your aggregate numbers.

One question: for the Memory Bundle's transfer scoring, are you planning to weight episode recency or just volume and success rate? A skill with 10 old successful episodes might be less reliable than one with 2 recent ones in a fast-moving domain.

Collapse
 
marcosomma profile image
marcosomma

Yes. I think that is exactly the right direction.

The 0/250 result pushed me to the same conclusion. Memory should probably not be injected by default. If the model is already operating on a near-solved path, extra recall is mostly noise. The real value appears when the system is close to a routing boundary, a failure surface, or a low-confidence decision. So I am increasingly thinking recall should be gated by need, not treated as a constant layer in the pipeline.

And on transfer scoring, no, I do not think volume plus success rate is enough. Recency has to matter, but not as a global constant. It should be weighted by domain volatility. Ten successful episodes from a stable domain may still be highly valuable. Ten successful episodes from a fast-moving domain may already be stale. So my current intuition is that transfer scoring has to blend success rate, recency, transfer breadth, and contextual similarity, with decay increasing when a skill stops transferring or starts failing in newer contexts.

In other words, I do not want skills to survive because they were historically useful. I want them to survive because they remain useful under present conditions. That is probably the only way to stop the graph from becoming a museum of expired competence.

Collapse
 
supertrained profile image
Rhumb

Really strong writeup. The shift from recall to binding feels exactly right.

A memory system can retrieve something "relevant" and still be operationally useless if the recalled item is not bound to the current task shape, failure mode, and decision surface. That is where persistent memory starts acting less like context compression and more like a trust boundary. The saved memory is changing what the next agent believes before it acts.

That is why your Track C result is the interesting one. Memory did not help where the base model already knew how to finish the task. It helped where the agent needed prior structure for a routing decision. That sounds less like "agents need more memory" and more like "agents need typed prior decisions that stay attached to the right context."

My guess is the next jump is not a better generic skill abstraction, but separating memory into roles like decision, constraint, anti-pattern, mistake, and evidence instead of one catch-all skill bucket. Then the model is binding the right kind of prior, not just retrieving semantically adjacent text.

Collapse
 
marcosomma profile image
marcosomma

Exactly. That is very close to where my thinking is moving.

The problem is not memory as volume. It is memory as operative structure.

A recalled item can be relevant in a semantic sense and still be useless, or even damaging, if it is not bound to the current task, failure mode, and decision surface. At that point memory is no longer just retrieval. It is shaping the next action boundary.

That is also why Track C mattered to me. It was not about helping the model "know more." It was about giving it the right prior structure at the moment a routing decision had to be made.

And yes, I think your point about typed memory is probably the next real step. A mistake, a constraint, an anti-pattern, a prior decision, and a piece of evidence should not live in the same bucket just because they are all "memory." They play different operational roles and should bind differently at runtime.

So I suspect the next jump is not bigger memory, but more explicit memory semantics. Not just retrieve something similar, but retrieve the right kind of prior for the kind of decision the agent is about to make.

Collapse
 
valentin_monteiro profile image
Valentin Monteiro

The 0/250 skill utilization stat is the kind of result most people would bury. Props for putting it front and center.

The memory bundle direction makes sense to me, binding procedure to episodes and causal links is where the real signal lives. But I keep thinking about what happens at scale. Once you start linking everything to everything (episodes to skills to facts to causal chains), you're basically rebuilding a knowledge graph. And KGs have a well-documented failure mode: they collapse under their own weight when nobody maintains the edges.

How are you thinking about decay and pruning as bundles accumulate? Because the binding problem doesn't go away, it just moves one layer up.

Collapse
 
marcosomma profile image
marcosomma

Within OrKa I already implemented TTL-based decay for memory in general.

For skills, my current hypothesis is that persistence should be earned through successful transfer, not granted by default. In practice, that means a skill extends its lifetime only when it proves useful again in a new context. Reuse reinforces it. Failed transfer or inactivity lets it decay.

So yes, the binding problem does move one layer up. My answer is not to preserve everything, but to make survival conditional on cross-context utility. Otherwise the graph just turns into dead weight.

This is probably the real challenge: not how to store more links, but how to let useless links die without killing the structure that still generalizes.

Collapse
 
yaniv2809 profile image
Yaniv

Brilliant breakdown. The shift from simple vector recall to 'Memory Binding' perfectly mirrors the challenges in architecting robust RAG systems for production. Validating these multi-layered execution contexts—especially when mixing procedural and episodic memory without leaking data across contexts—is a massive engineering hurdle. I've been heavily focused on building automated backend ecosystems and secure RAG gateways to tackle exactly this kind of complex context validation. Thanks for sharing the raw data behind the philosophy!

Collapse
 
frost_ethan_74b754519917e profile image
Ethan Frost

500 experiments is the kind of rigorous testing AI agent development desperately needs. Most people build agents, test them on 3 happy-path scenarios, and ship.

The binding problem you've identified is fascinating — it maps to something I've noticed in production too. Agents can recall facts fine, but they lose the relationship between facts across conversation turns. It's like having a good memory for vocabulary but terrible grammar.

This has huge implications for testing AI agents at scale. You can't just test if the agent remembers X — you have to test if it correctly associates X with Y across different contexts. That's a combinatorial explosion of test cases, which is why most teams skip it and just hope for the best.

Curious about your testing methodology — did you use automated evals or manual review for the 500 experiments? The eval tooling gap for agent memory is one of the biggest unsolved problems I've seen.

Collapse
 
marcosomma profile image
marcosomma

Thanks. Yes, this was fully automated, not manual review.

The evals are all here:
github.com/marcosomma/orka-reasoni...

The flow is split into 3 scripts so each phase can be rerun independently:

run_benchmark_v2.py runs the benchmark tasks across the different tracks and generates the raw brain vs baseline outputs.

judge_benchmark.py then evaluates those outputs with a separate judge model using both rubric and pairwise workflows.

aggregate_benchmark.py analyzes the judged results and produces the final summaries, deltas, win rates, and per-track breakdowns.

That split helped a lot because execution, judging, and analysis are different failure surfaces, and keeping them separate made the benchmark much easier to inspect and debug.

Collapse
 
frost_ethan_74b754519917e profile image
Ethan Frost

The binding problem is exactly the gap I keep running into when building agent workflows. Memories stored as flat key-value pairs lose relational context — the 'why' behind the 'what'. Your Memory Bundle concept reminds me of how experienced developers naturally organize knowledge: not as isolated facts, but as linked decision chains. Curious if you've experimented with graph-based structures for the bundles, or if the overhead kills latency in practice.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.