OpenClaw incident analysis

Why full-context sessions can fail instead of compacting cleanly

A focused analysis of the newest compaction issues and the last four days of relevant OpenClaw code changes. The short version: this is not one bug. The visible failure is produced by several paths that meet exactly when a long Codex-backed session reaches its context limit.

Generated 2026-05-26 Scope: issues + source + local branch diff Repo: /home/hakalya/openclaw

Executive Summary

Most likely local failure mode

On Codex-OAuth sessions, normal turns can use the Codex runtime while compaction fallback can still resolve to the plain openai provider. If there is no direct OpenAI API key for that path, compaction fails at the moment the context is full.

Not fully singular

Recent issues also show stale thread bindings, early preflight compactions, untracked Codex runtime overhead, event-loop stalls, and stuck locks.

Upstream is moving

origin/main contains several fixes after the stable branch, including direct Codex compaction routing and Codex boundary hardening.

Conclusion: for the current setup, updating blindly is risky because the local branch is not a simple ancestor of origin/main and has dirty changes in the exact compaction runtime-context files. But the dirty local patch is in the right place: it maps the context-engine runtime context from openai to openai-codex when the harness runtime is Codex.

Failure Mechanism

The concrete failure chain reported in the newest issues looks like this:

1

Session grows

Long tool-heavy or chat-heavy session reaches preflight or provider context pressure.

2

Native compact attempted

Codex app-server compaction needs an existing thread/session binding.

3

Binding missing

If the binding is missing or stale, older paths fall back to the context engine.

4

Provider mismatch

The fallback can resolve openai/gpt-5.5 instead of openai-codex/gpt-5.5.

5

Compaction fails

The direct OpenAI path asks for a normal API key and fails, or the session stalls around locks/timeouts.

Newest Relevant Issues

IssueSignalWhy it matters hereStatus read
#86820 primary Codex OAuth fallback tries direct OpenAI Matches the observed symptom: compaction reaches fallback, then fails because the plain OpenAI provider has no API key. Open, updated 2026-05-26.
#86373 routing embedded compaction fallback target mismatch Describes the provider/auth split directly: provider=openai with authProfileId=openai-codex:.... Recent; connected to the same fix train.
#86470 doctor rewrites Codex profiles Explains how a valid openai-codex/* route can be normalized into an apparently valid but compaction-breaking openai/* setup. Still relevant on stable v2026.5.22.
#86819 accounting untracked runtime overhead /context detail can account for only a small fraction of reported context, leaving about 62k tokens as provider/runtime overhead. Open; exact live Codex baseline still needs proof.
#86358 runtime event-loop starvation Compaction can stall the Node event loop long enough that unrelated fetches time out, making a recovering session look broken. Open, P1-class behavior.
#85712 preflight tiny context after compact-only route Preflight can decide to compact only, then continue with a tiny assembled context and no user-facing warning. Open; likely adjacent to repeated compaction reports.
#81178 stale state repeated early preflight compactions After a successful compact, stale pre-compaction usage can trigger another premature compact. Recent comments, regression-shaped.
#70334 stuck session processing lock remains Older but still explains the user-visible “it went quiet” pattern after context overflow handling. Historical but aligned.

Last Four Days of Relevant Code Changes

c08400ea7d Fix context pressure preflight for tool-heavy sessions

Introduces stronger preflight pressure estimation and routes such as compact_only, truncate_tool_results_only, and compact_then_truncate. This is useful, but it means compaction can now be triggered before the model call by OpenClaw's own estimator.

46de078b2a Bound embedded compaction write locks

Targets a known failure class where compaction/session locks can remain held too long and block later session progress.

dd47e479ae Fail Codex compaction at the Codex boundary

Important hardening: when the harness runtime is Codex, missing or stale native compaction binding should not silently fall through into the wrong context-engine path.

f0061ddc54 Preserve partial summary on mid-chain chunk failure

Improves recovery when chunked compaction fails partway through, reducing all-or-nothing loss.

f4cfa012e1 Route compaction through Codex auth provider

The direct embedded compaction path now maps OpenAI + Codex runtime/auth to openai-codex before resolving model auth. This directly addresses the missing direct OpenAI API key failure for that path.

bcde7b138a Handle preflight compaction no-op budgets

Targets repeated/no-op compaction behavior after the preflight estimator believes compaction is needed but the effective budget situation has not improved.

Local Branch Read

Branch state

Local source is on stable-v2026.5.22-guest, not a clean fast-forward ancestor of origin/main. A blind merge would mix guest branch changes with the current upstream fix train.

Dirty files are relevant

Two dirty files are exactly in the compaction runtime-context area: compaction-runtime-context.ts and its test.

What the dirty patch appears to fix

It resolves the harness policy, maps openai + Codex runtime to openai-codex for context-engine runtime context, and preserves openai-codex:... auth profiles only when the provider is deliberately changed to the Codex provider. That covers a gap still visible in origin/main where buildEmbeddedCompactionRuntimeContext returns provider: resolved.provider.

Recommended Next Actions

1. Test the dirty runtime-context patch

Add a minimal Codex-OAuth compaction fallback repro and run the targeted test file. This patch is likely not cosmetic; it covers the remaining context-engine runtime-context mismatch.

2. Do not run doctor fixes blindly

Until #86470 is resolved, avoid automatic rewrites that turn openai-codex/* into openai/* without proving compaction still routes through Codex auth.

3. Fail visibly at the right boundary

If Codex native compaction has no valid thread binding, report that specific condition instead of falling into a misleading direct OpenAI API-key failure.

4. Separate runtime overhead in reports

/context detail should label Codex native/cache/runtime overhead separately so users do not chase AGENTS.md, tools, or memory ghosts for a 62k-token residual.

5. Guard preflight no-op loops

After compaction, next preflight should use post-compaction active replay, not stale transcript records. No-op compactions need an escalation path.

6. Watch event-loop stalls

Large compaction should not starve Telegram/API fetches. Keep timer-delay diagnostics and consider offloading CPU-heavy summary assembly or token accounting.

Evidence Used