close
Skip to content

RTC: fix connection lost error on large update cause by mismatch between update size bounds check and expanded base64 update size#77669

Merged
alecgeatches merged 3 commits into
WordPress:trunkfrom
danluu:fix-connection-error-large-update-pr
Apr 28, 2026
Merged

RTC: fix connection lost error on large update cause by mismatch between update size bounds check and expanded base64 update size#77669
alecgeatches merged 3 commits into
WordPress:trunkfrom
danluu:fix-connection-error-large-update-pr

Conversation

@danluu
Copy link
Copy Markdown
Contributor

@danluu danluu commented Apr 25, 2026

This is part of a series of bug reports and PRs from an AI fuzzing project. See #77532 for more details on that.

A large update can cause the dreaded "connection lost" modal. The code has a check that's intended to handle this, but the client check is on the raw Yjs update byte length whereas the server check runs on the base64 encoded string. This expansion can cause an update to get sent which is larger than the server can handle.

The fix proposed in this PR is to fix the client check, which will reduce the size of updates the client can send. Alternately, we could change what's checked on the server and increase the update size the server will accept. Below are videos of what it looks like before and after the fix. The repro is a fairly unrealistic, in that it stuffs giant updates into the title, but this bug isn't specific to the title being large and can happen if large edits are made in documents.

annotated-repro.mp4
annotated-after-fix.mp4

BEGIN AI GENERATED TEXT

The PR branch is fix-connection-error-large-update-pr. It contains only the
regression test commit and the fix commit. This explanatory history lives on the
separate branch try/connection-error-large-update-explanation.

This explanation exists only on the danluu remote, not on the PR branch or
the upstream WordPress/gutenberg remote. Remote copy:
https://github.com/danluu/gutenberg/blob/try/connection-error-large-update-explanation/docs/explanations/fuzzer-bugs/real-time-collaboration-large-update-connection-lost-bug.md.

Bug Summary

An edit that produced a sync update near the client's old 1 MiB size limit
could still be rejected by the server and then surface the generic
Connection lost modal.

The mismatch was in the unit being limited:

  • The client checked the raw Yjs update byte length.
  • The sync REST payload sends each update as a base64 data string.
  • The server schema validates that encoded data string with a 1 MiB
    maxLength.
  • Base64 expands three raw bytes into four encoded characters.

Before the fix, the client allowed a raw update of 1,048,576 bytes. Encoding
that update produces a 1,398,104 character string, which is larger than the
server's per-update limit. The client therefore treated the update as safe, but
the server rejected it. The repeated poll failure eventually reached the
existing retryable connection-error path and displayed Connection lost.

Existing UI Behavior

The PR does not add a new popup.

Two pre-existing UI paths matter here:

  • Connection lost: the generic retryable sync failure modal.
  • This post is already being edited: the normal WordPress post-lock modal.

After the fix, oversized document updates use the existing
document-size-limit-exceeded path. Real-time collaboration is disabled for
that document, and WordPress falls back to ordinary post locking. A second user
then sees the existing post-lock modal with the existing explanation that the
post is too large for real-time collaboration.

Fix Plan

The fix keeps the server contract unchanged and makes the client enforce the
same effective limit before enqueueing an update.

  1. Define the server's encoded update limit as 1 MiB.
  2. Compute the largest raw update that cannot exceed that encoded limit:
    Math.floor( encodedLimit / 4 ) * 3, which is 786,432 bytes.
  3. Use that computed value as MAX_UPDATE_SIZE_IN_BYTES.
  4. When PollingManager sees a local Yjs update larger than that raw limit,
    emit document-size-limit-exceeded, unregister the room, and return without
    queueing the oversized update.
  5. Let the existing editor behavior disable real-time collaboration and fall
    back to post locking.

The small return after unregistering the room is important. Without it, the
oversized update can still be added to the queue after the room is marked
unrecoverable.

Why Not Chunk The Update

Splitting the update into arbitrary request chunks is not a local client-only
fix.

The sync endpoint stores and forwards typed Yjs updates. It does not currently
have a protocol for partial update chunks. A chunking design would need explicit
metadata such as update IDs, chunk indexes, total chunks, ordering rules, and
completion state. It would also need server-side storage and reassembly,
cleanup of abandoned chunks, idempotency for retrying chunks, backward
compatibility for older clients, and clear interaction with compaction.

Sending arbitrary byte slices as normal updates would be unsafe because peers
and the server would not have a complete Yjs update to apply until all pieces
were reassembled.

For this PR, graceful degradation is the safer behavior: detect the oversize
update before the server rejects it, disable real-time collaboration for that
document, and fall back to the existing post-lock workflow.

Reproduction Levels

The narrow regression can be reproduced at the utility-test level:

npm run test:unit packages/sync/src/providers/http-polling/test/utils.test.ts -- --runInBand

Before the fix, a test that creates a sync update at MAX_UPDATE_SIZE_IN_BYTES
shows the encoded data string can exceed 1 MiB. After the fix, the largest
allowed raw update encodes within the server limit.

The lower-level polling manager tests cover the existing
document-size-limit-exceeded path with a mocked small limit:

npm run test:unit packages/sync/src/providers/http-polling/test/polling-manager.test.ts -- --runInBand

There is also an existing browser-level path in:

test/e2e/specs/editor/collaboration/collaboration-document-size-lock.spec.ts

That test exercises the post-fix user behavior: an oversized document disables
real-time collaboration, and another editor sees the standard post-lock modal.

A separate exploratory top-level Playwright repro for an oversized aggregate
sync request body is pushed only to the danluu remote, outside this PR branch:
https://github.com/danluu/gutenberg/blob/try/connection-error-large-update-explanation/test/e2e/specs/editor/collaboration/collaboration-sync-body-size-connection-lost.spec.ts.
That path is related to the same generic Connection lost symptom, but it is a
different limit: the full request body can exceed 16 MiB even when individual
updates are under the per-update limit. This PR targets the per-update base64
accounting mismatch.

PR-description-ready note:

Additional browser-level repro, kept outside this PR:
https://github.com/danluu/gutenberg/blob/try/connection-error-large-update-explanation/test/e2e/specs/editor/collaboration/collaboration-sync-body-size-connection-lost.spec.ts

It lives only on the `danluu` remote, on branch
`try/connection-error-large-update-explanation`. It is intentionally not part
of this PR because the PR uses lower-level unit coverage for the per-update
base64 limit. The browser repro demonstrates the related `Connection lost`
symptom when one aggregate sync poll exceeds the server's `16 MiB` request-body
limit.

To run it from a clean checkout:

```bash
git fetch danluu try/connection-error-large-update-explanation
git worktree add ../gutenberg-large-update-repro FETCH_HEAD
cd ../gutenberg-large-update-repro
source ~/.nvm/nvm.sh && nvm use
npm install
WP_ENV_PORT=8893 npm run wp-env-test start
WP_ENV_PORT=8893 WP_BASE_URL=http://localhost:8893 npm exec \
	--workspace @wordpress/e2e-tests-playwright -- wp-scripts test-playwright \
	test/e2e/specs/editor/collaboration/collaboration-sync-body-size-connection-lost.spec.ts \
	--project=chromium
WP_ENV_PORT=8893 npm run wp-env-test stop
```

Use a different `WP_ENV_PORT` if `8893` is already occupied.

PR Contents

The PR branch has two commits:

  • Add sync update size limit test
  • Fix sync update size limit accounting

The code changes are limited to:

  • packages/sync/src/providers/http-polling/config.ts
  • packages/sync/src/providers/http-polling/polling-manager.ts
  • packages/sync/src/providers/http-polling/test/utils.test.ts

No markdown explanation file is included in the PR branch.

Verification

The PR branch was checked with:

npm run format -- packages/sync/src/providers/http-polling/config.ts packages/sync/src/providers/http-polling/polling-manager.ts packages/sync/src/providers/http-polling/test/utils.test.ts
npm run lint:js -- packages/sync/src/providers/http-polling/config.ts packages/sync/src/providers/http-polling/polling-manager.ts packages/sync/src/providers/http-polling/test/utils.test.ts
npm run test:unit packages/sync/src/providers/http-polling/test/utils.test.ts -- --runInBand
npm run test:unit packages/sync/src/providers/http-polling/test/polling-manager.test.ts -- --runInBand
git diff --check try/fuzz..HEAD

The browser-level post-fix behavior was also recorded locally: it shows the
post-lock fallback instead of the generic Connection lost modal.

END AI GENERATED TEXT

Use of AI Tools

All of the code here was AI generated and the bug was found by an AI-written fuzzer with automated AI triage. After a discussion with @alecgeatches I'm filing a number of bugs and PRs since the initial bug seems to be a real bug and the fuzzer has found a number of bugs that users are really encountering.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: danluu <danluu@git.wordpress.org>
Co-authored-by: alecgeatches <alecgeatches@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@github-actions
Copy link
Copy Markdown

👋 Thanks for your first Pull Request and for helping build the future of Gutenberg and WordPress, @danluu! In case you missed it, we'd love to have you join us in our Slack community.

If you want to learn more about WordPress development in general, check out the Core Handbook full of helpful information.

@github-actions github-actions Bot added the First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository label Apr 25, 2026
@danluu danluu changed the title Fix connection lost error on large update cause by mismatch between update size bounds check and expanded base64 update size RTC: fix connection lost error on large update cause by mismatch between update size bounds check and expanded base64 update size Apr 25, 2026
@t-hamano t-hamano added [Type] Bug An existing feature does not function as intended [Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration labels Apr 27, 2026
@dmsnell dmsnell mentioned this pull request Apr 27, 2026
@alecgeatches
Copy link
Copy Markdown
Contributor

@danluu The fix here looks straightforward, and I see the possible mismatch. I'm happy to merge when tests pass!

@danluu
Copy link
Copy Markdown
Contributor Author

danluu commented Apr 27, 2026

@alecgeatches What's the deal with the performance tests? I've noticed that they're failing with a 60-minute timeout across many PRs, not just the ones I've touched. Is there a known issue that's causing them to often (not sure if always) fail?

@alecgeatches
Copy link
Copy Markdown
Contributor

@danluu Taking a look at the performance tests, not sure. We previously had an issue with site editor tests timing out with RTC, I'll see if this is related.

@alecgeatches
Copy link
Copy Markdown
Contributor

I found a maybe relevant PR in #77443 (comment) and am continuing discussion there. Still planning to wait until we have trunk tests under control before merging, I'll come back here after!

@alecgeatches alecgeatches enabled auto-merge (squash) April 28, 2026 17:10
@alecgeatches alecgeatches merged commit 5ddf4ad into WordPress:trunk Apr 28, 2026
39 checks passed
@github-actions github-actions Bot added this to the Gutenberg 23.1 milestone Apr 28, 2026
@alecgeatches
Copy link
Copy Markdown
Contributor

@danluu This has been merged! The performance tests were fixed separately and hopefully won't cause issues on your other PRs. We'll work on getting through those this week, thank you!

@danluu
Copy link
Copy Markdown
Contributor Author

danluu commented Apr 28, 2026

@alecgeatches Thanks for all the help with these! I know it can be a pain to review AI generated PRs with the overly verbose AI generated descriptions they carry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Feature] Real-time Collaboration Phase 3 of the Gutenberg roadmap around real-time collaboration First-time Contributor Pull request opened by a first-time contributor to Gutenberg repository [Package] Sync [Type] Bug An existing feature does not function as intended

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants