RTC: fix connection lost error on large update cause by mismatch between update size bounds check and expanded base64 update size#77669
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
👋 Thanks for your first Pull Request and for helping build the future of Gutenberg and WordPress, @danluu! In case you missed it, we'd love to have you join us in our Slack community. If you want to learn more about WordPress development in general, check out the Core Handbook full of helpful information. |
|
@danluu The fix here looks straightforward, and I see the possible mismatch. I'm happy to merge when tests pass! |
|
@alecgeatches What's the deal with the performance tests? I've noticed that they're failing with a 60-minute timeout across many PRs, not just the ones I've touched. Is there a known issue that's causing them to often (not sure if always) fail? |
|
@danluu Taking a look at the performance tests, not sure. We previously had an issue with site editor tests timing out with RTC, I'll see if this is related. |
|
I found a maybe relevant PR in #77443 (comment) and am continuing discussion there. Still planning to wait until we have trunk tests under control before merging, I'll come back here after! |
|
@danluu This has been merged! The performance tests were fixed separately and hopefully won't cause issues on your other PRs. We'll work on getting through those this week, thank you! |
|
@alecgeatches Thanks for all the help with these! I know it can be a pain to review AI generated PRs with the overly verbose AI generated descriptions they carry. |

This is part of a series of bug reports and PRs from an AI fuzzing project. See #77532 for more details on that.
A large update can cause the dreaded "connection lost" modal. The code has a check that's intended to handle this, but the client check is on the raw Yjs update byte length whereas the server check runs on the base64 encoded string. This expansion can cause an update to get sent which is larger than the server can handle.
The fix proposed in this PR is to fix the client check, which will reduce the size of updates the client can send. Alternately, we could change what's checked on the server and increase the update size the server will accept. Below are videos of what it looks like before and after the fix. The repro is a fairly unrealistic, in that it stuffs giant updates into the title, but this bug isn't specific to the title being large and can happen if large edits are made in documents.
annotated-repro.mp4
annotated-after-fix.mp4
BEGIN AI GENERATED TEXT
The PR branch is
fix-connection-error-large-update-pr. It contains only theregression test commit and the fix commit. This explanatory history lives on the
separate branch
try/connection-error-large-update-explanation.This explanation exists only on the
danluuremote, not on the PR branch orthe upstream
WordPress/gutenbergremote. Remote copy:https://github.com/danluu/gutenberg/blob/try/connection-error-large-update-explanation/docs/explanations/fuzzer-bugs/real-time-collaboration-large-update-connection-lost-bug.md.
Bug Summary
An edit that produced a sync update near the client's old
1 MiBsize limitcould still be rejected by the server and then surface the generic
Connection lostmodal.The mismatch was in the unit being limited:
datastring.datastring with a1 MiBmaxLength.Before the fix, the client allowed a raw update of
1,048,576bytes. Encodingthat update produces a
1,398,104character string, which is larger than theserver's per-update limit. The client therefore treated the update as safe, but
the server rejected it. The repeated poll failure eventually reached the
existing retryable connection-error path and displayed
Connection lost.Existing UI Behavior
The PR does not add a new popup.
Two pre-existing UI paths matter here:
Connection lost: the generic retryable sync failure modal.This post is already being edited: the normal WordPress post-lock modal.After the fix, oversized document updates use the existing
document-size-limit-exceededpath. Real-time collaboration is disabled forthat document, and WordPress falls back to ordinary post locking. A second user
then sees the existing post-lock modal with the existing explanation that the
post is too large for real-time collaboration.
Fix Plan
The fix keeps the server contract unchanged and makes the client enforce the
same effective limit before enqueueing an update.
1 MiB.Math.floor( encodedLimit / 4 ) * 3, which is786,432bytes.MAX_UPDATE_SIZE_IN_BYTES.PollingManagersees a local Yjs update larger than that raw limit,emit
document-size-limit-exceeded, unregister the room, and return withoutqueueing the oversized update.
back to post locking.
The small
returnafter unregistering the room is important. Without it, theoversized update can still be added to the queue after the room is marked
unrecoverable.
Why Not Chunk The Update
Splitting the update into arbitrary request chunks is not a local client-only
fix.
The sync endpoint stores and forwards typed Yjs updates. It does not currently
have a protocol for partial update chunks. A chunking design would need explicit
metadata such as update IDs, chunk indexes, total chunks, ordering rules, and
completion state. It would also need server-side storage and reassembly,
cleanup of abandoned chunks, idempotency for retrying chunks, backward
compatibility for older clients, and clear interaction with compaction.
Sending arbitrary byte slices as normal updates would be unsafe because peers
and the server would not have a complete Yjs update to apply until all pieces
were reassembled.
For this PR, graceful degradation is the safer behavior: detect the oversize
update before the server rejects it, disable real-time collaboration for that
document, and fall back to the existing post-lock workflow.
Reproduction Levels
The narrow regression can be reproduced at the utility-test level:
Before the fix, a test that creates a sync update at
MAX_UPDATE_SIZE_IN_BYTESshows the encoded
datastring can exceed1 MiB. After the fix, the largestallowed raw update encodes within the server limit.
The lower-level polling manager tests cover the existing
document-size-limit-exceededpath with a mocked small limit:There is also an existing browser-level path in:
That test exercises the post-fix user behavior: an oversized document disables
real-time collaboration, and another editor sees the standard post-lock modal.
A separate exploratory top-level Playwright repro for an oversized aggregate
sync request body is pushed only to the
danluuremote, outside this PR branch:https://github.com/danluu/gutenberg/blob/try/connection-error-large-update-explanation/test/e2e/specs/editor/collaboration/collaboration-sync-body-size-connection-lost.spec.ts.
That path is related to the same generic
Connection lostsymptom, but it is adifferent limit: the full request body can exceed
16 MiBeven when individualupdates are under the per-update limit. This PR targets the per-update base64
accounting mismatch.
PR-description-ready note:
PR Contents
The PR branch has two commits:
Add sync update size limit testFix sync update size limit accountingThe code changes are limited to:
packages/sync/src/providers/http-polling/config.tspackages/sync/src/providers/http-polling/polling-manager.tspackages/sync/src/providers/http-polling/test/utils.test.tsNo markdown explanation file is included in the PR branch.
Verification
The PR branch was checked with:
The browser-level post-fix behavior was also recorded locally: it shows the
post-lock fallback instead of the generic
Connection lostmodal.END AI GENERATED TEXT
Use of AI Tools
All of the code here was AI generated and the bug was found by an AI-written fuzzer with automated AI triage. After a discussion with @alecgeatches I'm filing a number of bugs and PRs since the initial bug seems to be a real bug and the fuzzer has found a number of bugs that users are really encountering.