Mar 12, 2026, 16:34 GMT+8

#What Could Go Wrong With Folders?

This is a story about how I survived through a disastrous folder system and what I've learned from it.

#The Outage

We were having a chill year-end party, everyone was playing table games and chatting. Then, a sharp ping hit the Slack channel — prod is down.

A quick scan over our AWS dashboard showed that our Redis was running out of memory and crashing. Our newly developed feature, realtime folders, had just gone online a week ago and uses Redis for syncing under the hood. Maybe too many users were trying out this new feature, we thought.

To slap a quick band-aid to recover prod, we vertically and horizontally scaled the processing worker to effectively 10x the capacity. But something wasn't right — the unprocessed messages in Redis were still piling up, just not as fast as before, and users were starting to complain about broken folders and missing notes.

A folder is a pretty common feature across various products. What could go so terribly wrong that even 10x the capacity cannot fix it?

#Our Folder System

Before I uncover the underlying problem and the solution, let's first understand how our folder system is designed. Here at HackMD, our core value is realtime collaboration (e.g., our note editor). So, from the very beginning, we designed it on top of CRDT (Conflict-free replicated data type), a conflict-free realtime synchronization algorithm.

Realtime primer

Why collaborative folders felt plausible

Sync roomidle

Alice

Local state

root

folder1

note1-1

folder2

Bob

Local state

root

folder1

note1-1

folder2

We settled on a JS implementation, Yjs, and one of its transport layers, y-redis. Since we were already using Socket.IO, we created a fork to swap out the WebSocket layer with Socket.IO. The overall architecture looks like this:

With the current setup, everything seemed fine in our local development environment. There were some slight hiccups on the staging server, but those were squashed and fixed. At the time, it was close to the year end (Lunar New Year approaching), so we had to pack things up and prepare to ship it to prod without having had a chance to stress test it first.

Not even a week into the deployment, we already had a lot of problems. Folder operations not persisting, notes disappearing, workspaces not loading, and our server failing to keep up with all of this without us having much insight into what was actually happening.

#We re-architected y-redis

Of course rewriting the thing would fix it. But rewriting requires a fundamental understanding of the domain and a lot of testing to ensure compatibility, hence it's usually a last resort. Why and how did we end up here? Well, let me introduce you to the y-redis architecture first.

#y-redis overview

y-redis uses a server/worker split:

Servers handle auth and websocket fanout, writing Yjs updates to Redis streams, and replaying state for initial sync.
Workers consume stream-compaction tasks to persist merged document state and clean up Redis.

Seems fine, right? The server is stateless, probably easily scalable. Workers are split into consumer groups, which also sounds scalable. So, what's the problem here? Let's do a quick estimation on how many servers and workers we'll need to handle a production workload.

Say we have 10,000 active rooms (in our design, a workspace is a y-redis room), each doing 1 message per second with default settings (REDIS_TASK_DEBOUNCE=10s). (N = rooms, k = clients)

Servers: the y-redis server basically does nothing other than auth, so one server can handle a few thousand connections (fanout F). Based on servers ~= 1.5 (safety) * N * k / F, assuming an average room has 2–3 connected clients, and a server can handle 2,000 concurrent connections, we get servers ~= 1.5 * 10000 * 2~3 / 2000 = 15~22.5. Not unreasonable for about 20k to 30k connections.
Workers: Here's where things get interesting. Each worker is doing:
1. Claim task (xAUTOCLAIM).
2. Rebuild ydoc: Grab existing data from DB (persisted) and Redis (xREAD; transmitting buffer) and merge them together.
3. Persist to DB.
4. Re-queue the task and trim (xTRIM) the room messages.
We can combine that into a processing rate formula (tasks/sec):
P ~= 1 / (max(read redis, read store) + merge & rebuild ydoc + persist + trim & requeue)

For a semi-realistic workload, we can assume:
- read redis: <10ms
- read store: <100ms
- merge & rebuild ydoc: <50ms (time complexity of this operation is roughly O(msg))
- persist: <100ms (we have pretty complex persistence logic: checking consistency, conflict resolution, etc.)
- trim & requeue: <20ms
- total: <300ms
In this setting, a typical worker has a processing rate of 3–5 (assuming total is 200–300ms). With the default debounce time 10s (task not claimable in 10s), we get workers ~= N / (10 * P) (without 1.5x safety), which translates into workers ~= 10000/30~50 = 333~200. Well, that's a lot of workers.

Let's say we have a super fast database, so reading and writing are both sub-10ms operations. We still get a ~100ms processing time, which translates into a P of 10, and thus about 100 workers. Still a lot for 10,000 rooms and just 20 or so servers.

#The problem

This would have been just costly but fine (not really) if we had known we needed that many workers beforehand (which we didn't). Let's see how many workers we'd need if we under-provisioned at 10% for a day.

Assume N=10,000, P=5, and we deployed 20 workers where we actually need 200. After 1 day under-provisioned:

Actual rate: 20 * 5 = 100/sec
Deficit: 1,000 - 100 = 900/sec
Backlog after a day: 900 * 86,400 = 77,760,000 compactions

If we want to recover backlog in R days, required total workers: W_total = (1000 + 900/R) / 5. So workers to add (W_total - 20):

Recover in 1 day: total 380 (+360)
Recover in 2 days: total 290 (+270)
Recover in 7 days: total 226 (+206)

And the worst thing? This calculation is best case. Remember that we are rebuilding ydoc in the worker? That has a cost of O(msg), so piling up unprocessed messages slows down the system even more, making P even smaller. With P halved, we'd need to add 740 workers to recover in a day. Also, the worst case is unbounded.

Can things get even worse? Yes — we under-provisioned for at least 7 days. Some of the rooms now took over a minute just to load, making the processing rate even worse — a negative feedback loop. Using our previously assumed numbers, we would have at least 544,320,000 backlogged compactions (we had way more than 1 msg/room/sec, so the real number was likely much bigger).

#The fix

The debugging stage was a disaster — nobody knew what was broken or where, and we didn't know we needed that many workers at the time. By the time we were 7 days in, I was so frustrated that I thought: "If the worker doesn't work, why don't we just throw that away?".

And so I did.

The transport layer is basically the same, but the server is now stateful, maintaining live ydoc copies of all connected workspaces. This allowed reactive in-place persistence instead of the original pulling/requeueing worker flow. I employed a RedLock-like election algorithm to choose a persist leader, making sure we only run persistence once per loaded workspace to avoid race conditions. Persistence itself is done in a worker thread collocated with the server and reused across different workspaces.

How much better is this approach? Let's do some quick math. With active rooms N=10,000, message rate m=1, and the default debounce of 3s.

We can approximate this with the equation: Servers = ((N * m) / debounce) / P

Where P is 1 / L (L is task duration in seconds), where L is usually below 0.1s for worker threads.
- Instead of reading from both Redis and our DB to rebuild ydoc, we can jump straight to persisting. On top of that, we also did some hardening on the data structure so we can remove most of the expensive checks in the persistence layer (I'll touch on this later in the article). So the actual number would be way lower than 100ms.
That gives us Servers ~= 10,000 / 3 / (1/0.1) = 333.

But there's a big catch. The worker thread can persist different workspaces concurrently, and since the biggest cost here is the DB write, that gives us a huge improvement. The equation now becomes: P = K / L, where K is the max concurrent persisting tasks. Which gives us: Servers ~= 10,000 / 3 / (10/0.1) = 33.3 (assuming K=10), way less than the previous requirement of over 200 workers in y-redis.

For a more realistic workload after all the optimization, L is usually between 0.05 to 0.08, which translates into 17 to 27 servers (last time I checked, we were running 10–20 servers).

This change alone fixed the production outage and reduced the Redis memory usage significantly, and I successfully saved my Lunar New Year vacation.

#Highly resilient folder data structure

Earlier, I mentioned doing some hardening on the data structure so we could remove most of the expensive checks in the persistence layer. Let me walk you through what I did.

#V1 folder tree

To better understand the problem, let's first take a look at the v1 folder tree.

The folder tree was our first attempt to implement a realtime collaborative folder system. The architecture was pretty straightforward — a simple struct inside a LWW (last-write-win) map. In this design, children includes all subfolders, and notes includes all notes in the folder.

type Folder = {
  id: string;
  name: string;
  icon: string;
  parentId: string;
  children: string[];
  notes: string[];
  // ...other fields
};

type ClientId = string;
type FolderTree = LWWMap<ClientId, Folder>;

Sounds reasonable enough — what could go wrong? Well, the LWW boundary is at the folder level instead of the field level. Let me illustrate the problem:

Bad conflict resolution due to LWW granularity

With this simple layout:

txt

root
├── folder1
│   └── note1-1
└── folder2
    └── folder2-1

Two users, Alice and Bob, do the following at almost the same time:

Alice moves note1-1 into folder2 (at time t1) This breaks down into two steps:
1. Remove note1-1 from folder1 (remove from notes)
2. Add note1-1 into folder2 (add to notes)
txt
```
root
├── folder1 # changed @ t1
└── folder2 # changed @ t1
    ├── note1-1
    └── folder2-1
```

Bob moves folder2-1 into root (at time t2)

txt

root
├── folder1
│   └── note1-1
├── folder2 # changed @ t2
└── folder2-1 # changed @ t2

Here's the problem: if t2 is slightly larger than t1 and arrives just before t1 has a chance to propagate to Bob's client:

On Alice's client after getting Bob's change: because t2 > t1, changes made to folder2 are superseded by Bob's changes (moving folder2-1 to root updates children on folder2).
1. Remove note1-1 from folder1 -> no conflict, accepted
2. Add note1-1 into folder2 -> conflict w/ Bob's change, superseded
On Bob's client after getting Alice's change: sync the change done to folder1 (remove note1-1), others are the same as above.

After this conflict resolution, we end up with this folder tree:

txt

root
├── folder1
├── folder2
└── folder2-1

note1-1 is now completely invisible — gone from the entire tree.

V1 failure mode

Folder-level LWW drops a note

Alice @ t0

Baseline

root

folder1

note1-1

folder2

folder2-1

Bob @ t0

Baseline

root

folder1

note1-1

folder2

folder2-1

Merged

No merge yet

root

folder1

note1-1

folder2

folder2-1

Invalid state not handled

What if we do something similar, but with folders instead of notes?

With this very simple layout:

txt

root
├── folder1
└── folder2

And the same two users, doing the following operations simultaneously:

Alice moves folder1 into folder2
Bob moves folder2 into folder1

This time, since there's no conflict on any single folder, the result is exactly what you'd expect: folder1 is moved into folder2, and folder2 is moved into folder1.

Now we have this — a cycle, detached from root.

txt

root

┌── folder1
└── folder2

And everyone knows what happens to a disconnected component in a graph: it's unreachable, meaning the tree now looks like this to the users:

txt

root

Structural bug

Mutual moves create a detached cycle

Parent graph

Everything reachable

root

folder1

folder2

User view

Normal rendering

root

folder1

folder2

Invariants

Reachable from root

No cycle

Render path exists

In v1, nothing repaired this graph. Once root lost reachability, the UI had no path back.

Confusing data structure transformation

Because the folder tree data structure and the actual backend data layer were implemented by different people, the results were a bit all over the place.

For example, the "folder data structure" has three different representations across the database, the logic living in fp-ts, and the output structure the frontend code actually uses. The most painful point was that there's an explicit root folder (parentId === 'root') in the fp-ts logic and the frontend code, while parentId === null means the folder is in root on the backend.

So we literally had these:

// backend
const parentFolderId = folder.parentId === "root" ? null : folder.parentId;

// frontend
const parentFolderId = folder.parentId ?? "root";

#V2 folder core

With how folder tree was implemented and the LWW structure we were using, both problems mentioned above were very hard to fix without creating more issues.

To fix all the issues, I redesigned the folder sync data structure from the ground up, incorporating invalid state recovery from day one.

Fine-grained per field sync

To allow per-field sync, I devised a new nested data structure that uses Y.Map directly inside Y.Array to represent metadata. Like the LWWMap in the folder tree, a top-level append-only Y.Array retains history entries (so we can actually do LWW). The key difference is using a Y.Map instead of a plain object for the field value.

// NOTE: this is pseudocode

// LWWMap
Y.Array<[{Key,Value,Timestamp}, ..., {Key,Value,Timestamp}]>

// My data structure; metadata like timestamps are stored in the Y.Map as field values
Y.Array<[Y.Map<Value>, ..., Y.Map<Value>]>

// For folder per field sync
type Folder = Y.Map<any>; // updating individual fields on Y.Map is safe and syncable
Y.Array<[Y.Map<Folder>, ..., Y.Map<Folder>]>

This let me put any Yjs primitive inside the array field, including a nested Y.Map. Using a nested Y.Map for the actual folder data finally enabled per-field fine-grained LWW sync.

Handling invalid states

To eliminate all invalid states, the new core structure runs a quick check-and-fix operation immediately after applying an update.

Folder with parent that doesn't exist in tree
- -> Re-parent folder to root
Note with parent that doesn't exist in tree
- -> Re-parent note to root
Cyclic dependent folders
- -> Re-parent any of the folders to root to break the cycle

The explicit root folder is also gone — any folder without a parentId is simply attached to root. Same goes for notes: the folder ↔ notes relation is now stored as a single map (Map<noteId, folderId>). Notes automatically fall back to root if not present in the relation map.

The new structure also dropped the notes and children arrays, and now aligns closely with the data model actually stored in the DB.

diff

type FullFolderMeta = {
-  id?: string;
+  id: string;
+  clientId: string;
  name: string;
  icon?: string | null;
-  parentId: string;
+  parentId?: string | null;
-  children: string[];
-  notes: string[];
  // ...other fields
};

Let's revisit the earlier example of Alice and Bob moving folders into each other. Right after the first sync, we still get this:

txt

root (not real)

┌── folder1 (parentId: folder2)
└── folder2 (parentId: folder1)

Then immediately after, the update hook runs and breaks the cycle (order is deterministic):

txt

root (not real)
├── folder1 (parentId: null)
└── folder2 (parentId: folder1)

V2 repair loop

The same bad state now self-heals

Incoming state

Before repair

root

no reachable children

Detached cycle

folder1parent: folder2

folder2parent: folder1

Authoritative state

Not derived yet

root

waiting for check

Repair

Detect cycleactive

Pick break pointqueued

Re-parent to rootqueued

v2 treats invalid states as expected inputs that need deterministic cleanup, not impossible situations.

Additional hardening

There are a few more things I did to further harden the whole system, including but not limited to:

Simplifying the core operation APIs: fewer APIs means less surface area to test and maintain
- create -> upsert
- update -> upsert
- move -> upsert
- addNoteToFolder -> moveNoteToFolder
- moveNoteToFolder -> moveNoteToFolder
- removeNoteFromFolder -> moveNoteToFolder
Client out-of-sync detection
- Yjs assumes the transport layer is reliable and doesn't account for dropped packets or lost client updates. However, our system is complex, and client network conditions are unpredictable. So I designed a fallback mechanism to protect folder state and auto-heal client out-of-sync issues.
- In the new system, the client waits for an ACK after sending an update, and only continues to sync after receiving it. If three retries still fail, the client deems itself out of sync and requests the latest authoritative data from the server to overwrite local state. This mechanism significantly reduces the chance of the server receiving out-of-sync data and lowers the risk of clients hitting infinite reload loops.
Client reconnection recovery
- The old folder transport layer didn't handle disconnection correctly — it just hard-refreshed when the user came back online. I reworked it so the new transport layer can instantiate a proper handshake with the server without needing a hard refresh.

#Validation

To ensure correctness and reliability beyond normal unit and integration tests, I employed a simulation testing strategy to catch edge cases that traditional tests would miss.

This idea was inspired by TigerBeetle's testing approach:

Simulation Testing

Tiger Style

Seeded with a Linear Congruential Generator (LCG), every operation in the simulator is fully replayable. I then ran randomized tests based on the LCG output. The seed, input, output, and errors were all recorded per round, allowing me to quickly reproduce the exact error that crashed the simulation.

Pseudo decision tree:

txt

runSimulation(seed):
  rng = LCG(seed)
  clients = createClients(count = rng.int(2..5))

  repeat 1000 times:
    actor = rng.pick(clients)
    action = weightedPick(rng, {
      create, delete, update, reparent,
      moveNotes, batchUpdate, noOp, query
    })

    if action == create:
      actor.createFolder(randomName(rng), randomParent(rng))
    if action == delete:
      actor.deleteFolder(randomFolder(rng))
    if action == update:
      actor.updateFolder(randomFolder(rng), randomPatch(rng))
    if action == reparent:
      actor.moveFolder(randomFolder(rng), randomParent(rng))
    if action == moveNotes:
      actor.moveNotes(randomNotes(rng), randomFolderOrRoot(rng))
    if action == batchUpdate:
      actor.updateMany(randomFolderSubset(rng), randomPatch(rng))
    if action == noOp:
      actor.writeSomethingThatShouldChangeNothing()
    if action == query:
      actor.readPathDepthOrChildren(randomFolder(rng))

    occasionally check basic invariants

  waitUntilAllClientsResync()
  assert all clients converge to same final shape

Some issues surfaced after multiple simulation runs; I fixed them and kept running until the simulations stopped crashing. After running millions of simulations with multi-client concurrent operations, I was pretty confident that the new folder core structure wouldn't produce any invalid state.

#Rollout, migration, and observability

To allow a zero-downtime migration from folder tree (v1) to folder core (v2), I designed a two-mode traffic router.

Shadow traffic: v1 remains the primary transport layer for all visible operations, while v2 connects in parallel, doing shadow reads and writes — just not surfaced to users.
Canary transportation: Users are divided into 16 buckets by the last hex digit of their UUID, and we serve different transport schemas to different buckets, allowing gradual enrollment.

We first switched everyone to shadow traffic mode to auto-bootstrap the v2 folder core data structure. Once everything looked stable, we gradually enrolled users into the pure v2 schema, bucket by bucket.

To improve observability, I also added an end-to-end trace logging system. The client creates a randomized trace ID for each connection, and all logs across frontend and backend for that session share the same trace ID, making connection issues much easier to debug.

We still had some issues right after the complete v2 rollout, but since the v2 core was hardened and thoroughly tested, tracking down bugs was much easier — we could skip second-guessing whether the core data structure was at fault.

#Wrapping up

Welp, that was a wild ride. I learned a lot trying to survive all of this — if you asked me for a one-sentence summary, I'd say "never roll your own sync engine", just use Convex or something.

But for real, the biggest takeaway is that testing is really important — bet hard on it, especially for mission-critical parts. Testing should be part of the system design from the very start; otherwise you'll have a really hard time trying to add it later.

Finally, I hope you found this article interesting — and I sincerely hope you never find yourself in a position that needs any of this.

AI assisted disclosure:

I hand wrote the entire article and ran it through Opus 4.6 once for grammar and spelling correction.

Interactive demo were mostly created by GPT-5.4 and Opus 4.6.

#What Could Go Wrong With Folders?

#The Outage

#Our Folder System

#We re-architected y-redis

#y-redis overview

#The problem

#The fix

#Highly resilient folder data structure

#V1 folder tree

Bad conflict resolution due to LWW granularity

Invalid state not handled

Confusing data structure transformation

#V2 folder core

Why they are not really fixable?

Fine-grained per field sync

Handling invalid states

Additional hardening

#Validation

#Rollout, migration, and observability

#Wrapping up

Command Palette