#What Could Go Wrong With Folders?

This is a story about how I survived through a disastrous folder system and what I've learned from it.

#The Outage

We were having a chill year-end party, everyone was playing table games and chatting. Then, a sharp ping hit the Slack channel — prod is down.

A quick scan over our AWS dashboard showed that our Redis was running out of memory and crashing. Our newly developed feature, realtime folders, had just gone online a week ago and uses Redis for syncing under the hood. Maybe too many users were trying out this new feature, we thought.

To slap a quick band-aid to recover prod, we vertically and horizontally scaled the processing worker to effectively 10x the capacity. But something wasn't right — the unprocessed messages in Redis were still piling up, just not as fast as before, and users were starting to complain about broken folders and missing notes.

A folder is a pretty common feature across various products. What could go so terribly wrong that even 10x the capacity cannot fix it?

#Our Folder System

Before I uncover the underlying problem and the solution, let's first understand how our folder system is designed. Here at HackMD, our core value is realtime collaboration (e.g., our note editor). So, from the very beginning, we designed it on top of CRDT (Conflict-free replicated data type), a conflict-free realtime synchronization algorithm.

Realtime primer

Why collaborative folders felt plausible

Sync roomidle

Alice

Local state

root
folder1
note1-1
folder2

Bob

Local state

root
folder1
note1-1
folder2

Same starting tree(1/4)

Both clients hold a local copy. Edits apply instantly, then fan out through the shared room.

Alice and Bob join the workspace
Both render the same tree

We settled on a JS implementation, Yjs, and one of its transport layers, y-redis. Since we were already using Socket.IO, we created a fork to swap out the WebSocket layer with Socket.IO. The overall architecture looks like this:

overall architecture

With the current setup, everything seemed fine in our local development environment. There were some slight hiccups on the staging server, but those were squashed and fixed. At the time, it was close to the year end (Lunar New Year approaching), so we had to pack things up and prepare to ship it to prod without having had a chance to stress test it first.

Not even a week into the deployment, we already had a lot of problems. Folder operations not persisting, notes disappearing, workspaces not loading, and our server failing to keep up with all of this without us having much insight into what was actually happening.

#We re-architected y-redis

Of course rewriting the thing would fix it. But rewriting requires a fundamental understanding of the domain and a lot of testing to ensure compatibility, hence it's usually a last resort. Why and how did we end up here? Well, let me introduce you to the y-redis architecture first.

#y-redis overview

y-redis uses a server/worker split:

detailed y-redis architecture

Seems fine, right? The server is stateless, probably easily scalable. Workers are split into consumer groups, which also sounds scalable. So, what's the problem here? Let's do a quick estimation on how many servers and workers we'll need to handle a production workload.

Say we have 10,000 active rooms (in our design, a workspace is a y-redis room), each doing 1 message per second with default settings (REDIS_TASK_DEBOUNCE=10s). (N = rooms, k = clients)

Let's say we have a super fast database, so reading and writing are both sub-10ms operations. We still get a ~100ms processing time, which translates into a P of 10, and thus about 100 workers. Still a lot for 10,000 rooms and just 20 or so servers.

#The problem

This would have been just costly but fine (not really) if we had known we needed that many workers beforehand (which we didn't). Let's see how many workers we'd need if we under-provisioned at 10% for a day.

Assume N=10,000, P=5, and we deployed 20 workers where we actually need 200. After 1 day under-provisioned:

If we want to recover backlog in R days, required total workers: W_total = (1000 + 900/R) / 5. So workers to add (W_total - 20):

And the worst thing? This calculation is best case. Remember that we are rebuilding ydoc in the worker? That has a cost of O(msg), so piling up unprocessed messages slows down the system even more, making P even smaller. With P halved, we'd need to add 740 workers to recover in a day. Also, the worst case is unbounded.

Can things get even worse? Yes — we under-provisioned for at least 7 days. Some of the rooms now took over a minute just to load, making the processing rate even worse — a negative feedback loop. Using our previously assumed numbers, we would have at least 544,320,000 backlogged compactions (we had way more than 1 msg/room/sec, so the real number was likely much bigger).

#The fix

The debugging stage was a disaster — nobody knew what was broken or where, and we didn't know we needed that many workers at the time. By the time we were 7 days in, I was so frustrated that I thought: "If the worker doesn't work, why don't we just throw that away?".

And so I did.

detailed y-socketio-redis architecture

The transport layer is basically the same, but the server is now stateful, maintaining live ydoc copies of all connected workspaces. This allowed reactive in-place persistence instead of the original pulling/requeueing worker flow. I employed a RedLock-like election algorithm to choose a persist leader, making sure we only run persistence once per loaded workspace to avoid race conditions. Persistence itself is done in a worker thread collocated with the server and reused across different workspaces.

How much better is this approach? Let's do some quick math. With active rooms N=10,000, message rate m=1, and the default debounce of 3s.

We can approximate this with the equation: Servers = ((N * m) / debounce) / P

But there's a big catch. The worker thread can persist different workspaces concurrently, and since the biggest cost here is the DB write, that gives us a huge improvement. The equation now becomes: P = K / L, where K is the max concurrent persisting tasks. Which gives us: Servers ~= 10,000 / 3 / (10/0.1) = 33.3 (assuming K=10), way less than the previous requirement of over 200 workers in y-redis.

For a more realistic workload after all the optimization, L is usually between 0.05 to 0.08, which translates into 17 to 27 servers (last time I checked, we were running 10–20 servers).

This change alone fixed the production outage and reduced the Redis memory usage significantly, and I successfully saved my Lunar New Year vacation.

#Highly resilient folder data structure

Earlier, I mentioned doing some hardening on the data structure so we could remove most of the expensive checks in the persistence layer. Let me walk you through what I did.

#V1 folder tree

To better understand the problem, let's first take a look at the v1 folder tree.

The folder tree was our first attempt to implement a realtime collaborative folder system. The architecture was pretty straightforward — a simple struct inside a LWW (last-write-win) map. In this design, children includes all subfolders, and notes includes all notes in the folder.

ts
type Folder = {
  id: string;
  name: string;
  icon: string;
  parentId: string;
  children: string[];
  notes: string[];
  // ...other fields
};

type ClientId = string;
type FolderTree = LWWMap<ClientId, Folder>;

Sounds reasonable enough — what could go wrong? Well, the LWW boundary is at the folder level instead of the field level. Let me illustrate the problem:

Bad conflict resolution due to LWW granularity

With this simple layout:

txt
root
├── folder1
│   └── note1-1
└── folder2
    └── folder2-1

Two users, Alice and Bob, do the following at almost the same time:

Here's the problem: if t2 is slightly larger than t1 and arrives just before t1 has a chance to propagate to Bob's client:

After this conflict resolution, we end up with this folder tree:

txt
root
├── folder1
├── folder2
└── folder2-1

note1-1 is now completely invisible — gone from the entire tree.

V1 failure mode

Folder-level LWW drops a note

Alice @ t0

Baseline

root
folder1
note1-1
folder2
folder2-1

Bob @ t0

Baseline

root
folder1
note1-1
folder2
folder2-1

Merged

No merge yet

root
folder1
note1-1
folder2
folder2-1

Starting state(1/4)

v1 stores whole folders as LWW objects. Unrelated fields inside the same folder can overwrite each other.

Tree is valid on both clients
folder2 owns folder2-1; note1-1 in folder1

Invalid state not handled

What if we do something similar, but with folders instead of notes?

With this very simple layout:

txt
root
├── folder1
└── folder2

And the same two users, doing the following operations simultaneously:

This time, since there's no conflict on any single folder, the result is exactly what you'd expect: folder1 is moved into folder2, and folder2 is moved into folder1.

Now we have this — a cycle, detached from root.

txt
root

┌── folder1
└── folder2

And everyone knows what happens to a disconnected component in a graph: it's unreachable, meaning the tree now looks like this to the users:

txt
root
Structural bug

Mutual moves create a detached cycle

Parent graph

Everything reachable

root
folder1
folder2

User view

Normal rendering

root
folder1
folder2
Invariants
Reachable from root
No cycle
Render path exists

In v1, nothing repaired this graph. Once root lost reachability, the UI had no path back.

Two folders at root(1/4)

No cycles. Both reachable from root.

Tree is a valid hierarchy
Cycle risk is latent until concurrent moves merge

Confusing data structure transformation

Because the folder tree data structure and the actual backend data layer were implemented by different people, the results were a bit all over the place.

For example, the "folder data structure" has three different representations across the database, the logic living in fp-ts, and the output structure the frontend code actually uses. The most painful point was that there's an explicit root folder (parentId === 'root') in the fp-ts logic and the frontend code, while parentId === null means the folder is in root on the backend.

So we literally had these:

ts
// backend
const parentFolderId = folder.parentId === "root" ? null : folder.parentId;

// frontend
const parentFolderId = folder.parentId ?? "root";

#V2 folder core

With how folder tree was implemented and the LWW structure we were using, both problems mentioned above were very hard to fix without creating more issues.

To fix all the issues, I redesigned the folder sync data structure from the ground up, incorporating invalid state recovery from day one.

Fine-grained per field sync

To allow per-field sync, I devised a new nested data structure that uses Y.Map directly inside Y.Array to represent metadata. Like the LWWMap in the folder tree, a top-level append-only Y.Array retains history entries (so we can actually do LWW). The key difference is using a Y.Map instead of a plain object for the field value.

ts
// NOTE: this is pseudocode

// LWWMap
Y.Array<[{Key,Value,Timestamp}, ..., {Key,Value,Timestamp}]>

// My data structure; metadata like timestamps are stored in the Y.Map as field values
Y.Array<[Y.Map<Value>, ..., Y.Map<Value>]>

// For folder per field sync
type Folder = Y.Map<any>; // updating individual fields on Y.Map is safe and syncable
Y.Array<[Y.Map<Folder>, ..., Y.Map<Folder>]>

This let me put any Yjs primitive inside the array field, including a nested Y.Map. Using a nested Y.Map for the actual folder data finally enabled per-field fine-grained LWW sync.

Handling invalid states

To eliminate all invalid states, the new core structure runs a quick check-and-fix operation immediately after applying an update.

The explicit root folder is also gone — any folder without a parentId is simply attached to root. Same goes for notes: the folder ↔ notes relation is now stored as a single map (Map<noteId, folderId>). Notes automatically fall back to root if not present in the relation map.

The new structure also dropped the notes and children arrays, and now aligns closely with the data model actually stored in the DB.

diff
type FullFolderMeta = {
-  id?: string;
+  id: string;
+  clientId: string;
  name: string;
  icon?: string | null;
-  parentId: string;
+  parentId?: string | null;
-  children: string[];
-  notes: string[];
  // ...other fields
};

Let's revisit the earlier example of Alice and Bob moving folders into each other. Right after the first sync, we still get this:

txt
root (not real)

┌── folder1 (parentId: folder2)
└── folder2 (parentId: folder1)

Then immediately after, the update hook runs and breaks the cycle (order is deterministic):

txt
root (not real)
├── folder1 (parentId: null)
└── folder2 (parentId: folder1)
V2 repair loop

The same bad state now self-heals

Incoming state

Before repair

root
no reachable children
Detached cycle
folder1parent: folder2
folder2parent: folder1

Authoritative state

Not derived yet

root
waiting for check
Repair
Detect cycleactive
Pick break pointqueued
Re-parent to rootqueued

v2 treats invalid states as expected inputs that need deterministic cleanup, not impossible situations.

Same bad input(1/4)

v2 receives the same cyclic parent state. The difference is what happens next.

Incoming state contains a cycle
Repair hook about to run

Additional hardening

There are a few more things I did to further harden the whole system, including but not limited to:

#Validation

To ensure correctness and reliability beyond normal unit and integration tests, I employed a simulation testing strategy to catch edge cases that traditional tests would miss.

This idea was inspired by TigerBeetle's testing approach:

Seeded with a Linear Congruential Generator (LCG), every operation in the simulator is fully replayable. I then ran randomized tests based on the LCG output. The seed, input, output, and errors were all recorded per round, allowing me to quickly reproduce the exact error that crashed the simulation.

Pseudo decision tree:

txt
runSimulation(seed):
  rng = LCG(seed)
  clients = createClients(count = rng.int(2..5))

  repeat 1000 times:
    actor = rng.pick(clients)
    action = weightedPick(rng, {
      create, delete, update, reparent,
      moveNotes, batchUpdate, noOp, query
    })

    if action == create:
      actor.createFolder(randomName(rng), randomParent(rng))
    if action == delete:
      actor.deleteFolder(randomFolder(rng))
    if action == update:
      actor.updateFolder(randomFolder(rng), randomPatch(rng))
    if action == reparent:
      actor.moveFolder(randomFolder(rng), randomParent(rng))
    if action == moveNotes:
      actor.moveNotes(randomNotes(rng), randomFolderOrRoot(rng))
    if action == batchUpdate:
      actor.updateMany(randomFolderSubset(rng), randomPatch(rng))
    if action == noOp:
      actor.writeSomethingThatShouldChangeNothing()
    if action == query:
      actor.readPathDepthOrChildren(randomFolder(rng))

    occasionally check basic invariants

  waitUntilAllClientsResync()
  assert all clients converge to same final shape

Some issues surfaced after multiple simulation runs; I fixed them and kept running until the simulations stopped crashing. After running millions of simulations with multi-client concurrent operations, I was pretty confident that the new folder core structure wouldn't produce any invalid state.

#Rollout, migration, and observability

To allow a zero-downtime migration from folder tree (v1) to folder core (v2), I designed a two-mode traffic router.

We first switched everyone to shadow traffic mode to auto-bootstrap the v2 folder core data structure. Once everything looked stable, we gradually enrolled users into the pure v2 schema, bucket by bucket.

To improve observability, I also added an end-to-end trace logging system. The client creates a randomized trace ID for each connection, and all logs across frontend and backend for that session share the same trace ID, making connection issues much easier to debug.

We still had some issues right after the complete v2 rollout, but since the v2 core was hardened and thoroughly tested, tracking down bugs was much easier — we could skip second-guessing whether the core data structure was at fault.

rought data flow diagram

#Wrapping up

Welp, that was a wild ride. I learned a lot trying to survive all of this — if you asked me for a one-sentence summary, I'd say "never roll your own sync engine", just use Convex or something.

But for real, the biggest takeaway is that testing is really important — bet hard on it, especially for mission-critical parts. Testing should be part of the system design from the very start; otherwise you'll have a really hard time trying to add it later.

Finally, I hope you found this article interesting — and I sincerely hope you never find yourself in a position that needs any of this.


AI assisted disclosure:

  • I hand wrote the entire article and ran it through Opus 4.6 once for grammar and spelling correction.
  • Interactive demo were mostly created by GPT-5.4 and Opus 4.6.

Command Palette

Search for a command to run...