The pull request was open from June 16 to October 30. Four and a half months. 138 files. ~12,000 lines of diff. The branch consolidated how the whole admin asks “is this user allowed to do this thing” — replacing several ad-hoc permission checks with a single, named API.
It merged at 12:28. The revert merged at 12:57. The refactor lived twenty-nine minutes on master.
I think those 29 minutes were the best decision we made that quarter, and I want to write down why.
What the refactor actually was
The admin had grown several parallel ways of expressing the same idea:
if ($user->hasRight('something'))in some placesif (in_array('something', $user->getRights()))in others- A few callsites that constructed the right-name from a feature toggle
- One module that bypassed the check by re-reading the session
- A handful of
if ($user->isAdmin())shortcuts that subtly diverged from “has the rights for this thing”
Five idioms, same intent, slightly different behavior on edge cases. The refactor collapsed them into one: a Permissions::can($user, $action) style API with the per-module right names enumerated in a single registry. New checks could be grepped. Old checks would, eventually, all line up.
The PR description called it a refactor. In practice it was a refactor and a bugfix — several of the old idioms had been quietly granting access they shouldn’t, or denying it where they shouldn’t, and the rewrite tightened those up.
How a 12,000-LOC refactor passes review
In bits. The PR went through five distinct review rounds between June and October, mostly driven by QA finding behavior differences on specific admin screens. Most rounds were small (“the multi-select on the roles screen lost its select-all”). A few were bigger (“the data-fields page now hides a tab from non-admins that used to be visible to leads”). Each round produced more commits, more screenshots, more “verify this on customer X.”
By late October the QA passes had cleared everything that could be clicked through. The test suite was green. The Danger bot warned, accurately, that it was a “Large PR detected: ~12000 LOC changed. Consider splitting.” We did not split.
The bug, and the merge
It merged at 12:28 on a Thursday.
The bug was in create project — a flow that runs hundreds of times a day. One of the old idioms had implicitly granted a permission that the new registry now required explicitly. Creating a project from one specific entry point hit a path that asked the new API a question the old code never asked, and the new API correctly answered “no” — which then aborted the create flow with an opaque error.
Test suite green. Manual QA green. Production: project creation fell off a cliff.
By 12:35-ish, support had two tickets and our error tracker had counted ten or twenty unique exceptions.
Why we reverted instead of hotfixing
The reflex on a fresh merge breaking production is stabilize forward: identify the broken path, push a fix, redeploy. That works when the broken thing is a leaf — a single endpoint, a single template, a single function. It does not work well when the broken thing is one expression of a refactor that touched 138 files.
The argument for reverting:
- The bug was in production, which meant other bugs of the same shape almost certainly existed and hadn’t surfaced yet. The test suite couldn’t find them; manual QA hadn’t found them; the only thing finding them was live traffic. Stabilizing forward means stabilizing one bug at a time, on production, while the codebase is in a state nobody fully understands.
- The PR had a clean parent commit. The revert was a single click. The customer-facing apps weren’t pinned to anything in the new code yet.
- The team that had built it had been on it for four months. They were not the right people to debug it from cold, mid-incident.
The argument against reverting was sunk cost. Four and a half months of work. A coordinated set of PRs across two other repositories. A QA pass that had cleared everything. Reverting felt like throwing it away.
It wasn’t throwing it away. Reverting bought us the time to re-merge the same diff with the bug fixed, in calm, under code review, on a calendar that wasn’t “right now.”
By 12:57, master was back where it was at 12:27.
The 29 minutes
What actually happened in those 29 minutes:
- 12:28 — Merge lands.
- 12:30-ish — First production exceptions appear.
- 12:35-ish — Support has two tickets. The team gets pinged.
- 12:40-ish — We’ve reproduced the create-project error locally on the merged branch. Root cause is one missing right in the registry.
- 12:45-ish — Quick discussion: hotfix or revert. We choose revert.
- 12:50-ish — Revert PR opened.
- 12:57 — Revert merged.
The decision-to-revert step was the longest. The PR author wanted to hotfix; reverting feels like a personal rejection of the work. The argument that won was: “the revert lets you fix this PR. The hotfix doesn’t let us fix the next one.”
The operational muscle this took — being able to revert in under 30 minutes — is the part nobody plans for and everybody benefits from. Specifically:
- The branch was rebased on master, not merged-with-merge-commits. The revert was one commit.
- Deploys are fast — under five minutes from merge to production on this repo.
- We don’t have a manual QA gate on master (the manual QA happens pre-merge). So the revert deployed itself, the same way the broken merge had.
- Nobody had to ask permission to revert. The dev who merged it could revert it.
Those four things are not free. They are the result of past investment. If your incident response includes “ping the release manager for approval,” your revert window is not 29 minutes.
What we changed for v2
The v2 PR opened the next morning. It was the same branch, rebased on the new master (which now didn’t contain the reverted merge), with one additional commit: the missing right added to the registry, and a regression test that called the create-project endpoint end-to-end and asserted a non-zero return.
That regression test is the actual lesson.
The original PR’s test coverage was excellent for the new code. It had unit tests on the new Permissions::can() API. It had unit tests on the new registry. It had QA screenshots from manual click-through on the affected admin screens.
What it didn’t have was a test that asked, of every important user-facing flow: does this still work end-to-end after the refactor? The create-project flow was hot enough that “does this still work” should have been a CI gate.
For v2 we added a small set of integration tests covering the highest-traffic flows: create-project, create-customer, create-registration, the daily approval cycle. Not exhaustive. Just enough that a regression in any one of them would fail CI before the merge button could go green.
v2 merged on November 6 and has stayed in. There were two small hotfixes the following week (one typo on a right name, one missed callsite in a rarely-used corner) but no incidents of the create-project shape.
The lesson
The shape of this story is not “we did something wrong and learned a lesson.” We did something carefully — four months of review, multiple QA rounds, coordinated cross-repo PRs — and shipped a regression anyway. Test coverage on the new code does not protect you from a refactor’s effect on existing call paths. The only thing that does is exercising those call paths end-to-end.
Two things I’d write down for next time:
- Identify the top-N hot flows before merging any refactor that touches more than ~50 files. Create-project, create-customer, daily approval — for us, on this codebase, those three are the load-bearing wall. A test that fails when any of them stop working ten minutes from now is worth more than a hundred unit tests that pass on the new abstraction.
- Optimize for a fast revert, not for never needing one. The 29-minute revert was free for us because of investments in deploy speed, merge hygiene, and a culture where reverting is a tool rather than an admission. If those weren’t already in place, this incident would have lasted four hours, not twenty-nine minutes.
Review will not save you on a refactor this big. Plan for the revert.