What Clockwork showed us about a 30-second list page

A list page in our admin had a problem nobody quite knew how to file. On a customer with a busy month it would sometimes finish in three seconds, sometimes in twenty, and occasionally just crash with a 500 because PHP’s max_execution_time of thirty seconds had run out. Reload, try again, and you might get the page. Or you might not.

Pages with this shape are the worst kind to optimize. The slow case is real but not reliable. Profiling locally returns “everything is fast.” Adding microtime(true) calls around suspected blocks confirms whatever you thought before you added them.

What broke the impasse was Clockwork, which a colleague had installed in our codebase the week before. Clockwork is a PHP profiler that hangs in the browser dev-tools, showing per-request timings, queries, and custom events. The default install was already giving us “how many queries does this page run” — useful for the obvious cases, not enough to figure out where the time on this page was actually going.

I added a small helper, addClockworkEvent($name, $startTime), that wraps Clockwork’s event API with a backtrace filter and a consistent naming convention. Then I dropped calls to it at the boundaries I’d suspected without evidence — repository constructors, the rule engine, the holiday cache, the schedule lookups, the export builders.

The next Clockwork trace was unrecognizable. The fast cases and the slow cases looked completely different — in a way the old “how many queries” view had hidden.

What the trace showed

The slow case wasn’t one slow query. It was a thousand small things, none of them individually scary.

The same ConfigRepository was instantiated many times per request because every callsite that needed config did new ConfigRepository(...) instead of asking a registry. The holiday repository was loading the year’s holidays once per employee. The custom-fields repository was re-reading its definitions on each form render. The rule engine was reconstructing its rule graph for every registration even though the graph only changed when an admin saved a configuration page.

FIND_IN_SET was used as a filter on a column with hundreds of thousands of rows. Two SQL tables had index orderings that no longer matched the queries hitting them. Inside the employee-registration render, the loop fetched one employee at a time, generating one round trip per employee for what could have been a single batched fetch.

Each of these was, on a small customer, basically free. On a large customer in a busy month, they added up to the 30-second timeout.

The ten fixes

Across three weeks I shipped roughly ten changes. None of them were a refactor in the dramatic sense:

Singleton pattern for ~20 repositories and services. The trait that owns this is fifteen lines. Hit MySQLWrapper, ConfigRepository, RuleCache, RuleBuilder, RuleService, HolidayRepository, ScheduleRepository, Schedule, ModuleService, AssistantService, RegistrationConversionService, and friends.
In-memory caching in CustomFieldRepository. First call hits the DB; subsequent calls use a per-request static array.
Cached assistant evaluations. The evaluator was deterministic per (employee_id, date); caching it within the request collapsed N calls into one.
FIND_IN_SET → IN in three queries in RegistrationRepository. FIND_IN_SET doesn’t use indexes; IN does.
Recursive query for holidaysInRange. Replaced a per-day fetch loop with one query that returns the whole range.
JSON aggregation for worktimeSchedules. Collapsed an N+1 of schedule-day fetches into one query returning a JSON column the app already knew how to read.
Index reorganization on two SQL tables. No new indexes — just reordering columns inside existing ones so the leftmost prefix matched the actual queries.
Subquery restructure + a deleted flag check in the list query that started this whole thing. The old form re-evaluated the subquery per outer row; the new form pre-filters before the join.
Chunked employee rendering — 20 at a time. The render loop used to issue one DB round trip per employee. Now it batches.
SystemLogger for batch logging. The audit-log write that fired on every action used to be one INSERT per call. The new logger buffers and writes once per request.

Plus one cleanup: the preload limit on the timesheet view rose from 100 to 500, because with the changes above, fetching 500 was cheaper than fetching 100 had been before.

What I almost wrote, and didn’t

The temptation in the middle of this work was to rewrite the list page entirely — pull out the query, redesign it as a single read with computed columns, lift it into a view. Cleaner, more elegant, lots of demo value.

It would also have taken a month and left a long tail of unknown regressions.

The ten small fixes took three weeks, each one shipped on its own, each one a measurable win in the Clockwork trace, each one rollback-able on its own merits. The big rewrite would have been one PR I’d have been afraid to revert.

A page that’s slow because of one heroic bad query is rare. A page that’s slow because of years of accumulated micro-costs is normal. Treat it accordingly.

The lesson

Two things stuck with me.

The instrumentation pass was worth more than any single fix. Without the addClockworkEvent() helper and the targeted calls, every one of the ten changes above was a guess. With it, each fix was a measurable response to a specific line in the trace. The afternoon I spent on instrumentation is the work I’d recommend most.

Trust the trace over your intuition about hotspots. I would have sworn the bottleneck was a slow query. It wasn’t. It was a constellation of small ones, plus instantiation costs, plus loop-per-employee fetches. My intuition was wrong by an order of magnitude on which class of fix mattered.

The page loads cleanly now. I haven’t seen a 30-second timeout since.

What the trace showed#

The ten fixes#

What I almost wrote, and didn’t#

The lesson#

What the trace showed

The ten fixes

What I almost wrote, and didn’t

The lesson