The seed loss — how $16,800 in 2023 became $532,000 by 2026

The episode discussed below is a specific event in the internal backtest pipeline; the compounding math is exact for the historical window described. It is not a forecast of any forward outcome, and live execution will produce a different number for reasons that include slippage, fills, and the normal divergence of any sample from its model.

A trade-selection disturbance of roughly $16,800 in mid-2023 became a $532,000 gap in the 2026 backtest. Same period, same names, same rules. The only thing between the two numbers is five years of compounding.

The arithmetic

Net asset value compounded roughly 31.6× between the date of the original disturbance and the end of the five-year backtest window. Any dollar of pnl that the system did not capture at the start of that window cost ~$31.60 by the end. A sixteen-thousand-dollar miss therefore did not stay a sixteen-thousand-dollar miss. It became a half-million-dollar gap relative to the locked baseline.

The principle is older than any of our work. Compounding amplifies the early disturbance more than the late one. A dollar of pnl in year one of a five-year run is worth far more than a dollar of pnl in year five, because the dollar in year one is itself reinvested into the engine that produces all subsequent dollars. Early drift is catastrophically expensive.

What happened

A routine refactor moved part of the scanner code into a shared library. The commit message asserted equivalence: same trades, same signals. The trades were almost the same — same total count, same composite scores on the names that appeared in both versions — but a few entry triggers shifted by two or three trading sessions for two specific names in May 2023.

Direct realised pnl from the four changed trade legs added up to roughly −$16,800. The discovery was forensic: the live four-engine stack's full-period return had dropped from the corrected baseline of +128.8% CAGR to roughly +121%, the final NAV from $6.297.913 to roughly $5.34M. The headline degradation looked alarming. The seed disturbance behind it looked trivial.

Why a five-percent miss is not a five-percent miss

The dangerous frame is to look at a single small backtest regression and shrug it off as noise. A reasonable observer sees +121% CAGR versus +128.8% and concludes that the difference is within sampling tolerance. But the difference is not noise — it is the compounded shadow of a single trade-selection mistake that happened to land early in the run.

If a similar disturbance is absorbed every six months — a scanner refactor, a small library upgrade, a re-derived benchmark — and each one shaves five percent off the long-run CAGR, the strategy that started life as a one hundred thirty percent compounding machine ends its marketing lifecycle as an eighty percent one. The decay is not loud. It is administrative.

The discipline that follows

Three rules fall out of accepting the arithmetic.

Lock the inputs to the baseline. The raw signal pools that feed Genbu, Suzaku, Byakko, and Seiryū are pinned to specific dated files. The backtest does not regenerate them on every run. A new release re-locks them only after the full pipeline has been verified end-to-end against the prior baseline within a tight tolerance.
Treat silent drift as a regression. A refactor whose commit message says “same trades, same signals” is tested against that claim. The test is not unit-level — it is run on the locked five-year window and the headline NAV is compared. If it differs, the refactor is presumed to have changed semantics until proven otherwise.
Operate inside published envelopes. The engines do not improvise. ATR-derived stops, position-size caps, deployment caps, exit timing, and per-engine activity windows are written down and bounded. The temptation to widen one of them in response to a recent loss is met with the same arithmetic: a short-window improvement that loosens the long-window rule will compound the wrong way.

What we did about this one

The original signal pools from before the refactor were intact in the archive. We pinned them as the v7 baseline, re-ran the four-engine stack against them, and verified the headline numbers reproduced to the dollar. The regression is closed. The procedure that produced it has been changed so it cannot happen silently again — a baseline verification step now runs before any release that touches the scanner.

Why this is in the public research

Because it is the kind of mistake every systematic manager makes at some point in operating a backtest. Most do not say so. The right response to it is not to bury the episode — it is to write down the math, change the procedure, and publish both. The bot is paper-traded in public for exactly this reason. The record of how we operate, including the corrections, is the only honest track we have.

— Shishin Research