Why we deleted A-grade signals

For the first eighteen months of the intraday sleeve — the broader shape of which is in how we trade levels — we ran a graded confluence model with three tiers: A, B, and C. Each tier was a combination of signals that had to align at a level — multi-timeframe agreement, volume profile, market-structure context, funding posture. A meant “all of them agreed strongly”. B meant “most of them, with one mild dissent”. C was exploratory and never traded live.

The intuition was hard to argue with. A grade is rarer, cleaner, higher-conviction; you’d expect a higher win rate, a fatter average R, and quieter drawdown. We expected exactly that going in.

For three quarters we ran A and B side by side, both live, both at small size. Same instrument set, same execution stack, same exit logic. Only the entry filter differed. By the end of the third quarter the spreadsheet looked like this:

	A grade	B grade
Trades / 90d	~14	~78
Win rate	52.9%	54.8%
Avg R / trade	+0.21	+0.18
Stddev R	0.94	0.71
90d total R	+2.9	+14.0
Max drawdown	−3.4 R	−2.1 R

A had a slightly fatter average R and a slightly worse win rate, but the unsettling number was the standard deviation. A’s per-trade variance was 33% wider than B’s, on 18% of the cadence. When you compound that over a quarter, A is louder, slower, and ends up with the worse Sharpe of the two. We deleted A from the live book the quarter after.

The first explanation we considered, and discarded, was statistical power. A grade gives you ~14 trades a quarter — not enough trades to size with confidence. The Sharpe-weighted sizer needs a few dozen observations before its weights stabilise. On A alone, sizing was always lagging the regime by two to three weeks; on B with ~78 trades the sizer was usually within a week of optimal. That much we expected; it does not, on its own, explain the variance gap.

The second explanation does. A signals cluster. When the macro tape lines up, you get three A entries on correlated names within a few hours. The portfolio thinks it has three independent edges, but it actually has one edge expressed three times. Drawdowns on A were always sharper than on B for this reason — when the shared driver moved against you, all three positions lost at once. B trades fire more constantly across a wider observation window and don’t bunch the same way.

The third explanation is the one that took us a year to admit to ourselves: the information edge lives in the disagreement. A is “everything agrees”, which is by definition the consensus reading. B is “most things agree, with one mild dissent” — and the dissent is the part of the trade that’s not yet priced into the consensus. B trades that work do so because the dissent was the truth and the consensus was the crowd. B has more upside conditional on being right, because being right on B means the crowd was wrong. A trades that work do so because the consensus was right, which is fine, except that the consensus already moved the price before we got there.

We left the grading machinery in. We just stopped trading A. The grader still computes A; we use it as a suppression signal — when an A grade fires, B entries on that instrument are skipped for the next four hours, because the consensus has already moved. That suppression is worth roughly +0.04 R per skipped trade, the difference between B alone and B-with-A-suppression on the same data. It’s a reminder that a discarded model can still earn its keep as a filter on a different model.

The size cap moved too. With A out of the book, the per-trade size cap came down from 1.0 R to 0.6 R, because B fires more often and the sleeve’s notional has to be bounded across a cluster of correlated B’s, not just per trade.

The general lesson is uncomfortable. If you’re running a model where you can grade your own signals — A through D, 1 through 5, however you like — the strongest tier is probably the wrong place to live. The strongest tier is the consensus, and the consensus is by definition the part of the move that’s already happened. The next-tier-down is where the disagreement lives, and the disagreement is where the edge lives.

The same logic almost certainly applies to the long sleeve, but on a different timescale. The macro-regime model has its own analogue of A grade — moments where every input agrees on a regime. Those moments are also when the consensus is most priced. We are studying whether the sleeve allocator should fade unanimous regime reads instead of following them. Early backtest is encouraging. More on that when we have a quarter of forward data.

— inite team