Backtesting practice — a staged validation process

Jarosław Wasiński · Published: May 17, 2026 · Updated: May 28, 2026 · 7 min read

Last verified: May 28, 2026 · Long-term evergreen content

Educational purposes only — not investment advice. 74–89% of retail accounts lose money.

Short answer

Honest backtesting is not a single test but a multi-stage process where each step is a sieve with smaller holes. First, write entry, exit, stop-loss, take-profit, and position-size rules precisely enough that a second human would place identical trades. Second, choose history — ten years minimum for daily strategies, five for M30 and M15, two of real tick data for scalping — so the sample spans different regimes: trend, consolidation, volatility shocks, rate cycles. Third, split the data into seventy percent in-sample for parameter optimisation and twenty-five to thirty percent out-of-sample, untouched until tuning is complete. Fourth, run walk-forward — five to seven iterations with a rolling window — where WFE between 0.5 and 0.75 signals a strategy worth deploying. Fifth, forward test on a demo account for three to six months with parameters frozen. Sixth, micro-lot live, one-tenth of the target size, for another three to six months. A strategy returning twelve percent annualised at every stage is genuinely better than one returning thirty on IS and eight on OOS — consistency matters, not the peak.

Almost everyone serious about a strategy runs a backtest, but few follow an honest, staged validation process — and that process separates the account that survives its first year from the one wiped out in month three. This piece walks the full pipeline: writing rules cleanly, choosing the history window, splitting the data, walk-forward, demo, and the first micro-lot live. Not an instruction for one test, but the discipline of taking a strategy from idea to live account.

Why a single backtest is not a process

A single backtest shows a number you do not know. The strategy earned thirty percent — but did it capture a real edge, or did you run three hundred parameter combinations until one fitted the historical noise? The number alone cannot answer. A process can, by gradually removing the trader's ability to learn from the data on which the strategy is finally judged. Each stage is a sieve with smaller holes. One clean test is in my piece on how to backtest a strategy; here I focus on the discipline starting long before the first click of Start.

Rules on paper, no interpretation

A strategy that cannot be translated into code or a precise sheet of rules is not ready to be tested. Write entry, exit, stop-loss, take-profit, and position size precisely enough that a second human reading the document would place identical trades. "I buy when I see a trend" is an impression, not a rule. A testable version: "buy at the close of a daily candle when EMA(50) is above EMA(200), RSI(14) below seventy, price touches the twenty-period average from above; stop loss 1.5 ATR(14) below entry; target 2.5 ATR above; risk one percent". The discipline filters out most ideas before the test runs.

The data window must span more than one regime

The second filter is history. My rule: ten years minimum for daily strategies, five for M30 and M15, two for sub-M15 scalping — from real tick data, not synthetic broker history. The last decade held long DXY trends (2014–2017), the volatility shock of March 2020, the rate-hike cycle (2022–2023), and the consolidation of 2024. A strategy that only works in one of those worlds is an illusion fitted to one epoch. Fewer than one hundred trades over five years is too small a sample — one hundred is the floor, professionals aim for three hundred.

Splitting the data and protecting the out-of-sample block

The third stage takes data away from yourself. Split history into seventy percent IS for optimisation and twenty-five to thirty percent OOS, untouched until optimisation is complete. OOS is the honesty sieve — it shows whether parameters chosen on training data have value beyond it. If IS gives eighty percent win rate and profit factor 2.4, and the same parameters give fifty percent and 1.1 on OOS, you have just caught yourself curve-fitting. Twelve percent annualised on IS and eleven on OOS is genuinely better than thirty on IS and eight on OOS. Consistency, not the peak.

Walk-forward as the finest sieve

A single IS/OOS split gives one number. Walk-forward repeats it five to seven times: first IS 2018–2021, OOS 2022; second IS shifts to 2019–2022, OOS 2023; and so on. For each window you re-optimise, freeze the winning parameters, test on OOS, record the result, shift the window. After five to seven cycles the average OOS is the most honest proxy for what a live account will produce. WFE in 0.5–0.75 signals a strategy worth taking forward; below 0.3 it is a curve-fit confession. The mechanics and the difference between rolling and anchored variants live in walk-forward analysis; for wider context, see the traders' workshop on ForexMechanics.

"The whole purpose of walk-forward analysis is to reveal the real-time, real-money performance of a trading strategy without actually trading it with real money in real time." — Robert Pardo, The Evaluation and Optimization of Trading Strategies, Wiley, 2008

Demo and micro-lot live — where the strategy meets reality

A strategy that cleared walk-forward is ready for demo, not real money. Three to six months of forward testing with parameters frozen is the first real-time behaviour test: live spreads, actual macro releases, the Sunday-evening gap, real trading-hour liquidity. Demo extracts what a backtest never shows — a strategy judged liquid on history can prove hard to execute because signals appear when you are asleep. Not data problems but problems of you and your market.

After forward testing you do not jump to full size. You push through with a micro-lot — one-tenth of the target — for three to six months on real money. The point is informational: how live execution differs from demo, the real slippage, how the broker behaves around NFP, and how you react to genuine if modest losses. The table across stages has four columns: IS, OOS, demo, micro-lot. The closer the figures sit, the lower the risk you are living in an illusion. A wider divergence is the signal to step back, not scale up.

Illustrative example — full pipeline for a breakout strategy

A breakout strategy on EUR/USD, M30: entry on breaking the highest high of twenty candles, exit on the lowest low of ten. History 2014–2023, IS 2014–2020, OOS 2021–2023. Optimisation on IS: twenty-three for the high, eleven for the low, stop loss 1.4 ATR, win rate fifty-eight percent, profit factor 1.72, twenty-two percent annualised. On OOS: fifty-four percent win rate, profit factor 1.51, eighteen percent annualised. Five walk-forward iterations give average WFE 0.71. Four months of demo confirm half a pip more slippage than assumed; win rate and profit factor stay in range. Micro-lot live from January gives fifty-one percent after three months — below backtest, still net positive. Third quarter brings the decision: scale up, or wait if volatility drifts from the norm. Figures illustrative.

What to do tomorrow

Write the strategy rules into a plain text file precisely enough that another human reading it would place identical trades — no interpretation, no "I feel the trend", with exact parameter values, a stop-loss formula, and position size as a percentage of account equity.
Download historical data for the pair you actually trade — ten years minimum for daily charts, five for M30 and M15, two years of genuine tick data for scalping; verify the sample spans different regimes, including trend, consolidation, volatility shocks, and rate cycles.
Split the data into seventy percent in-sample and twenty-five to thirty percent out-of-sample, do not touch OOS until IS optimisation is complete, then run five to seven walk-forward iterations; if WFE lands below 0.5 or parameters jump more than fifty percent between iterations, simplify the logic.
For a strategy that has cleared walk-forward, run three to six months of forward testing on demo with parameters frozen, then three to six months of micro-lot live; only when the four result sets are consistent scale to target size, with a Monte Carlo simulation alongside.
Read results through the lens of consistency, not the highest return: twelve percent annualised at every stage is better than thirty on IS and eight on OOS, because consistency decides whether the account survives its first year; for the edge this process tests, see discovering a trading edge.

About the author

Jarosław Wasiński

Editor-in-chief at MyBank.pl · Financial and market analyst

Independent analyst and practitioner with 20+ years in finance. Founder and editor-in-chief of MyBank.pl, running since 2004. Fundamental analysis of FX and macro markets since 2007.

Sources & bibliography

Robert Pardo The Evaluation and Optimization of Trading Strategies · klasyczny podręcznik o ewaluacji systemów transakcyjnych i metodyce walk-forward onlinelibrary.wiley.com ↗
MetaQuotes MetaTrader 5 Help — Strategy Tester · oficjalna dokumentacja MT5 dotycząca Strategy Testera, forward testingu i optymalizacji parametrów www.metatrader5.com ↗
MetaQuotes MetaTrader 4 Help — Strategy Testing · opis Strategy Testera MT4: parametry uruchomienia, modele tickowe, interpretacja raportu www.metatrader4.com ↗
Backtrader Backtrader documentation — Introduction · wprowadzenie do otwartego silnika backtestowego w Pythonie używanego przez quants www.backtrader.com ↗
TradingView Pine Script v6 — Welcome · oficjalna dokumentacja Pine Script i Strategy Testera w TradingView www.tradingview.com ↗

Frequently asked

How does this multi-stage process differ from a single backtest?

A single backtest yields one number and one interpretation. It shows whether the strategy was historically profitable but silently assumes the optimisation process did not learn the noise. A multi-stage process turns that one number into a sequence of sieves. First, written rules filter out unverifiable ideas. Next, a long history window filters out strategies that work only in one regime. Then the in-sample versus out-of-sample split filters out parameters that work only on training data. Walk-forward filters out parameters that work only in one random out-of-sample window. Demo filters out strategies that cannot be executed on live spreads. Micro-lot live filters out traders who do not hold up psychologically. After all six sieves a fraction of the original group of strategies remains, but that is the fraction with a real chance of surviving the first year. A single backtest does not enforce that selection, which is why eighty percent of retail loses despite eighty percent having run some kind of historical test.

How long does the full process from idea to micro-lot live take?

A realistic schedule runs from nine to fifteen months from first written rules to scaling the position to target size. The first two weeks go on precise rule-writing and downloading historical data. The next two to four weeks cover in-sample optimisation and the first out-of-sample validation; if it fails, you return to the rules rather than hunt for a better test. Walk-forward with five to seven iterations adds another month because every window requires its own optimisation. Three to six months of forward testing on a demo account follow. After that another three to six months of micro-lot live trading. Only after comparing the four result sets and confirming their consistency can you scale up to the target position. Cutting the schedule below nine months means skipping one of the sieves, and every skipped sieve moves the risk from the validation phase to the live phase, where it costs real money.

Which metrics should I track across stages to catch inconsistency?

The table maintained across the process should have four result columns (in-sample, out-of-sample, demo, micro-lot live) and at least four rows of metrics each. The first is win rate as a percentage — a gap larger than ten percentage points between stages signals inconsistency. The second is profit factor, gross profit divided by gross loss — a gap larger than 0.3 between stages is a warning sign. The third is the average reward-to-risk ratio — a gap larger than 0.5 R suggests the stop-loss behaves differently from the test. The fourth is maximum drawdown, which almost always grows from stage to stage but a rise above fifty percent between two adjacent stages means the strategy is meeting conditions absent from the test. The fifth, optional, is average slippage in pips — the difference between backtest and demo, and between demo and micro-lot, tells you whether the broker behaves in line with assumptions. Inconsistency in any of these metrics is the cue to step back one stage and understand the source rather than scale the position.

Does clearing the full process guarantee that the strategy will earn live?

No. Each stage raises the probability that the strategy has a real edge, but no set of historical or forward tests removes the fundamental risk: next quarter's market may be unlike anything you saw in the data. The whole process silently assumes that the regime in the out-of-sample windows and in demo will be similar enough to the live regime. If the strategy learned the 2018 to 2023 market with two volatility shocks and two rate cycles, and trades from 2024 in a long range with low volatility and fewer market-moving releases, micro-lot live may show results far from the backtest. That is why the discipline is not relying solely on a clean run through the sieves but on holding the micro-lot long enough to compare the result on real money against the three earlier stages. The complement is a Monte Carlo simulation that randomly reorders the trade sequence and reveals the distribution of possible equity curves — an estimate of the worst reasonable scenario, which a backtest alone never exposes.

Go deeper · the complete guide

Reading now Backtesting practice — a staged validation process