Three forex bots, fifteen production bugs, and a year of backtests that taught me one hard lesson: most trading "edges" are lucky streaks wearing a disguise. Here's the whole story — in order — ending with the tool I built to stop fooling myself.
The Gauntlet on GitHub → · Live performance · Long-form blog
The bot's backtest said PF 1.40. The live account's first 120 trades said PF 0.46. Closing that gap is what this journey has been about.
A friend of mine turned a small deposit into roughly $300k over a few years of careful, mostly-manual forex trading. I wanted to understand what he was doing well enough to automate a version of it on my own laptop — no VPS, no prop firm, no subscription signal service. Just a script that reads the market and takes the trades a disciplined person would.
The first real deployment was a ~4,200-line Python script running against an OANDA live micro-account with $297. Simple rules: EMA crossover with RSI confirmation, fixed percent risk, nothing fancy. It taught me more about spreads, slippage, and session liquidity in two months than any YouTube video.
V3 moved to an OANDA practice account with a sizeable demo balance so I could stress test features that need more room: six strategies, session filters, regime detection, per-pair blocks, ML entry filters, confluence scoring. The codebase grew to ~7,200 lines with a PyQt desktop GUI and rolling logs.
The backtest looked good. The live account didn't.
After 17 days of live trading, V3 had closed 45 trades with a profit factor of 0.54 — a losing system. I paused the bot and replayed every one of those trades in a simulator with the same candles, the same signals, but with a different exit: a wide stop, a break-even shield, and a fixed 2.75R take-profit instead of a tight trailing stop.
Same 45 entries. Different exits. PF 1.60. The edge was always there — the exit was giving it all back.
The fix is three phases:
Position sizing uses 1% flat risk calculated on the wide stop distance — so the dollar risk is constant even as the pip stop gets wider. Backtested the same way, simulated the same way, shipped live.
Rolling the config out wasn't one commit — it was fifteen. Each one was a subtle mismatch between what the simulator did and what the live execution actually did. Any single one of them was enough to turn a winning config into a losing one. A sample:
Legacy BE code fired at +3 pips and moved the stop, killing the shield at +10.
An old "take half off at 60% of TP" block was trimming every winner in half.
A favorable-excursion rule was swapping the fixed TP back to the old trailing one on winning trades.
A 10-pip cap on stops was silently narrowing the wide stop back to the old one on volatile pairs.
Kelly-VAPS, ML confidence, and streak multipliers compounded — on losing days, 1% risk became 0.47%.
Config override of {} didn't clear the default — so a stale pair-risk override kept leaking through.
On an $85k balance the 500,000-unit cap halved every trade's real risk to 0.5%. Would have halved expected return.
Pyramiding logic, tick-level proportional BE, two separate MFE-tracking paths, an unscoped local variable, a missing account fallback. The blog post walks through every one.
With all 15 bugs fixed and a dry-run passing on $85,733 balance (all 7 pairs sizing at exactly 1.00% risk), V3 went live Sunday evening. The 7-day tab on the Journal page is watching this run in real time.
V3's weakness isn't the strategy anymore — it's the Python GIL and the ~30ms latency between a price tick and a decision. V4 rebuilds the data plane in Rust: async OANDA streaming, lock-free indicator updates, sub-millisecond decide() latency, the same PyQt GUI reading shared state through JSON files and SQLite.
V4 runs in shadow mode first — every signal is logged as if
the bot had traded it, but no orders are actually submitted. This lets us validate
against live prices for a week before flipping execute_trades = true.
This site used to advertise paid signal subscriptions on the strength of a backtest. The live numbers didn't back that up — so instead of selling signals, I published the full trade log. Every trade visible, wins and losses.
Fixing the exit helped, but the gap between backtest and reality kept coming back in new forms. The real problem was bigger than any single bug: I kept finding "edges" that were really just lucky stretches. Mine enough variations of a strategy, keep the best-looking one, and a coin flip starts to look like skill. I was fooling myself — methodically.
So I stopped tuning strategies and built a validation framework with one purpose: to reject a strategy, not bless it. It runs a strategy's trades through a row of statistical kill-tests — a Deflated Sharpe that penalizes how many variants you tried, a bootstrap that exposes the real drawdown tail, a cost-stress test, a regime split, a parameter-plateau check.
Then I ran everything I'd ever built through it. Most of it died. That was the point — a backtest that can't fail you can't protect you either.
Tested cleanly — real spreads, no look-ahead, multi-year samples:
The pattern was undeniable: simple price-pattern edges on FX majors don't survive realistic costs. The good stretches were hot streaks that gave themselves back.
Carry (earning the interest-rate differential) was small but real. Trend-following on a diversified basket of indices, metals and bonds was the strongest, most robust thing I found. And stacking a few genuinely uncorrelated thin edges into one portfolio beat any of them alone — the best result of the entire project.
I queried my own live account to be certain. US OANDA lets US retail traders trade spot FX only — no indices, no oil, no bonds, not even gold. The diversified trend edge, the strongest thing I found, runs on instruments my account literally cannot touch.
That reframed everything. The year of struggle wasn't a strategy problem — it was an access problem. I'd been forced to fish in the one market (hyper-efficient FX majors) where a small mechanical edge is hardest to find.
The most valuable thing I built this year wasn't a money-making strategy. It was the tool that reliably told me when I didn't have one — and kept me from betting real money on luck. So I'm giving it away.
The Gauntlet is on GitHub → One file, zero dependencies. Run the demo and watch it pass a real edge and kill a data-snooped fake in about a minute.
Our backtest and our live bot ran different code. Same repo, same config file — different code paths. The only way to catch that is to replay real trades against the live engine, not just against a tidy simulator loop.
The same 45 entries produced a losing system with one exit and a winning one with another. We spent months tuning signal filters when the leverage was in the exit logic.
Dynamic risk scaling sounds responsible and reasonable. In practice it meant our losing-streak trades were half-sized by the time the winning trade showed up. A flat 1% is boring, and boring is profitable.
EUR/GBP lost 6 straight on our account. Rather than re-tune it, we blocked it — and the next winning trade somewhere else paid for the analysis. Blocking a pair is cheap; re-tuning is expensive.
Try enough variations and keep the best, and randomness alone hands you a great-looking backtest. The antidote is a Deflated Sharpe that charges you for every variant you tried — it's the gate that killed most of my "winners."
The strongest edge I found needed instruments my US spot-FX account can't trade. A whole year of pain turned out to be the wrong market, not the wrong code. Where you're allowed to trade matters more than how clever the rules are.
I didn't find a money printer. I found a reliable way to know when I don't have an edge — worth more than another hopeful strategy. An honest "no" is the rarest and most valuable answer in trading.
The Gauntlet is open source — one file, no dependencies, with a runnable demo that passes a real edge and kills a data-snooped fake in a minute. And I'm writing up each phase on the blog with the real code and the real numbers.