A harness for rigorous AI/ML experiments in finance.
An equity research-and-trading system — multi-agent research, ML prediction, risk-gated execution, weekly self-tuning — instrumented end-to-end[1].
Every signal, prediction, fill, and dollar of P&L instrumented and traceable. The console is a view, not a measurement layer.
Six sector teams, a portfolio decision agent, and a macro layer on LangGraph + Claude. Structured outputs, LLM-as-judge.
Stacked ensemble of gradient-boosted and linear models. 21-day market-relative return predictions with confidence-driven veto.
Weekly evaluation writes optimized parameters back to four S3 configs. Downstream modules read them on cold-start.
Current phase
Every aspect of the system reliable and measurable — every experiment decided on data, not vibes.
- Seven modules wired end-to-end via S3 — data, research, prediction, execution, backtesting, evaluation, dashboard.
- Multi-agent research, stacked meta-ensemble, risk-gated executor, weekly backtester.
- Three Step Functions running unattended (Saturday weekly + weekday morning + EOD).
- Step Functions reliable end-to-end with drift detection and runtime trend alarms.
- Every decision point measurable — agent calls, predictor verdicts, fills, P&L attribution, risk events.
- Closed feedback loop — backtester writing four optimized configs to S3 weekly.
- Runs featured experiments against pre-committed bars on the Phase-2 substrate.
- Broader feature breadth in inference (current 21 features → ~50-feature ArcticDB store).
- Gated on ≥99% SF success rate over 8 weeks + transparency-inventory complete.
- Paper → live capital with progressive sizing.
- Portfolio-level risk overlays beyond per-position gates.
- Gated on sustained positive alpha vs SPY over a 12-week Phase 3 window.
Instrumented end-to-end[1]
Every layer of the pipeline is observable and auditable:
Research-agent LLM calls capture prompt, response, tool calls, and
structured metadata to
s3://alpha-engine-research/decision_artifacts/, with
LLM-as-judge rubric scores attached on every load-bearing agent type
(sector quant, qual, peer-review, thesis-update, macro economist, CIO).
Every order, fill, and exit decision recorded with per-trade
realized_pnl and rationale where applicable; the audit
trail that surfaced the
PFE short-sell retro.
Signal accuracy at 10d / 30d, predictor rolling 30d IC, NAV vs SPY daily returns, per-trade realized P&L, daily portfolio-level attribution (position P&L + interest + unattributed residual).
Per-call cost tracked at LLM call time and aggregated to a weekly cost parquet.
Trace ID + token counts on every production LLM call.
The backtester replays the morning-planner stage of historical runs as an observational diff against current code.