A harness for rigorous AI/ML experiments in finance.

An equity research-and-trading system — multi-agent research, ML prediction, risk-gated execution, weekly self-tuning — instrumented end-to-end^[1].

Open the Dashboard →

End-to-end measurement

Every signal, prediction, fill, and dollar of P&L instrumented and traceable. The console is a view, not a measurement layer.

Multi-agent research

Six sector teams, a portfolio decision agent, and a macro layer on LangGraph + Claude. Structured outputs, LLM-as-judge.

Machine-learning overlay

Stacked ensemble of gradient-boosted and linear models. 21-day market-relative return predictions with confidence-driven veto.

Self-improvement loop

Weekly evaluation writes optimized parameters back to four S3 configs. Downstream modules read them on cold-start.

Current phase

Every aspect of the system reliable and measurable — every experiment decided on data, not vibes.

Phase 1 ✓

Completeness

KPI: Coverage

Phase 2 ▶

Reliability + Measurability

KPI: Uptime + Coverage

Phase 3 ·

Performance (paper)

KPI: Alpha vs SPY

Phase 4 ·

Performance (live)

KPI: NAV

Phase 1 · Completeness

Seven modules wired end-to-end via S3 — data, research, prediction, execution, backtesting, evaluation, dashboard.
Multi-agent research, stacked meta-ensemble, risk-gated executor, weekly backtester.
Three Step Functions running unattended (Saturday weekly + weekday morning + EOD).

Phase 2 · Reliability + Measurability

Step Functions reliable end-to-end with drift detection and runtime trend alarms.
Every decision point measurable — agent calls, predictor verdicts, fills, P&L attribution, risk events.
Closed feedback loop — backtester writing four optimized configs to S3 weekly.

Phase 3 · Performance (paper)

Runs featured experiments against pre-committed bars on the Phase-2 substrate.
Broader feature breadth in inference (current 21 features → ~50-feature ArcticDB store).
Gated on ≥99% SF success rate over 8 weeks + transparency-inventory complete.

Phase 4 · Performance (live)

Paper → live capital with progressive sizing.
Portfolio-level risk overlays beyond per-position gates.
Gated on sustained positive alpha vs SPY over a 12-week Phase 3 window.

Instrumented end-to-end^[1]

Every layer of the pipeline is observable and auditable:

Decision artifacts

Research-agent LLM calls capture prompt, response, tool calls, and structured metadata to s3://alpha-engine-research/decision_artifacts/, with LLM-as-judge rubric scores attached on every load-bearing agent type (sector quant, qual, peer-review, thesis-update, macro economist, CIO).

Trade audit log

Every order, fill, and exit decision recorded with per-trade realized_pnl and rationale where applicable; the audit trail that surfaced the PFE short-sell retro.

Performance metrics

Signal accuracy at 10d / 30d, predictor rolling 30d IC, NAV vs SPY daily returns, per-trade realized P&L, daily portfolio-level attribution (position P&L + interest + unattributed residual).

Cost telemetry

Per-call cost tracked at LLM call time and aggregated to a weekly cost parquet.

LangSmith tracing

Trace ID + token counts on every production LLM call.

Parity replay

The backtester replays the morning-planner stage of historical runs as an observational diff against current code.

A harness for rigorous AI/ML experiments in finance.

Instrumented end-to-end[1]

Instrumented end-to-end^[1]