docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer) by maocheng23 · Pull Request #630 · sgl-project/SpecForge

maocheng23 · 2026-06-30T20:39:00Z

Docs only — no code change. Rewrites plan.md to reconcile the original "SpecForge Redesign
Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine, and
adds a consolidated docs/roadmap/ that folds in the online-disaggregation roadmap (#618).

Why

There were two architecture efforts for the same goal. They aren't either/or — they're
complementary, given a real multi-node / >100 GB/s / isolated-pool requirement that overturns
the original draft's "no Mooncake / HTTP is sufficient" bet.

What the reconciled plan says

Canonical substrate = the runtime control + data plane. SampleRef (metadata) +
FeatureStore (tensors, Local/SharedDir/Mooncake) + FeatureDataLoader → TrainBatch.
No separate HiddenStateStream source of truth — FeatureDataLoader over
SampleRef+FeatureStore already is the stream; online/offline/disaggregated variation lives
in (ref source + FeatureStore) and is shielded from training.
training / inference become plan.md-style domain packages on top: keep the runtime
TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers
(CheckpointManager/Evaluator/no_sync()/full resume); converge SGLangAdapter →
TargetEngine + backends (de-EAGLE3).
Colocated lightweight path = control plane opt-in/no-op (one canonical path, not a fork),
guarded by a colocated≡disaggregated numerical-equivalence gate.

Scope decisions (consolidating with #618)

Frozen target — no weight sync. "Train-with-decode" = a frozen target streaming hidden
states (W2/W3), not a serve-and-push workload. The predecessor's W4 weight lifecycle
(WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream) is out of
scope; draft_weight_version is provenance only.
Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator
(multi-node N-producer/M-trainer); decision gate lives in the online roadmap. Until then we keep
the home-grown metadata-only control plane.

Consolidated roadmap (`docs/roadmap/`)

Per-phase Goal / Target state / Implementation (files+symbols) / Tests / Done-when across three
tracks, with a README index (standing decisions, phase-status-at-a-glance, cross-track deps):

domain-refactor.md — A (done) → B (TargetEngine + domain Trainer) → C (colocated) → D
(managers) → E (drafts registry / config / CLI / export).
online-disaggregation.md — folds in [DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618: O1.1/O1.2 (in review) → O1.3 live frozen-target
capture (next) → O2 scale-out (Ray = open) → O3 hardening.
eval-and-breadth.md — E1 acceptance-length eval harness → E2 algorithm breadth (new algo = a
StrategySpec + loss).

Relation to the in-flight work

The composable-launch stack (#627 / #628 / #629 — StrategySpec registry + parameterized
launch.py; eagle3/dflash/domino end-to-end) is Phase A. The online track (folding #618)
proceeds in parallel; #618 is superseded by docs/roadmap/online-disaggregation.md.

Files

plan.md — rewritten (reconciled); frozen-target + Ray-open applied.
docs/roadmap/ — README + three track docs (new).
docs/redesign-draft-legacy.md — the original redesign draft, preserved verbatim (with a
"superseded by plan.md" banner).

🤖 Generated with Claude Code

Rewrite plan.md to reconcile the original "SpecForge Redesign Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine. Docs only — no code change. - Canonical substrate = runtime control + data plane (SampleRef + FeatureStore incl. Mooncake + FeatureDataLoader). The isolated-pool / >100GB/s requirement overturns the original "no Mooncake" bet. - No separate HiddenStateStream source of truth — FeatureDataLoader over SampleRef+FeatureStore IS the stream; topology variation lives in (ref source + FeatureStore), shielded from training. - training/inference become plan.md-style domain packages on top: keep the runtime TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers (checkpoint/ eval/no_sync/resume); converge SGLangAdapter -> TargetEngine + backends. - colocated lightweight path = control plane opt-in/no-op, guarded by a colocated == disaggregated numerical-equivalence gate. - original redesign draft preserved verbatim in docs/redesign-draft-legacy.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist · 2026-06-30T20:39:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Add docs/roadmap/ — per-phase Goal/Target/Implementation/Tests/Done-when across three tracks (domain-refactor, online-disaggregation, eval-and-breadth) plus a README index. The online track folds in the former online-disaggregation roadmap (#618) so there is one roadmap home. Apply two scope decisions to plan.md and the roadmap: - Frozen target, no weight sync. "Train-with-decode" = a frozen target streaming hidden states (W2/W3), not a serve-and-push workload. Drop the W4 weight lifecycle (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream); draft_weight_version is provenance only. - Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator; decision gate in the online roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 mentioned this pull request Jun 30, 2026

[DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618

Closed

maocheng23 mentioned this pull request Jul 1, 2026

[DataFlow runtime] Phase B1 — TargetEngine ABC + de-EAGLE3 the target boundary #631

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630
maocheng23 wants to merge 3 commits into
mainfrom
docs/plan-reconciled

maocheng23 commented Jun 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

maocheng23 commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What the reconciled plan says

Scope decisions (consolidating with #618)

Consolidated roadmap (docs/roadmap/)

Relation to the in-flight work

Files

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maocheng23 commented Jun 30, 2026 •

edited

Loading

Consolidated roadmap (`docs/roadmap/`)