Skip to content

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630

Draft
maocheng23 wants to merge 3 commits into
mainfrom
docs/plan-reconciled
Draft

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630
maocheng23 wants to merge 3 commits into
mainfrom
docs/plan-reconciled

Conversation

@maocheng23

@maocheng23 maocheng23 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Docs only — no code change. Rewrites plan.md to reconcile the original "SpecForge Redesign
Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine, and
adds a consolidated docs/roadmap/ that folds in the online-disaggregation roadmap (#618).

Why

There were two architecture efforts for the same goal. They aren't either/or — they're
complementary, given a real multi-node / >100 GB/s / isolated-pool requirement that overturns
the original draft's "no Mooncake / HTTP is sufficient" bet.

What the reconciled plan says

  • Canonical substrate = the runtime control + data plane. SampleRef (metadata) +
    FeatureStore (tensors, Local/SharedDir/Mooncake) + FeatureDataLoaderTrainBatch.
  • No separate HiddenStateStream source of truthFeatureDataLoader over
    SampleRef+FeatureStore already is the stream; online/offline/disaggregated variation lives
    in (ref source + FeatureStore) and is shielded from training.
  • training / inference become plan.md-style domain packages on top: keep the runtime
    TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers
    (CheckpointManager/Evaluator/no_sync()/full resume); converge SGLangAdapter
    TargetEngine + backends (de-EAGLE3).
  • Colocated lightweight path = control plane opt-in/no-op (one canonical path, not a fork),
    guarded by a colocated≡disaggregated numerical-equivalence gate.

Scope decisions (consolidating with #618)

  • Frozen target — no weight sync. "Train-with-decode" = a frozen target streaming hidden
    states (W2/W3), not a serve-and-push workload. The predecessor's W4 weight lifecycle
    (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream) is out of
    scope
    ; draft_weight_version is provenance only.
  • Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator
    (multi-node N-producer/M-trainer); decision gate lives in the online roadmap. Until then we keep
    the home-grown metadata-only control plane.

Consolidated roadmap (docs/roadmap/)

Per-phase Goal / Target state / Implementation (files+symbols) / Tests / Done-when across three
tracks, with a README index (standing decisions, phase-status-at-a-glance, cross-track deps):

  • domain-refactor.md — A (done) → B (TargetEngine + domain Trainer) → C (colocated) → D
    (managers) → E (drafts registry / config / CLI / export).
  • online-disaggregation.mdfolds in [DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618: O1.1/O1.2 (in review) → O1.3 live frozen-target
    capture (next) → O2 scale-out (Ray = open) → O3 hardening.
  • eval-and-breadth.md — E1 acceptance-length eval harness → E2 algorithm breadth (new algo = a
    StrategySpec + loss).

Relation to the in-flight work

The composable-launch stack (#627 / #628 / #629StrategySpec registry + parameterized
launch.py; eagle3/dflash/domino end-to-end) is Phase A. The online track (folding #618)
proceeds in parallel; #618 is superseded by docs/roadmap/online-disaggregation.md.

Files

  • plan.md — rewritten (reconciled); frozen-target + Ray-open applied.
  • docs/roadmap/ — README + three track docs (new).
  • docs/redesign-draft-legacy.md — the original redesign draft, preserved verbatim (with a
    "superseded by plan.md" banner).

🤖 Generated with Claude Code

Rewrite plan.md to reconcile the original "SpecForge Redesign Plan" (from-scratch,
torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine. Docs only — no
code change.

- Canonical substrate = runtime control + data plane (SampleRef + FeatureStore incl.
  Mooncake + FeatureDataLoader). The isolated-pool / >100GB/s requirement overturns the
  original "no Mooncake" bet.
- No separate HiddenStateStream source of truth — FeatureDataLoader over
  SampleRef+FeatureStore IS the stream; topology variation lives in (ref source +
  FeatureStore), shielded from training.
- training/inference become plan.md-style domain packages on top: keep the runtime
  TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers (checkpoint/
  eval/no_sync/resume); converge SGLangAdapter -> TargetEngine + backends.
- colocated lightweight path = control plane opt-in/no-op, guarded by a
  colocated == disaggregated numerical-equivalence gate.
- original redesign draft preserved verbatim in docs/redesign-draft-legacy.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Add docs/roadmap/ — per-phase Goal/Target/Implementation/Tests/Done-when
across three tracks (domain-refactor, online-disaggregation, eval-and-breadth)
plus a README index. The online track folds in the former online-disaggregation
roadmap (#618) so there is one roadmap home.

Apply two scope decisions to plan.md and the roadmap:
- Frozen target, no weight sync. "Train-with-decode" = a frozen target streaming
  hidden states (W2/W3), not a serve-and-push workload. Drop the W4 weight
  lifecycle (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream);
  draft_weight_version is provenance only.
- Ray is an open decision, not a non-goal. Candidate for the O2 scale-out
  orchestrator; decision gate in the online roadmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… W3′ naming

Review fixes (verified against the files):
- Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629)
  "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one
  consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A).
  Leave the spine's "landed" wording (it is merged).
- Module placement (confirmed): Evaluator/EvalCache are top-level domain managers
  (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match
  plan.md §2.3 and domain-refactor.md.
- W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports
  (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2,
  the workload table and §G2 rather than overloading one name.
- O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid
  narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3
  extraction and domain Trainer carry no engine risk.

Additional contradictions found by a completeness sweep and fixed:
- StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E
  move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/.
- TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not
  "absorbs runtime/inference adapters".
- Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames.
- Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel).
- Drop the up-19/up-20 branch tags that only appeared in the online doc.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant