docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630
Draft
maocheng23 wants to merge 3 commits into
Draft
docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer)#630maocheng23 wants to merge 3 commits into
maocheng23 wants to merge 3 commits into
Conversation
Rewrite plan.md to reconcile the original "SpecForge Redesign Plan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow runtime/ spine. Docs only — no code change. - Canonical substrate = runtime control + data plane (SampleRef + FeatureStore incl. Mooncake + FeatureDataLoader). The isolated-pool / >100GB/s requirement overturns the original "no Mooncake" bet. - No separate HiddenStateStream source of truth — FeatureDataLoader over SampleRef+FeatureStore IS the stream; topology variation lives in (ref source + FeatureStore), shielded from training. - training/inference become plan.md-style domain packages on top: keep the runtime TrainerCore/DraftTrainStrategy seam, add Trainer lifecycle + managers (checkpoint/ eval/no_sync/resume); converge SGLangAdapter -> TargetEngine + backends. - colocated lightweight path = control plane opt-in/no-op, guarded by a colocated == disaggregated numerical-equivalence gate. - original redesign draft preserved verbatim in docs/redesign-draft-legacy.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Add docs/roadmap/ — per-phase Goal/Target/Implementation/Tests/Done-when across three tracks (domain-refactor, online-disaggregation, eval-and-breadth) plus a README index. The online track folds in the former online-disaggregation roadmap (#618) so there is one roadmap home. Apply two scope decisions to plan.md and the roadmap: - Frozen target, no weight sync. "Train-with-decode" = a frozen target streaming hidden states (W2/W3), not a serve-and-push workload. Drop the W4 weight lifecycle (WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream); draft_weight_version is provenance only. - Ray is an open decision, not a non-goal. Candidate for the O2 scale-out orchestrator; decision gate in the online roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Docs only — no code change. Rewrites
plan.mdto reconcile the original "SpecForge RedesignPlan" (from-scratch, torchrun-native, HTTP-only) with the landed DataFlow
runtime/spine, andadds a consolidated
docs/roadmap/that folds in the online-disaggregation roadmap (#618).Why
There were two architecture efforts for the same goal. They aren't either/or — they're
complementary, given a real multi-node / >100 GB/s / isolated-pool requirement that overturns
the original draft's "no Mooncake / HTTP is sufficient" bet.
What the reconciled plan says
SampleRef(metadata) +FeatureStore(tensors, Local/SharedDir/Mooncake) +FeatureDataLoader→TrainBatch.HiddenStateStreamsource of truth —FeatureDataLoaderoverSampleRef+FeatureStorealready is the stream; online/offline/disaggregated variation livesin (ref source +
FeatureStore) and is shielded from training.training/inferencebecome plan.md-style domain packages on top: keep the runtimeTrainerCore/DraftTrainStrategyseam, addTrainerlifecycle + managers(
CheckpointManager/Evaluator/no_sync()/full resume); convergeSGLangAdapter→TargetEngine+ backends (de-EAGLE3).guarded by a colocated≡disaggregated numerical-equivalence gate.
Scope decisions (consolidating with #618)
states (W2/W3), not a serve-and-push workload. The predecessor's W4 weight lifecycle
(
WeightVersion/WeightPublisher/update_draft_weights/ServingTrafficStream) is out ofscope;
draft_weight_versionis provenance only.(multi-node N-producer/M-trainer); decision gate lives in the online roadmap. Until then we keep
the home-grown metadata-only control plane.
Consolidated roadmap (
docs/roadmap/)Per-phase Goal / Target state / Implementation (files+symbols) / Tests / Done-when across three
tracks, with a README index (standing decisions, phase-status-at-a-glance, cross-track deps):
domain-refactor.md— A (done) → B (TargetEngine+ domainTrainer) → C (colocated) → D(managers) → E (drafts registry / config / CLI / export).
online-disaggregation.md— folds in [DataFlow runtime] Online disaggregated training roadmap + PR plan (train-with-decode) #618: O1.1/O1.2 (in review) → O1.3 live frozen-targetcapture (next) → O2 scale-out (Ray = open) → O3 hardening.
eval-and-breadth.md— E1 acceptance-length eval harness → E2 algorithm breadth (new algo = aStrategySpec+ loss).Relation to the in-flight work
The composable-launch stack (#627 / #628 / #629 —
StrategySpecregistry + parameterizedlaunch.py; eagle3/dflash/domino end-to-end) is Phase A. The online track (folding #618)proceeds in parallel; #618 is superseded by
docs/roadmap/online-disaggregation.md.Files
plan.md— rewritten (reconciled); frozen-target + Ray-open applied.docs/roadmap/— README + three track docs (new).docs/redesign-draft-legacy.md— the original redesign draft, preserved verbatim (with a"superseded by plan.md" banner).
🤖 Generated with Claude Code