Skip to content

Studio: Add Tensor-Parallel llama.cpp support#6040

Open
oobabooga wants to merge 8 commits into
unslothai:mainfrom
oobabooga:studio-tensor-parallelism
Open

Studio: Add Tensor-Parallel llama.cpp support#6040
oobabooga wants to merge 8 commits into
unslothai:mainfrom
oobabooga:studio-tensor-parallelism

Conversation

@oobabooga
Copy link
Copy Markdown
Contributor

@oobabooga oobabooga commented Jun 6, 2026

This adds tensor parallelism support for llama.cpp through a new "Tensor Parallelism" checkbox:

image

When checked, it passes --split-mode tensor to llama-server.

On single GPU setups, it gets ignored even if checked.

How

The PR doesn't just pass the flag when the checkbox is checked, it also integrates it with Studio's existing memory allocation framework. Specifically, a new policy was added for the tensor parallel case:

  • Use all GPUs instead of trying to use the smallest viable subset.
  • Calculate the context using a reduced VRAM budget (total free VRAM - weights - N*reserve, where reserve is the per-GPU compute buffer), sized across all GPUs at once (reusing _estimate_kv_cache_bytes).
  • Pass --tensor-split weighted by free - reserve per GPU, but only on asymmetric setups where an even split would overflow the smallest GPU.
  • If a tensor-parallel load still fails (some architectures crash or aren't supported by --split-mode tensor), fall back to layer split automatically so the model loads regardless.

Each of these changes is necessary to prevent out of memory errors and crashes. They were obtained after extensive testing rather than static code analysis. My asymmetric multi-gpu setup was valuable for developing this PR because it's the hardest type of setup to get tensor parallelism right (it OOMs by default).

About the per-GPU buffer reserve

In tensor mode, llama.cpp allocates a compute-graph buffer on every GPU (the logits buffer, n_batch x vocab, plus activation scratch). Unlike the KV cache, it can't be derived from the context length: llama.cpp sizes it internally via graph_reserve, it's roughly equal on each device regardless of the split, and it barely changes with context. In practice it ranged from ~2.3 GB (gemma-3-27B) to ~3.8 GB (gemma-4-31B) on a 256k-vocab model. And --fit is disabled in tensor mode, so nothing else caps the KV cache: underestimate this and the load OOMs.

So instead of computing it, the PR reserves a fixed 5 GiB per GPU, above the measured worst case. It's subtracted from each GPU's free VRAM before computing --tensor-split, and held back per device when capping context. It scales with vocab/batch, not context, so a constant works; the auto-fallback to layer split covers any underestimate.

Speed measurements

RTX 6000 Ada (48 GB) + RTX 3090 (24 GB), PCIe, no NVLink, no NCCL. Decode speed in tokens/sec, measured during a 512-token generation after a ~5.9k-token prompt (i.e. at a realistic context depth, not an empty context), layer split (default) vs tensor split. This is a weak interconnect, so NVLink/NCCL setups should see more.

Dense models (tensor parallelism improves decode, more so for larger models):

Model layer tensor change
Qwen3-8B Q4_K_M 75.3 87.1 +16%
gemma-3-12B UD-Q4_K_XL 50.2 58.8 +17%
gemma-4-31B BF16 9.4 11.5 +22%
Qwen3.6-27B BF16 11.2 13.8 +23%
gemma-3-27B Q8_0 16.5 25.4 +54%

MoE models (no benefit, can be much slower):

Model layer tensor change
gpt-oss-20B Q8_0 122.7 127.2 +4%
gemma-4-26B-A4B UD-Q4_K_M 76.5 74.7 -2%
Qwen3.5-35B-A3B UD-Q6_K_XL 140.6 128.3 -9%
Qwen3.6-35B-A3B BF16 64.2 32.4 -50%

For reference, VRAM splits across both GPUs as expected. For example, gemma-3-27B uses 29.6 GB on a single GPU with layer split, and 16.1 / 15.9 GB split across both with tensor.

@oobabooga oobabooga requested a review from danielhanchen as a code owner June 6, 2026 03:59
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for a Tensor Parallelism toggle (--split-mode tensor) for GGUF models on multi-GPU setups, threading the option from the frontend UI to the backend. The review feedback suggests enhancing robustness by filtering out low-VRAM GPUs (less than the 5 GiB reserve) during tensor-parallel planning to avoid OOM crashes, and refactoring a fragile test that asserts on raw source code strings to use proper mocking instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

studio/backend/core/inference/llama_cpp.py (3079-3115)

high

On multi-GPU setups, attempting to run tensor parallelism when one or more GPUs have less free VRAM than the required compute-graph buffer reserve (_TENSOR_PARALLEL_BUFFER_RESERVE_MIB, i.e., 5 GiB) will result in an immediate out-of-memory (OOM) error or crash on those low-VRAM devices.

We can prevent this by filtering the list of GPUs to only those that have enough free VRAM to accommodate the compute-graph buffer. If fewer than 2 GPUs meet this requirement, we should disable tensor_parallel early and fall back to the standard layer-split planning path. This avoids unnecessary process launches and crashes, providing a much smoother fallback experience.

                    usable_gpus = []
                    if tensor_parallel:
                        reserve_mib = self._TENSOR_PARALLEL_BUFFER_RESERVE_MIB
                        usable_gpus = [g for g in gpus if g[1] >= reserve_mib]
                        if len(usable_gpus) < 2:
                            logger.info(
                                "Tensor parallelism requested but only %d GPU(s) have "
                                "enough free VRAM (>= %d MiB) for the compute buffer; "
                                "ignoring (needs >= 2).",
                                len(usable_gpus),
                                reserve_mib,
                            )
                            tensor_parallel = False

                    if tensor_parallel and gpus:
                        # Tensor-parallel allocation: use all GPUs, weight the split
                        # by (free - buffer), and cap context to the pooled VRAM
                        # after weights + per-device compute-graph buffers. See
                        # _plan_tensor_parallel for the policy + rationale.
                        target_ctx = (
                            effective_ctx
                            if explicit_ctx
                            else (self._context_length or effective_ctx)
                        )
                        (
                            effective_ctx,
                            max_available_ctx,
                            gpu_indices,
                            tp_tensor_split,
                        ) = self._plan_tensor_parallel(
                            usable_gpus,
                            model_size,
                            target_ctx,
                            cache_type_kv = cache_type_kv,
                            n_parallel = n_parallel,
                            mtp_engaged = _mtp_will_engage,
                        )
                        use_fit = False

studio/backend/core/inference/llama_cpp.py (2760-2775)

medium

To ensure _plan_tensor_parallel is fully defensive and self-contained, we should filter the input gpus list to only include devices that have enough free VRAM to hold the compute-graph buffer. This prevents planning tensor splits on low-VRAM GPUs that are guaranteed to OOM.

        reserve_mib = self._TENSOR_PARALLEL_BUFFER_RESERVE_MIB
        usable_gpus = [g for g in gpus if g[1] >= reserve_mib]
        gpu_indices = sorted(idx for idx, _ in usable_gpus)
        if len(gpu_indices) < 2:
            # Tensor parallelism is meaningless on <2 GPUs (the caller drops the
            # toggle before this); be defensive and never emit a split here.
            return (
                target_ctx if target_ctx > 0 else 4096,
                target_ctx if target_ctx > 0 else 4096,
                gpu_indices,
                None,
            )
        free_by_idx = {idx: free for idx, free in usable_gpus}
        pool_mib = sum(free_by_idx.values())
        kv_budget_b = (
            (pool_mib - len(gpu_indices) * reserve_mib) * 1024 * 1024 - model_size
        )

studio/backend/tests/test_tensor_parallel.py (408-430)

medium

Asserting on the raw source code string of routes/inference.py using string manipulation (find, rfind) is highly fragile and makes the test suite extremely brittle to minor formatting changes, refactoring, or variable renaming.

Instead, this retry behavior should be tested by mocking llama_backend.load_model to raise an exception on the first call and verify that the route catches it and retries with tensor_parallel=False.

References
  1. To test a function's try...except block, mock a dependency called within the try block to raise an exception. Do not mock the entire function and set its side_effect to an exception, as this will not execute the function's own exception handling logic.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1399280483

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e96608243

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e96608243

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6bf2ab4172

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/core/inference/llama_cpp.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d8cb50c91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant