Studio: Add Tensor-Parallel llama.cpp support by oobabooga · Pull Request #6040 · unslothai/unsloth

oobabooga · 2026-06-06T03:59:18Z

This adds tensor parallelism support for llama.cpp through a new "Tensor Parallelism" checkbox:

When checked, it passes --split-mode tensor to llama-server.

On single GPU setups, it gets ignored even if checked.

How

The PR doesn't just pass the flag when the checkbox is checked, it also integrates it with Studio's existing memory allocation framework. Specifically, a new policy was added for the tensor parallel case:

Use all GPUs instead of trying to use the smallest viable subset.
Calculate the context using a reduced VRAM budget (total free VRAM - weights - N*reserve, where reserve is the per-GPU compute buffer), sized across all GPUs at once (reusing _estimate_kv_cache_bytes).
Pass --tensor-split weighted by free - reserve per GPU, but only on asymmetric setups where an even split would overflow the smallest GPU.
If a tensor-parallel load still fails (some architectures crash or aren't supported by --split-mode tensor), fall back to layer split automatically so the model loads regardless.

Each of these changes is necessary to prevent out of memory errors and crashes. They were obtained after extensive testing rather than static code analysis. My asymmetric multi-gpu setup was valuable for developing this PR because it's the hardest type of setup to get tensor parallelism right (it OOMs by default).

About the per-GPU buffer reserve

In tensor mode, llama.cpp allocates a compute-graph buffer on every GPU (the logits buffer, n_batch x vocab, plus activation scratch). Unlike the KV cache, it can't be derived from the context length: llama.cpp sizes it internally via graph_reserve, it's roughly equal on each device regardless of the split, and it barely changes with context. In practice it ranged from ~2.3 GB (gemma-3-27B) to ~3.8 GB (gemma-4-31B) on a 256k-vocab model. And --fit is disabled in tensor mode, so nothing else caps the KV cache: underestimate this and the load OOMs.

So instead of computing it, the PR reserves a fixed 5 GiB per GPU, above the measured worst case. It's subtracted from each GPU's free VRAM before computing --tensor-split, and held back per device when capping context. It scales with vocab/batch, not context, so a constant works; the auto-fallback to layer split covers any underestimate.

Speed measurements

RTX 6000 Ada (48 GB) + RTX 3090 (24 GB), PCIe, no NVLink, no NCCL. Decode speed in tokens/sec, measured during a 512-token generation after a ~5.9k-token prompt (i.e. at a realistic context depth, not an empty context), layer split (default) vs tensor split. This is a weak interconnect, so NVLink/NCCL setups should see more.

Dense models (tensor parallelism improves decode, more so for larger models):

Model	layer	tensor	change
Qwen3-8B Q4_K_M	75.3	87.1	+16%
gemma-3-12B UD-Q4_K_XL	50.2	58.8	+17%
gemma-4-31B BF16	9.4	11.5	+22%
Qwen3.6-27B BF16	11.2	13.8	+23%
gemma-3-27B Q8_0	16.5	25.4	+54%

MoE models (no benefit, can be much slower):

Model	layer	tensor	change
gpt-oss-20B Q8_0	122.7	127.2	+4%
gemma-4-26B-A4B UD-Q4_K_M	76.5	74.7	-2%
Qwen3.5-35B-A3B UD-Q6_K_XL	140.6	128.3	-9%
Qwen3.6-35B-A3B BF16	64.2	32.4	-50%

For reference, VRAM splits across both GPUs as expected. For example, gemma-3-27B uses 29.6 GB on a single GPU with layer split, and 16.1 / 15.9 GB split across both with tensor.

for more information, see https://pre-commit.ci

gemini-code-assist

Code Review

This pull request adds support for a Tensor Parallelism toggle (--split-mode tensor) for GGUF models on multi-GPU setups, threading the option from the frontend UI to the backend. The review feedback suggests enhancing robustness by filtering out low-VRAM GPUs (less than the 5 GiB reserve) during tensor-parallel planning to avoid OOM crashes, and refactoring a fragile test that asserts on raw source code strings to use proper mocking instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

I am having trouble creating individual review comments. Click here to see my feedback.

studio/backend/core/inference/llama_cpp.py (3079-3115)

On multi-GPU setups, attempting to run tensor parallelism when one or more GPUs have less free VRAM than the required compute-graph buffer reserve (_TENSOR_PARALLEL_BUFFER_RESERVE_MIB, i.e., 5 GiB) will result in an immediate out-of-memory (OOM) error or crash on those low-VRAM devices.

We can prevent this by filtering the list of GPUs to only those that have enough free VRAM to accommodate the compute-graph buffer. If fewer than 2 GPUs meet this requirement, we should disable tensor_parallel early and fall back to the standard layer-split planning path. This avoids unnecessary process launches and crashes, providing a much smoother fallback experience.

                    usable_gpus = []
                    if tensor_parallel:
                        reserve_mib = self._TENSOR_PARALLEL_BUFFER_RESERVE_MIB
                        usable_gpus = [g for g in gpus if g[1] >= reserve_mib]
                        if len(usable_gpus) < 2:
                            logger.info(
                                "Tensor parallelism requested but only %d GPU(s) have "
                                "enough free VRAM (>= %d MiB) for the compute buffer; "
                                "ignoring (needs >= 2).",
                                len(usable_gpus),
                                reserve_mib,
                            )
                            tensor_parallel = False

                    if tensor_parallel and gpus:
                        # Tensor-parallel allocation: use all GPUs, weight the split
                        # by (free - buffer), and cap context to the pooled VRAM
                        # after weights + per-device compute-graph buffers. See
                        # _plan_tensor_parallel for the policy + rationale.
                        target_ctx = (
                            effective_ctx
                            if explicit_ctx
                            else (self._context_length or effective_ctx)
                        )
                        (
                            effective_ctx,
                            max_available_ctx,
                            gpu_indices,
                            tp_tensor_split,
                        ) = self._plan_tensor_parallel(
                            usable_gpus,
                            model_size,
                            target_ctx,
                            cache_type_kv = cache_type_kv,
                            n_parallel = n_parallel,
                            mtp_engaged = _mtp_will_engage,
                        )
                        use_fit = False

studio/backend/core/inference/llama_cpp.py (2760-2775)

To ensure _plan_tensor_parallel is fully defensive and self-contained, we should filter the input gpus list to only include devices that have enough free VRAM to hold the compute-graph buffer. This prevents planning tensor splits on low-VRAM GPUs that are guaranteed to OOM.

        reserve_mib = self._TENSOR_PARALLEL_BUFFER_RESERVE_MIB
        usable_gpus = [g for g in gpus if g[1] >= reserve_mib]
        gpu_indices = sorted(idx for idx, _ in usable_gpus)
        if len(gpu_indices) < 2:
            # Tensor parallelism is meaningless on <2 GPUs (the caller drops the
            # toggle before this); be defensive and never emit a split here.
            return (
                target_ctx if target_ctx > 0 else 4096,
                target_ctx if target_ctx > 0 else 4096,
                gpu_indices,
                None,
            )
        free_by_idx = {idx: free for idx, free in usable_gpus}
        pool_mib = sum(free_by_idx.values())
        kv_budget_b = (
            (pool_mib - len(gpu_indices) * reserve_mib) * 1024 * 1024 - model_size
        )

studio/backend/tests/test_tensor_parallel.py (408-430)

Asserting on the raw source code string of routes/inference.py using string manipulation (find, rfind) is highly fragile and makes the test suite extremely brittle to minor formatting changes, refactoring, or variable renaming.

Instead, this retry behavior should be tested by mocking llama_backend.load_model to raise an exception on the first call and verify that the route catches it and retries with tensor_parallel=False.

References

To test a function's try...except block, mock a dependency called within the try block to raise an exception. Do not mock the entire function and set its side_effect to an exception, as this will not execute the function's own exception handling logic.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1399280483

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e96608243

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e96608243

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6bf2ab4172

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2d8cb50c91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Studio: Add Tensor-Parallel llama.cpp support

1399280

oobabooga requested a review from danielhanchen as a code owner June 6, 2026 03:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

f183dee

for more information, see https://pre-commit.ci

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Outdated

Studio: harden Tensor-Parallel fallback and GPU selection

6e96608

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Outdated

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py Outdated

oobabooga and others added 2 commits June 6, 2026 01:55

Studio: reconcile split-mode extras and harden tensor-split planning

c5db8df

[pre-commit.ci] auto fixes from pre-commit.com hooks

6bf2ab4

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread studio/backend/core/inference/llama_cpp.py Outdated

oobabooga and others added 2 commits June 6, 2026 02:17

Studio: reconcile split-mode extras in backend duplicate-load guard

c21b8a3

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d8cb50

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts

Studio: preserve inherited non-tensor split modes on reload

dff1f59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Studio: Add Tensor-Parallel llama.cpp support#6040

Studio: Add Tensor-Parallel llama.cpp support#6040
oobabooga wants to merge 8 commits into
unslothai:mainfrom
oobabooga:studio-tensor-parallelism

oobabooga commented Jun 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

oobabooga commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How

About the per-GPU buffer reserve

Speed measurements

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

studio/backend/core/inference/llama_cpp.py (3079-3115)

studio/backend/core/inference/llama_cpp.py (2760-2775)

studio/backend/tests/test_tensor_parallel.py (408-430)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oobabooga commented Jun 6, 2026 •

edited

Loading