Hi — we spent several days root-causing the Gemma-4-12B perplexity anomalies
and want to share results + reproduction instruments, because the June-5
rebuild did not fix the underlying problem.
Method (engine-independent). We wrote a from-scratch reference forward
for gemma-4 directly off the official safetensors + config (no llama.cpp, no
transformers — ~130 lines of numpy/torch). On a verified token fixture
(identical to HF tokenizer.json, 5431/5431) it measures the TRUE
full-precision wikitext chunk-0 PPL at 4.6776, with targets at max-logit
(NLL ≈ 0.001). The same script can dequantize any GGUF's tensors and run the
identical arithmetic over them — which removes the inference engine as a
variable entirely.
Results (same fixture, same protocol, only the weight bytes change):
| weights |
PPL (our forward) |
PPL (llama.cpp) |
| official bf16 safetensors |
4.68 |
— |
| pre-fix Q4_K_M GGUF |
271.2 |
505.9 |
| pre-fix QAT-Q4_0 GGUF |
364.3 |
397.5 |
rebuilt (post-June-5) gemma-4-12B-it-qat-UD-Q4_K_XL |
192.9 |
— |
Two independent engines agree per-artifact → llama.cpp's forward is NOT the
problem; the artifacts are. PR ggml-org/llama.cpp#24118 fixed
vision/audio projector config handling — the text-tower weight damage
predates and survives it.
Damage anatomy (forensics scripts included):
- No layer permutation: the blk↔layer mapping is exactly diagonal
(cross-layer cosines ≈ 0).
- In-place damage with a period-6 signature: vs the official checkpoint,
layers ≡ 0,1 (mod 6) sit at cos 0.93–0.97 while the other four sit at
0.24–0.70 (measured on the pre-fix K_M, which shares the bf16 source).
- The per-layer
layer_output_scale class is independently defective:
restoring ONLY those scalars from the checkpoint takes the QAT artifact
from 364 → 97. Restoring norms or embeddings makes it worse (they are
coherent with the damaged weights — so the matmul tensors are damaged too).
- Generation looks deceptively OK (confident positions stay correct), which
is why this slipped through smoke tests. PPL on a fixed fixture catches it.
Reproduction: all instruments + receipts (MIT) are here:
https://github.com/nihilistau/shannon-prime-lattice/tree/main/tests/gemma4_gold
— _t2_manual_forward.py (gold), _t2c_gold_on_gguf.py (grade any GGUF),
_t2g_perm_hunt.py (cosine forensics). A step-by-step verification + fix
write-up: https://github.com/nihilistau/Position_Is_Arithmetic/blob/main/GEMMA4-QUANT-FIX.md
Suggested fix on your side: re-convert from the official safetensors and
verify at the weight level before publishing — per-layer cosine vs the
checkpoint (should be >0.99 for ≥8-bit tensors, uniform across layers) and a
fixed-fixture teacher-forced PPL within a few percent of 4.68. Happy to help
validate a candidate rebuild with the instruments above.
Also FYI: gemma-4-12B is unusually PTQ-hostile — we measured naive
all-tensor symmetric int4 at +45% PPL even from clean weights; keeping
attention/down-proj/embed at 8-bit and quantizing only FFN gate/up to 4-bit
lands at +9.6%. Recipe table in the write-up.
Hi — we spent several days root-causing the Gemma-4-12B perplexity anomalies
and want to share results + reproduction instruments, because the June-5
rebuild did not fix the underlying problem.
Method (engine-independent). We wrote a from-scratch reference forward
for gemma-4 directly off the official safetensors + config (no llama.cpp, no
transformers — ~130 lines of numpy/torch). On a verified token fixture
(identical to HF
tokenizer.json, 5431/5431) it measures the TRUEfull-precision wikitext chunk-0 PPL at 4.6776, with targets at max-logit
(NLL ≈ 0.001). The same script can dequantize any GGUF's tensors and run the
identical arithmetic over them — which removes the inference engine as a
variable entirely.
Results (same fixture, same protocol, only the weight bytes change):
gemma-4-12B-it-qat-UD-Q4_K_XLTwo independent engines agree per-artifact → llama.cpp's forward is NOT the
problem; the artifacts are. PR ggml-org/llama.cpp#24118 fixed
vision/audio projector config handling — the text-tower weight damage
predates and survives it.
Damage anatomy (forensics scripts included):
(cross-layer cosines ≈ 0).
layers ≡ 0,1 (mod 6) sit at cos 0.93–0.97 while the other four sit at
0.24–0.70 (measured on the pre-fix K_M, which shares the bf16 source).
layer_output_scaleclass is independently defective:restoring ONLY those scalars from the checkpoint takes the QAT artifact
from 364 → 97. Restoring norms or embeddings makes it worse (they are
coherent with the damaged weights — so the matmul tensors are damaged too).
is why this slipped through smoke tests. PPL on a fixed fixture catches it.
Reproduction: all instruments + receipts (MIT) are here:
https://github.com/nihilistau/shannon-prime-lattice/tree/main/tests/gemma4_gold
—
_t2_manual_forward.py(gold),_t2c_gold_on_gguf.py(grade any GGUF),_t2g_perm_hunt.py(cosine forensics). A step-by-step verification + fixwrite-up: https://github.com/nihilistau/Position_Is_Arithmetic/blob/main/GEMMA4-QUANT-FIX.md
Suggested fix on your side: re-convert from the official safetensors and
verify at the weight level before publishing — per-layer cosine vs the
checkpoint (should be >0.99 for ≥8-bit tensors, uniform across layers) and a
fixed-fixture teacher-forced PPL within a few percent of 4.68. Happy to help
validate a candidate rebuild with the instruments above.
Also FYI: gemma-4-12B is unusually PTQ-hostile — we measured naive
all-tensor symmetric int4 at +45% PPL even from clean weights; keeping
attention/down-proj/embed at 8-bit and quantizing only FFN gate/up to 4-bit
lands at +9.6%. Recipe table in the write-up.