Page MenuHomePhabricator

Use static php array files for l10n cache at WMF (instead of CDB)
Open, HighPublic

Description

Facebook's Fred Emmott works on benchmarking HHVM's performance when running various open-source PHP frameworks. This puts him in contact with MediaWiki's codebase. He wrote in to suggest that we experiment with using plain PHP files instead of CDB for the l10n cache. We should try that and see whether it improves performance.


Deployment plan (see also T99740#5165753 by @Krinkle):

  1. Have Scap (also) generate l10n cache in the array format whenever it calls rebuildLocalisationCache.
  2. Package new scap and have it deployed to production (T245530); then run a full scap.
  3. Enable array format for Beta Cluster wikis.
  4. Enable array format for production:
    • bump opcache settings on production
    • 2026: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1281779
    • Enable array format for mwdebug1001/mwdebug2001 for performance testing (establish baseline on x002, detect difference if any, confirm difference in other DC). This is temporary. Undo afterward to reduce differences between debug and prod.
    • Enable array format for group0 (i.e. testwikis, closed, mw.org, office).
    • Enable array format for wikidata.org in production.
    • Enable array format for commons.wikimedia.org in production.
    • Enable array format for group1 in production.
    • Enable array format for group2 in production. (all wikis)
    • Update Scap config to no longer generate the old cdb format via rebuildLocalisationCache.
    • Remove config switch.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/mediawiki-configmaster+3 -0
operations/puppetproduction+10 -10
mediawiki/coremaster+43 -46
operations/puppetproduction+2 -0
mediawiki/coremaster+315 -105
operations/mediawiki-configmaster+15 -0
operations/mediawiki-configmaster+0 -15
mediawiki/tools/scapmaster+21 -3
operations/mediawiki-configmaster+2 -3
operations/mediawiki-configmaster+3 -2
operations/mediawiki-configmaster+3 -1
operations/mediawiki-configmaster+9 -1
operations/mediawiki-configmaster+1 -0
operations/mediawiki-configmaster+2 -0
operations/mediawiki-configmaster+5 -21
operations/mediawiki-configmaster+15 -0
mediawiki/corewmf/1.35.0-wmf.11+10 -0
mediawiki/coremaster+10 -0
mediawiki/coremaster+13 -1
operations/mediawiki-configmaster+1 -1
mediawiki/coremaster+21 -14
operations/mediawiki-configmaster+8 -0
operations/mediawiki-configmaster+5 -0
mediawiki/corewmf/1.26wmf9+147 -1
mediawiki/coremaster+147 -1
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
tasks.py: Mention l10n type in Running rebuildLocalisationCache.php messagerepos/releng/scap!1166dancymaster-I7ba62bc94baa4184098c569163caff182a9cc894master
Customize query in GitLab

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 630592 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] Feature flag PHP L10n generation

https://gerrit.wikimedia.org/r/630592

Change 630592 merged by jenkins-bot:
[mediawiki/tools/scap@master] Feature flag PHP L10n generation

https://gerrit.wikimedia.org/r/630592

Change 651228 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/mediawiki-config@master] Disable PHP L10n in beta cluster

https://gerrit.wikimedia.org/r/651228

Change 651228 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable PHP L10n in beta cluster

https://gerrit.wikimedia.org/r/651228

Krinkle changed the task status from Stalled to Open.Jul 28 2022, 4:05 AM
Krinkle removed Krinkle as the assignee of this task.

No longer stalled as T266055 is now resolved for prod. Un-assinging for now until it comes around as a scheduled goal.

Change 883707 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/mediawiki-config@master] Revert "Disable PHP L10n in beta cluster"

https://gerrit.wikimedia.org/r/883707

This is somewhat important for mw-on-k8s so even if the perf gains are small, it's still a good cause to see done.

This is somewhat important for mw-on-k8s so even if the perf gains are small, it's still a good cause to see done.

Can you expand on what makes this important for mw-on-k8s?

I can't find it but I had it somewhere that serviceops mentioned this, in one of their slides as part of mw-on-k8s challenges because cdb files are too big for images and harder to maintain and build. Can't find it though :/

From the point of view of image size and memory usage, it might be better to merge all the languages into a single file, and combine identical values by having a numerically indexed global value array. Maybe also abbreviate the message keys by mapping them to integers. We can tolerate a little bit of indirection at runtime.

In T99740#6111784, @Joe wrote:

Basically, my precondition for seeing this in production right now would be:

5 years later a number of these are now taken care of as side effect of other improvements in the hosting stack.

  • Stop revalidating opcache (which seeems a good idea given the occasional corruptions we see anyways)
  • Rolling restart php-fpm with every scap run (this is currently supported in scap, but needs to be tested)

MediaWiki and php-fpm now live inside an OCI container as part of a Kubernetes Pod. Code only changes via container replacement which means the opcache doesn't need to check for on-disk updates and we do get a fresh php-fpm runtime environment for each deployment.

  • Set opcache to be what can contain one train release, not multiple ones like we do today

The Pretrain project (T369112) builds single MediaWiki version PHP containers. That might make it a useful test platform for future experiments in replacing CDB with something else. We do not yet have a functional deployment of Pretrain, so it is not an immediately available operational platform for new experiments, but it is likely to become available in the next 4-6 months. I have a hope that one of the projects to follow the initial Pretrain deployment will be single MediaWiki version containers for the other wikis as well.

  • Make checks *pre-deploy* to ensure we don't get over said limit.

A validation of opcache utilization sounds like something we could work into the deployment process. I'm not exactly sure what we would need to do if the limit validation failed, but I'm sure that could be worked out.

From the point of view of image size and memory usage, it might be better to merge all the languages into a single file, and combine identical values by having a numerically indexed global value array. Maybe also abbreviate the message keys by mapping them to integers. We can tolerate a little bit of indirection at runtime.

I did a brief experiment in this vein; a quick indirection level still using CDBs reduced my l10n cache dir from 850M to 475M, and compressing serialized message payloads with deflate got it down to 300M. Using a lower-level index over the id numbers vs a hashmap for the id->id and id->string lookups should save further space.

[Update] Yeah, using binary packed arrays instead of CDB for the id->id and id->string mappings gets it down to like 140M with the smaller indexes and deflate compression for the payloads. This is a 5x reduction in file size that has to be shuffled around for _relatively_ little added complexity, which feels like a good start.

Container deploys might also benefit from breaking up the common strings table into multiple files, where we have an "old" common file that is retained as-is (and doesn't have to be re-sent each deploy) and a "new" file that collects any new strings not in the old common file.

https://phabricator.wikimedia.org/T410694#11398762

Very happy with the results of my holiday hacking: T410694#11418011

  • dropped the exotic binary file experiments, the opcode cache just performs too much better with pre-instantiated strings and arrays
  • implemented read-time fallback processing and message key string indirection to massively reduce on-disk file size and opcache size of cached arrays
  • removed the duplicate list of message keys since it can be read from a common list

I think this is quite promising, and with some cleanup would be worth considering for some testing in a more realistic environment.

Change #1212717 had a related patch set uploaded (by Krinkle; author: Bvibber):

[mediawiki/core@master] Language: shrink LCStoreStaticArray by doing fallback at read time

https://gerrit.wikimedia.org/r/1212717

Change #1212717 merged by jenkins-bot:

[mediawiki/core@master] Language: shrink LCStoreStaticArray by doing fallback at read time

https://gerrit.wikimedia.org/r/1212717

Change #1219190 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] scap: Add php_l10n build in Beta Cluster

https://gerrit.wikimedia.org/r/1219190

Change #1219190 merged by Ori:

[operations/puppet@production] scap: Add php_l10n build in Beta Cluster

https://gerrit.wikimedia.org/r/1219190

Change #1219155 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] Language: minor tweak and simplify in LCStoreStaticArray

https://gerrit.wikimedia.org/r/1219155

Did another quick memory test to refresh my data on my local dev machine at the hackathon (T424284#11879476)

Key notes:

  • The opcache's interned strings region comes out of your main opcache memory and reserves the entire area. Choose the size wisely!
    • 256 mb of string interning cache will likely hold all our messages in every language in production, but double-check this.
  • Hitting all languages with this set of extensions eats 129 megabytes of string interning cache over CDB, with an extra million or so strings.
    • The extra RAM usage is because it forces all strings to be cached even if they are never individually fetched.
    • Running multiple MediaWiki versions on the same PHP server likely will not significantly add to this usage, because most strings will be identical to their other-version siblings!
    • Ran a test with two identical wiki frontends and dev trees; confirmed that strings increase by only 1-2 mb from adding the second wiki.
  • Array backend is using 75 megabytes of additional opcache vs CDB, most likely from the backing hashmaps for array literals in the .php files. (This number is significantly reduced since our old production test, thanks to my changes a few months ago to de-duplicate data in the arrays.)
    • Running multiple MediaWiki versions likely will increase this amount linearly by the number of live versions. I don't think it can de-duplicate hashmaps the way it de-duplicates strings, so files in one version that are identical to another version's may still use twice the data.
    • Ran a test with two identical wiki frontends and dev trees; confirmed that total opcache usage on the second wiki goes up by 113 mb on array backend vs a smaller 47mb on cdb backend. This is a better ratio than I feared. :)
  • I have not attempted to compare runtime memory usage (eg, everything outside the opcache) yet.

Implications of the multi-version setup for our kubernetes setup:

  • if we run multiple versions in one pod:
    • strings will be de-duplicated between versions
    • array data will NOT be de-duplicated between versions
  • if we run multiple versions on separate pods on the same computer:
    • strings will NOT be de-duplicated between versions
    • array data will NOT be de-duplicated between versions
  • if we run one version per pod per computer:
    • you only have the weight of one version :D

Change #1281609 had a related patch set uploaded (by Bvibber; author: Bvibber):

[operations/mediawiki-config@master] Enable LCStoreStaticArray on beta for live performance testing

https://gerrit.wikimedia.org/r/1281609

Notes:

It looks like Timo already enabled generation for beta on scap config, woohoo

Open questions for 2026:

  • Can we proceed with live testing on beta or should that wait on pretrain (T369112)?
  • (Timo and Giuseppe both indicate we should be able to move forward with this on the current system.)
  • Does anything else in MW core or infrastructure need tweaking before we start testing this? Likely have to update these PHP configs:
profile::mediawiki::php::fpm_config:
  opcache.interned_strings_buffer: 96
  opcache.memory_consumption: 1024

to something like this:

profile::mediawiki::php::fpm_config:
  opcache.interned_strings_buffer: 256
  opcache.memory_consumption: 1536

to accomodate increased use of the opcache for both strings and for array data.

Who can own this? I need to coordinate with my team's PM and I expect to have limited time available to help with production testing, so things like "turn it off if it breaks" that have very short turn-around times will need someone from SRE. :)

Change #1281779 had a related patch set uploaded (by Bvibber; author: Bvibber):

[operations/puppet@production] Bump opcache memory limits for LCStoreStaticArray

https://gerrit.wikimedia.org/r/1281779

Change #1281609 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable LCStoreStaticArray on beta for live performance testing

https://gerrit.wikimedia.org/r/1281609

Now live on beta

https://beta-update.wmcloud.org/202605031400.log

[14:00:08] 14:00:08 Started scap prep auto
[14:00:08] 14:00:08 Update https://gerrit.wikimedia.org/r/operations/mediawiki-config (master branch) in /srv/mediawiki-staging
[14:00:14] 14:00:14 Changes pulled down since last fetch:
[14:00:14] commit 5abd8e7c981f4e8d43d8c6b36a1488c039e1b618
[14:00:14] Author: Brooke Vibber <bvibber@wikimedia.org>
[14:00:14] Date:   Fri May 1 21:22:51 2026 +0200
[14:00:14] 
[14:00:14]     Enable LCStoreStaticArray on beta for live performance testing
…
[14:03:27] 14:03:27 Running rebuildLocalisationCache.php
[14:03:33] 14:03:33 0 languages rebuilt out of 549
…
[14:04:53] 14:04:53 Running '/usr/local/sbin/restart-php-fpm-all' on 2 host(s)
[14:04:53] 14:04:53 php-fpm-restart:   0% (ok: 0; fail: 0; left: 0)                        
[14:04:54] 14:04:54 php-fpm-restart: 100% (in-flight: 0; ok: 2; fail: 0; left: 0)          
…
[14:04:54] 14:04:54 Running purgeMessageBlobStore.php

Timo has pushed the config update for beta and it hasn't exploded yet. :D

Closed out the hackathon task; further testing work will proceed here:

I'll run some performance tests on Tuesday when I'm back home and have all day to let things run (and have time to fix it if it breaks under memory load testing).

There are three expected failure modes:

  • logic error corrupts results (unlikely, but if it happens, revert the config patch)
  • hitting every language could result in so much memory usage that it triggers the auto-restart (possible; if it happens, change the site config to raise the limits and try again)
  • if raising the limits results in decreased performance, that's bad
  • something unexpected ;)

Additionally, check performance metrics if possible to confirm that the expected speed benefits are there. (Though beta is weird enough it might obscure that.)

bvibber added a subscriber: SherryYang-WMF.

I'm going to be bold and assign this to myself for additional testing to confirm all looks good on beta and prepping for the production tests.

Note this is not a Readers Growth team task, though we will directly benefit from it with faster backports. :) I expect to be doing this work in the morning my time before regular office hours so I can coordinate with Timo and Giuseppe if I need further input from them in diagnosing metrics or poking configurations.

Pinging @SherryYang-WMF to let her know this is on my personal todo list but I expect to work on it in my spare time to keep from impinging on our assigned team work.

dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/1166

tasks.py: Mention l10n type in Running rebuildLocalisationCache.php message

dancy merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/1166

tasks.py: Mention l10n type in Running rebuildLocalisationCache.php message

@bvibber Maybe interesting datapoints from beta cluster: When rebuildLocalisationCache.php runs for php-l10n, it doesn't regenerate as many language files as the cdb counterpart. Is that expected?

dancy@deployment-deploy04:/srv/beta-update-logs$ grep --context 5  "[^0] languages rebuilt out of" *.log
...
202605050740.log-[07:43:03] 07:43:03 Finished cache_git_info (duration: 00m 25s)
202605050740.log-[07:43:03] 07:43:03 Started l10n-update
202605050740.log-[07:43:06] 07:43:06 Updating ExtensionMessages-master.php
202605050740.log-[07:43:16] 07:43:16 Updating LocalisationCache for master using 2 thread(s)
202605050740.log-[07:43:16] 07:43:16 Running rebuildLocalisationCache.php
202605050740.log:[07:46:24] 07:46:24 549 languages rebuilt out of 549
202605050740.log-[07:46:27] 07:46:27 Running rebuildLocalisationCache.php
202605050740.log:[07:46:40] 07:46:40 15 languages rebuilt out of 549
202605050740.log-[07:46:45] 07:46:45 Generating JSON versions and md5 files (as www-data)
202605050740.log-[07:47:34] 07:47:34 Finished l10n-update (duration: 04m 31s)
202605050740.log-[07:47:34] 07:47:34 Checking for new runtime errors locally
202605050740.log-[07:47:42] 07:47:42 Started sync-canaries
202605050740.log-[07:47:42] 07:47:42 sync-canaries:   0% (ok: 0; fail: 0; left: 1)
--
202605050759.log-[08:05:47] 08:05:47 Finished cache_git_info (duration: 00m 28s)
202605050759.log-[08:05:47] 08:05:47 Started l10n-update
202605050759.log-[08:05:53] 08:05:53 Updating ExtensionMessages-master.php
202605050759.log-[08:06:05] 08:06:05 Updating LocalisationCache for master using 2 thread(s)
202605050759.log-[08:06:05] 08:06:05 Running rebuildLocalisationCache.php
202605050759.log:[08:07:23] 08:07:23 198 languages rebuilt out of 549
202605050759.log-[08:07:27] 08:07:27 Running rebuildLocalisationCache.php
202605050759.log:[08:07:36] 08:07:36 31 languages rebuilt out of 549
202605050759.log-[08:07:39] 08:07:39 Generating JSON versions and md5 files (as www-data)
202605050759.log-[08:08:06] 08:08:06 Finished l10n-update (duration: 02m 18s)
202605050759.log-[08:08:06] 08:08:06 Checking for new runtime errors locally
202605050759.log-[08:08:12] 08:08:12 Started sync-canaries
202605050759.log-[08:08:12] 08:08:12 sync-canaries:   0% (ok: 0; fail: 0; left: 1)
--
202605051900.log-[19:03:04] 19:03:04 Finished cache_git_info (duration: 00m 25s)
202605051900.log-[19:03:04] 19:03:04 Started l10n-update
202605051900.log-[19:03:09] 19:03:09 Updating ExtensionMessages-master.php
202605051900.log-[19:03:19] 19:03:19 Updating LocalisationCache for master using 2 thread(s)
202605051900.log-[19:03:19] 19:03:19 Running rebuildLocalisationCache.php (cdb)
202605051900.log:[19:06:15] 19:06:15 549 languages rebuilt out of 549
202605051900.log-[19:06:19] 19:06:19 Running rebuildLocalisationCache.php (php)
202605051900.log:[19:06:27] 19:06:27 1 languages rebuilt out of 549
202605051900.log-[19:06:30] 19:06:30 Generating JSON versions and md5 files (as www-data)
202605051900.log-[19:07:18] 19:07:18 Finished l10n-update (duration: 04m 14s)
202605051900.log-[19:07:18] 19:07:18 Checking for new runtime errors locally
202605051900.log-[19:07:25] 19:07:25 Started sync-canaries
202605051900.log-[19:07:25] 19:07:25 sync-canaries:   0% (ok: 0; fail: 0; left: 1)

(In older entries, the cdb rebuild results are shown first, then the php rebuild).

@bvibber Maybe a interesting datapoints from beta cluster: When rebuildLocalisationCache.php runs for php-l10n, it doesn't regenerate as many language files as the cdb counterpart. Is that expected?

Excellent thing to double check on! :D and the answer is yes it is expected -- each language's individual cache file in the static arrays now depends on only its own language's changes, not the changes of the other languages it uses as fallbacks. Eg, a change to "en" used to change hundreds of languages (and still does for CDB), but now should change only the "en" php itself.

This is a nice benefit of being able to _very_ cheaply do the fallback merges at runtime. :D

This is a nice benefit of being able to _very_ cheaply do the fallback merges at runtime. :D

This is great!

Did a quick test punching the beta cluster with hits to all 500-whatever ?uselang=xx params and didn't see anything obviously explode on the total ram usage via grafana.

This is good news! However it's hard to tell 100% if my hits are distributing to all servers, and there's a lot of noise in there because it's already set up to restart php-fpm every 10 minutes and run a puppet rerun every 25.

Later will have to specifically measure the opcache memory usage, see if that's instrumented already.

@bvibber this is *rad* I am very happy you (and I imagine others) went to tackle this topic during the Hackathon! I can't wait to see the positive effects once that is deployed to production 🎉

Note that I can no longer push deployments myself, but I'm happy to continue helping with this project as it was a spare time project for me to begin with. :)