Page MenuHomePhabricator

FY25-26 WE5.4.12: Identify the provenance of image requests
Open, Needs TriagePublic

Description

Problem

As part of the work under WE5.4 to protect our infrastructure from abusive scraping, we want to be able to understand the provenance of image requests. This means being able to distinguish when and where a URL to an image was generated.

This will allow us to use this information as a signal in request filtering at the CDN, by helping to determine if a request is coming from a browser session visiting the website, an API query, from dumps or if they are the result of hotlinking.

Approach

Generate signed URLs for image requests, by adding query parameters that contain the provenance information and a signature that can be trivially validated at the CDN. The signature should be an HMAC that includes the URL, source (web, api, dumps), timestamp and a secret.

  1. Acceptance criteria
    • Generated image URLs include provence query parameters
    • Generated image URLs include an HMAC signature
    • Signature contents and HMAC algorithm agreed with SRE
    • SRE can configure the CDN based on the source that generated an image URL
    • SRE can configure the CDN based on the freshness of an image URL

Status updates

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1239807 merged by jenkins-bot:

[mediawiki/core@master] FileRepo: Rename cache-busting param to '_' on file description pages

https://gerrit.wikimedia.org/r/1239807

@Joe @CDanis I heard you're the people to talk to about the desired data and format of these query parameters.

Currently, the proposed patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239464 includes the following data:

  • Site which is requesting the image, e.g. 'www.mediawiki.org'
  • Generator (the software component involved), e.g. 'parser' or 'imageinfo'. Entry point is used as fallback if not specified, e.g. 'index', 'api', 'rest'
  • Format of the requested image, 'original', 'thumbnail' or 'thumbnail_unscaled'

The format is UTM parameters (respectively utm_source, utm_campaign and utm_content, in this order), on the assumption that they'll be stripped by search engines etc.

Example: https://upload.wikimedia.org/wikipedia/commons/a/a9/Example.jpg?utm_source=mediawiki.localhost&utm_campaign=parser&utm_content=thumbnail

Your thoughts on that would be appreciated. I also have two questions:

Change #1239464 merged by jenkins-bot:

[mediawiki/core@master] Media: Add provenance parameters to thumbnail and media file URLs

https://gerrit.wikimedia.org/r/1239464

@Joe @CDanis I heard you're the people to talk to about the desired data and format of these query parameters.

Currently, the proposed patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239464 includes the following data:

  • Site which is requesting the image, e.g. 'www.mediawiki.org'
  • Generator (the software component involved), e.g. 'parser' or 'imageinfo'. Entry point is used as fallback if not specified, e.g. 'index', 'api', 'rest'
  • Format of the requested image, 'original', 'thumbnail' or 'thumbnail_unscaled'

The format is UTM parameters (respectively utm_source, utm_campaign and utm_content, in this order), on the assumption that they'll be stripped by search engines etc.

Example: https://upload.wikimedia.org/wikipedia/commons/a/a9/Example.jpg?utm_source=mediawiki.localhost&utm_campaign=parser&utm_content=thumbnail

Your thoughts on that would be appreciated. I also have two questions:

Sorry it's been a few weeks of intense work on other stuff. The proposed format is good as far as I'm concerned, as a first step.

I think adding a signature is useful. It would be enough to have a simple signature like a simple SHA1 of the other parameters as follows: $SECRET;site=mediawiki.localhost;generator=parser;format=thumbnail which we can add in utm_term (again abusing the term). I would go with a simple sha1 instead of using hmac because the risk of compromise is pretty low.

Change #1253625 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster

https://gerrit.wikimedia.org/r/1253625

Change #1253625 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster

https://gerrit.wikimedia.org/r/1253625

Mentioned in SAL (#wikimedia-operations) [2026-03-16T20:57:26Z] <catrope@deploy2002> Started scap sync-world: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-16T20:59:17Z] <catrope@deploy2002> matmarex, catrope: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-16T21:05:37Z] <catrope@deploy2002> Finished scap sync-world: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]] (duration: 08m 06s)

Change #1260029 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on group0 wikis

https://gerrit.wikimedia.org/r/1260029

Change #1260029 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on group0 wikis

https://gerrit.wikimedia.org/r/1260029

Mentioned in SAL (#wikimedia-operations) [2026-03-31T23:10:45Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-31T23:12:45Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-31T23:51:06Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]] (duration: 40m 21s)

Change #1267437 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on most group1 wikis

https://gerrit.wikimedia.org/r/1267437

Change #1267437 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on most group1 wikis

https://gerrit.wikimedia.org/r/1267437

Mentioned in SAL (#wikimedia-operations) [2026-04-08T07:36:29Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-08T07:38:18Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-08T07:46:04Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] (duration: 09m 34s)

Change #1269440 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on wikidata.org

https://gerrit.wikimedia.org/r/1269440

Change #1269441 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on Commons

https://gerrit.wikimedia.org/r/1269441

Change #1269442 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on remaining Wikipedias

https://gerrit.wikimedia.org/r/1269442

Progress update (2-6 Mar, 9-13 Mar; copied here from Asana for transparancy):

  • Investigate and fix broken thumbnails on officewiki (Timo investigated an found missing thumbnail steps on private wikis, Amir enabled this).
  • Test and merge trial implementation of media provenance URLs in MediaWiki core behind a feature flag (developed by Bartosz and Timo). T414338
    • Refactor logic in FileRepo and Media classes in MediaWiki core to reduce duplication and make adding provenance URLs simpler and more reliable. T414338
    • Find and fix VisualEditor would-be-bug where media type breaks due to accidental reliance on URLs having no query string. T419458
  • Enable media provenance feature in Beta Cluster and on testwikis in production. T414338

Progress update (9 Apr 2026):

  • Enable media provenance on 573 additional wikis (including all Wiktionary and Wikivoyage wikis, and 18 Wikipedias). We are now live on 720/1068 wikis. T414338
  • Found regression in MediaViewer causing double downloads. T422586
  • Prepare Stockphoto gadget on Commons ahead of rollout to prevent regression. T419135

Next steps:

  • Deploy media provenance feature to Wikidata, Commons, and 346 remaining Wikipedias.

Change #1276086 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/MultimediaViewer@master] mmv.bootstrap: Avoid double download when thumb is unscaled original

https://gerrit.wikimedia.org/r/1276086

Change #1276086 merged by jenkins-bot:

[mediawiki/extensions/MultimediaViewer@master] mmv.bootstrap: Avoid double download when thumb is unscaled original

https://gerrit.wikimedia.org/r/1276086

Change #1269440 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on wikidata.org

https://gerrit.wikimedia.org/r/1269440

Change #1269441 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on Commons

https://gerrit.wikimedia.org/r/1269441

Mentioned in SAL (#wikimedia-operations) [2026-05-01T19:51:14Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1269440|Enable wgTrackMediaRequestProvenance on wikidata.org (T414338)]], [[gerrit:1269441|Enable wgTrackMediaRequestProvenance on Commons (T414338)]]

Mentioned in SAL (#wikimedia-operations) [2026-05-01T19:52:57Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1269440|Enable wgTrackMediaRequestProvenance on wikidata.org (T414338)]], [[gerrit:1269441|Enable wgTrackMediaRequestProvenance on Commons (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-05-01T20:06:42Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1269440|Enable wgTrackMediaRequestProvenance on wikidata.org (T414338)]], [[gerrit:1269441|Enable wgTrackMediaRequestProvenance on Commons (T414338)]] (duration: 15m 27s)

I think these changes may the the cause behind https://commons.wikimedia.org/wiki/MediaWiki_talk:Gadget-GoogleImagesTineye.js#c-Masur-20251229182900-Reverse_Image_Search_-_Google_and_TinEye_failing_to_retrieve_source_images_from

the gadget's logic is simple. it takes the url of the image and gives it to the search engines in the form of https://lens.google.com/uploadbyurl?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Ff%2Ffa%2FStatue_of_Taras_Shevchenko_in_Shevchenkove%252C_Shevchenkove_Raion_2019_by_Venzz_04.jpg%3Futm_source%3Dcommons.wikimedia.org%26utm_campaign%3Dindex%26utm_content%3Doriginal

but as i tested manually, search engines cannot get the image, no matter the link comes with or without the new trackers

utm_source=commons.wikimedia.org&utm_campaign=index&utm_content=original

please explain how to get the gadget working again, i.e. how to get a link of a file that can be read by other websites.

@RoyZuo A more robust way would be to make the gadget download the image (or a thumbnail), then upload it to the search engine, instead of asking the search engine to fetch it from us, which may be blocked if they don't respect our user-agent policy.

In the meantime, it looks like using a thumbnail URL instead of the original file URL works, at least for now.

I've tested everything I wanted to test on Commons and Wikidata.

I expected Wikidata to perhaps not get the provenance params or not work with MMV, but it all looks good. I did find a bug, T426217: MediaViewer downloads high-res image twice if thumb URL is re-used, but that's pre-existing and not caused or made more common by provenance params, and so does not need to block roll-out.

Change #1269442 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on remaining Wikipedias

https://gerrit.wikimedia.org/r/1269442

I've scheduled this for tomorrow afternoon, 13:00 UTC.

Change #1269442 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on remaining Wikipedias

https://gerrit.wikimedia.org/r/1269442

Mentioned in SAL (#wikimedia-operations) [2026-05-14T13:42:53Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]]

Mentioned in SAL (#wikimedia-operations) [2026-05-14T13:44:41Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-05-14T13:49:57Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] (duration: 07m 03s)

Progress update (15 May 2026):

  • SRE now includes media provenance as a signal in calculating the X-Is-Browser score on the edge.
  • Fixed regression in MediaViewer causing high-res double downloads (blocking rollout). T422586
  • Enable media provenance on Wikidata and Wikimedia Commons (720 -> 722/1068 wikis).
  • Did broad manual testing across Wikimedia Commons and Wikidata post-rollout.
  • Found bug in MediaViewer causing lack of provenance params in some cases (pre-existing, not blocking rollout). T424082
  • Found bug in MediaViewer causing low-res double downloads (pre-existing, not blocking rollout). T426217
  • Enable media provenance on remaining 346 Wikipedias, including English Wikipedia. Now live on all 1068 wikis.

As explained by @RoyZuo above, we have at Wikimedia Commons a serious problem if the gadget that supports image reverse search on Google Lens, TinEye and Yandex doesn't work. Right now, TinEye and Yandex work but Google Lens fails as it is unable to access the images. I am not responsible for the gadget but as one of the admins at Commons I can tell you that this gadget is absolutely essential to fight against copyright violations. We delete about 2000 copyvios every day and we cannot do this efficiently if Google Lens cannot be conveniently queried. Hence, some solution is required such that this gadget can pass URLs that are subsequently not blocked when the respective services download them.

Change #1288925 had a related patch set uploaded (by Krinkle; author: Seddon):

[operations/mediawiki-config@master] Revert "Enable wgTrackMediaRequestProvenance on Commons"

https://gerrit.wikimedia.org/r/1288925

Change #1288925 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "Enable wgTrackMediaRequestProvenance on Commons"

https://gerrit.wikimedia.org/r/1288925

Mentioned in SAL (#wikimedia-operations) [2026-05-18T21:31:09Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]]

Mentioned in SAL (#wikimedia-operations) [2026-05-18T21:32:56Z] <krinkle@deploy1003> seddon, krinkle: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-05-18T21:42:39Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]] (duration: 11m 29s)

I think these changes may the the cause behind https://commons.wikimedia.org/wiki/MediaWiki_talk:Gadget-GoogleImagesTineye.js#c-Masur-20251229182900-Reverse_Image_Search_-_Google_and_TinEye_failing_to_retrieve_source_images_from

but as i tested manually, search engines cannot get the image, no matter the link comes with or without the new trackers

@RoyZuo and @AFBorchert this and the note on VP, implying this has been going on for several weeks, imply that this is very likely NOT caused by this ticket. More likely simply the anti scraping measures of the foundation that have been implemented before have caught these systems as well..

Matmarex has already given advise on how to change the gadget in a way that might make it work more reliable. This can be done right now. Or you can open a separate ticket to investigate why this websites are blocked from accessing us, but it might be that the it is not actually possible to distinguish these systems from illegitimate scrapers. It's hard to say.

@AFBorchert use browsers like opera, which have "search image with google lens" when you right click on it. probably some extensions for other browsers also do this.
basically the same method described by matmarex: searching the copied image.
but i'm not gonna put my time into making that a gadget on commons.
who broke it should fix it. or who wants to.

@TheDJ I am not familiar with the architecture and the algorithms of the protection system against unwanted scraping. To me it appears quite likely that the amount of traffic from a particular site can play a role, causing the tool to work or to fail for some sites. But it appears to me very likely that the gadget failures are linked to the protection system. Regard the gadget: I am not the author of the gadget or anyhow involved in its development. However, downloading and uploading the image to submit them to various reverse searches as suggested by @matmarex do not appear to be the straightforward solution. I think it would be better to be able within the gadget to generate image URLs that are subsequently accepted by protection system. My point is that Wikimedia Commons and its defense against copyright violations is a critical part of the infrastructure. This perspective should be IMHO taken into account when designing and updating the protection system.