Python: Automated subclass models #15044

RasmusWL · 2023-12-08T10:46:20Z

Overview

This PR adds automatically captured subclass information for a the majority of interesting PyPI packages. As an example of what this PR achieves:

the PyPI package flask-restplus defines flask_restful.Resource (src) which is a subclass of flask.views.MethodView.
Our modeling of flask based remote-flow-sources ultimately depends on being able to figure out whether a class is a subclass of flask.views.MethodView
so knowing a subclass of flask_restful.Resource is also a flask.views.MethodView allows us to model the remote-flow-sources properly.

We've traditionally had automatic dependency installation available on the codeql-action, but are moving to a solution without this (see https://github.blog/changelog/2023-07-12-code-scanning-with-codeql-no-longer-installs-python-dependencies-automatically-for-new-users/).

We've previously relied on analyzing installed dependencies to reach this conclusion (by following subclass relationship). This PR is the solution to still be able to reach the same conclusion when we stop installing dependencies.

We achieve this by using the extensible type-models to ahead-of-time record important subclass/aliasing information. For example, see the very first commit (2f17d2f)

Our internal testing shows that for all but a few cases, we end up with a solution comparable or better to what we had before, even when narrowing our focus to repos where dependency installation was successful before.

(Thanks to @tausbn for helping with the modeling ❤️)

Reviewing this PR

What a mess. Sorry. It's a mix of working on the tooling process-mrva-results.py/SubclassFinder.qll and enabling subclasses/aliases to be found in the actual modeling. The latter bit required removing the private annotations for much of our modeling. I think we'll just have to live with that.

The only commits I've found that don't follow this pattern, and that could have been made into separate PRs, are:

Notes

The tooling to generate these subclass-capture models automatically only lives internally.
to ensure this automated modeling could still be recreated once we don't do dependency installation (if we wanted to use a different format say), the actual modeling with MRVA has only been done while I made sure we wouldn't make use of any dependencies that might have been installed (specifically by this commit).

Based on some DBs I had that contained dependencies

Also makes `empty.model.yml` empty once again

(makes future diffing much easier)

Ooops

This is important to model mixins correctly, for example when they help handle incoming requests, and therefore need to know that `self.kwargs` contains data controlled by a user.

:thinkies: turns out that .getASubclass*() had to be applied everywhere...

This required making some of the relevant bits public, but they are marked as internal anyway.

Same trick as 'generate-code-scanning-query-list.py'

python/ql/lib/semmle/python/frameworks/Stdlib.qll

But the new test results looks very strange indeed!

for module entry definitions from the dataflow graph.

mostly removing of nodes from the graph. One result lost: ``` check("submodule.submodule_attr", submodule.submodule_attr, "submodule_attr", globals()) #$ MISSING:prints=submodule_attr ```

…s from github#15030

tausbn

Only a few suggestions here and there, otherwise I think this looks good!

Of course, we should do some performance testing before merging. Also, it'll be interesting to see if all of these models make a difference in terms of results for our standard suite.

python/ql/lib/change-notes/2023-12-08-automated-subclass-models.md

python/ql/lib/semmle/python/frameworks/FastApi.qll

tausbn · 2023-12-08T14:13:10Z

python/ql/lib/semmle/python/frameworks/Stdlib.qll

@@ -298,7 +303,7 @@ module Stdlib {
 * policy, and the code is not in a polished enough state that we want to do so -- at
 * least not without having convincing use-cases for it :)
 */
-private module StdlibPrivate {
+module StdlibPrivate {


Should probably add "INTERNAL: Do not use." to the associated QLDoc.

Co-authored-by: Taus <tausbn@github.com>

these changes took performance for loading and writing all files locally 29.60s to 3.17s (that is, using `gather_from_existing`)

Verified by joining all files, splitting again, and observing no diff in git. (these operations only take a few seconds on my local machine, so shouldn't be too much of an issue)

python/ql/src/meta/ClassHierarchy/Find.ql

+  }
+}
+
+class WSGIServer extends FindSubclassesSpec {


python/ql/src/meta/ClassHierarchy/Find.ql

+private import semmle.python.frameworks.Pycurl
+private import semmle.python.frameworks.RestFramework
+private import semmle.python.frameworks.SqlAlchemy
+private import semmle.python.frameworks.Tornado


python/ql/src/meta/alerts/Sinks.qll

@@ -0,0 +1,79 @@
+private import python
+private import semmle.python.dataflow.new.DataFlow


yoff

I would suggest a few comments, but I understand that these may be obvious for people working with the tooling, so I am willing to approve this as is.

yoff · 2023-12-14T10:37:16Z

python/ql/src/meta/ClassHierarchy/join-yml-files.py

+for f in glob.glob(f"{subclass_capture_path}/auto-*.model.yml", recursive=True):
+    os.unlink(f)


I wonder if users might be surprised by this also being part of the script behaviour? It should probably be mentioned in the doc-string at least.

yoff · 2023-12-14T10:41:40Z

python/ql/src/meta/ClassHierarchy/split-yml-files.py

+    package_data[t[1]].add(t)
+write_all_package_data_to_files(package_data)
+
+joined_file.unlink()


Similarly to above. The modality of "either you have the joined file or you have all the split ones" should probably be made clear.

RasmusWL and others added 30 commits December 8, 2023 11:27

WIP: Flask View class modeling for restplus

2f17d2f

Based on some DBs I had that contained dependencies

WIP rest of modeling done so far

f06bbd2

Python: Improve docs/names around already modeled classes

bb3ced0

Python: Adjust test-code predicate

ba0a5b1

Python: Streamline what modules to allow for now

b66dd23

Python: Add query metadata

b1f5dea

Python: Remove query predicate annotation

451a210

Python: Add script to process results from MRVA (bqrs files)

5e98ff4

Also makes `empty.model.yml` empty once again

FIXME already fixed

1c43d11

Python: Sort MaD rows

734dcb1

(makes future diffing much easier)

Python: Make Django use auto-modeling

d6fec9e

Ooops

Python: Automodel for tornado

eb97a79

Python: Automodel for WSGIServer

ec38464

Python: Improve import * handling

77a4d81

Python: Allow any results.bqrs file

dfdb66f

Python: Improve SelfRefMixin

ba19f95

This is important to model mixins correctly, for example when they help handle incoming requests, and therefore need to know that `self.kwargs` contains data controlled by a user.

Python: Enable auto-model BaseHttpRequestHandler

af6c5cc

Python: More import fixes

1e69762

:thinkies: turns out that .getASubclass*() had to be applied everywhere...

Python: Enable auto-model for cgi.FieldStorage

bff7ae2

Python: Enable auto-model for Django Model

d622d87

Python: Add Django response models

7b1c6b0

Python: Add Flask response model

cb1efa9

Python: Add Requests response model

1d4b4ee

This required making some of the relevant bits public, but they are marked as internal anyway.

Python: Add http.client.HTTPResponse model

750f14f

Python: Improve speed of process-mrva-results.py

7d86a8d

Same trick as 'generate-code-scanning-query-list.py'

Python: Add test of find-subclass code

e7d5573

Python: Also capture alias with new name

f19b672

Python: Add starlette.websocket model

83e6e51

Python: Add clickhouse_driver model

f5bed2d

Python: Add aiohttp.ClientSession model

947aa09

RasmusWL added 11 commits December 8, 2023 11:27

Python: Ignore any captured info with tests in it

c4abffe

Python: Use separate directory for subclass capture models

6db3b37

Python: Disallow examples

6ce8cd3

Python: Disallow invalid path component

004bb50

Python: Don't include docs/ folder

dc90411

Python: auto subclass capture

bd2b5cf

Python: auto subclass capture

dd811ad

Python: auto subclass capture

203f40c

Python: auto subclass capture

e713326

Python: Refactor taint-sinks meta queries

a86d341

Python: Add change-note

299a3f4

github-actions bot added documentation Python labels Dec 8, 2023

RasmusWL requested a review from a team December 8, 2023 10:48

github-advanced-security bot found potential problems Dec 8, 2023

View reviewed changes

python/ql/lib/semmle/python/frameworks/Stdlib.qll Fixed Show fixed Hide fixed

python/ql/lib/semmle/python/frameworks/Stdlib.qll Fixed Show fixed Hide fixed

python/ql/lib/semmle/python/frameworks/Stdlib.qll Fixed Show fixed Hide fixed

RasmusWL and others added 5 commits December 8, 2023 16:38

Python: Don't filter subclass tests away

368696a

Python: Adjust subclass finder to no ESSA nodes

35ba495

But the new test results looks very strange indeed!

Python: remove control flow nodes

8ab59e0

for module entry definitions from the dataflow graph.

Python: adjust test expectations

8885ee3

mostly removing of nodes from the graph. One result lost: ``` check("submodule.submodule_attr", submodule.submodule_attr, "submodule_attr", globals()) #$ MISSING:prints=submodule_attr ```

Python: Recover subclass finder .expected after cherry picking commit…

de55ca3

…s from github#15030

tausbn previously approved these changes Dec 11, 2023

View reviewed changes

Apply suggestions from code review

420898d

Co-authored-by: Taus <tausbn@github.com>

RasmusWL dismissed tausbn’s stale review via 420898d December 13, 2023 20:53

RasmusWL and others added 5 commits December 13, 2023 21:54

Python: treat auto subclass capture models as auto-generated

565ca8c

Co-authored-by: Taus <tausbn@github.com>

Python: Update a few QLdocs

1625e37

Python: Script: Improve performance by using C++ impl

ea9c8bc

these changes took performance for loading and writing all files locally 29.60s to 3.17s (that is, using `gather_from_existing`)

Python: Add ability to split and join autogenerated yml files

d631daf

Verified by joining all files, splitting again, and observing no diff in git. (these operations only take a few seconds on my local machine, so shouldn't be too much of an issue)

Python: join auto-modeling into one file

3220c9e

github-advanced-security bot found potential problems Dec 13, 2023

View reviewed changes

yoff approved these changes Dec 14, 2023

View reviewed changes

Python: Automated subclass models #15044

Python: Automated subclass models #15044

RasmusWL commented Dec 8, 2023 •

edited

tausbn left a comment

tausbn Dec 8, 2023

yoff left a comment

yoff Dec 14, 2023

yoff Dec 14, 2023

		@@ -0,0 +1,79 @@
		private import python
		private import semmle.python.dataflow.new.DataFlow

		for f in glob.glob(f"{subclass_capture_path}/auto-*.model.yml", recursive=True):
		os.unlink(f)

Python: Automated subclass models #15044

Are you sure you want to change the base?

Python: Automated subclass models #15044

Conversation

RasmusWL commented Dec 8, 2023 • edited

Overview

Reviewing this PR

Notes

tausbn left a comment

Choose a reason for hiding this comment

tausbn Dec 8, 2023

Choose a reason for hiding this comment

yoff left a comment

Choose a reason for hiding this comment

yoff Dec 14, 2023

Choose a reason for hiding this comment

yoff Dec 14, 2023

Choose a reason for hiding this comment

RasmusWL commented Dec 8, 2023 •

edited