New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Automated subclass models #15044
base: main
Are you sure you want to change the base?
Conversation
Based on some DBs I had that contained dependencies
Also makes `empty.model.yml` empty once again
(makes future diffing much easier)
This is important to model mixins correctly, for example when they help handle incoming requests, and therefore need to know that `self.kwargs` contains data controlled by a user.
:thinkies: turns out that .getASubclass*() had to be applied everywhere...
This required making some of the relevant bits public, but they are marked as internal anyway.
Same trick as 'generate-code-scanning-query-list.py'
But the new test results looks very strange indeed!
for module entry definitions from the dataflow graph.
mostly removing of nodes from the graph.
One result lost:
```
check("submodule.submodule_attr", submodule.submodule_attr, "submodule_attr", globals()) #$ MISSING:prints=submodule_attr
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a few suggestions here and there, otherwise I think this looks good!
Of course, we should do some performance testing before merging. Also, it'll be interesting to see if all of these models make a difference in terms of results for our standard suite.
python/ql/lib/change-notes/2023-12-08-automated-subclass-models.md
Outdated
Show resolved
Hide resolved
| @@ -298,7 +303,7 @@ module Stdlib { | |||
| * policy, and the code is not in a polished enough state that we want to do so -- at | |||
| * least not without having convincing use-cases for it :) | |||
| */ | |||
| private module StdlibPrivate { | |||
| module StdlibPrivate { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add "INTERNAL: Do not use." to the associated QLDoc.
Co-authored-by: Taus <tausbn@github.com>
Co-authored-by: Taus <tausbn@github.com>
these changes took performance for loading and writing all files locally 29.60s to 3.17s (that is, using `gather_from_existing`)
Verified by joining all files, splitting again, and observing no diff in git. (these operations only take a few seconds on my local machine, so shouldn't be too much of an issue)
| } | ||
| } | ||
|
|
||
| class WSGIServer extends FindSubclassesSpec { |
Check warning
Code scanning / CodeQL
Acronyms should be PascalCase/camelCase. Warning
| private import semmle.python.frameworks.Pycurl | ||
| private import semmle.python.frameworks.RestFramework | ||
| private import semmle.python.frameworks.SqlAlchemy | ||
| private import semmle.python.frameworks.Tornado |
Check warning
Code scanning / CodeQL
Redundant import Warning
semmle.python.frameworks.Tornado
| @@ -0,0 +1,79 @@ | |||
| private import python | |||
| private import semmle.python.dataflow.new.DataFlow | |||
Check warning
Code scanning / CodeQL
Redundant import Warning
semmle.python.security.dataflow.NoSqlInjectionCustomizations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a few comments, but I understand that these may be obvious for people working with the tooling, so I am willing to approve this as is.
| for f in glob.glob(f"{subclass_capture_path}/auto-*.model.yml", recursive=True): | ||
| os.unlink(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if users might be surprised by this also being part of the script behaviour? It should probably be mentioned in the doc-string at least.
| package_data[t[1]].add(t) | ||
| write_all_package_data_to_files(package_data) | ||
|
|
||
| joined_file.unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to above. The modality of "either you have the joined file or you have all the split ones" should probably be made clear.
Overview
This PR adds automatically captured subclass information for a the majority of interesting PyPI packages. As an example of what this PR achieves:
flask-restplusdefinesflask_restful.Resource(src) which is a subclass offlask.views.MethodView.flask.views.MethodViewflask_restful.Resourceis also aflask.views.MethodViewallows us to model the remote-flow-sources properly.We've traditionally had automatic dependency installation available on the codeql-action, but are moving to a solution without this (see https://github.blog/changelog/2023-07-12-code-scanning-with-codeql-no-longer-installs-python-dependencies-automatically-for-new-users/).
We've previously relied on analyzing installed dependencies to reach this conclusion (by following subclass relationship). This PR is the solution to still be able to reach the same conclusion when we stop installing dependencies.
We achieve this by using the extensible type-models to ahead-of-time record important subclass/aliasing information. For example, see the very first commit (2f17d2f)
Our internal testing shows that for all but a few cases, we end up with a solution comparable or better to what we had before, even when narrowing our focus to repos where dependency installation was successful before.
(Thanks to @tausbn for helping with the modeling ❤️)
Reviewing this PR
What a mess. Sorry. It's a mix of working on the tooling
process-mrva-results.py/SubclassFinder.qlland enabling subclasses/aliases to be found in the actual modeling. The latter bit required removing theprivateannotations for much of our modeling. I think we'll just have to live with that.The only commits I've found that don't follow this pattern, and that could have been made into separate PRs, are:
Notes