Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Automated subclass models #15044

Draft
wants to merge 93 commits into
base: main
Choose a base branch
from

Conversation

RasmusWL
Copy link
Member

@RasmusWL RasmusWL commented Dec 8, 2023

Overview

This PR adds automatically captured subclass information for a the majority of interesting PyPI packages. As an example of what this PR achieves:

We've traditionally had automatic dependency installation available on the codeql-action, but are moving to a solution without this (see https://github.blog/changelog/2023-07-12-code-scanning-with-codeql-no-longer-installs-python-dependencies-automatically-for-new-users/).

We've previously relied on analyzing installed dependencies to reach this conclusion (by following subclass relationship). This PR is the solution to still be able to reach the same conclusion when we stop installing dependencies.

We achieve this by using the extensible type-models to ahead-of-time record important subclass/aliasing information. For example, see the very first commit (2f17d2f)

Our internal testing shows that for all but a few cases, we end up with a solution comparable or better to what we had before, even when narrowing our focus to repos where dependency installation was successful before.

(Thanks to @tausbn for helping with the modeling ❤️)

Reviewing this PR

What a mess. Sorry. It's a mix of working on the tooling process-mrva-results.py/SubclassFinder.qll and enabling subclasses/aliases to be found in the actual modeling. The latter bit required removing the private annotations for much of our modeling. I think we'll just have to live with that.

The only commits I've found that don't follow this pattern, and that could have been made into separate PRs, are:

Notes

  1. The tooling to generate these subclass-capture models automatically only lives internally.
  2. to ensure this automated modeling could still be recreated once we don't do dependency installation (if we wanted to use a different format say), the actual modeling with MRVA has only been done while I made sure we wouldn't make use of any dependencies that might have been installed (specifically by this commit).

RasmusWL and others added 30 commits December 8, 2023 11:27
Based on some DBs I had that contained dependencies
Also makes `empty.model.yml` empty once again
(makes future diffing much easier)
This is important to model mixins correctly, for example when they help
handle incoming requests, and therefore need to know that `self.kwargs`
contains data controlled by a user.
:thinkies: turns out that .getASubclass*() had to be applied everywhere...
This required making some of the relevant bits public, but they are marked as internal anyway.
Same trick as 'generate-code-scanning-query-list.py'
RasmusWL and others added 5 commits December 8, 2023 16:38
But the new test results looks very strange indeed!
for module entry definitions from the dataflow graph.
mostly removing of nodes from the graph.
One result lost:
```
check("submodule.submodule_attr", submodule.submodule_attr, "submodule_attr", globals()) #$ MISSING:prints=submodule_attr
```
tausbn
tausbn previously approved these changes Dec 11, 2023
Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few suggestions here and there, otherwise I think this looks good!

Of course, we should do some performance testing before merging. Also, it'll be interesting to see if all of these models make a difference in terms of results for our standard suite.

@@ -298,7 +303,7 @@ module Stdlib {
* policy, and the code is not in a polished enough state that we want to do so -- at
* least not without having convincing use-cases for it :)
*/
private module StdlibPrivate {
module StdlibPrivate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add "INTERNAL: Do not use." to the associated QLDoc.

Co-authored-by: Taus <tausbn@github.com>
RasmusWL and others added 5 commits December 13, 2023 21:54
Co-authored-by: Taus <tausbn@github.com>
these changes took performance for loading and writing all files locally
29.60s to 3.17s

(that is, using `gather_from_existing`)
Verified by joining all files, splitting again, and observing no diff in
git.

(these operations only take a few seconds on my local machine, so
shouldn't be too much of an issue)
}
}

class WSGIServer extends FindSubclassesSpec {

Check warning

Code scanning / CodeQL

Acronyms should be PascalCase/camelCase. Warning

Acronyms in WSGIServer should be PascalCase/camelCase.
private import semmle.python.frameworks.Pycurl
private import semmle.python.frameworks.RestFramework
private import semmle.python.frameworks.SqlAlchemy
private import semmle.python.frameworks.Tornado

Check warning

Code scanning / CodeQL

Redundant import Warning

Duplicate import, the module is already imported by
semmle.python.frameworks.Tornado
.
@@ -0,0 +1,79 @@
private import python
private import semmle.python.dataflow.new.DataFlow

Check warning

Code scanning / CodeQL

Redundant import Warning

Redundant import, the module is already imported inside
semmle.python.security.dataflow.NoSqlInjectionCustomizations
.
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a few comments, but I understand that these may be obvious for people working with the tooling, so I am willing to approve this as is.

Comment on lines +29 to +30
for f in glob.glob(f"{subclass_capture_path}/auto-*.model.yml", recursive=True):
os.unlink(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if users might be surprised by this also being part of the script behaviour? It should probably be mentioned in the doc-string at least.

package_data[t[1]].add(t)
write_all_package_data_to_files(package_data)

joined_file.unlink()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to above. The modality of "either you have the joined file or you have all the split ones" should probably be made clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants