Python dataflow: flow summaries restart#8781
Conversation
|
I've only done a superficial glance at this, but I'd like to talk about the issue with The root of the problem is that the following cycle is created between call graph construction and API graphs: Unless we want API graphs to be mutually recursive with type-tracking steps and call construction, we'll need to break the cycle somehow. As far as I can tell, you've broken the cycle like this: This means anything not depending on API graphs can contribute to I believe Ruby's solution so far is to make type-tracking steps explicitly depend on non-library callables only. The downside here is that none of their standard library models are visible to type-tracking. (@hvitved @aibaars do you agree here?) @yoff Since I'm unfamiliar with the Python libraries I'm not 100% sure my interpretation above is an accurate description of what you've done here. Could you please check if this is the case? If so, I think it's a fine solution provided we can come up with names that aren't confusing. |
1430ab6 to
b4aa941
Compare
|
Force pushed because I could see that my amended commit message to the last commit had not been pushed. |
Correct (since flow summaries depend on API graphs, which depend on type tracking). |
|
The diagram looks correct to me. Is the difference with Ruby just that we have not (yet?) written any standard library models using flow summaries? Or is there a more fundamental difference? |
b4aa941 to
ebedc7e
Compare
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Fixed
Show fixed
Hide fixed
| /** | ||
| * A query predicate for outputting flow summaries in semi-colon separated format in QL tests. | ||
| * The syntax is: "namespace;type;overrides;name;signature;ext;inputspec;outputspec;kind", | ||
| * ext is hardcoded to empty. | ||
| */ |
Check warning
Code scanning / CodeQL
Predicate QLDoc style.
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Fixed
Show fixed
Hide fixed
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Fixed
Show fixed
Hide fixed
|
The failing language test belongs to |
tausbn
left a comment
There was a problem hiding this comment.
Thank you for this Herculean effort!
I think in terms of the implementation, this looks pretty solid. I've added a bunch of requests for clarification (and also some requests for renaming stuff). I'll be very happy to discuss this synchronously if that's easier. 🙂
| abstract predicate isParameterOf(DataFlowCallable c, int i); | ||
| } | ||
|
|
||
| class SourceParameterNode extends ParameterNode, CfgNode { |
There was a problem hiding this comment.
I took me a while to figure out what Source was referring to in this context (what with this being dataflow and all). I wonder if there is a better name for this, or is SourceParameterNode just the accepted term?
| } | ||
| } | ||
|
|
||
| abstract class DataFlowSourceCall extends DataFlowCall, TDataFlowSourceCall { |
There was a problem hiding this comment.
Like my comment on SourceParameterNode, I feel like this is also potentially confusing. At the very least there should be QLDoc that explains what's going on.
| nodeFrom.asCfgNode() = call.getNode().getKwargs().getAFlowNode() | ||
| ) | ||
| or | ||
| FlowSummaryImpl::Private::Steps::summaryLocalStep(nodeFrom, nodeTo, true) |
There was a problem hiding this comment.
Hmm... So we allow summaries to extend local flow steps (which means they underpin all of the dataflow machinery). Does this not lead to everything being recursively dependent?
There was a problem hiding this comment.
No, there are now two local step relations. One for type trackers and one for the global data flow computation.
| kind = kind and | ||
| exists(Call call, Name f, FunctionDef def | | ||
| f = call.getAnArg() and | ||
| def.getDefinedFunction().getName() = f.getId() and | ||
| // c.getCallableValue() = def.getDefinedFunction().getDefinition() and | ||
| c.getName() = f.getId() | ||
| ) |
There was a problem hiding this comment.
This may have been fixed in a later commit, but as QL-for-QL helpfully points out, creation isn't restricted by this disjunct.
| * Includes function calls, method calls, class calls and library calls. | ||
| * All these will be associated with a `CallNode`. | ||
| */ | ||
| TNonSpecialCall(CallNode call) or |
There was a problem hiding this comment.
Something about NonSpecial... rubs me the wrong way. Could we use a positive formulation instead? I'm tempted to suggest CallNodeCall, though that is a bit of a mouthful. (However, it does precisely say that these are calls that happen at call nodes.)
Alternatively, since we have SpecialCall, perhaps its counterpart should be NormalCall?
| /** Gets a callable value for this callable, if one exists. */ | ||
| abstract CallableValue getCallableValue(); | ||
|
|
||
| abstract CallNode getACall2(); |
There was a problem hiding this comment.
This needs a better name. Even if it is only internal, it certainly doesn't convey what it's actually supposed to do.
| override DataFlowCallable getEnclosingCallable() { result.getScope() = call.getNode().getScope() } | ||
| } | ||
|
|
||
| abstract class NonLibraryDataFlowSourceCall extends NonSpecialCall { |
There was a problem hiding this comment.
I think this name needs a bit of workshopping. Could we come up with a better description of "non-library"?
| } | ||
|
|
||
| class SourceParameterNode extends ParameterNode, CfgNode { | ||
| //, LocalSourceNode { |
There was a problem hiding this comment.
I'm still a bit ambivalent about these not being local sources, and I would like to understand this a bit more. Is this merely because of negative recursion issues?
There was a problem hiding this comment.
Negative recursion is what directly prevents me from inheriting. But I can add that inheritance into the Ruby codebase which basically has the same structure, so I think the solution is to clear up the mutual recursive structure of our predicates, which I fear lacks a bit of discipline at the moment.
There was a problem hiding this comment.
With all of the recent changes, is your comment still in effect?
There was a problem hiding this comment.
Unfortunately, yes. I do feel we should be able to resolve this, but I had a quick go and it seems there are multiple cycles to untangle, so I would prefer to do this later..
There was a problem hiding this comment.
Alright. Thanks for checking!
hvitved
left a comment
There was a problem hiding this comment.
A few drive-by comments. Is the logic actually hooked up yet? I fail to see references to e.g. any of the FlowSummaryImpl::Private::Steps predicates. See e.g. DataFlowPrivate.qll for ruby for how this is done.
| * flow graph. | ||
| */ | ||
| class ParameterNode extends CfgNode, LocalSourceNode { | ||
| abstract class ParameterNode extends Node { |
There was a problem hiding this comment.
Are you sure you want an abstract class in the public API?
| @@ -0,0 +1,175 @@ | |||
| /** | |||
| * Provides Ruby specific classes and predicates for defining flow summaries. | |||
| or | ||
| call = any(ClassValue c | not c.isAbsent()).getACall() | ||
| or | ||
| call = any(SpecialMethodCallNode special) |
There was a problem hiding this comment.
call instanceof SpecialMethodCallNode
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatchPointsTo.qll
Show resolved
Hide resolved
| @@ -2,6 +2,7 @@ private import python | |||
| private import DataFlowPublic | |||
There was a problem hiding this comment.
I think you are missing allowParameterReturnInSelf, see https://github.com/github/codeql/blob/main/ruby/ql/lib/codeql/ruby/dataflow/internal/DataFlowPrivate.qll#L940.
There was a problem hiding this comment.
Thanks! Indeed we just have none().
They are mostly in |
Ah, good, sorry for the false alarm. I guess I must have looked at a diff limited to a subset of commits perhaps. |
- add flowsummaries shared files - register in indentical files - fix initial non-monotonic recursions - add DataFlowSourceCall - add resolvedCall - add SourceParameterNode failing tests: - 3/library-tests/with/test.ql
allowing more `OutNode`s (not restricting to `CallNode`s), gives more flow in the `classesCallGraph` test
also convert to inline tests
summary flow is excluded from the local flow relation used for typetracking, but included in the one used for global data flow.
- Branch predicates are made simple. In particular, they do not try to detect library calls. - All branches based on `CallNode`s are gathered into one. - That branch has been given a class `NonSpecialCall`, which is the new parent of call classes based on `CallNode`s. (Those classes now have more involved charpreds.) - A new such class, 'LambdaCall` has been split out from `FunctionCall` to allow the latter to replace its general `CallNode` field with a specific `FunctionValue` one. - `NonSpecialCall` is not an abstract class, but it has some abstract overrides. Therefor, it is not considered a resolved call in the test `UnresolvedCalls.qll`.
Two classes have been inserted into the hierarchies: - `NonLibraryDataFlowCallable` with a method `getACall2`. This method implements "get a call, not considering flow summaries". For `NonLibraryDataFlowCallable`s, `getACall` will defer to `getACall2`. While you could have a synthesised call to such a callable, it would not correspond to a `CallNode`. - `NonLibraryDataFlowSourceCall` with methods `getArg2` and `getCallable2`. These also refer to a call graph that does not consider flow summaries. `getArg2` is used to synthesise pre-update nodes for arguments. `getCallable2` is used in `connects` to compute argument passing. This is used to define data flow nodes for overflow arguments. `getACall2` ensures that `LibraryCallableValue::getACall` is not called when the charpred of `FunctionCall` is evaluated.
- sync summary files - format files - fix compilation
This can be used when one wants to consider a (source) parameter node as a local source.
by making client code use the "new" class. Really, this part of the split class should have the old name, to minimise disruptions to clients. Same goes for the other split classes.
Thank you for the review. I think both renaming and qldocs are warranted. Specifically, when classes are split (in order to only have one participate in a recursion that would otherwise be non-monotonic) the user-facing one should be retaining the old name so as to minimise breaking changes. |
to avoid nodes in the extracted stdlib Was there a reason for this depth to be 1?
ede5f3c to
56dcfc2
Compare
since map resolves to a class call also fix test expectation
This is not necessary as long as `LibraryCall` only includes unresolved calls.
RasmusWL
left a comment
There was a problem hiding this comment.
I think it would be super nice with a change to the result type of SummarizedCallable.getACall. (which seems like a fairly simple change, I can even do it myself if you want me to)
Besides that, I don't think I have requests for any other significant changes.
I put it in frameworks\Builtins.qll. It could also go in the StdLib folder, I guess. Or perhaps you have a different place that you find natural, feel free to suggest :-)
We have normally put both builtins and standard library modeling in the Stdlib.qll file. I guess that is not technically 100% correct. However, on the top of my head, it doesn't seem like we're gaining anything from splitting it into two, so I would suggest you move it into the existing file (but do let me know if I'm overlooking something 🤷). ((note: if you do want to keep the Builtins.qll file, you should add module structure like in all the other files in the frameworks folder)
| class ReversedSummary extends SummarizedCallable { | ||
| ReversedSummary() { this = "builtins.reversed" } | ||
|
|
||
| override CallNode getACall() { result = API::builtin("reversed").getACall().getNode() } |
There was a problem hiding this comment.
This line of code makes me wonder if we could change getACall for SummarizedCallable to have DataFlow::CallCfgNode as a result type, so this could just be written as result = API::builtin("reversed").getACall(). The extra .getNode() seems like it will be included every time we define a SummarizedCallable, so we might as well get away with it now?
There was a problem hiding this comment.
It is slightly inconsistent for LibraryCallable to return a DatFlow::Node here, when the other callables return control flow nodes, but I tried it now, see if you like :-)
There was a problem hiding this comment.
yes, I'm happy 👍 Much of it will be changed with the new call-graph anyway 🤷
python/ql/lib/semmle/python/dataflow/new/internal/FlowSummaryImplSpecific.qll
Show resolved
Hide resolved
The other callables return control flow nodes, so it is slightly inconsistent for this to return a data flow node, but it does make models based on API graphs nicer.
RasmusWL
left a comment
There was a problem hiding this comment.
Minor fixes needed... but almost 😅
python/ql/lib/semmle/python/dataflow/new/internal/FlowSummaryImplSpecific.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Rasmus Wriedt Larsen <rasmuswriedtlarsen@gmail.com>
Indeed, thanks for the fix :-) |
tausbn
left a comment
There was a problem hiding this comment.
A few minor comments here and there. Otherwise, I think this looks very good!
Regarding the overloading of "source" to mean "in source code", how about "extracted"? This, to me, seems fairly unambiguous...
| predicate lambdaCall(DataFlowCall call, LambdaCallKind kind, Node receiver) { none() } | ||
| predicate lambdaCall(DataFlowCall call, LambdaCallKind kind, Node receiver) { | ||
| receiver = call.(SummaryCall).getReceiver() and | ||
| kind = kind |
There was a problem hiding this comment.
So, if I understand correctly, this is to allow kind to be any LambdaCallKind? If so, maybe exists(kind) is a more idiomatic way of indicating an intentional cartesian product?
There was a problem hiding this comment.
That is correct (and LambdaCallKind is currently Unit).
| } | ||
|
|
||
| class SourceParameterNode extends ParameterNode, CfgNode { | ||
| //, LocalSourceNode { |
There was a problem hiding this comment.
With all of the recent changes, is your comment still in effect?
The precedence for the use of "source" to denote elements of source code is found in `EssaVariable::getSourceVariable` as well as in the Ruby code base. But it clashes with the many uses of source to mean "source of flow" found in the data flow library.


This demonstrates that flow summaries can be based on API graphs and all our existing tests can pass 🎉
I have trimmed down the changes needed to get to this state, and given detailed explanations of the tricky transformations needed to avoid non-monotonic recursion.
For a moment, the commit history was beautiful.
Then main moved on and more changes needed to be made, now it is a bit longer and more hairy.
Also, there is a branch where it is even more hairy, cutting up some of the inheritance hierarchies in order to achieve
SourceParameterNode extends LocalSourceNode, but I suggest we leave that for now.SourceParameterNodeis no longer something non-internal development should be concerned about.Names have become a bite nicer and the impact on user code and on
.expected-files has been minimised.We might still do more test, showing off the capabilities, especially the use from MaD which is not demonstrated at the moment. However, simply having it compile and the basic tests pass seems like a worthy milestone at this point.