JS: ATM: New features for imports and for function parameters related to an endpoint#8740
JS: ATM: New features for imports and for function parameters related to an endpoint#8740kaeluka wants to merge 889 commits intogithub:esbena/improve-featuresfrom kaeluka:atm-file-imports-feature
Conversation
c8751ba to
94b0f1a
Compare
...rimental/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointFeatures.qll
Show resolved
Hide resolved
...rimental/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointFeatures.qll
Outdated
Show resolved
Hide resolved
...rimental/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointFeatures.qll
Show resolved
Hide resolved
...rimental/adaptivethreatmodeling/lib/experimental/adaptivethreatmodeling/EndpointFeatures.qll
Outdated
Show resolved
Hide resolved
9a24ebd to
a84e317
Compare
|
did some long-needed housekeeping, hence the force pushes. |
|
Experimentation being tracked here. |
|
@kaeluka are When looking and SQL injection,
|
|
Also, |
|
Here's an example that I'm thinking might inspire more feature improvements: I trained a model using a combination of the old and new features, but dropped the 5 old features that are supposed to be replaced by new features (argumentIndex, calleeName, calleeAccessPathWithStructuralInfo, calleeAccessPath, calleeApiName). This endpoint ended up as a FP for SQL injection. It's missing all but 4 features: contextFunctionInterfaces, enclosingFunctionBody, enclosingFunctionName, and fileImports. It looks like there's a TP SQL sink within the same function on line 35. The four features contextFunctionInterfaces, enclosingFunctionBody, enclosingFunctionName, and fileImports are of course identical for both endpoints, since they sit in the same function. (This sink actually involves the same variable as the FP, but in the TP the variable is used as input to Maybe we could expect the model to learn that if it has only these more global features it should classify the endpoint as NotASink, because real sinks generally have more features, but some sinks don't have all too many features either. I wonder whether there are features we could add that would help recognize more directly that this is not a SQL sink -- such as something that indicates that the endpoint is just being assigned to a variable. At the same time, I'm hesitant to turn our features into a long list of heuristics that effectively try to distinguish between sinks and non-sinks, because that won't generalize or scale very well. Ideally we'd have a short, simple set of features, and let the model do the heavy lifting of learning patterns that distinguish sinks from non-sinks. That was the idea with the original feature, enclosingFunctionBody, but of course a single function often contains both sinks and non-sinks. That's why we tried to think of ways to define a more localized version of enclosingFunctionBody, but we never got anywhere good with that exploration. That's also why we wanted to give the model access to the full syntax in the function body rather than just the names of objects / functions.... Happy to do some brainstorming about whether we can write a smaller set of more general features that would contain enough information for the model to learn how to distinguish sinks from non-sinks. |
|
Thanks, Tiferet! It's good to see that our feedback loop is becoming faster! One thing we've been trying to achieve here is to increase coverage of features, in other words: make sure that each endpoint has at least some features. One thing I could try next week is to locally run some ad-hoc queries that show me endpoints with a low set of features and try to systematically improve the situation. In regards to the specific example you gave: if all the model gets is values for contextFunctionInterfaces, enclosingFunctionBody, enclosingFunctionName, and fileImports, I'm not surprised it'll misclassify the endpoint. Those features are the same for each endpoint in that function/method and even in the most dangerous of methods, most endpoints are not sinks. Meaning: what you have found here is a bug in that there NEEDS to be better feature coverage in this instance. |
|
Btw, the alert on 25 is perhaps not perfect, but if that was the only alert a developer would see, it would still clearly point her to an actual vulnerability. So even though this is, technically, an FP, it's also close enough to the TP to make it extremely valuable. A developer would not perceive this as FP but send us a thank you note. Edit: |
|
Ok, here's some results from locally digging into this: I was at first surprised that the Additionally, I'm going to add a feature And I had forgotten to answer that question:
Yes, that's expected :) |
|
@kaeluka Following up on our conversation today, here are two spreadsheets listing all the endpoints that have flow from a source in our evaluation set for Tainted Path Injections.
According to our end-to-end evaluation, training with the new features + three of the old features does slightly worse on Tainted Path Injection than training with only the old features, but the change is not significant. For all four queries, we're seeing similar end-to-end metrics with the new features as the old ones. The question is, do the new features really have little impact (seems unlikely), or do our metrics not reflect well the user experience (e.g. number of embarrassing FPs)? NOTE: To make it easier to find interesting examples for manual inspection, I added a "classification" column (FN, FP, TN, TP), and sorted results first by this column and then by "num empty features", the number of features this endpoint is missing (out of either 12 or 8, depending on the experiment). |
|
Perfect, thank you!! As always, looking at specific examples is great. I want to zoom in to one example that's present in both files and will show you how I think that the new features are a game changer here. TL;DR: with the old features, the model can't be expected to make a correct decision here. It's blind. With the new feature, I'm certain the problem lies elsewhere. I'm picking an example that is present in both datasets so we can evaluate what the change we're implementing is doing. Example: https://github.com/mapbox/mapbox-studio-classic/blob/99d9084/lib/style.js#L440 If you look at the example, you'll agree with me, that this is an easy one. The model should really, really be able to figure that one out. I've pasted the code below and marked var reader = fstream.Reader({
path: uri.dirname,
// ^^^^^^^^^^^
type: 'Directory',
...
})ResultsIn both datasets you've sent me, this example is classified as a (false) negative.
On What Grounds are Those Misclassifications Made?I computed the feature values locally on this file, and am listing their values here. This way, we can gain insights into where this misclassification is coming from. If the new features are not informative enough for a human to tell this is a sink — then we lack features. If they are — we have a problem in training data selection, ML model architecture, or who-knows-where. Defined Feature Values (old)As you can see, with the old features, the ML model has no way of making a good decision here. It's only getting an an unstructured representation of the function body and a name. It doesn't even have a way of knowing which endpoint it's being asked about (I'm making the endpoint bold for you to show you how impossible the ML model's task here is).
Defined Feature Values (new)Contrary to the old features, where the ML model was definitely not to blame, this misclassification makes no sense to me at ALL. I can see that the input goes into something named path as an argument, I can see that the call is to Something is wrong here that's beyond features. Any ideas, @tiferet?
edit: update selection of old/new features according to |
I need to look at this, but I think this is the wrong URL? It doesn't match the code you're analyzing |
|
Oh my! This is the line: https://github.com/mapbox/mapbox-studio-classic/blob/99d9084/lib/style.js#L440 Editing above! |
|
@kaeluka A quick first observation: I don't know if you noticed, but with the new features this endpoint was misclassified as a SQL injection sink, not a non-sink. (You can see this by looking at the score columns in the spreadsheets, and finding the column with the highest score.) I assume there's no reasonable reason for the model to think that, right? |
|
And another 🤔 : Why is |
Not that this answers any of your questions, but I think the two old features that exist are |
QuestionWhy does the model misclassify https://github.com/mapbox/mapbox-studio-classic/blob/99d9084/lib/style.js#L440 as Hypothesis@kaeluka is it possible that this type of sink is just too rare to be represented in our training data? Reasoning: No similar features appear in training examplesI looped through all training data, looking for I found no such examples for For The token TL;DRIf my analysis is correct, the reason the model misclassified this example is that none of the features that you found informative as an expert were informative to the model, because the training set included no examples with similar values in these features. Reference code and dataCode is found in branch Just for reference, here's the code I used to produce the list: Here's the list I'm looking at: all training and validation sinks of type |
Co-authored-by: Henry Mercer <henrymercer@github.com>
…se positives for XSS query
…g concatenation leaves are usually not sinks
This adds four new features that makes the imports in a file of an endpoint visible to the model.
Contained Features
Feature 1:
fileImportsExample:
In this file, any endpoint will have the value
fs pg testfor thefileImportsfeature.Feature 2:
calleeImportsThis lists only the imports used in the callee of an invocation of which the endpoint is an argument or part of an argument.
This should, after experimentation has concluded it's OK, replace the currently existing
calleeApiNamefeature — which is not stable (it's relying on API graphs; context: https://github.com/github/ml-ql-adaptive-threat-modeling/issues/1843).Example
Feature 3:
contextSurroundingFunctionParametersContains the parameters of all functions surrounding the endpoint.
Feature 4:
contextFunctionInterfacesInFileInterfaces (eg.,
name(param1, param2, param3)) of all functions in the same file.Dependencies
This depends on #8586.
Review starting at the commit "ATM: new feature to list all imports in an endpoint's file".