Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add neighborhood scope token feature to ATM library #7158

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

@annarailton
Copy link

@annarailton annarailton commented Nov 17, 2021

⚠️ This is a replacement of #7148 where the code is on a branch of codeql, rather than a fork. This makes testing these changes in backend much more straightforward.

Taking the work in https://github.com/github/ml-ql-adaptive-threat-modeling-backend/tree/annarailton/neighbourhood-features and moving it into the CodeQL library.

The feature neighborhoodBody is now a token feature extracted in ExctractEndpointData.ql, alongside the likes of enclosingFunctionBody.

See github/ml-ql-adaptive-threat-modeling#1553

@annarailton
Copy link
Author

@annarailton annarailton commented Nov 17, 2021

Suggestions for names for this feature (please add to this)

  • neighborhoodBody
  • enclosingFunctionBodyNeighborhood
  • enclosingFunctionBodyLocalScope
  • enclosingFunctionBodyLocal
  • enclosingFunctionBodyEndpointNeighborhood

A few thoughts from @tiferet:

  • Although they're long, I like the options that include enclosingFunctionBody, because they call out the relationship with the existing features enclosingFunctionBody and enclosingFunctionName.
  • To me enclosingFunctionBodyNeighborhood sounds like it's the neighborhood around the function body (i.e. a superset of the function body) rather than a neighborhood within the function body (i.e. a subset of the function body). [ I agree - Anna]
  • enclosingFunctionBodyEndpointNeighborhood is probably clearest, although also longest. I'm happy with that or with enclosingFunctionBodyLocalScope / enclosingFunctionBodyLocal.
  • Eventually maybe we can tweak this so it's not limited only to a subtree of the enclosing function body. If this feature is called enclosingFunctionBodyEndpointNeighborhood, the more general one can then be just endpointNeighborhood...

Loading

@annarailton annarailton force-pushed the annarailton/neighborhood-features branch from a14a636 to 0dc2cef Nov 18, 2021
annarailton and others added 8 commits Nov 18, 2021
Co-authored-by: Chris Smowton <smowton@github.com>
Co-authored-by: Chris Smowton <smowton@github.com>
This provides functionality for getting the token features associated with a
neighborhood around an AST node. It is strongly related to `FunctionBodies`.

Co-authored-by: Chris Smowton <smowton@github.com>
Co-authored-by: Chris Smowton <smowton@github.com>
Co-authored-by: Tiferet Gazit <tiferet@github.com>
@annarailton annarailton force-pushed the annarailton/neighborhood-features branch from 0dc2cef to 05460f6 Nov 18, 2021
// approximates the behavior of the classifer on non-generic body features where large body
// features are replaced by the absent token.
if count(DatabaseFeatures::AstNode node, string token | bodyTokens(rootNode, node, token)) > 256
if
Copy link

@tiferet tiferet Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need getBodyTokenFeatureForNeighborhoodNode, or can we reuse getBodyTokenFeatureForEntity?

Loading

if getNumDescendents(node.getParentNode()) > maxNumDescendants()
if
// `node` will always have a parent as we start at and endpoint
node.getParentNode() = getOutermostEnclosingFunction(node) or
Copy link

@tiferet tiferet Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean the neighborhood can never be the entire enclosing function? We could instead do

if
      node = getOutermostEnclosingFunction(node) or
...

Loading

Copy link
Author

@annarailton annarailton Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading

Copy link
Author

@annarailton annarailton Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emulates what happens in getTokenBodyFeatureForEntity, which also returns the function body but not the top-level function AST node (i.e. the function name + parameters I think).

In the above example, neighborhoodBody is <ABSENT> but enclosingFunctionBody is also really short (and would be identical to neighborhoodBody).

Loading

Copy link

@tiferet tiferet Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions about this:

  1. If this emulates what happens in getTokenBodyFeatureForEntity then why does neighborhoodBody end up <ABSENT> while enclosingFunctionBody is short but not absent?
  2. Why would we want neighborhoodBody to be <ABSENT> rather than being short and identical to enclosingFunctionBody? There could still be useful signal in the short sequence. Different features are have different paths through the network (different parameters are learned for each), so having identical values in some instances isn't redundant. Also, we're hoping we may be able to replace the full function body with these more localized features eventually.

Loading

then result = node
else result = getNeighborhoodAstNode(node.getParentNode())
}

/** Count number of descendants of an AST node */
int getNumDescendents(Raw::AstNode node) { result = count(node.getAChildNode*()) }

private ASTNode getContainer(ASTNode node) {
Copy link

@tiferet tiferet Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOI, what's the difference between a container and a parent?

Loading

Copy link
Author

@annarailton annarailton Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Containers skip up to an enclosing function, parents step one level up the AST graph, e.g. for

f() { 
  if(endpoint) {
    … 
  } 
}

the container is f() but the parent is if(…)

Loading

Copy link

@tiferet tiferet Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then why do we need getContainer*?

Loading

Copy link

@tiferet tiferet Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an endpoint is enclosed in a function that's enclosed in a function, do the enclosingFunction features look at the outermost function?

Loading

@@ -8,6 +8,12 @@ import javascript
import CodeToFeatures
import EndpointScoring

/** Maximum number of descendants of an AST node to be considered to be in the "neighborhood" of that node */
Copy link

@tiferet tiferet Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before running the dev pipeline, do we want to produce several different features with different values of maxNumDescendants, so we can experiment and see which give good signal?

Loading

Copy link
Author

@annarailton annarailton Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds sensible yes.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants