Skip to content

Start sharing Concepts across dynamic languages#8307

Closed
hmac wants to merge 20 commits intogithub:mainfrom
hmac:hmac/shared-concepts
Closed

Start sharing Concepts across dynamic languages#8307
hmac wants to merge 20 commits intogithub:mainfrom
hmac:hmac/shared-concepts

Conversation

@hmac
Copy link
Contributor

@hmac hmac commented Mar 1, 2022

This PR attempts to move towards the sharing of Concepts.qll across Ruby, Python and JavaScript. To achieve this it adds two main things:

Re-export common libraries at common paths

The classes in Concepts.qll all depend on at least the language's dataflow library. This is currently imported using something like import codeql.ruby.DataFlow or import semmle.javascript.dataflow.DataFlow. In order to share Concepts.qll, we need common paths for these libraries across the languages.

I've chosen codeql as the common prefix for this files, so we have

  • codeql.DataFlow
  • codeql.Concepts
  • codeql.TaintTracking

Not all of these files are needed for the rest of the changes in this PR, but they will allow us to share common queries in the future.

Introduce ConceptsSpecific.qll

This module contains concept classes which are specific to the language. What is left in Concepts.qll are classes that are shared across all three languages. Over time, as we standardise our concepts, the idea is we move classes from ConceptsSpecific to Concepts.

I've started by sharing the following concepts:

  • FileSystemAccess
  • FileSystemReadAccess
  • FileSystemWriteAccess
  • SystemCommandExecution

This required changing a few member predicate names to be consistent across the languages, and using the Range pattern in JS, as we do in Ruby and Python.

This PR is structured to be reviewed commit-by-commit. I've tried to make it clear that nothing has changed when moving everything from Concepts to ConceptsSpecific by renaming the file and re-adding the original.

To do

  • Update changelog

@hmac hmac force-pushed the hmac/shared-concepts branch 2 times, most recently from ab88a4d to 8f6470f Compare March 2, 2022 03:00
import RouteHandler_getAResponseHeader
import HeaderDefinition_defines
import SystemCommandExecution
import SystemCommandExecutions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this because the previous name generated a warning in the test output, that it clashed with the module of the same name in ConceptsSpecific.qll.

/**
* Holds if this expression flows into `sink` in zero or more (possibly
* inter-procedural) steps.
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but it was causing a QLDoc test failure.

@hmac
Copy link
Contributor Author

hmac commented Mar 2, 2022

This may not be the approach we want to take, but I think it's a reasonable one and can help continue the discussion of how we want to share this stuff in practice. I'd be keen to hear your thoughts!

@github/codeql-ruby @github/codeql-javascript @github/codeql-python

@hmac hmac marked this pull request as ready for review March 2, 2022 05:02
@hmac hmac requested review from a team as code owners March 2, 2022 05:02
@asgerf
Copy link
Contributor

asgerf commented Mar 2, 2022

Splendid work overall! I do have some concerns about breaking changes though.

I've chosen codeql as the common prefix for this files

The ConceptsSpecific.qll file could re-export the relevant libraries, making them available to Concepts.qll. This is closer to how other shared libraries handle this situation. I'd call for consistency here and use the conventional method.

Another point to consider is that the import codeql.DataFlow style affects the public API in a pretty significant way, as it's likely users will start importing these files. It's not the kind of change we can do lightly.

This required changing a few member predicate names to be consistent across the languages, and using the Range pattern in JS, as we do in Ruby and Python.

Some of the refactorings to a ::Range class are breaking changes in the JS libraries. I wish the instanceof syntax had allowed us to migrate gracefully, but no. The only safe method I know of is to rename the class and leave a deprecated alias behind, for example:

/** DEPRECATED. Extend `CommandExecution` or `CommandExecution::Range` instead. */
deprecated class SystemCommandExecution = CommandExecution::Range;

class CommandExecution { .. }
module CommandExecution {
  abstract class Range { ... }
}

It sucks having to arbitrarily rename classes, but it also sucks to break things for our users.

@hmac
Copy link
Contributor Author

hmac commented Mar 2, 2022

I've chosen codeql as the common prefix for this files

The ConceptsSpecific.qll file could re-export the relevant libraries, making them available to Concepts.qll. This is closer to how other shared libraries handle this situation. I'd call for consistency here and use the conventional method.

That makes sense for this scenario. I wonder, what should we do if we want to share query-related files like this one? Simple files like this are 90% boilerplate and often just depend on DataFlow, RemoteFlowSources and Concepts. The idea behind the codeql. aliases was we could use them here, too. Is there a better alternative we could use instead?

It sucks having to arbitrarily rename classes, but it also sucks to break things for our users.

I agree. Do we have any policy on how long we keep deprecated classes around before completing such a breaking change? It would be sad if we had to keep such things around forever, and we can't have that many users that it's literally impossible for them to all migrate, surely? In Ruby I reckon we have around ~0 which makes it a bit easier!

@intrigus-lgtm
Copy link
Contributor

I agree. Do we have any policy on how long we keep deprecated classes around before completing such a breaking change?

https://github.com/github/codeql/blob/main/docs/supported-queries.md#supported-codeql-queries-and-libraries:

Once a query or library has appeared in a stable release, a one-year deprecation period is required before we can remove it. There can be exceptions to this when it's not technically possible to mark it as deprecated.

@hmac
Copy link
Contributor Author

hmac commented Mar 3, 2022

Assuming we have to rename all the existing classes in the JS Concepts.qll, I propose:

SystemCommandExecution -> CommandExecution
FileSystemAccess -> FileAccess
FileSystemReadAccess -> FileReadAccess
FileSystemWriteAccess -> FileWriteAccess
FileNameSource -> FilenameSource (?)

Then for the remaining three:

DatabaseAccess
PersistentReadAccess
PersistentWriteAccess

I think we should leave these in ConceptsSpecific and consider introducing more fine-grained concepts that better match the Python/Ruby versions, e.g.

SqlConstruction
SqlExecution
CookieReadAccess
CookieWriteAccess
...

This PR will only tackle the first four names, of course.

How does that sound? The Python and Ruby concepts already all use the Range pattern, so I think we don't have to worry about breaking changes there.

@hmac hmac requested a review from a team March 3, 2022 02:55
@hmac hmac removed the request for review from a team March 3, 2022 04:17
@hmac
Copy link
Contributor Author

hmac commented Mar 3, 2022

(Sorry about the ping there, I think it's because I updated some class names in documentation)

@hmac hmac force-pushed the hmac/shared-concepts branch from d785e08 to 0d8aff3 Compare March 3, 2022 20:14
@RasmusWL
Copy link
Member

RasmusWL commented Mar 7, 2022

I haven't had a lot of bandwidth to look at this yet, but I am very interested in getting this to work 👍

@hmac hmac force-pushed the hmac/shared-concepts branch from 476f592 to 763d6a1 Compare March 7, 2022 22:47
@hmac
Copy link
Contributor Author

hmac commented Mar 7, 2022

The ConceptsSpecific.qll file could re-export the relevant libraries, making them available to Concepts.qll. This is closer to how other shared libraries handle this situation. I'd call for consistency here and use the conventional method.

That makes sense for this scenario. I wonder, what should we do if we want to share query-related files like this one? Simple files like this are 90% boilerplate and often just depend on DataFlow, RemoteFlowSources and Concepts. The idea behind the codeql. aliases was we could use them here, too. Is there a better alternative we could use instead?

To revisit this, I realise we can use the exact same pattern, e.g. as done here. So I've removed the codeql.* files and added an Imports module to ConceptsSpecific.qll:

// ruby/ql/lib/codeql/ConceptsSpecific.qll
module Imports {
  import codeql.ruby.DataFlow
}
module Concepts {
  ...
}

// ruby/ql/lib/codeql/Concepts.qll
private import ConceptsSpecific::Imports
import ConceptsSpecific::Concepts

I think this does what we need without polluting the global scope with either unnecessary extra files or extra modules.

@RasmusWL
Copy link
Member

RasmusWL commented Mar 9, 2022

Thanks so much for taking initiative on this 🔥 💪 🙏

Strategy for sharing concepts

[The ConceptsSpecific.qll] module contains concept classes which are specific to the language. What is left in Concepts.qll are classes that are shared across all three languages. Over time, as we standardise our concepts, the idea is we move classes from ConceptsSpecific to Concepts.

I'm not 100% convinced that this strategy for sharing concepts is the right one. After a bit of standardizing, we'll have a beefy Concepts.qll with many items. But what will then happen when we start adding support for a new language?

No concepts will have any concrete models initially, and I assume that it will take some time to fill in the gaps. This means that potential customers that write QL code will see these concepts being available, but not being able to use them (since they will produce no results) 😬 Another point is that maybe not all concepts even makes sense for a specific language; the most prominent example I can think of is CodeExecution, which might not be a part of compiled languages -- although I guess typically there is some sort of scripting support through libraries, for example for Lua, so this argument might not be that relevant.

When I thought about how to share concepts initially, I thought about an other solution: Define each concept in shared file, such as concepts/SqlExecution.qll, and let each language have their own Concepts.qll where they can pick and choose to import the concepts that are relevant/actually have modeling. That could also allow some language specific additions to a concept:

import concepts.SqlExecution as SharedSqlExecution

class Range extends SharedSqlExecution::SqlExecution instanceof SharedSqlExecution::SqlExecution::Range {
  predicate additionalPredicateOnlyNeededInThisLanguage() {
    super.additionalPredicateOnlyNeededInThisLanguage()
  }
}

module SqlExecution {
  abstract class Range extends SharedSqlExecution::SqlExecution::Range {
      abstract predicate additionalPredicateOnlyNeededInThisLanguage();
  }
}

Although I'm not 100% convinced whether that's a good idea 🤔

Problems with concepts I've seen

  1. You end up wanting to change a concept
    • This happend to me when writing SSRF query (where I borrowed concept from Ruby)
  2. You postpone sharing changes back to other languages (looking at myself for those SSRF changes)

I think we need to aim for building a solution that is flexible enough that you will not be blocked for weeks just because you want to change an existing concept slightly, but also a solution that encourages "upstreaming" changes, and not just leaving them in your own version.

PRO/CON comparison

This is an attempt to get my thoguhts and feelings down on the two proposed solutions (although there might be others)

Solution 1: Shared Concepts.qll with language specific ConceptsSpecific

  1. PRO: It's very easy to hook into. You subscribe to get identical version of Concepts.qll, and that's it
  2. CON: Initially you will have concepts without any modeling. This can be confusing
  3. CON: You will need to gain consensus from all participating languages before you can add something to the shared Concepts.qll.
    1. until that is done, new concepts will live in copy-pasted form, and not be properly shared.
    2. the friction from getting approval/consensus might cause people to not add them to Concepts.qll at all 😐
  4. PRO: When something is added to Concepts.qll, multiple languages will have looked at them, and approved their structure/definition.
  5. PRO: concepts will not diverge between langauges. If one langauge makes addition, all langauges will HAVE to get it as well.
    • but this only applied to the shared concepts
  6. CON: it's hard for a single language to extend a concept with a new member-predicate
  7. CON: It will be very hard to share HTTP module, since for parts of it (route-setup) JS already has a different system. If we share only what we can agree on, we can't really extend the HTTP module in ConceptsSpecific, and would need some non-trivial import mechanism to allow some langauges (Python/Ruby) to continue their current HTTP setup.

Solution 2: Individual Concepts.qll with shared definitions in concepts/ directory

  1. PRO: Also very easy to hook into, you just need to add import concepts.SomeNewConcept to the Concepts.qll file.
  2. PRO: Languages will not have concepts exposed without actively choosing to add them.
  3. PRO: Concepts.qll doesn't become thousands of lines long
  4. CON: it's possible to import concepts differently than other langauges do (that is, if you only want some HTTP concepts, you might not wrap them in the same ql modules) -- so end users might end up seeing different behavior from different Concepts.qll files 😬
  5. PRO: You can start sharing a Concept between two languages without waiting for all other langauges to also add modeling/queries.
    • When one langauge adds a new concept, the default should be to do it in concepts/MyNewConcept.qll, which makes sharing easy by default.
  6. CON: Concepts only adopted by one or two languages might have design flaws that make them unsuitable for other langauges.
    • if a concept is already added to a CodeQL release, we get into 1-year support problem before we can fully deprecate it 😐
  7. PRO: A language can add their own extensions to a concept, and we can do a rolling update of other languages.
  8. CON: A language can add their own extensions to a concept, so concepts can diverge.
  9. PRO: It will be easy to share members of HTTP module in Concepts (bar the problems from (4))

Thoughts on some of the CONS:

  • (4, (6) and (8) are engineering/human challenges, and we might be able to handle these with shared code reviews between langauges.
  • An other solution to (6) might be that we require/aspire to new queries to be added to all languages, greatly improving collaboration between langauges.
  • For (6) we can't just mark all our concepts as INTERNAL: Do not use. since that would be the common API we would want our end-users to write queries against.

Conclusion

I think overall I'm slightly favoring Individual Concepts.qll with shared definitions in concepts/ directory but I'm very open to discussing this further 👍

Sharing query implementations

That makes sense for this scenario. I wonder, what should we do if we want to share query-related files like this one? Simple files like this are 90% boilerplate and often just depend on DataFlow, RemoteFlowSources and Concepts. The idea behind the codeql. aliases was we could use them here, too. Is there a better alternative we could use instead?

We could share the <query>Customizations.qll file directly, all it needs is access to Concepts and RemoteFlowSources -- both are in language specific places though, so we might end up making an artificial <query>CustomizationsLanguageSpecific.qll file that each language have to supply themselves. That's at least one solution. (but I guess from your most recent comment that you have figured out this strategy yourself as well 😄)

Potentially we will just have a single `` in <lang>/ql/lib/security/dataflow/

hmac added 5 commits March 11, 2022 16:58
This prevents a circular dependency when we override shared concepts,
which leads to ambiguous name resolution errors.
Update a member predicate of SystemCommandExecution to match the naming
in the JS version.
Also add a member predicate `isShellInterpreted`, and rename
`getCommand` to `getACommandArgument`. This brings it in line with the
JS version.
@hmac hmac force-pushed the hmac/shared-concepts branch from 763d6a1 to 4f96871 Compare March 13, 2022 21:04
@hmac
Copy link
Contributor Author

hmac commented Mar 13, 2022

@RasmusWL @esbena I've pushed up a slightly messy WIP of the newer design that we talked about. Feel free to modify as you need!

@esbena
Copy link
Contributor

esbena commented Mar 14, 2022

@asgerf, to keep you in the loop:

@RasmusWL, @hmac and I had a synchronous discussion about the architecture which since has been implemented by @hmac. The key difference from what you reviewed before is that the language-specific Concepts.qll now explicitly re-exports the classes of ConceptsShared.qll through aliases, optionally by enhancing the classes with additional language-specific predicates, see https://github.com/github/codeql/pull/8307/files#diff-52570c25024d3303a382e02a4f8114bd137c2a579a04954efb282813042d9f95R34 as an example.

Moving forward, Concepts.qll will be the staging ground for brand new concepts from each language, but we will maintain a discipline of moving those concepts to ConceptsShared.qll ASAP.

Before we merge, we should add a some architecture documentation to the top of ConceptsShared.qll.

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current way the changes are introduced in this PR, we would be breaking our deprecation policy 😳

Comment on lines +513 to +515
"python/ql/lib/semmle/python/ConceptsShared.qll",
"ruby/ql/lib/codeql/ruby/ConceptsShared.qll",
"javascript/ql/lib/semmle/javascript/ConceptsShared.qll"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to an internal location, such as "python/ql/lib/semmle/python/internal/ConceptsShared.qll"? -- then it's obvious that end-users and query writers should not use this file directly

We can also move ConceptsImports to that location.

Comment on lines +137 to +138
/** DEPRECATED: use `CommandExecution::Range` instead. */
deprecated class SystemCommandExecution = CommandExecution::Range;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the deprecated aliases to the language specific Concepts.qll files instead? Then we don't introduce deprecated aliases for new languages that adopt ConceptsShared 😉


/** Gets the argument that specifies the command to be executed. */
DataFlow::Node getCommand() { result = range.getCommand() }
DataFlow::Node getACommandArgument() { result = super.getACommandArgument() }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rename is not ok in itself, since we need to follow standard deprecation policy for getCommand ...

abstract class Range extends DataFlow::Node {
/** Gets the argument that specifies the command to be executed. */
abstract DataFlow::Node getCommand();
abstract DataFlow::Node getACommandArgument();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same goes for this rename

* for instance by spawning a new process.
*/
abstract class SystemCommandExecution extends DataFlow::Node {
class SystemCommandExecution extends DataFlow::Node instanceof SystemCommandExecution::Range {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change in itself is also not compatible with deprecation policy 😞

@RasmusWL
Copy link
Member

I tried to rewrite the commits so we would follow deprecation policy (like I pointed out in review above). That spurred me to look closer at the suggested changes. I specifically looked at CommandExecution, and how it was used in the 3 languages, where I found some inconsistencies. Overall I'm not sure we want the shared concept to contain as many member-predicates initially (at least not without reworking the queries in both Python and Ruby).

When we have come to an agreement on how we want to share concepts, as highlighted by @esbena here, I would propose we split this PR up into:

  1. One PR that sets up the scaffolding, with empty ConceptsShared.qll file that is properly shared.
  2. One PR for each set of concepts we want to share, so we can discuss what the new concept should look like, and align on how the query is implemented.
    • one for CommandExecution
    • one for FileAccess

If you're interested in the details of my investigation, see details below.

Details

investigation of how command execution concept is used

Currently, on main:

Ruby

rb/command-line-injection

class SystemCommandExecutionSink extends Sink {
  SystemCommandExecutionSink() { 
    exists(SystemCommandExecution c | c.isShellInterpreted(this)) 
  }
}

Python

py/command-line-injection

class CommandExecutionAsSink extends Sink {
  CommandExecutionAsSink() {
    this = any(SystemCommandExecution e).getCommand()
  }
}

JS

JS has multiple queries around command injection

JS also has some more advanced logic, implemented in isIndirectCommandArgument that handles things like finding <cmd> in sh -c <cmd>.

isIndirectCommandArgument is not used to define a Sink subclass, but used in the isSink predicate (with slight indirection through a isSinkWithHighlight predicate)

js/command-line-injection:

  predicate isSinkWithHighlight(DataFlow::Node sink, DataFlow::Node highlight) {
    sink instanceof Sink and highlight = sink
    or
    isIndirectCommandArgument(sink, highlight)
  }

...

  class SystemCommandExecutionSink extends Sink, DataFlow::ValueNode {
    SystemCommandExecutionSink() { this = any(SystemCommandExecution sys).getACommandArgument() }
  }

js/indirect-command-line-injection:

  predicate isSinkWithHighlight(DataFlow::Node sink, DataFlow::Node highlight) {
    sink instanceof Sink and highlight = sink
    or
    isIndirectCommandArgument(sink, highlight)
  }
  
 ...
 
   private class SystemCommandExecutionSink extends Sink, DataFlow::ValueNode {
    SystemCommandExecutionSink() { this = any(SystemCommandExecution sys).getACommandArgument() }
  }

js/shell-command-injection-from-environment:

  predicate isSinkWithHighlight(DataFlow::Node sink, DataFlow::Node highlight) {
    sink instanceof Sink and highlight = sink
    or
    isIndirectCommandArgument(sink, highlight)
  }
  
 ...
 
   class ShellCommandSink extends Sink, DataFlow::ValueNode {
    ShellCommandSink() { any(SystemCommandExecution sys).isShellInterpreted(this) }
  }

js/shell-command-constructed-from-input:

Notice that this is about library input (@name Unsafe shell command constructed from library input)

This one has a bit more complex implementation, but most of the sinks boils down to using the isExecutedAsShellCommand type-back-tracker, that has this core implementation bit:

t.start() and result = sys.getACommandArgument() and sys.isShellInterpreted(result)
or
t.start() and isIndirectCommandArgument(result, sys)

there is also a ShellTrueCommandExecutionSink sink, that uses a type-back-tracker that looks at getArgumentList directly.

  private DataFlow::SourceNode endsInShellExecutedArray(
    DataFlow::TypeBackTracker t, SystemCommandExecution sys
  ) {
    t.start() and
    result = sys.getArgumentList().getALocalSource() and
    // the array gets joined to a string when `shell` is set to true.
    sys.getOptionsArg()
        .getALocalSource()
        .getAPropertyWrite("shell")
        .getRhs()
        .asExpr()
        .(BooleanLiteral)
        .getValue() = "true"

Conclusion

To gain alignment, it seems like Ruby should use getACommandArgument like JS/Python.

If we want getArgumentList to be part of the shared concept, both Ruby and Python should implement isIndirectCommandArgument

With the current queries in Python/Ruby, it looks like isShellInterpreted should just be language specific part in JS.

@esbena
Copy link
Contributor

esbena commented Mar 17, 2022

👍🏻 to the split of this PR.
The discussions about the language differences for two concepts should not delay the rest.
Still, I think this PR was the right thing to create as it motivated the concrete discussion of what the end result should be.

@RasmusWL
Copy link
Member

Still, I think this PR was the right thing to create as it motivated the concrete discussion of what the end result should be.

agreed 👌

@hmac
Copy link
Contributor Author

hmac commented Mar 23, 2022

Superseded by #8476

@hmac hmac closed this Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants