Yoff/python use shared cfg for dataflow#21894
Draft
yoff wants to merge 81 commits into
Draft
Conversation
There was a problem hiding this comment.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
8cab5a2 to
350a68d
Compare
These tests consist of various Python constructions (hopefully a somewhat comprehensive set) with specific timestamp annotations scattered throughout. When the tests are run using the Python 3 interpreter, these annotations are checked and compared to the "current timestamp" to see that they are in agreement. This is what makes the tests "self-validating". There are a few different kinds of annotations: the basic `t[4]` style (meaning this is executed at timestamp 4), the `t.dead[4]` variant (meaning this _would_ happen at timestamp 4, but it is in a dead branch), and `t.never` (meaning this is never executed at all). In addition to this, there is a query, MissingAnnotations, which checks whether we have applied these annotations maximally. Many expression nodes are not actually annotatable, so there is a sizeable list of excluded nodes for that query.
These use the annotated, self-verifying test files to check various consistency requirements. Some of these may be expressing the same thing in different ways, but it's fairly cheap to keep them around, so I have not attempted to produce a minimal set of queries for this.
This one demonstrates a bug in the current CFG. In a dictionary
comprehension `{k: v for k, v in d.items()}`, we evaluate the value
before the key, which is incorrect. (A fix for this bug has been
implemented in a separate PR.)
This looks for nodes annotated with `t.never` in the test that are reachable in the CFG. This should not happen (it messes with various queries, e.g. the "mixed returns" query), but the test shows that in a few particular cases (involving the `match` statement where all cases contain `return`s), we _do_ have reachable nodes that shouldn't be.
This one is potentially a bit iffy -- it checks for a very powerful propetry (that implies many of the other queries), but as the test results show, it can produce false positives when there is in fact no problem. We may want to get rid of it entirely, if it becomes too noisy.
Currently we only instantiate them with the old CFG library, but in the future we'll want to do this with the new library as well. Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
We can only annotate the ones that correspond directly to AST nodes anyway. Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Not entirely sure about the `else:` blocks. Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Switches the trunk dataflow library and all in-tree consumers
(frameworks, ApiGraphs, Concepts, regexp, security customisations,
test harness) from the legacy Flow.qll/ESSA stack to the new
shared-CFG facade (Cfg.qll) and the ESSA-shaped adapter on the
shared-SSA library (SsaImpl.qll).
Highlights:
* DataFlowPublic/Private/Dispatch, Attributes, VariableCapture,
IterableUnpacking, ImportResolution, ImportStar, LocalSources,
TaintTrackingPrivate, MatchUnpacking, TypeTrackingImpl,
SsaImpl, Builtins all now qualify CFG/SSA references with
Cfg:: / SsaImpl:: and stop pulling in semmle.python.essa.*.
* AstNodeImpl.qll/Cfg.qll: ImportMember exposes its inner
ImportExpr, DefinitionNode.getValue covers Alias / AnnAssign /
AugAssign / AssignExpr / For-target / Parameter-default,
ForNode is treated as an expression node, AnnotatedExitNode is
canonical, and BoolExprNode.getAnOperand drops the dominance
constraint that did not hold for short-circuit BBs.
* SsaImpl.qll: parameters always get a ParameterDefinition (so
unused parameters still have SSA defs), scope-entry defs for
module globals require an actual store somewhere, scope-exit
has a synthetic use so reaching-defs survives to module
boundary, and the legacy SsaSourceVariable / EssaVariable
surface (getName, getScope, getAUse, getASourceUse,
getAnImplicitUse) is reinstated for downstream queries.
* DataFlowPublic.qll: GuardNode redesigned around the new
structural outcome nodes (isAfterTrue / isAfterFalse). The
legacy ConditionBlock + flipped indirection is gone;
controlsBlock walks UP through 'not' / '==True' / 'is False'
etc. via outcomeOfGuard, accumulating polarity cleanly. Only
BarrierGuard<...> is preserved as public API.
* ModuleVariableNode.getAWrite and LocalFlow::definitionFlowStep
bypass SSA and consult Cfg::NameNode.defines /
Cfg::DefinitionNode.getValue directly, so that write defs
pruned by shared SSA (because the variable has no in-scope
read) still produce dataflow steps.
* Frameworks + downstream consumers: replace
EssaVariable.hasDefiningNode, getAReturnValueFlowNode,
Parameter.getDefault, Scope.getEntryNode / getANormalExit etc.
with CFG-side bridges through Cfg::ControlFlowNode.
The legacy Flow.qll / Essa.qll stack is untouched and remains
available for queries that import it directly.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test-side changes accompanying the dataflow migration:
* Test queries (.ql) and shared test harness (TestSummaries,
TestTaintLib) qualify CFG / SSA types with Cfg:: / SsaImpl::,
bridge via AST (Name, Call, ...) instead of legacy NameNode /
CallNode, and switch GlobalSsaVariable / EssaVariable usages
to the new adapter API.
* .expected files updated for legitimate precision and toString
changes:
- phi-node def-use edges newly exposed in def_use_counts.
- scope-exit synthetic use surfaces one extra implicit use
in use-use-counts.
- For [empty]/[non-empty] outcome rows added in
EnclosingCallable.
- SsaSourceVariable / Global Variable label cosmetics
normalised throughout.
* Inline annotations:
- typetracking/test.py: removed MISSING:tracked on lines
93/95 (now found), added SPURIOUS:tracked on line 108
(decorator over-reach).
- global-flow/test.py: added SPURIOUS writes=g_mod on line
20 (correctly reports immediately-overwritten write).
- tainttracking/customSanitizer/test.py: marked
try/except: ensure_tainted(s) cases as MISSING: tainted
(no-raise CFG abstraction does not connect try body to
except body).
- coverage/test.py: marked
SINK(return_from_inner_scope([])) as
MISSING: flow=... pending closer investigation.
* regression/{dataflow,custom_dataflow}.expected: accept two
if/else cond-correlation over-reaches (documented limitation;
same imprecision applies under legacy semantics by design).
After this change the dataflow library-tests stand at 62 of 64
passing; the two remaining failures are tracked under the
ImportStarRefinement workstream.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a 4th disjunct to `SsaImplInput::variableWrite` in the shared-SSA adapter that mirrors legacy ESSA's `ImportStarRefinement`: every variable whose scope is the import-star's scope, OR which is used in the import-star's scope, gets an uncertain write at the `import *` position. Uncertain writes do not kill prior definitions; shared SSA's `SsaUncertainWrite` joins the new value with the immediately-preceding definition via `uncertainWriteDefinitionInput`. This is the equivalent of legacy ESSA's two-input refinement. Cannot depend on `ImportStar` / `ImportResolution` (those modules import `SsaImpl`), so the predicate uses the structural heuristic on `Cfg::ImportStarNode` directly. This closes the two remaining failing dataflow library-tests: - `import-star/global` — `module_export` chains via `from X import *` re-exports now resolve: the importing module has an SSA def of every re-exported name, so `lastUseVar` finds the read at the use site. - `typetracking_imports/highlight_problem` — a direct `from .foo import foo` immediately followed by `from .other import *` is now correctly marked as dead at the direct import. Two scope-entry-def noise rows in `highlight_problem.expected` are also dropped — legacy ESSA needed them as refinement inputs, but shared SSA handles uncertain writes without an explicit prior def. They were always tagged `no use to normal exit` (dead). Dataflow library-tests: 62/64 → 64/64 passing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The legacy CFG emitted two ControlFlowNodes for `x[i] += 42` (one load, one store, with `load.strictlyDominates(store)`). The new CFG collapses them to a single canonical node, mirroring Java's single-`VarAccess` model where `isVarRead`/`isVarWrite` are non-disjoint on the same expression. Reconcile two legacy two-node behaviours with the merged single-node world: 1. `Cfg::ControlFlowNode.isLoad()` no longer excludes augmented targets — both `isLoad` and `isStore` hold on the merged canonical node, matching Java. `NameNode.defines` drops the now-redundant `not isLoad` guard; `Py::Name.defines` already filters by `isDefinition` (Store/Param/AugAssign-target ctx). 2. `LocalFlow::definitionFlowStep` is restricted to NameNode targets, matching legacy ESSA's `assignment_definition` which required `defn.(NameNode).defines(v)`. Subscript and attribute writes (`x[i] = 42`, `obj.attr = 42`) no longer emit a local-flow step *into* the LHS expression — that flow is handled by the AttrWrite and content-flow machinery. This is essential for keeping augmented Subscript/Attribute targets classifiable as `LocalSourceNode` on the read side, which the API graph requires for emitting Use edges. `StoreLoadTest.ql` is updated to filter `isAugLoad` out of the regular `load` tag, mirroring the pre-existing `not isAugStore` filter on the `store` tag so augmented-assignment expectations remain `augload=n augstore=n` (not also `load=n store=n`). Closes the three remaining ApiGraphs library-test failures (`getSubscript.ql` semantically, plus cosmetic toString updates in `ModuleImportWithDots.ql` and `test_crosstalk.ql`). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
`ImportResolution.qll` was the last new-dataflow file with a direct `import semmle.python.essa.SsaDefinitions`, used only for the `SsaSource::init_module_submodule_defn` helper. Inline the 5-line body as a local private predicate. No functional change — the inlined predicate is clause-for-clause equivalent (the `f = init.getEntryNode()` join only constrained `package = init`, since `Scope.getEntryNode()` is unique per scope; we now express that constraint directly). All 70 dataflow + ApiGraphs library-tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ForStmt.getInit(int)/getUpdate(int) now return AstNode (was Expr) - Case.getAPattern() renamed to getPattern(int index) Both are stubs in Python (no C-style for, single match pattern). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Four library/query files still referenced the legacy Flow.qll `ControlFlowNode`
and friends, which no longer match the dataflow library's `Cfg::ControlFlowNode`:
- SubclassFinder.qll: type `value` as `Cfg::ControlFlowNode`.
- ExceptionInfo.qll: replace `EssaNodeDefinition.getDefiningNode()` filter
with `Cfg::NameNode.defines(_)` (the legacy ESSA class isn't reachable
through the new dataflow API at the query-pack layer).
- ServerSideRequestForgeryCustomizations.qll: qualify `BinaryExprNode` with
`Cfg::` and update `stringRestriction` to take `Cfg::ControlFlowNode`.
- TarSlipCustomizations.qll: qualify `CallNode`/`AttrNode`/`NameNode` and
the `tarFileInfoSanitizer` parameter with `Cfg::`.
The three reblessed `.expected` files are purely cosmetic toString churn
("ControlFlowNode for X" -> "X", "After X"); verified set-equal after
normalising the toString prefixes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pr children PEP 695 type-param names (e.g. `T` in `def func[T]:` or `class Box[T]:`) bind in an annotation scope that nests the function/class body, so their AST scope is the inner function/class — not the enclosing scope where the FunctionDefExpr/ClassDefExpr CFG node lives. Visiting them as children created scope-crossing CFG edges (nonLocalStep violations: 96 across CPython). Drop them from the children list; the legacy CFG omitted them too. TypeAliasStmt is unaffected (its type-params share scope with the alias's enclosing scope). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate 27 queries under python/ql/src/ from legacy CFG types (CallNode/AttrNode/NameNode/etc.) to the shared-CFG-based 'Cfg::' namespace, matching the dataflow API surface introduced earlier on this branch. ModificationOfParameterWithDefaultCustomizations.qll is rewritten on top of BarrierGuard, removing the last legacy ESSA dependency in that file. UnguardedNextInGenerator.ql still uses ESSA and bridges to the new CFG via Cfg::CallNode.getNode(). Also reformat 14 library and query files that had drifted from the formatter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The shared CFG creates multiple ControlFlowNodes per AST node in conditional contexts (e.g. afterTrue/afterFalse for boolean conditions, empty/non-empty for for-loops, matched/unmatched for match cases). These splits matter for control-flow analysis, but for dataflow — where we ask 'what is the value of this expression?' — we need exactly one representative per AST or we double-count calls, arguments, and store steps. This adds Cfg::isCanonicalAstNodeRepresentative as a purely structural pick: for split ASTs it selects the 'positive' outcome variant; for non-split ASTs it selects the unique variant. The picker is implemented via genuine-outcome helpers that work around the shared CFG's cross-kind isAfterValue fallback (ControlFlowGraph.qll:870-892), see the doc on isGenuineAfterTrue for details. The TCfgNode-family newtypes in DataFlowPublic, TNormalCall and TPotentialLibraryCall in DataFlowDispatch, and the SSA-projected use-use/def-use steps in DataFlowPrivate are all routed through the canonical filter. DataFlowConsistency and the test UnresolvedCalls helper qualify their CallNode casts with Cfg:: to keep working. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Library-test compile fixes after the shared-CFG migration: - PointsTo/global, PointsTo/local: use `f.getNode() = s.getValue()` instead of `s.getValue().getAFlowNode() = f` (the new CFG does not surface getAFlowNode on AST nodes). - PointsTo/new/ImpliesDataflow: bridge new Cfg::ControlFlowNode to the legacy ControlFlowNodeWithPointsTo via AST identity. - frameworks/aiohttp + frameworks/modeling-example: qualify CallNode / NameNode / AttrNode casts with Cfg:: now that those names live in the new CFG facade. Rebless 4 expected files for toString-only differences (renamed CFG positions like 'CFG node for foo' vs 'foo' — no semantic change): ImpliesDataflow, EnclosingCallable, NaiveModel, ProperModel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t SSA step Sweep the last few uses of legacy AstNode.getAFlowNode() in tests over to explicit ControlFlowNode joins after the shared-CFG migration. importflow.ql needs the new Cfg::ControlFlowNode/CompareNode types because DataFlow::Node. asCfgNode() now returns the shared-CFG node. Also extend ImportResolution::allowedEssaImportStep to walk back through uncertain-write SSA inputs, so that a later 'from X import *' does not hide the preceding explicit (re)assignment from module-export resolution. Without this, a reassigned name that survives a wildcard import was no longer recognised as the module export. Rebless ModuleExport.expected to drop the legacy 'ControlFlowNode for' toString prefix and pick up the two correct rows exposed by the fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After the shared-CFG migration, DataFlow::Node.asCfgNode() returns Cfg::ControlFlowNode rather than the legacy Flow::ControlFlowNode, and funcValue.getACall() / dfCall.getNode() now return different CFG types (legacy vs new). Update the two remaining test queries that still cast to legacy NameNode/CallNode types to bridge through Cfg:: types or AST. * experimental/import-resolution-namespace-relative/test.ql: cast to Cfg::NameNode instead of legacy NameNode. * experimental/library-tests/CallGraph/InlineCallGraphTest.ql: change predicate signatures from CallNode to AST Call, and bridge to legacy CallNode (points-to) and Cfg::CallNode (type-tracking) via getNode() on each side. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The shared CFG library overrides ControlFlowNode.toString() as 'final' (shared/controlflow/codeql/controlflow/Cfg.qll:1217), so the legacy 'ControlFlowNode for X' prefix is gone — the new toString returns just 'X' for normal nodes and 'After X' for after-nodes. This produces a large cosmetic diff in test expected files with no semantic change. Mass-rebless 78 .expected files whose actual output differs from the checked-in expected only by this rename. Each file was verified to be identical after normalising 'ControlFlowNode for ' and 'After ' away from both sides. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Second batch of test reblessings, capturing changes in result content (not just toString labels): - Framework taint/concept tests (fastapi, sqlalchemy, aiohttp, lxml, stdlib, django-orm): mostly gained MISSING-tainted annotations where the new dataflow no longer reaches sinks. Some are real taint regressions; left as documented failures for follow-up. - Exception-handler tests (CWE-209-StackTraceExposure, EmptyExcept, CatchingBaseException, IncorrectExceptOrder, FileNotAlwaysClosed, FindSubclass/Find, Statements/exit/UseOfExit): the no-raise shared CFG abstraction does not emit ExceptionSuccessor abrupt-completion edges from arbitrary expressions, so except-handler bodies (and their exception target Names) are statically dead. Tracked separately under cfg-modelling-exceptions. - Dataflow-path / control-flow node toString polish across the security query suite (PathInjection, CodeInjection, UnsafeUnpacking, UnsafeUsageOfClientSideEncryptionVersion, RequestWithoutValidation, ReflectedXss, CallGraph): simple-leaf nodes now stringify as their AST text instead of 'After X'. - SSA / call-graph improvements (CmpTest, CallGraph/InlineCallGraphTest): fewer SSA mismatches between new and old; two previously-MISSING tt= annotations resolved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- LongPath.expected: revert wrong rebless from 69c27c5. CI generates the long-path file during build, so the long-path entry is correct. - 4 framework/query DataFlowConsistency.expected: pure toString polish (ControlFlowNode for X -> X / After X). - essa/ssa-compute/CONSISTENCY/TypeTrackingConsistency.expected: deleted. The 6 prior 'unreachable node in step of kind ...' violations are gone under the new SSA; per CI auto-rebless convention the empty file is removed. - extractor-tests/syntax_error/CONSISTENCY/CfgConsistency.expected: new. Documents one expected deadEnd on `break` outside any loop in the syntax-error test corpus. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After rebasing onto main, apply the substance of upstream review-comment commits (1ef557c, 35faec3): - timer.py: stricter validation (raise TypeError for unknown subscript elements), bypass atexit via os._exit on failure. - test_basic.py: simpler test cases per review (drop unnecessary parens, use call form in test_callable_syntax), updated docstring. - TimerUtils.qll: docstring update reflecting the t[dead(n)] / t[never] forms. The 'dead(2)' annotation in test_boolean.py:27 is kept because our NewCfgBranchTimestamps check (added on this branch) requires it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1348e57 to
ef74ec1
Compare
The earlier rebless added '.hidden/inner' and 'folder' to the expected output, but those are namespace-package folders without __init__.py, so under Py2 mode (respect_init=True) they are correctly *not* extracted as packages. The previous local rebless picked up Py3 behaviour because python/ql/test/2/extractor-tests/options doesn't propagate --lang=2 (only library-tests and query-tests do). CI runs it under Py2 and produces the smaller output. This brings the expected file back in line with main and CI. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced Jun 1, 2026
Contributor
Author
|
Superseded by a clean, stacked rewrite of this work that doesn't touch this branch:
This draft is retained as documentation of where we intend to end up — it shouldn't be merged. The five PRs above land the same end state in much smaller, easier-to-review increments. |
yoff
pushed a commit
that referenced
this pull request
Jun 1, 2026
Flips the Python dataflow trunk from the legacy CFG (semmle/python/Flow.qll) and legacy ESSA SSA (semmle/python/essa/*) to the new shared CFG facade (semmle.python.controlflow.internal.Cfg) and the new SSA adapter (semmle.python.dataflow.new.internal.SsaImpl), both introduced additively in the preceding PRs in this stack. This is the trunk-flip equivalent of the original draft PR #21894 (kept around as documentation), rebased on top of the four preparatory PRs: P1: Remove AstNode.getAFlowNode() and rewrite callers (#21919). P2: Qualify Flow.qll's AST references with Py:: prefix (#21920). P3: Add new shared-CFG-backed control flow graph (#21921). P4: Add new shared-SSA-backed SSA adapter (#21923). The Python dataflow library (semmle/python/dataflow/new/) now imports the new CFG facade and SSA adapter. All CFG-typed predicates (ControlFlowNode, CallNode, BasicBlock, NameNode, AttrNode, ...) are qualified with the Cfg:: prefix; SSA references switch from EssaVariable/EssaDefinition to SsaImpl::Definition/SourceVariable. GuardNode is redesigned to use the new CFG's outcome-node model (isAfterTrue / isAfterFalse) instead of the legacy ConditionBlock + flipped indirection. Only BarrierGuard<...> is preserved as public API. Framework files (Bottle, FastApi, Django, Tornado, Pyramid, Stdlib, ...) are updated to take CFG nodes from the new facade. A handful of dataflow consistency tweaks for the new CFG: - Augmented-assignment targets are treated as both load and store. - 'from X import *' produces uncertain SSA writes for unknown names. - CFG nodes are canonicalised so dataflow does not see equivalent pre/post-order pairs as distinct nodes. Two AST tweaks for the new CFG: - AstNodeImpl: omit PEP 695 type-parameter names from FunctionDefExpr / ClassDefExpr children. - ImportResolution: drop the legacy essa import. Test churn (~175 files): reblessed library- and query-test .expected files reflect slightly different CFG granularity, different toString output, and a handful of true alert deltas in security queries. Verification: all 367 lib + src + consistency-queries compile clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yoff
pushed a commit
that referenced
this pull request
Jun 1, 2026
Flips the Python dataflow trunk from the legacy CFG (semmle/python/Flow.qll) and legacy ESSA SSA (semmle/python/essa/*) to the new shared CFG facade (semmle.python.controlflow.internal.Cfg) and the new SSA adapter (semmle.python.dataflow.new.internal.SsaImpl), both introduced additively in the preceding PRs in this stack. This is the trunk-flip equivalent of the original draft PR #21894 (kept around as documentation), rebased on top of the four preparatory PRs: P1: Remove AstNode.getAFlowNode() and rewrite callers (#21919). P2: Qualify Flow.qll's AST references with Py:: prefix (#21920). P3: Add new shared-CFG-backed control flow graph (#21921). P4: Add new shared-SSA-backed SSA adapter (#21923). The Python dataflow library (semmle/python/dataflow/new/) now imports the new CFG facade and SSA adapter. All CFG-typed predicates (ControlFlowNode, CallNode, BasicBlock, NameNode, AttrNode, ...) are qualified with the Cfg:: prefix; SSA references switch from EssaVariable/EssaDefinition to SsaImpl::Definition/SourceVariable. GuardNode is redesigned to use the new CFG's outcome-node model (isAfterTrue / isAfterFalse) instead of the legacy ConditionBlock + flipped indirection. Only BarrierGuard<...> is preserved as public API. Framework files (Bottle, FastApi, Django, Tornado, Pyramid, Stdlib, ...) are updated to take CFG nodes from the new facade. A handful of dataflow consistency tweaks for the new CFG: - Augmented-assignment targets are treated as both load and store. - 'from X import *' produces uncertain SSA writes for unknown names. - CFG nodes are canonicalised so dataflow does not see equivalent pre/post-order pairs as distinct nodes. Two AST tweaks for the new CFG: - AstNodeImpl: omit PEP 695 type-parameter names from FunctionDefExpr / ClassDefExpr children. - ImportResolution: drop the legacy essa import. Test churn (~175 files): reblessed library- and query-test .expected files reflect slightly different CFG granularity, different toString output, and a handful of true alert deltas in security queries. Verification: all 367 lib + src + consistency-queries compile clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yoff
pushed a commit
that referenced
this pull request
Jun 1, 2026
Flips the Python dataflow trunk from the legacy CFG (semmle/python/Flow.qll) and legacy ESSA SSA (semmle/python/essa/*) to the new shared CFG facade (semmle.python.controlflow.internal.Cfg) and the new SSA adapter (semmle.python.dataflow.new.internal.SsaImpl), both introduced additively in the preceding PRs in this stack. This is the trunk-flip equivalent of the original draft PR #21894 (kept around as documentation), rebased on top of the four preparatory PRs: P1: Remove AstNode.getAFlowNode() and rewrite callers (#21919). P2: Qualify Flow.qll's AST references with Py:: prefix (#21920). P3: Add new shared-CFG-backed control flow graph (#21921). P4: Add new shared-SSA-backed SSA adapter (#21923). The Python dataflow library (semmle/python/dataflow/new/) now imports the new CFG facade and SSA adapter. All CFG-typed predicates (ControlFlowNode, CallNode, BasicBlock, NameNode, AttrNode, ...) are qualified with the Cfg:: prefix; SSA references switch from EssaVariable/EssaDefinition to SsaImpl::Definition/SourceVariable. GuardNode is redesigned to use the new CFG's outcome-node model (isAfterTrue / isAfterFalse) instead of the legacy ConditionBlock + flipped indirection. Only BarrierGuard<...> is preserved as public API. Framework files (Bottle, FastApi, Django, Tornado, Pyramid, Stdlib, ...) are updated to take CFG nodes from the new facade. A handful of dataflow consistency tweaks for the new CFG: - Augmented-assignment targets are treated as both load and store. - 'from X import *' produces uncertain SSA writes for unknown names. - CFG nodes are canonicalised so dataflow does not see equivalent pre/post-order pairs as distinct nodes. Two AST tweaks for the new CFG: - AstNodeImpl: omit PEP 695 type-parameter names from FunctionDefExpr / ClassDefExpr children. - ImportResolution: drop the legacy essa import. Test churn (~175 files): reblessed library- and query-test .expected files reflect slightly different CFG granularity, different toString output, and a handful of true alert deltas in security queries. Verification: all 367 lib + src + consistency-queries compile clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yoff
pushed a commit
that referenced
this pull request
Jun 1, 2026
Flips the Python dataflow trunk from the legacy CFG (semmle/python/Flow.qll) and legacy ESSA SSA (semmle/python/essa/*) to the new shared CFG facade (semmle.python.controlflow.internal.Cfg) and the new SSA adapter (semmle.python.dataflow.new.internal.SsaImpl), both introduced additively in the preceding PRs in this stack. This is the trunk-flip equivalent of the original draft PR #21894 (kept around as documentation), rebased on top of the four preparatory PRs: P1: Remove AstNode.getAFlowNode() and rewrite callers (#21919). P2: Qualify Flow.qll's AST references with Py:: prefix (#21920). P3: Add new shared-CFG-backed control flow graph (#21921). P4: Add new shared-SSA-backed SSA adapter (#21923). The Python dataflow library (semmle/python/dataflow/new/) now imports the new CFG facade and SSA adapter. All CFG-typed predicates (ControlFlowNode, CallNode, BasicBlock, NameNode, AttrNode, ...) are qualified with the Cfg:: prefix; SSA references switch from EssaVariable/EssaDefinition to SsaImpl::Definition/SourceVariable. GuardNode is redesigned to use the new CFG's outcome-node model (isAfterTrue / isAfterFalse) instead of the legacy ConditionBlock + flipped indirection. Only BarrierGuard<...> is preserved as public API. Framework files (Bottle, FastApi, Django, Tornado, Pyramid, Stdlib, ...) are updated to take CFG nodes from the new facade. A handful of dataflow consistency tweaks for the new CFG: - Augmented-assignment targets are treated as both load and store. - 'from X import *' produces uncertain SSA writes for unknown names. - CFG nodes are canonicalised so dataflow does not see equivalent pre/post-order pairs as distinct nodes. Two AST tweaks for the new CFG: - AstNodeImpl: omit PEP 695 type-parameter names from FunctionDefExpr / ClassDefExpr children. - ImportResolution: drop the legacy essa import. Test churn (~175 files): reblessed library- and query-test .expected files reflect slightly different CFG granularity, different toString output, and a handful of true alert deltas in security queries. Verification: all 367 lib + src + consistency-queries compile clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
yoff
pushed a commit
that referenced
this pull request
Jun 1, 2026
Flips the Python dataflow trunk from the legacy CFG (semmle/python/Flow.qll) and legacy ESSA SSA (semmle/python/essa/*) to the new shared CFG facade (semmle.python.controlflow.internal.Cfg) and the new SSA adapter (semmle.python.dataflow.new.internal.SsaImpl), both introduced additively in the preceding PRs in this stack. This is the trunk-flip equivalent of the original draft PR #21894 (kept around as documentation), rebased on top of the four preparatory PRs: P1: Remove AstNode.getAFlowNode() and rewrite callers (#21919). P2: Qualify Flow.qll's AST references with Py:: prefix (#21920). P3: Add new shared-CFG-backed control flow graph (#21921). P4: Add new shared-SSA-backed SSA adapter (#21923). The Python dataflow library (semmle/python/dataflow/new/) now imports the new CFG facade and SSA adapter. All CFG-typed predicates (ControlFlowNode, CallNode, BasicBlock, NameNode, AttrNode, ...) are qualified with the Cfg:: prefix; SSA references switch from EssaVariable/EssaDefinition to SsaImpl::Definition/SourceVariable. GuardNode is redesigned to use the new CFG's outcome-node model (isAfterTrue / isAfterFalse) instead of the legacy ConditionBlock + flipped indirection. Only BarrierGuard<...> is preserved as public API. Framework files (Bottle, FastApi, Django, Tornado, Pyramid, Stdlib, ...) are updated to take CFG nodes from the new facade. A handful of dataflow consistency tweaks for the new CFG: - Augmented-assignment targets are treated as both load and store. - 'from X import *' produces uncertain SSA writes for unknown names. - CFG nodes are canonicalised so dataflow does not see equivalent pre/post-order pairs as distinct nodes. Two AST tweaks for the new CFG: - AstNodeImpl: omit PEP 695 type-parameter names from FunctionDefExpr / ClassDefExpr children. - ImportResolution: drop the legacy essa import. Test churn (~175 files): reblessed library- and query-test .expected files reflect slightly different CFG granularity, different toString output, and a handful of true alert deltas in security queries. Verification: all 367 lib + src + consistency-queries compile clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR migrates the Python dataflow library off the legacy CFG (
semmle.python.Flow) and onto the shared CFG (semmle.python.controlflow.internal.AstNodeImpl+Cfg). It is a prerequisite for deprecatingFlow.qlland unifying Python with the rest of the shared-CFG languages (Java, C#, Swift, …).How the tangled hierarchies were untangled
Three libraries were entwined: legacy CFG → legacy SSA (ESSA) → dataflow / type-tracking / guards / queries. They had to be replaced bottom-up.
New CFG facade.
python/ql/lib/semmle/python/controlflow/internal/Cfg.qllexposes the shared CFG with the same shape that dataflow needs (ControlFlowNode,CallNode,AttrNode,NameNode, …). Where the old CFG used DB-relation classes extending@py_flow_node, the facade rebuilds them as one-line bridges over the AST:NameNode.defines(v)is justexists(Name n | n = this.getNode() and n.defines(v) and not this.isLoad())— mirroring Java's pattern.Shared-SSA adapter.
SsaImpl.qllinstantiatescodeql.ssa.Ssa::Makeagainst the new CFG. The legacy ESSA refinement classes (EssaNodeRefinement,PyEdgeRefinement,EssaAttributeDefinition, used only by the points-to engine) are left behind on the old SSA; the new adapter covers exactly what new dataflow + import-resolution + type-tracking consume (plain assignment, phi, scope-entry, parameter, with-definition, multi-assignment).Guards.
GuardNodeis rebuilt as a CFG-native concept using the new CFG's outcome nodes (isAfterTrue/isAfterFalse). The Python-specific public surface (GuardNode,isSafeCheck) is dropped in favour of the cross-languageBarrierGuard<...>template. A newoutcomeOfGuard(guard, outcomeNode, branch)walks AST wrappers (not,== True,is False, …) to find the real splitting node.The plan is to move to the shared guards library later.
Canonicalisation. The shared CFG models each Python
Expras multiple nodes (before, after, after-true/false, intermediate boolean pairs). Dataflow needs one node per Expr. We pick theisAfter(<ast>)node as canonical — the node that fires once the (sub-)expression has been fully evaluated.AnnotatedExitNodewas added toisCanonicalso synthetic scope-exit uses for module globals attach correctly.The plan is to switch to simply having AST nodes in the dataflow graph. But current code is using utility predicates from the Cfg nodes, so this is a bigger operation. The canonicalization is a way to express "one dataflow node per AST node" while not rewriting a lot of code.
Sequential migration. Phase -1 wired all variable-binding constructs (
x: int = …,from x import a, PEP 695 type params, match patterns,with … as x, …) so ASTName.defines(v)predicates have corresponding CFG positions. Phase 0 built the facades. Phase 1 swapped dataflow trunk + ~25 framework files. Phase 3 validated.Semantic changes
Improvements (blessed)
CmpTest; new SSA noticeably more in sync with legacy.Imports/deprecated/DeprecatedModulefinding (md5 module correctly flagged).TypeTrackingConsistencyon essa/ssa-compute.Known regressions (blocked on
cfg-modelling-exceptions)The shared CFG uses a no-raise abstraction: exception-successor edges are only emitted where
Input::beginAbruptCompletionfires, which Python currently only does forassert. As a result,excepthandler bodies and exception target Names (einexcept E as e:) are statically dead, and the following are blessed as known regressions:Security/CWE-209-StackTraceExposure/StackTraceExposureSecurity/CWE-209-StackTraceExposure/ExceptionInfoMissing resultmarkersExceptions/general/EmptyExceptExceptions/general/CatchingBaseExceptionExceptions/general/IncorrectExceptOrderResources/FileNotAlwaysClosed/FileNotAlwaysClosedStatements/exit/UseOfExitexperimental/library-tests/FindSubclass/FindThe same root cause explains the 3,790 CPython binding-coverage gap and the 4 customSanitizer MISSING annotations observed during Phase -1.
Cosmetic
.expectedfiles reblessed for toString churn. The shared CFG library overridesControlFlowNode.toString()asfinal, dropping theControlFlowNode for Xprefix in favour ofXorAfter X(depending on whether the canonical node is a leaf or an after-node).LocalSourcetoString labels changed (Global Variable X→Variable X).Work left for future PRs
cfg-modelling-exceptions(highest priority). EmitbeginAbruptCompletion(ast, n, ExceptionSuccessor, false)from arbitrary call / subscript / attribute / etc. expressions inAstNodeImpl.qll. Expected to close all 8 cfg-exceptions regressions above, the 3,790 CPython binding misses, and the 4 customSanitizer MISSINGs. Discussion of the no-raise abstraction trade-off is open.semmle/python/Flow.qll. Adddeprecatedto the public classes (ControlFlowNode,CallNode,BasicBlock, …) and a change-note. Keep the file importable so downstream queries continue to compile with warnings only.GuardedControlFlow.qllto usecodeql.controlflow.Guards(the cross-language guards module). The Python-specificoutcomeOfGuardwalking we added here is a stop-gap that can be folded into the shared library.~15 framework filesthat currently still useBarrierGuard<...>directly.library-tests/dataflow-new-ssa/SsaTest— 5 Unexpected / 1 Missing for function-name defs not annotated in source.query-tests/Functions/ModificationOfParameterWithDefault— 2 missingmodification=lmarkers.cfg-modelling-exceptions:fastapi/InlineTaintTestlost 104 Pydantic-field paths (likely a TypeTracking attribute-refinement gap).fastapi_path_injection.py:26Source annotation missing.Commits
The branch was rebased onto main after the precursor evaluation-order test pack (PR #2189x — Taus) landed there. Substance of Taus's review-comment fixes (
1ef557c972f,35faec3db1e) was folded back in as a separate commit, with one exception: thedead(2)annotation intest_boolean.py:27was retained because ourNewCfgBranchTimestampscheck (introduced on this branch) requires it.