Python: Improve some flow summaries#22101
Conversation
34db1c7 to
a5444b5
Compare
| preservesValue = true | ||
| ) | ||
| input = "Argument[0].WithAnyDictionaryElement" and | ||
| output = "ReturnValue" and |
There was a problem hiding this comment.
This seems to change the meaning of the flow summary - previously it went from Argument[0].DictionaryElement[x] to ReturnValue.DictionaryElement[x] for all keys x, and now it goes from Argument[0].WithAnyDictionaryElement to ReturnValue. Can you explain?
There was a problem hiding this comment.
Argument[0].DictionaryElement[x] is saying that data must be stored inside a dictionary value with some key, which has the exact same effect as the original flow summary, except it doesn't compile down to a bunch of read-steps followed by a bunch of identical store-steps, and it is hence much more performant.
There was a problem hiding this comment.
Ah, I think I get it now. I've never come across this mechanism before - go seems to be one of the few languages not using it. So, if I've understood correctly, if the input is <input>.WithAnyDictionaryElement and the output is <output> then the MaD machinery will create flow from <input> to <output> as long as there is flow to <input>.AnyDictionaryElement? A little bit like a lookahead in regexes, where it checks for the existence of something without consuming it.
There was a problem hiding this comment.
Pull request overview
This PR updates Python dataflow flow summaries to be (mostly) equivalent while significantly improving performance by relying more on existing ContentSet encodings and adding support for summaries that depend on “with content”. It also broadens the builtins.zip modeling by removing the prior limitation that only the first two arguments were considered.
Changes:
- Rework several stdlib flow summaries to use
Any*Element/WithAny*Element-style encodings instead of per-index/per-key expansion. - Add infrastructure to encode and consume “with content” in flow summaries and plumb “expects content” through the dataflow internals.
- Update Python library tests to reflect the new (generalized) behavior, including new expected and known-spurious flows.
Show a summary per file
| File | Description |
|---|---|
| python/ql/test/library-tests/frameworks/django-orm/testapp/orm_tests.py | Updates ORM test expectations to reflect newly modeled flow through in_bulk().values(). |
| python/ql/test/library-tests/dataflow/coverage/test_builtins.py | Adjusts zip tuple test expectations for generalized argument handling and records a known spurious flow. |
| python/ql/lib/semmle/python/frameworks/Stdlib.qll | Rewrites several stdlib flow summaries to use broader content encodings and “with content” forms. |
| python/ql/lib/semmle/python/dataflow/new/internal/FlowSummaryImpl.qll | Adds encoding helper for “with content” in flow summary representations. |
| python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll | Hooks expectsContent up to flow-summary-specific content expectations. |
Review details
- Files reviewed: 4/5 changed files
- Comments generated: 0
- Review effort level: Low
This PR rewrites some Python flow summaries into equivalent* but much more performant summaries. Some summaries, like
builtins.enumerate, could be rewritten directly by making use of existingContentSets, while others, such asbuiltins.dict, required adding support for "with content".*The flow summary for
builtins.ziphas been generalized by removing the restriction that only the first two arguments are taken into account.DCA looks excellent; we reduce analysis time on
dask__daskby 95 %, but also recover the performance onsaltstack__saltthat was originally lost on #21888 (note that the performance lost on that PR forytdl-org__youtube-dlwas already recovered in #21941).