refactor(physical-plan): externalize statistics traversal into StatisticsContext#23051
refactor(physical-plan): externalize statistics traversal into StatisticsContext#23051asolimando wants to merge 4 commits into
Conversation
…ticsContext Decouple operator statistics propagation from tree traversal and caching, per apache#22958. - Add `StatisticsContext`, which owns the bottom-up walk and per-walk memoization cache. `compute(plan, args)` resolves each child before invoking the operator. - Add `ExecutionPlan::statistics_from_inputs(&self, input_stats, args)`: a stateless per-operator method that computes a node's statistics from pre-resolved child statistics. Default delegates to the deprecated `partition_statistics`. - Add `ExecutionPlan::child_stats_requests(&self, partition) -> Vec<ChildStats>` letting a node declare, per child, which partition to request (`ChildStats::At`) or to skip entirely (`ChildStats::Skip`). Union skips non-owning inputs on partition-specific requests. Operators no longer call back into the framework to fetch child statistics; caching is shared across a single walk via the context.
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
|
@2010YOUY01, @alamb: I have prepared this follow-up PR after #21815, it would be great if you could take a look at your convenience, I am in particular looking for guidance on the point raised in #22958 (comment). cc: @xudong963 in case you are interested in taking a look at this PR as well. Thank you folks for all the design and code feedback so far! |
xudong963
left a comment
There was a problem hiding this comment.
The changes look good to me.
There are two issues are about the API migration boundary:
-
Old callers calling migrated nodes can get unknown stats. Many built-in operators now implement only
statistics_from_inputs. But the deprecatedpartition_statisticsdefault does not delegate to statistics_from_inputs; it just returnsStatistics::new_unknown. So any remaining direct caller of partition_statistics can silently lose useful statistics. -
New callers calling old custom nodes can do wasted child traversal.
StatisticsContext::computealways computes child stats before callingstatistics_from_inputs. For old third-party plans that only overridepartition_statistics, the defaultstatistics_from_inputs
ignores those child stats and delegates topartition_statistics. So the new API may perform extra work, or even surface child-stat errors, that the old implementation never required.
…atistics-from-inputs
Make the default `child_stats_requests` skip every child, so a node that does not derive its statistics from children (for example one that only overrides the deprecated `partition_statistics`) triggers no child traversal. This avoids the eager child walk, and the child-stat errors it could surface, that `StatisticsContext::compute` previously performed even when `statistics_from_inputs` ignored the results. Nodes that read `input_stats` now declare the children they use via `child_stats_requests`.
|
Thanks @xudong963 for your feedback.
This is correct but it's the same one-way-delegation tradeoff we accepted in #21815, it's already documented in the upgrade guide.
This is indeed a regression w.r.t. #21815, whose default never walked children. I considered several options but I finally settled on flipping the default This way, a node that only overrides Side note: the semver check is technically correct but this is a change just between #21815 and this PR, so we don't need to go through a deprecation process as nothing was released in the meantime, correct? |
…atistics-from-inputs # Conflicts: # datafusion/core/src/datasource/listing/table.rs
Which issue does this PR close?
Rationale for this change
Follow-up to #21815. Per @2010YOUY01's suggestion there, this decouples statistics
traversal/caching from each operator's computation: operators now express only local
propagation, while the walk and cache live in an external
StatisticsContext.What changes are included in this PR?
StatisticsContextowns the bottom-up walk + per-walk cache;compute(plan, args)resolves children, then calls the operator.
ExecutionPlan::statistics_from_inputs(input_stats, args)— stateless localpropagation from pre-resolved child stats; default delegates to deprecated
partition_statistics.ExecutionPlan::child_stats_requests(partition) -> Vec<ChildStats>— per-childdirective (
At(Option<usize>)/Skip) for which partition each child is computedat (e.g. broadcast joins request the build side overall;
UnionExecskipsnon-owning inputs).
statistics_with_args;partition_statisticsstaysdeprecated.
Are these changes tested?
Yes — existing statistics tests now run through
StatisticsContext::compute, plus newunit tests for the cache and the
compute_statisticsbenchmark.Are there any user-facing changes?
Yes (public
ExecutionPlanAPI):partition_statisticsdeprecated in favor ofstatistics_from_inputs; newStatisticsContext/ChildStats; unreleasedstatistics_with_argsremoved. Upgrade guide updated.Reviewer note:
statistics_from_inputstakesargs: &StatisticsArgs(currently justpartition) as a non-breaking extension seam rather than a barepartition— happyto change if preferred (see #22958).
Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.