Skip to content

refactor(physical-plan): externalize statistics traversal into StatisticsContext#23051

Open
asolimando wants to merge 4 commits into
apache:mainfrom
asolimando:asolimando/22958-statistics-from-inputs
Open

refactor(physical-plan): externalize statistics traversal into StatisticsContext#23051
asolimando wants to merge 4 commits into
apache:mainfrom
asolimando:asolimando/22958-statistics-from-inputs

Conversation

@asolimando

Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

Follow-up to #21815. Per @2010YOUY01's suggestion there, this decouples statistics
traversal/caching from each operator's computation: operators now express only local
propagation, while the walk and cache live in an external StatisticsContext.

What changes are included in this PR?

  • StatisticsContext owns the bottom-up walk + per-walk cache; compute(plan, args)
    resolves children, then calls the operator.
  • ExecutionPlan::statistics_from_inputs(input_stats, args) — stateless local
    propagation from pre-resolved child stats; default delegates to deprecated
    partition_statistics.
  • ExecutionPlan::child_stats_requests(partition) -> Vec<ChildStats> — per-child
    directive (At(Option<usize>) / Skip) for which partition each child is computed
    at (e.g. broadcast joins request the build side overall; UnionExec skips
    non-owning inputs).
  • Removes the unreleased statistics_with_args; partition_statistics stays
    deprecated.
  • Migrates all operators/callers; updates the 55.0.0 upgrade guide.

Are these changes tested?

Yes — existing statistics tests now run through StatisticsContext::compute, plus new
unit tests for the cache and the compute_statistics benchmark.

Are there any user-facing changes?

Yes (public ExecutionPlan API): partition_statistics deprecated in favor of
statistics_from_inputs; new StatisticsContext / ChildStats; unreleased
statistics_with_args removed. Upgrade guide updated.


Reviewer note: statistics_from_inputs takes args: &StatisticsArgs (currently just
partition) as a non-breaking extension seam rather than a bare partition — happy
to change if preferred (see #22958).


Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

…ticsContext

Decouple operator statistics propagation from tree traversal and caching,
per apache#22958.

- Add `StatisticsContext`, which owns the bottom-up walk and per-walk
  memoization cache. `compute(plan, args)` resolves each child before
  invoking the operator.
- Add `ExecutionPlan::statistics_from_inputs(&self, input_stats, args)`:
  a stateless per-operator method that computes a node's statistics from
  pre-resolved child statistics. Default delegates to the deprecated
  `partition_statistics`.
- Add `ExecutionPlan::child_stats_requests(&self, partition) -> Vec<ChildStats>`
  letting a node declare, per child, which partition to request
  (`ChildStats::At`) or to skip entirely (`ChildStats::Skip`). Union skips
  non-owning inputs on partition-specific requests.

Operators no longer call back into the framework to fetch child statistics;
caching is shared across a single walk via the context.
@github-actions github-actions Bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate labels Jun 19, 2026
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v54.0.0 (current)
       Built [ 100.320s] (current)
     Parsing datafusion v54.0.0 (current)
      Parsed [   0.034s] (current)
    Building datafusion v54.0.0 (baseline)
       Built [  99.644s] (baseline)
     Parsing datafusion v54.0.0 (baseline)
      Parsed [   0.034s] (baseline)
    Checking datafusion v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.523s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 202.495s] datafusion
    Building datafusion-datasource v54.0.0 (current)
       Built [  35.935s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  35.971s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.033s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.242s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  73.195s] datafusion-datasource
    Building datafusion-ffi v54.0.0 (current)
       Built [  57.843s] (current)
     Parsing datafusion-ffi v54.0.0 (current)
      Parsed [   0.060s] (current)
    Building datafusion-ffi v54.0.0 (baseline)
       Built [  57.596s] (baseline)
     Parsing datafusion-ffi v54.0.0 (baseline)
      Parsed [   0.061s] (baseline)
    Checking datafusion-ffi v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.271s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 116.951s] datafusion-ffi
    Building datafusion-physical-optimizer v54.0.0 (current)
       Built [  37.550s] (current)
     Parsing datafusion-physical-optimizer v54.0.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-physical-optimizer v54.0.0 (baseline)
       Built [  37.331s] (baseline)
     Parsing datafusion-physical-optimizer v54.0.0 (baseline)
      Parsed [   0.021s] (baseline)
    Checking datafusion-physical-optimizer v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.108s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  75.975s] datafusion-physical-optimizer
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  35.699s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.129s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  35.349s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.128s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.565s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  StatisticsArgs::compute_child_statistics, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/33701b08b2061173c09dbab5ccffd3052a7d815f/datafusion/physical-plan/src/statistics.rs:119
  StatisticsArgs::compute_child_statistics, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/33701b08b2061173c09dbab5ccffd3052a7d815f/datafusion/physical-plan/src/statistics.rs:119

--- failure trait_method_missing: pub trait method removed or renamed ---

Description:
A trait method is no longer callable, and may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#trait-item-signature
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/trait_method_missing.ron

Failed in:
  method statistics_with_args of trait ExecutionPlan, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/33701b08b2061173c09dbab5ccffd3052a7d815f/datafusion/physical-plan/src/execution_plan.rs:529
  method statistics_with_args of trait ExecutionPlan, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/33701b08b2061173c09dbab5ccffd3052a7d815f/datafusion/physical-plan/src/execution_plan.rs:529

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  72.863s] datafusion-physical-plan

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 19, 2026
@asolimando

Copy link
Copy Markdown
Member Author

@2010YOUY01, @alamb: I have prepared this follow-up PR after #21815, it would be great if you could take a look at your convenience, I am in particular looking for guidance on the point raised in #22958 (comment).

cc: @xudong963 in case you are interested in taking a look at this PR as well.

Thank you folks for all the design and code feedback so far!

@xudong963 xudong963 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me.

There are two issues are about the API migration boundary:

  1. Old callers calling migrated nodes can get unknown stats. Many built-in operators now implement only statistics_from_inputs. But the deprecated partition_statistics default does not delegate to statistics_from_inputs; it just returns Statistics::new_unknown. So any remaining direct caller of partition_statistics can silently lose useful statistics.

  2. New callers calling old custom nodes can do wasted child traversal.
    StatisticsContext::compute always computes child stats before calling statistics_from_inputs. For old third-party plans that only override partition_statistics, the default statistics_from_inputs
    ignores those child stats and delegates to partition_statistics. So the new API may perform extra work, or even surface child-stat errors, that the old implementation never required.

Make the default `child_stats_requests` skip every child, so a node that does not
derive its statistics from children (for example one that only overrides the
deprecated `partition_statistics`) triggers no child traversal. This avoids the
eager child walk, and the child-stat errors it could surface, that
`StatisticsContext::compute` previously performed even when `statistics_from_inputs`
ignored the results.

Nodes that read `input_stats` now declare the children they use via
`child_stats_requests`.
@asolimando

Copy link
Copy Markdown
Member Author

Thanks @xudong963 for your feedback.

  1. Old callers calling migrated nodes can get unknown stats. Many built-in operators now implement only statistics_from_inputs. But the deprecated partition_statistics default does not delegate to statistics_from_inputs; it just returns Statistics::new_unknown. So any remaining direct caller of partition_statistics can silently lose useful statistics.

This is correct but it's the same one-way-delegation tradeoff we accepted in #21815, it's already documented in the upgrade guide.

  1. New callers calling old custom nodes can do wasted child traversal.
    StatisticsContext::compute always computes child stats before calling statistics_from_inputs. For old third-party plans that only override partition_statistics, the default statistics_from_inputs
    ignores those child stats and delegates to partition_statistics. So the new API may perform extra work, or even surface child-stat errors, that the old implementation never required.

This is indeed a regression w.r.t. #21815, whose default never walked children. I considered several options but I finally settled on flipping the default child_stats_requests to Skip to restore the behavior of #21815.

This way, a node that only overrides partition_statistics now declares no children, so there's no walk, and no risk of surfaced child errors. Nodes that read input_stats declare the children they use.

Side note: the semver check is technically correct but this is a change just between #21815 and this PR, so we don't need to go through a deprecation process as nothing was released in the meantime, correct?

…atistics-from-inputs

# Conflicts:
#	datafusion/core/src/datasource/listing/table.rs
@asolimando asolimando requested a review from xudong963 June 26, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation ffi Changes to the ffi crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decouple operator statistics propagation from traversal/caching via statistics_from_inputs

2 participants