You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make ColumnarToRowExec expose nullable row output when materializing a columnar batch, because the batch null bitmap is the execution-time source of truth.
Preserve that execution-time nullability only through downstream row/materialization paths that rebuild attributes after transition insertion: project-like operators, ExpandExec / GenerateExec, grouped aggregate output/key materialization, and analogous Python UDTF passthrough materialization.
Add a regression covering whole-stage codegen, split consume, row boundaries, partial grouped HashAggregateExec and SortAggregateExec, generate, take-ordered/project, and expand paths when a non-nullable columnar output physically contains a null.
Why are the changes needed?
ColumnarToRowExec can materialize a real null from a columnar batch even when the planned output attribute is non-nullable. Downstream row codegen can then trust the stale non-nullable attribute, skip the null check, and fail while materializing or consuming the row.
In that case the Parquet reader produced a physical null for a column whose planned output attribute was non-nullable, and downstream generated row code failed with a UTF8String.getBaseObject()NullPointerException.
An earlier version rebound attributes across the whole transition insertion rule. That was too broad: it changed unrelated parent expression and schema behavior and triggered failures in DPP reuse, plan-stability, and metadata-schema tests. This version keeps the transition contract local and refreshes only downstream row/materialization paths that can revive stale pre-transition nullability.
Does this PR introduce any user-facing change?
Yes. Queries that read a physical null through ColumnarToRowExec despite stale non-nullable planned metadata now preserve the null through row materialization instead of crashing in downstream row codegen.
How was this patch tested?
build/sbt -java-home /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home 'sql/testOnly org.apache.spark.sql.execution.SparkPlanSuite -- -z "ColumnarToRowExec should materialize null values from non-nullable columnar output"'
git diff --check
Was this patch authored or co-authored using generative AI tooling?
Discussed with @sunchao:
After digging further, we don’t think this is the right change for OSS Spark. Making every ColumnarToRowExec output nullable changes a very broad execution contract on a hot path: downstream JVM row operators now need to treat the physical null bitmap as authoritative even when the optimized plan says nullable=false. That adds null-check cost after common scan boundaries, changes explain/canonical/state-schema behavior, and still does not provide true end-to-end support for nulls under a non-null logical contract because Catalyst has already optimized using nullable=false. If Spark wants to harden this case, it should be through a narrower invariant check or a fully designed end-to-end semantic change with optimizer, execution, tests, and performance validation, not this localized row-boundary rewrite.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
ColumnarToRowExecexpose nullable row output when materializing a columnar batch, because the batch null bitmap is the execution-time source of truth.ExpandExec/GenerateExec, grouped aggregate output/key materialization, and analogous Python UDTF passthrough materialization.HashAggregateExecandSortAggregateExec, generate, take-ordered/project, and expand paths when a non-nullable columnar output physically contains a null.Why are the changes needed?
ColumnarToRowExeccan materialize a real null from a columnar batch even when the planned output attribute is non-nullable. Downstream row codegen can then trust the stale non-nullable attribute, skip the null check, and fail while materializing or consuming the row.One observed failing plan had this shape:
In that case the Parquet reader produced a physical null for a column whose planned output attribute was non-nullable, and downstream generated row code failed with a
UTF8String.getBaseObject()NullPointerException.An earlier version rebound attributes across the whole transition insertion rule. That was too broad: it changed unrelated parent expression and schema behavior and triggered failures in DPP reuse, plan-stability, and metadata-schema tests. This version keeps the transition contract local and refreshes only downstream row/materialization paths that can revive stale pre-transition nullability.
Does this PR introduce any user-facing change?
Yes. Queries that read a physical null through
ColumnarToRowExecdespite stale non-nullable planned metadata now preserve the null through row materialization instead of crashing in downstream row codegen.How was this patch tested?
build/sbt -java-home /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home 'sql/testOnly org.apache.spark.sql.execution.SparkPlanSuite -- -z "ColumnarToRowExec should materialize null values from non-nullable columnar output"'git diff --checkWas this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex