Skip to content

[SPARK-57484][SQL] Preserve row nullability after ColumnarToRowExec#56536

Closed
ZitingShen wants to merge 3 commits into
apache:masterfrom
ZitingShen:codex/columnar-to-row-nullability
Closed

[SPARK-57484][SQL] Preserve row nullability after ColumnarToRowExec#56536
ZitingShen wants to merge 3 commits into
apache:masterfrom
ZitingShen:codex/columnar-to-row-nullability

Conversation

@ZitingShen

@ZitingShen ZitingShen commented Jun 16, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

  • Make ColumnarToRowExec expose nullable row output when materializing a columnar batch, because the batch null bitmap is the execution-time source of truth.
  • Preserve that execution-time nullability only through downstream row/materialization paths that rebuild attributes after transition insertion: project-like operators, ExpandExec / GenerateExec, grouped aggregate output/key materialization, and analogous Python UDTF passthrough materialization.
  • Add a regression covering whole-stage codegen, split consume, row boundaries, partial grouped HashAggregateExec and SortAggregateExec, generate, take-ordered/project, and expand paths when a non-nullable columnar output physically contains a null.

Why are the changes needed?

ColumnarToRowExec can materialize a real null from a columnar batch even when the planned output attribute is non-nullable. Downstream row codegen can then trust the stale non-nullable attribute, skip the null check, and fail while materializing or consuming the row.

One observed failing plan had this shape:

HashAggregate
  HashAggregate
    ...
      ColumnarToRow
        Scan parquet ...

In that case the Parquet reader produced a physical null for a column whose planned output attribute was non-nullable, and downstream generated row code failed with a UTF8String.getBaseObject() NullPointerException.

An earlier version rebound attributes across the whole transition insertion rule. That was too broad: it changed unrelated parent expression and schema behavior and triggered failures in DPP reuse, plan-stability, and metadata-schema tests. This version keeps the transition contract local and refreshes only downstream row/materialization paths that can revive stale pre-transition nullability.

Does this PR introduce any user-facing change?

Yes. Queries that read a physical null through ColumnarToRowExec despite stale non-nullable planned metadata now preserve the null through row materialization instead of crashing in downstream row codegen.

How was this patch tested?

  • build/sbt -java-home /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home 'sql/testOnly org.apache.spark.sql.execution.SparkPlanSuite -- -z "ColumnarToRowExec should materialize null values from non-nullable columnar output"'
  • git diff --check

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex

@ZitingShen ZitingShen force-pushed the codex/columnar-to-row-nullability branch 2 times, most recently from 4f39b8a to dd14115 Compare June 16, 2026 04:44
@ZitingShen ZitingShen marked this pull request as ready for review June 16, 2026 04:44
@ZitingShen ZitingShen force-pushed the codex/columnar-to-row-nullability branch from dd14115 to a179ab6 Compare June 16, 2026 13:53
@ZitingShen ZitingShen force-pushed the codex/columnar-to-row-nullability branch from a179ab6 to d8cb5f0 Compare June 16, 2026 13:56
@ZitingShen

Copy link
Copy Markdown
Author

Discussed with @sunchao:
After digging further, we don’t think this is the right change for OSS Spark. Making every ColumnarToRowExec output nullable changes a very broad execution contract on a hot path: downstream JVM row operators now need to treat the physical null bitmap as authoritative even when the optimized plan says nullable=false. That adds null-check cost after common scan boundaries, changes explain/canonical/state-schema behavior, and still does not provide true end-to-end support for nulls under a non-null logical contract because Catalyst has already optimized using nullable=false. If Spark wants to harden this case, it should be through a narrower invariant check or a fully designed end-to-end semantic change with optimizer, execution, tests, and performance validation, not this localized row-boundary rewrite.

@ZitingShen ZitingShen closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant