[SPARK-57484][SQL] Preserve row nullability after ColumnarToRowExec by ZitingShen · Pull Request #56536 · apache/spark

ZitingShen · 2026-06-16T04:30:25Z

What changes were proposed in this pull request?

Make ColumnarToRowExec expose nullable row output when materializing a columnar batch, because the batch null bitmap is the execution-time source of truth.
Preserve that execution-time nullability only through downstream row/materialization paths that rebuild attributes after transition insertion: project-like operators, ExpandExec / GenerateExec, grouped aggregate output/key materialization, and analogous Python UDTF passthrough materialization.
Add a regression covering whole-stage codegen, split consume, row boundaries, partial grouped HashAggregateExec and SortAggregateExec, generate, take-ordered/project, and expand paths when a non-nullable columnar output physically contains a null.

Why are the changes needed?

ColumnarToRowExec can materialize a real null from a columnar batch even when the planned output attribute is non-nullable. Downstream row codegen can then trust the stale non-nullable attribute, skip the null check, and fail while materializing or consuming the row.

One observed failing plan had this shape:

HashAggregate
  HashAggregate
    ...
      ColumnarToRow
        Scan parquet ...

In that case the Parquet reader produced a physical null for a column whose planned output attribute was non-nullable, and downstream generated row code failed with a UTF8String.getBaseObject() NullPointerException.

An earlier version rebound attributes across the whole transition insertion rule. That was too broad: it changed unrelated parent expression and schema behavior and triggered failures in DPP reuse, plan-stability, and metadata-schema tests. This version keeps the transition contract local and refreshes only downstream row/materialization paths that can revive stale pre-transition nullability.

Does this PR introduce any user-facing change?

Yes. Queries that read a physical null through ColumnarToRowExec despite stale non-nullable planned metadata now preserve the null through row materialization instead of crashing in downstream row codegen.

How was this patch tested?

build/sbt -java-home /opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home 'sql/testOnly org.apache.spark.sql.execution.SparkPlanSuite -- -z "ColumnarToRowExec should materialize null values from non-nullable columnar output"'
git diff --check

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex

ZitingShen · 2026-06-17T16:33:26Z

Discussed with @sunchao:
After digging further, we don’t think this is the right change for OSS Spark. Making every ColumnarToRowExec output nullable changes a very broad execution contract on a hot path: downstream JVM row operators now need to treat the physical null bitmap as authoritative even when the optimized plan says nullable=false. That adds null-check cost after common scan boundaries, changes explain/canonical/state-schema behavior, and still does not provide true end-to-end support for nulls under a non-null logical contract because Catalyst has already optimized using nullable=false. If Spark wants to harden this case, it should be through a narrower invariant check or a fully designed end-to-end semantic change with optimizer, execution, tests, and performance validation, not this localized row-boundary rewrite.

ZitingShen force-pushed the codex/columnar-to-row-nullability branch 2 times, most recently from 4f39b8a to dd14115 Compare June 16, 2026 04:44

ZitingShen marked this pull request as ready for review June 16, 2026 04:44

ZitingShen force-pushed the codex/columnar-to-row-nullability branch from dd14115 to a179ab6 Compare June 16, 2026 13:53

[SPARK-57484][SQL] Preserve row nullability after ColumnarToRowExec

d8cb5f0

ZitingShen force-pushed the codex/columnar-to-row-nullability branch from a179ab6 to d8cb5f0 Compare June 16, 2026 13:56

ZitingShen added 2 commits June 16, 2026 16:02

Fix SparkPlanSuite scalastyle

1022860

Refine row-boundary nullability propagation

47e71c5

ZitingShen closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57484][SQL] Preserve row nullability after ColumnarToRowExec#56536

[SPARK-57484][SQL] Preserve row nullability after ColumnarToRowExec#56536
ZitingShen wants to merge 3 commits into
apache:masterfrom
ZitingShen:codex/columnar-to-row-nullability

ZitingShen commented Jun 16, 2026 •

edited

Loading

Uh oh!

ZitingShen commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZitingShen commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ZitingShen commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZitingShen commented Jun 16, 2026 •

edited

Loading