[SPARK-37019][SQL][FOLLOWUP] Resolve nested higher-order function arguments first by sunchao · Pull Request #56507 · apache/spark

sunchao · 2026-06-14T20:23:08Z

What changes were proposed in this pull request?

Resolve each higher-order function's argument expressions before checking their data types and binding its lambda functions.

The analyzer now follows this order:

Resolve the argument expressions using the current outer lambda scope.
Rebuild the higher-order function with those resolved arguments.
If the arguments are ready and valid, bind the lambda functions immediately.
Otherwise, resolve only the function expressions and defer binding.

This matches the established sequence in the single-pass HigherOrderFunctionResolver, so both analyzer paths now resolve arguments before binding lambdas.

This is intentionally narrow. It does not change ArrayAggregate accumulator types, casts, code generation, or runtime execution.

The PR also adds a focused regression test for the nested transform / filter / aggregate expression that exposed the bug.

Why are the changes needed?

ResolveLambdaVariables previously bound a higher-order function only when its arguments were already resolved at the start of the visit. If nested argument expressions became resolved during that visit, Spark still walked and rebuilt the remaining expression tree without binding the current lambda functions.

For complex nested types, that ordering could inspect a field extraction whose lambda variable was still unresolved and fail analysis with:

Invalid call to dataType on unresolved object

In short:

Before: check readiness -> resolve nested arguments -> wait for another analyzer pass
After:  resolve nested arguments -> check readiness -> bind lambdas in the same pass

Does this PR introduce any user-facing change?

Yes. Valid queries with nested higher-order functions that previously failed during analysis can now be analyzed and executed.

There is no public API, configuration, or intended runtime behavior change for queries that already worked.

How was this patch tested?

Added ArrayAggregate resolves nested lambda arguments before inspecting their types to reproduce the production-shaped failure and verify the result.
ResolveLambdaVariablesSuite: 6 tests passed.
DataFrameComplexTypeSuite: 18 tests passed.
Catalyst and SQL test Scalastyle: 0 errors and 0 warnings.
git diff --check passed.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex (GPT-5)

…uments first

sunchao · 2026-06-19T16:48:06Z

cc @cloud-fan @viirya @peter-toth @dongjoon-hyun

viirya

Reviewed the diff against the full ResolveLambdaVariables rule, the HigherOrderFunction trait and its subclasses, the subquery validator, and the single-pass HigherOrderFunctionResolver. The fix is correct and minimal, and it brings the legacy analyzer in line with the single-pass resolver. A few notes:

This aligns the legacy analyzer with the single-pass resolver — worth stating in the description. HigherOrderFunctionResolver.resolve (resolver/HigherOrderFunctionResolver.scala:118-141) already does exactly this sequence: resolve arguments -> withNewChildren(resolvedArguments ++ functions) -> validate subquery -> bind -> resolve functions. So the legacy ResolveLambdaVariables was the outlier, and the resolvedArguments ++ functions idiom is already proven there. That consistency is a stronger justification than "intentionally narrow," and reusing the established pattern is the right call.

The change depends on an undocumented invariant: children == arguments ++ functions. Both withNewChildren(... ++ h.functions) calls assume this ordering. It holds for every current subclass (SimpleHigherOrderFunction/BinaryLike, ArrayAggregate/QuaternaryLike, ZipWith/MapZipWith/TernaryLike), but it's not stated anywhere. A one-line comment at the withNewChildren call would protect future HOFs from a subtle child-reordering bug.

Redundant re-resolution in the bind branch. After bind(...), .mapChildren(resolve(...)) re-walks the just-resolved arguments. It's harmless — they short-circuit on case _ if e.resolved => e — but a brief comment clarifying that the mapChildren is there to resolve the now-bound lambda bodies (not the arguments) would save the next reader a double-take.

JIRA tag looks misattributed. SPARK-37019 is "Add codegen support to array higher-order functions," which is unrelated to analyzer resolution ordering. Consider filing a dedicated ticket for this fix.

Test coverage. The regression test reproduces the production-shaped failure well and asserts the concrete result rather than just "analysis succeeds" — good. Since the fix is on the generic HigherOrderFunction path, a direct unit test in ResolveLambdaVariablesSuite (and ideally a second HOF shape beyond ArrayAggregate) would lock the behavior in closer to the rule. Not a blocker.

Behavior on the deferral path is equivalent to the old mapChildren fallthrough, and resolving arguments before checkArgumentDataTypes() directly removes the Invalid call to dataType on unresolved object failure. LGTM after the (optional) JIRA fix and the two clarifying comments.

peter-toth

+1 to @viirya's review, especially the undocumented children == arguments ++ functions invariant. I independently reached the same point: it holds for every current HOF (SimpleHigherOrderFunction/BinaryLike, ArrayAggregate/QuaternaryLike, ZipWith/MapZipWith/TernaryLike) but the trait doesn't enforce it, so the clarifying comment is worth adding.

… order

sunchao · 2026-06-19T18:51:24Z

Thanks! Added comments to clarify in the related code

…uments first ### What changes were proposed in this pull request? Resolve each higher-order function's argument expressions before checking their data types and binding its lambda functions. The analyzer now follows this order: 1. Resolve the argument expressions using the current outer lambda scope. 2. Rebuild the higher-order function with those resolved arguments. 3. If the arguments are ready and valid, bind the lambda functions immediately. 4. Otherwise, resolve only the function expressions and defer binding. This matches the established sequence in the single-pass `HigherOrderFunctionResolver`, so both analyzer paths now resolve arguments before binding lambdas. This is intentionally narrow. It does not change `ArrayAggregate` accumulator types, casts, code generation, or runtime execution. The PR also adds a focused regression test for the nested `transform` / `filter` / `aggregate` expression that exposed the bug. ### Why are the changes needed? `ResolveLambdaVariables` previously bound a higher-order function only when its arguments were already resolved at the start of the visit. If nested argument expressions became resolved during that visit, Spark still walked and rebuilt the remaining expression tree without binding the current lambda functions. For complex nested types, that ordering could inspect a field extraction whose lambda variable was still unresolved and fail analysis with: ``` Invalid call to dataType on unresolved object ``` In short: ``` Before: check readiness -> resolve nested arguments -> wait for another analyzer pass After: resolve nested arguments -> check readiness -> bind lambdas in the same pass ``` ### Does this PR introduce _any_ user-facing change? Yes. Valid queries with nested higher-order functions that previously failed during analysis can now be analyzed and executed. There is no public API, configuration, or intended runtime behavior change for queries that already worked. ### How was this patch tested? - Added `ArrayAggregate resolves nested lambda arguments before inspecting their types` to reproduce the production-shaped failure and verify the result. - `ResolveLambdaVariablesSuite`: 6 tests passed. - `DataFrameComplexTypeSuite`: 18 tests passed. - Catalyst and SQL test Scalastyle: 0 errors and 0 warnings. - `git diff --check` passed. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Codex (GPT-5) Closes #56507 from sunchao/dev/chao/codex/hof-gate-analyzer-behavior-oss. Authored-by: Chao Sun <chao@openai.com> Signed-off-by: Chao Sun <chao@openai.com> (cherry picked from commit 3fa2bea) Signed-off-by: Chao Sun <chao@openai.com>

sunchao · 2026-06-19T23:54:09Z

Merged to master / branch-4.x, thanks for the review!!

[SPARK-37019][SQL][FOLLOWUP] Resolve nested higher-order function arg…

dea9dea

…uments first

sunchao force-pushed the dev/chao/codex/hof-gate-analyzer-behavior-oss branch from f2fed4d to dea9dea Compare June 17, 2026 04:12

sunchao changed the title ~~[SPARK-37019][SQL][FOLLOWUP] Defer ArrayAggregate accumulator widening~~ [SPARK-37019][SQL][FOLLOWUP] Resolve nested higher-order function arguments first Jun 17, 2026

sunchao marked this pull request as ready for review June 19, 2026 16:47

viirya approved these changes Jun 19, 2026

View reviewed changes

peter-toth approved these changes Jun 19, 2026

View reviewed changes

[SPARK-37019][SQL][FOLLOWUP] Clarify higher-order function resolution…

85cf90c

… order

sunchao closed this in 3fa2bea Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37019][SQL][FOLLOWUP] Resolve nested higher-order function arguments first#56507

[SPARK-37019][SQL][FOLLOWUP] Resolve nested higher-order function arguments first#56507
sunchao wants to merge 2 commits into
apache:masterfrom
sunchao:dev/chao/codex/hof-gate-analyzer-behavior-oss

sunchao commented Jun 14, 2026 •

edited

Loading

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

viirya left a comment

Uh oh!

peter-toth left a comment

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sunchao commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sunchao commented Jun 14, 2026 •

edited

Loading