[SPARK-37019][SQL][FOLLOWUP] Resolve nested higher-order function arguments first#56507
[SPARK-37019][SQL][FOLLOWUP] Resolve nested higher-order function arguments first#56507sunchao wants to merge 2 commits into
Conversation
f2fed4d to
dea9dea
Compare
viirya
left a comment
There was a problem hiding this comment.
Reviewed the diff against the full ResolveLambdaVariables rule, the HigherOrderFunction trait and its subclasses, the subquery validator, and the single-pass HigherOrderFunctionResolver. The fix is correct and minimal, and it brings the legacy analyzer in line with the single-pass resolver. A few notes:
This aligns the legacy analyzer with the single-pass resolver — worth stating in the description. HigherOrderFunctionResolver.resolve (resolver/HigherOrderFunctionResolver.scala:118-141) already does exactly this sequence: resolve arguments -> withNewChildren(resolvedArguments ++ functions) -> validate subquery -> bind -> resolve functions. So the legacy ResolveLambdaVariables was the outlier, and the resolvedArguments ++ functions idiom is already proven there. That consistency is a stronger justification than "intentionally narrow," and reusing the established pattern is the right call.
The change depends on an undocumented invariant: children == arguments ++ functions. Both withNewChildren(... ++ h.functions) calls assume this ordering. It holds for every current subclass (SimpleHigherOrderFunction/BinaryLike, ArrayAggregate/QuaternaryLike, ZipWith/MapZipWith/TernaryLike), but it's not stated anywhere. A one-line comment at the withNewChildren call would protect future HOFs from a subtle child-reordering bug.
Redundant re-resolution in the bind branch. After bind(...), .mapChildren(resolve(...)) re-walks the just-resolved arguments. It's harmless — they short-circuit on case _ if e.resolved => e — but a brief comment clarifying that the mapChildren is there to resolve the now-bound lambda bodies (not the arguments) would save the next reader a double-take.
JIRA tag looks misattributed. SPARK-37019 is "Add codegen support to array higher-order functions," which is unrelated to analyzer resolution ordering. Consider filing a dedicated ticket for this fix.
Test coverage. The regression test reproduces the production-shaped failure well and asserts the concrete result rather than just "analysis succeeds" — good. Since the fix is on the generic HigherOrderFunction path, a direct unit test in ResolveLambdaVariablesSuite (and ideally a second HOF shape beyond ArrayAggregate) would lock the behavior in closer to the rule. Not a blocker.
Behavior on the deferral path is equivalent to the old mapChildren fallthrough, and resolving arguments before checkArgumentDataTypes() directly removes the Invalid call to dataType on unresolved object failure. LGTM after the (optional) JIRA fix and the two clarifying comments.
peter-toth
left a comment
There was a problem hiding this comment.
+1 to @viirya's review, especially the undocumented children == arguments ++ functions invariant. I independently reached the same point: it holds for every current HOF (SimpleHigherOrderFunction/BinaryLike, ArrayAggregate/QuaternaryLike, ZipWith/MapZipWith/TernaryLike) but the trait doesn't enforce it, so the clarifying comment is worth adding.
|
Thanks! Added comments to clarify in the related code |
…uments first ### What changes were proposed in this pull request? Resolve each higher-order function's argument expressions before checking their data types and binding its lambda functions. The analyzer now follows this order: 1. Resolve the argument expressions using the current outer lambda scope. 2. Rebuild the higher-order function with those resolved arguments. 3. If the arguments are ready and valid, bind the lambda functions immediately. 4. Otherwise, resolve only the function expressions and defer binding. This matches the established sequence in the single-pass `HigherOrderFunctionResolver`, so both analyzer paths now resolve arguments before binding lambdas. This is intentionally narrow. It does not change `ArrayAggregate` accumulator types, casts, code generation, or runtime execution. The PR also adds a focused regression test for the nested `transform` / `filter` / `aggregate` expression that exposed the bug. ### Why are the changes needed? `ResolveLambdaVariables` previously bound a higher-order function only when its arguments were already resolved at the start of the visit. If nested argument expressions became resolved during that visit, Spark still walked and rebuilt the remaining expression tree without binding the current lambda functions. For complex nested types, that ordering could inspect a field extraction whose lambda variable was still unresolved and fail analysis with: ``` Invalid call to dataType on unresolved object ``` In short: ``` Before: check readiness -> resolve nested arguments -> wait for another analyzer pass After: resolve nested arguments -> check readiness -> bind lambdas in the same pass ``` ### Does this PR introduce _any_ user-facing change? Yes. Valid queries with nested higher-order functions that previously failed during analysis can now be analyzed and executed. There is no public API, configuration, or intended runtime behavior change for queries that already worked. ### How was this patch tested? - Added `ArrayAggregate resolves nested lambda arguments before inspecting their types` to reproduce the production-shaped failure and verify the result. - `ResolveLambdaVariablesSuite`: 6 tests passed. - `DataFrameComplexTypeSuite`: 18 tests passed. - Catalyst and SQL test Scalastyle: 0 errors and 0 warnings. - `git diff --check` passed. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Codex (GPT-5) Closes #56507 from sunchao/dev/chao/codex/hof-gate-analyzer-behavior-oss. Authored-by: Chao Sun <chao@openai.com> Signed-off-by: Chao Sun <chao@openai.com> (cherry picked from commit 3fa2bea) Signed-off-by: Chao Sun <chao@openai.com>
|
Merged to |
What changes were proposed in this pull request?
Resolve each higher-order function's argument expressions before checking their data types and binding its lambda functions.
The analyzer now follows this order:
This matches the established sequence in the single-pass
HigherOrderFunctionResolver, so both analyzer paths now resolve arguments before binding lambdas.This is intentionally narrow. It does not change
ArrayAggregateaccumulator types, casts, code generation, or runtime execution.The PR also adds a focused regression test for the nested
transform/filter/aggregateexpression that exposed the bug.Why are the changes needed?
ResolveLambdaVariablespreviously bound a higher-order function only when its arguments were already resolved at the start of the visit. If nested argument expressions became resolved during that visit, Spark still walked and rebuilt the remaining expression tree without binding the current lambda functions.For complex nested types, that ordering could inspect a field extraction whose lambda variable was still unresolved and fail analysis with:
In short:
Does this PR introduce any user-facing change?
Yes. Valid queries with nested higher-order functions that previously failed during analysis can now be analyzed and executed.
There is no public API, configuration, or intended runtime behavior change for queries that already worked.
How was this patch tested?
ArrayAggregate resolves nested lambda arguments before inspecting their typesto reproduce the production-shaped failure and verify the result.ResolveLambdaVariablesSuite: 6 tests passed.DataFrameComplexTypeSuite: 18 tests passed.git diff --checkpassed.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Codex (GPT-5)