[SPARK-42751][PS] Support str.findall with capture groups by nguyen1hc · Pull Request #56533 · apache/spark

nguyen1hc · 2026-06-16T00:47:26Z

What changes were proposed in this pull request?

This PR updates pandas-on-Spark Series.str.findall to support regex patterns with multiple capture groups.

When the regex pattern has more than one capture group, pandas returns a list of tuples. The existing pandas UDF return type expects an array of strings, which can fail during Arrow conversion. This PR returns an array of arrays of strings for this case and converts each tuple match into a list.

This PR also adds regression test coverage for Series.str.findall with multiple capture groups.

Why are the changes needed?

Series.str.findall currently cannot handle regex patterns that return tuples, for example:

ps.Series(["abc-123 def-456"]).str.findall("([a-z]+)-([0-9]+)")

This should return the captured groups instead of failing during conversion.

Does this PR introduce any user-facing change?

Yes.

Previously, Series.str.findall could fail for regex patterns with multiple capture groups. With this change, it returns nested arrays representing the captured groups.

How was this patch tested?

Added a regression test in SeriesStringOpsAdvTests.test_string_findall.

Also ran:

python -m py_compile python/pyspark/pandas/strings.py python/pyspark/pandas/tests/series/test_string_ops_adv.py
git diff --check
PYTHON_EXECUTABLE=python ./dev/lint-python --ruff

Note: the local lint-python --ruff wrapper skipped ruff because the ruff command was not installed. The focused PySpark unittest could not start locally because Spark jars have not been built in this checkout.

Was this patch authored or co-authored using generative AI tooling?

No.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for Series.str.findall patterns with multiple capture groups in pandas-on-Spark by returning nested arrays and validating behavior with a new test.

Changes:

Added a regression test ensuring multi-group findall results match pandas semantics (with list-of-lists normalization).
Updated Series.str.findall to return array<array<string>> for patterns with multiple capture groups.
Added group-count detection via regex compilation to select the appropriate Spark return schema.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
python/pyspark/pandas/tests/series/test_string_ops_adv.py	Adds a test covering multi-capture-group behavior for `str.findall`.
python/pyspark/pandas/strings.py	Updates `str.findall` UDF return type and normalizes multi-group matches into nested arrays.

Comments suppressed due to low confidence (1)

python/pyspark/pandas/strings.py:1

For patterns with multiple capture groups, this implementation now returns a nested array (and the UDF converts pandas' tuple matches into lists). This is a user-visible behavioral difference from pandas (which yields tuples) and should be documented in the findall docstring (and/or user-facing docs) to avoid surprises when moving code between pandas and pandas-on-Spark.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

            ret = s.str.findall(pat, flags)
+            if num_groups > 1:
+                ret = ret.map(
+                    lambda matches: [list(match) for match in matches]
+                    if isinstance(matches, list)
+                    else matches
+                )
            if str_dtype:
                # ArrayType does not support NaN, so replace with None
                ret = ret.replace(np.nan, None)


nguyen1hc · 2026-06-16T00:52:06Z

Hi, I have opened a pull request for this issue:
#56533

The PR updates pandas-on-Spark Series.str.findall to support regex patterns with multiple capture groups and adds focused regression test coverage.

uros-b · 2026-06-17T17:41:08Z

                lambda x: x.str.findall("wh.*", flags=re.IGNORECASE), self.pser, ignore_null=True
            )

+        pser = pd.Series(["abc-123 def-456", "no match"])


This test doesn't cover the null + multi-group path. It only exercises list rows; "no match" returns [], not NaN. The isinstance guard's NaN branch is therefore untested. Please consider adding a None element to the series to lock in that behavior.

Copilot AI review requested due to automatic review settings June 16, 2026 00:47

Copilot AI reviewed Jun 16, 2026

View reviewed changes

[SPARK-42751][PS] Support findall with capture groups

92a3413

nguyen1hc force-pushed the SPARK-42751-ps-str-findall-capture-groups branch from b9d212a to d872465 Compare June 16, 2026 00:55

[SPARK-42751][PS][DOCS] Document findall capture group output

d3fb798

nguyen1hc force-pushed the SPARK-42751-ps-str-findall-capture-groups branch from d872465 to d3fb798 Compare June 16, 2026 00:57

nguyen1hc added 2 commits June 16, 2026 12:43

[SPARK-42751][PS][TESTS] Fix findall capture group test assertion

4c84a24

[SPARK-42751][PS][TESTS] Normalize findall capture group assertion

ef9a01b

uros-b reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42751][PS] Support str.findall with capture groups#56533

[SPARK-42751][PS] Support str.findall with capture groups#56533
nguyen1hc wants to merge 4 commits into
apache:masterfrom
nguyen1hc:SPARK-42751-ps-str-findall-capture-groups

nguyen1hc commented Jun 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nguyen1hc commented Jun 16, 2026

Uh oh!

uros-b Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nguyen1hc commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nguyen1hc commented Jun 16, 2026

Uh oh!

uros-b Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nguyen1hc commented Jun 16, 2026 •

edited

Loading