Skip to content

[SPARK-42751][PS] Support str.findall with capture groups#56533

Open
nguyen1hc wants to merge 4 commits into
apache:masterfrom
nguyen1hc:SPARK-42751-ps-str-findall-capture-groups
Open

[SPARK-42751][PS] Support str.findall with capture groups#56533
nguyen1hc wants to merge 4 commits into
apache:masterfrom
nguyen1hc:SPARK-42751-ps-str-findall-capture-groups

Conversation

@nguyen1hc

@nguyen1hc nguyen1hc commented Jun 16, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR updates pandas-on-Spark Series.str.findall to support regex patterns with multiple capture groups.

When the regex pattern has more than one capture group, pandas returns a list of tuples. The existing pandas UDF return type expects an array of strings, which can fail during Arrow conversion. This PR returns an array of arrays of strings for this case and converts each tuple match into a list.

This PR also adds regression test coverage for Series.str.findall with multiple capture groups.

Why are the changes needed?

Series.str.findall currently cannot handle regex patterns that return tuples, for example:

ps.Series(["abc-123 def-456"]).str.findall("([a-z]+)-([0-9]+)")

This should return the captured groups instead of failing during conversion.

Does this PR introduce any user-facing change?

Yes.

Previously, Series.str.findall could fail for regex patterns with multiple capture groups. With this change, it returns nested arrays representing the captured groups.

How was this patch tested?

Added a regression test in SeriesStringOpsAdvTests.test_string_findall.

Also ran:

python -m py_compile python/pyspark/pandas/strings.py python/pyspark/pandas/tests/series/test_string_ops_adv.py
git diff --check
PYTHON_EXECUTABLE=python ./dev/lint-python --ruff

Note: the local lint-python --ruff wrapper skipped ruff because the ruff command was not installed. The focused PySpark unittest could not start locally because Spark jars have not been built in this checkout.

Was this patch authored or co-authored using generative AI tooling?

No.

Copilot AI review requested due to automatic review settings June 16, 2026 00:47

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for Series.str.findall patterns with multiple capture groups in pandas-on-Spark by returning nested arrays and validating behavior with a new test.

Changes:

  • Added a regression test ensuring multi-group findall results match pandas semantics (with list-of-lists normalization).
  • Updated Series.str.findall to return array<array<string>> for patterns with multiple capture groups.
  • Added group-count detection via regex compilation to select the appropriate Spark return schema.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/pyspark/pandas/tests/series/test_string_ops_adv.py Adds a test covering multi-capture-group behavior for str.findall.
python/pyspark/pandas/strings.py Updates str.findall UDF return type and normalizes multi-group matches into nested arrays.
Comments suppressed due to low confidence (1)

python/pyspark/pandas/strings.py:1

  • For patterns with multiple capture groups, this implementation now returns a nested array (and the UDF converts pandas' tuple matches into lists). This is a user-visible behavioral difference from pandas (which yields tuples) and should be documented in the findall docstring (and/or user-facing docs) to avoid surprises when moving code between pandas and pandas-on-Spark.
#

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1190 to 1199
ret = s.str.findall(pat, flags)
if num_groups > 1:
ret = ret.map(
lambda matches: [list(match) for match in matches]
if isinstance(matches, list)
else matches
)
if str_dtype:
# ArrayType does not support NaN, so replace with None
ret = ret.replace(np.nan, None)
@nguyen1hc

Copy link
Copy Markdown
Author

Hi, I have opened a pull request for this issue:
#56533

The PR updates pandas-on-Spark Series.str.findall to support regex patterns with multiple capture groups and adds focused regression test coverage.

@nguyen1hc nguyen1hc force-pushed the SPARK-42751-ps-str-findall-capture-groups branch from b9d212a to d872465 Compare June 16, 2026 00:55
@nguyen1hc nguyen1hc force-pushed the SPARK-42751-ps-str-findall-capture-groups branch from d872465 to d3fb798 Compare June 16, 2026 00:57
lambda x: x.str.findall("wh.*", flags=re.IGNORECASE), self.pser, ignore_null=True
)

pser = pd.Series(["abc-123 def-456", "no match"])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't cover the null + multi-group path. It only exercises list rows; "no match" returns [], not NaN. The isinstance guard's NaN branch is therefore untested. Please consider adding a None element to the series to lock in that behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants