[SPARK-42751][PS] Support str.findall with capture groups#56533
[SPARK-42751][PS] Support str.findall with capture groups#56533nguyen1hc wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for Series.str.findall patterns with multiple capture groups in pandas-on-Spark by returning nested arrays and validating behavior with a new test.
Changes:
- Added a regression test ensuring multi-group
findallresults match pandas semantics (with list-of-lists normalization). - Updated
Series.str.findallto returnarray<array<string>>for patterns with multiple capture groups. - Added group-count detection via regex compilation to select the appropriate Spark return schema.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| python/pyspark/pandas/tests/series/test_string_ops_adv.py | Adds a test covering multi-capture-group behavior for str.findall. |
| python/pyspark/pandas/strings.py | Updates str.findall UDF return type and normalizes multi-group matches into nested arrays. |
Comments suppressed due to low confidence (1)
python/pyspark/pandas/strings.py:1
- For patterns with multiple capture groups, this implementation now returns a nested array (and the UDF converts pandas' tuple matches into lists). This is a user-visible behavioral difference from pandas (which yields tuples) and should be documented in the
findalldocstring (and/or user-facing docs) to avoid surprises when moving code between pandas and pandas-on-Spark.
#
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ret = s.str.findall(pat, flags) | ||
| if num_groups > 1: | ||
| ret = ret.map( | ||
| lambda matches: [list(match) for match in matches] | ||
| if isinstance(matches, list) | ||
| else matches | ||
| ) | ||
| if str_dtype: | ||
| # ArrayType does not support NaN, so replace with None | ||
| ret = ret.replace(np.nan, None) |
|
Hi, I have opened a pull request for this issue: The PR updates pandas-on-Spark Series.str.findall to support regex patterns with multiple capture groups and adds focused regression test coverage. |
b9d212a to
d872465
Compare
d872465 to
d3fb798
Compare
| lambda x: x.str.findall("wh.*", flags=re.IGNORECASE), self.pser, ignore_null=True | ||
| ) | ||
|
|
||
| pser = pd.Series(["abc-123 def-456", "no match"]) |
There was a problem hiding this comment.
This test doesn't cover the null + multi-group path. It only exercises list rows; "no match" returns [], not NaN. The isinstance guard's NaN branch is therefore untested. Please consider adding a None element to the series to lock in that behavior.
What changes were proposed in this pull request?
This PR updates pandas-on-Spark
Series.str.findallto support regex patterns with multiple capture groups.When the regex pattern has more than one capture group, pandas returns a list of tuples. The existing pandas UDF return type expects an array of strings, which can fail during Arrow conversion. This PR returns an array of arrays of strings for this case and converts each tuple match into a list.
This PR also adds regression test coverage for
Series.str.findallwith multiple capture groups.Why are the changes needed?
Series.str.findallcurrently cannot handle regex patterns that return tuples, for example:This should return the captured groups instead of failing during conversion.
Does this PR introduce any user-facing change?
Yes.
Previously,
Series.str.findallcould fail for regex patterns with multiple capture groups. With this change, it returns nested arrays representing the captured groups.How was this patch tested?
Added a regression test in
SeriesStringOpsAdvTests.test_string_findall.Also ran:
Note: the local
lint-python --ruffwrapper skipped ruff because theruffcommand was not installed. The focused PySpark unittest could not start locally because Spark jars have not been built in this checkout.Was this patch authored or co-authored using generative AI tooling?
No.