ARROW-10195: [C++] Add string struct extract kernel using re2#8459
ARROW-10195: [C++] Add string struct extract kernel using re2#8459maartenbreddels wants to merge 6 commits intoapache:masterfrom
Conversation
|
should re2 be available on all platform? appveyor for instance gives: |
|
@maartenbreddels Can you rebase this? That would be helpful for me setting up |
9041521 to
5372dc3
Compare
|
I cleaned it up a bit, but it might fail running the test, hope this is enough for you for now. |
5372dc3 to
f66bc36
Compare
|
There are two issues with this PR: empty slicesIn #8728 https://github.com/apache/arrow/pull/8728/files#diff-7771ecc138c4ecc2dc8498affe04f5b7f182c4b77b18a512c3dd07a82d45aa3dR116 was added, which adds an empty slice to the test, which breaks my test (which is good). However, my kernel will not be called, meaning i cannot set the output's type (which depends on the input regex). Shall I open a JIRA for that? The error is: (showing the struct mismatch) ChunkedArray failsCommenting out the empty slice test, the next failure is with I fail to see a difference in this output (maybe a null value in a child?), and this is difficult to debug in gdb. Maybe an opportunity to improve the printing of differences (since I fail to see any). |
b500ed4 to
f1bd901
Compare
d4608a9 to
356c300
Compare
f1bd901 to
43773ea
Compare
|
@maartenbreddels what's the status of this PR? Ready for review, or are there still open issues? (you listed some above at #8459 (comment)) |
|
This needs rebase now that #8468 has merged. |
1357f98 to
58cbcb7
Compare
|
Rebased, but there are two issues left:
|
|
@maartenbreddels Will you need help here? |
|
Yes, that would be appriciated, the null values I find difficult to debug, and the second issue I'm not sure that is good solution. |
|
Ok, I'm looking at this. |
|
Naming nit @jorisvandenbossche @jonkeane @ianmcook . |
f565700 to
9b1f6e0
Compare
|
@maartenbreddels For the record:
I am not sure what was causing this, but my small rewrite of the kernel fixed it.
This was solved by defining a |
In pandas, this is called "extract" (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html), so from that point of view this is a good name. The |
|
One thing I noted while quickly trying it out: When having an optional group like the above, I would maybe expect a null instead of an empty string in the first child? |
Agreed, but unfortunately re2 doesn't give us that information... |
cpp/src/arrow/compute/api_scalar.h
Outdated
There was a problem hiding this comment.
Naming nit: I would maybe call this "pattern" instead of "regex"
|
Thanks for finishing this Antoine!
Good to know, thanks. |
|
@maartenbreddels You're welcome :-) |
The second commit adds re2 to the linked libraries. @xhochy how should this be done, should I open a separate issue for this?