ARROW-10557: [C++] Add scalar string slicing/substring extract kernel#9000
ARROW-10557: [C++] Add scalar string slicing/substring extract kernel#9000maartenbreddels wants to merge 3 commits intoapache:masterfrom
Conversation
a23b352 to
dae5960
Compare
dae5960 to
fea524b
Compare
|
@pitrou this is ready for review |
You could instead use regular for loops instead of the pytest parametrization "magic". It would probably remove a lot of overhead. |
|
I'll think I'll do a full review once the (lr)trim PR is merged, it will be easier. |
|
👋 checking back in here. The PR on which this was based has merged, so this can be rebased now |
d4608a9 to
356c300
Compare
05e790e to
4c2327c
Compare
|
@pitrou this is ready for review, the failure seems unrelated (minio) Sorry for taking so long to get back at this, I hope we can get this, and my other open string-PRs sorted out soon! |
pitrou
left a comment
There was a problem hiding this comment.
Thank you very much @maartenbreddels . Here are some comments.
c91ee0f to
0b646c0
Compare
|
Failure seems unrelated: |
|
@pitrou @maartenbreddels is this ready to merge? Trying to mop up issues for 4.0 |
|
@nealrichardson I need to review this again, sorry. |
pitrou
left a comment
There was a problem hiding this comment.
Just two more questions. Looks good otherwise!
There was a problem hiding this comment.
Hmm... for step < 0, wouldn't it be more logical to first compute end_sliced? Presumably, you then can pass end_sliced as one of the boundaries for computing begin_sliced?
There was a problem hiding this comment.
Just a question btw, you don't need to act on this if you think it's unnecessary.
|
@maartenbreddels Do you want to update this PR? |
|
I'm working on updating this. |
|
Thanks for picking this up @pitrou, I could not find the time to update it! |
Needs a rebase after #8621 is merged
I totally agree with https://github.com/python/cpython/blob/c9bc290dd6e3994a4ead2a224178bcba86f0c0e4/Objects/sliceobject.c#L252
This was tricky to get right, the main difficulty is in manually dealing with reverse iterators. Therefore I put on extra guardrails by having the Python unittests cover a lot of cases. All edge cases detected by this are translated to the C++ unittest suite, so we could reduce them to reduce pytest execution cost (I added 1 second).
Slicing is based on Python,
[start, stop)inclusive/exclusive semantics, where an index refers to a codeunit (like Python apparently, badly documented), and negative indices start counting from the right.step != 0is supported, like Python.The only thing we cannot support easily, are things like reversing a string, since in Python one can do
s[::-1]ors[-1::-1], but we don't support empty values with the Option machinery (we model this as an c-int64). To mimic this, we can dopc.utf8_slice_codeunits(ar, start=-1, end=-sys.maxsize, step=-1)(i.e. a very large negative value).For instance, libraries such as Pandas and Vaex can do sth like that, confirmed to be working by modifying the unittest like this:
So libraries using this can implement the full Python behavior with this workaround.