ARROW-9140: [R] Zero-copy Arrow to R where possible#10445
ARROW-9140: [R] Zero-copy Arrow to R where possible#10445romainfrancois wants to merge 22 commits intoapache:masterfrom
Conversation
|
An interesting side benefit from this is that we don't get altrep R vectors from slices: library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
x <- 1:1e3+ 1L
v <- Array$create(x)
x1 <- v$as_vector()
.Internal(inspect(x1))
#> @7face1b4ba18 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7facdc27e998)
x2 <- v$Slice(500)$as_vector()
.Internal(inspect(x2))
#> @7facde2bfd80 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=500, ptr=0x7facdb0a2198)Created on 2021-06-08 by the reprex package (v2.0.0) |
f2db09c to
320b33f
Compare
|
@nealrichardson I've turned off making INT64 arrays with no nulls into altrep bit64::integer64 vectors (i.e. reusing the payload of the array as the data of the underlying double array in R ...) because of the auto downcasting to int vectors when it fits: library(arrow, warn.conflicts = FALSE)
i64 <- Array$create(bit64::as.integer64(1:10))
.Internal(inspect(as.vector(i64)))
#> @7ff0b2099c38 13 INTSXP g0c4 [REF(2)] (len=10, tl=0) 1,2,3,4,5,...Created on 2021-06-08 by the reprex package (v2.0.0) I can altogether remove the int64 altrep code if needed ... ? |
|
This is super cool. We should get some benchmarking on this. Tangentially related, should we have an |
|
Interesting, we could make that opt-in. For now it's all happening here so would be easy enough to make conditional: // [[arrow::export]]
SEXP Array__as_vector(const std::shared_ptr<arrow::Array>& array) {
auto type = array->type();
#if defined(HAS_ALTREP)
if (array->null_count() == 0) {
switch (type->id()) {
case arrow::Type::DOUBLE:
return arrow::r::Make_array_nonull_dbl_vector(array);
case arrow::Type::INT32:
return arrow::r::Make_array_nonull_int_vector(array);
// case arrow::Type::INT64:
// return arrow::r::Make_array_nonull_int64_vector(array);
default:
break;
}
}
#endif
return arrow::r::ArrayVector__as_vector(array->length(), type, {array});
} |
|
Cool, so that would just be |
|
I've used I've put
|
SGTM
That surprises me, I'd expect that checking an R boolean would be faster than counting nulls in a (potentially long) array.
I think defaulting to ALTREP on is good (particularly if/when we quantify its benefits) but leaving an escape hatch in case of error is nice. |
pitrou
left a comment
There was a problem hiding this comment.
Not an R user, but here are some comments.
|
Ok, I've run some benchmarks on this branch and I'm seeing a huge speed up for floats + integers with It might be out of scope for this PR, but chunked arrays don't see a similar speed up (which makes sense given they call library(arrow, warn.conflicts = FALSE)
x <- 1:1e3+ 1L
v <- Array$create(x)
x1 <- v$as_vector()
.Internal(inspect(x1))
#> @7f9077f5a1a8 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7f90975a9a08)
v_chunked <- ChunkedArray$create(x)
x2 <- v_chunked$as_vector()
.Internal(inspect(x2))
#> @7f908312c000 13 INTSXP g0c7 [REF(2)] (len=1000, tl=0) 2,3,4,5,6,...Created on 2021-06-10 by the reprex package (v2.0.0) arrowbench results (using the new benchmarks in voltrondata-labs/arrowbench#28): |
… no nulls into an altrep R numeric vector w/ no copy
…e `arrow.use_altrep` option (with default to TRUE: opt out)
50a88df to
8721259
Compare
- const std::shared_ptr<Array>& Get()
|
@jonkeane we might be able to deal with a fake I believe this can be a follow up pull request with some altrep vectors to wrap chunked arrays, and arrays with nulls. Implementation will be different, e.g. the |
|
That sounds good. The faked With csv reading, we frequently (with sufficiently large csvs) get multiple chunks, but I've tried reading parquet a few times to get a table with chunked arrays where there are more than one chunk and haven't so far seen it happen. I need to dig more into the parquet reading source code to see if it is ever possible to get more-than-1-chunks- |
This comment has been minimized.
This comment has been minimized.
library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
c_int <- ChunkedArray$create(1:1000)
c_dbl <- ChunkedArray$create(as.numeric(1:1000))
c_int$num_chunks
#> [1] 1
c_dbl$num_chunks
#> [1] 1
.Internal(inspect(as.vector(c_int)))
#> @7fc1314ce2a8 13 INTSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, int32, NONULL> (len=1000, ptr=0x7fc1274a0408)
.Internal(inspect(as.vector(c_dbl)))
#> @7fc131528a90 14 REALSXP g0c0 [REF(65535)] std::shared_ptr<arrow::Array, double, NONULL> (len=1000, ptr=0x7fc12b2b5178)Created on 2021-06-18 by the reprex package (v2.0.0.9000) I think I'll deal with this https://issues.apache.org/jira/browse/ARROW-13114 here as well, and then do a follow up for using RTasks (i.e. https://issues.apache.org/jira/browse/ARROW-13113). |
|
Actually, I now think it is better to do the RTasks thing first https://issues.apache.org/jira/browse/ARROW-13113 and then rework |
|
I'm not sure why we get this: https://github.com/apache/arrow/pull/10445/checks?check_run_id=2858594249#step:8:2641 |
Looks like linking the arrow binding failed. There's a cpp11 symbol missing ... which is odd because it's a template, so it ought to be inlined everywhere |
bkietz
left a comment
There was a problem hiding this comment.
This is neat, thanks for doing it. Some minor comments:
| } | ||
|
|
||
| SEXP MakeInt32ArrayNoNull(const std::shared_ptr<Array>& array) { | ||
| return Int32ArrayNoNull::Make(array); |
There was a problem hiding this comment.
Edge case worthy of mention in a comment somewhere: IIUC this doesn't check for instances of NA_integer_ in array and these will be considered null by R even though is_altrep_int_nonull(as.vector(array)) (and we report no_NA = 1)
There was a problem hiding this comment.
I'll think about where to mention this, but this isn't limited to these altrep variants. any int32 array that happens to have a non null NA_integer_ will be mismanaged by R with this sentinel approach.
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
|
@nealrichardson I believe this is ready to go |
This makes altrep R vectors of type
INTSXPorREALSXPfromarrow::Arrayof typeInt32Type/DoubleTypethat don't have any nulls:the altrep vector holds an external pointer so that the
Arraystays around, and its payload is shared. The R vector is marked as not mutable.create a “big” arrow Array with no nulls (just for testing purposes)
turn into R vector, using altrep, sharing the payload
verify it’s an altrep with the inspect method
it’s marked as not mutable so check that modify -> duplicate
timings for double vector
when a copy is needed, the data is copied entirely:
timings for integer vector
Created on 2021-06-08 by the reprex package (v2.0.0)