Conversation
Codecov Report
@@ Coverage Diff @@
## master #4187 +/- ##
========================================
Coverage 99.61% 99.61%
========================================
Files 72 72
Lines 13916 14046 +130
========================================
+ Hits 13862 13992 +130
Misses 54 54 Continue to review full report at Codecov.
|
|
Tried to use |
|
For |
src/fifelse.c
Outdated
| error("Function 'topn' is not built for large value of 'n'. The algorithm is made for small values. Please prefer the 'order' if you want to proceed with such large value."); | ||
| } | ||
| if (len0 > len1) { | ||
| error("'n' cannot be larger than length of 'vec'."); |
There was a problem hiding this comment.
A very likely use case is topn by group. I think most people familiar with top_n would expect this to work:
set.seed(123L)
dt = data.table(x = rnorm(10), grp = sample(5L, 10L, replace = T))
dt[dt[, .I[topn(x, 2L)], by = grp]$V1, ]
##Error: 'n' cannot be larger than length of 'vec'.
There was a problem hiding this comment.
so you suggest we return something like below?
> dt[dt[, .I[order(x)[1:2]], by = grp]$V1, ]
x grp
1: -0.6868529 1
2: -0.5604756 1
3: -0.2301775 4
4: NA NA
5: 0.1292877 5
6: NA NA
7: -0.4456620 3
8: 1.7150650 3
9: -1.2650612 2
10: 0.4609162 2There was a problem hiding this comment.
My preference would it be to return seq_len(n) and maybe have a warning that there are less records than n.
dt[dt[, .I[order(x)[1:min(.N,2)]], by = grp]$V1, ]
x grp
<num> <int>
1: -0.6868529 1
2: -0.5604756 1
3: -0.2301775 4
4: 0.1292877 5
5: -0.4456620 3
6: 1.7150650 3
7: -1.2650612 2
8: 0.4609162 2
P.S., this PR is great.
set.seed(123)
dt = data.table(x = rnorm(1e7), grp = sample(100, 1e7, TRUE))
microbenchmark::microbenchmark(
dt[, .I[topn(x, 5L)], by = grp],
dt[order(-x), .I[seq_len(5L)], by = grp],
dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp], ##closest to dplyr::top_n
times = 5L
)
Unit: milliseconds
expr min lq mean median uq max neval
dt[, .I[topn(x, 5L)], by = grp] 157.9859 158.9587 161.4327 160.2456 160.6012 169.3721 5
dt[order(-x), .I[seq_len(5L)], by = grp] 1336.8226 1343.3002 1347.5568 1345.3929 1350.1008 1362.1676 5
dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp] 1432.7179 1459.2029 1529.9473 1498.1096 1626.5848 1633.1211 5
There was a problem hiding this comment.
It should be now working like you asked.
There was a problem hiding this comment.
Thanks!
This is more for future thought. I was also going to suggest the maintainers look at updating the Database-like ops benchmark for top n by group as dt[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)], by = id6]. But as the number of groups increase, there seems to be some overhead. I assume this is a memory and gforce type of issue. I can log a separate issue once this is merged for better grouping performance.
Details, benchmark, and verbose message
There are two rows with ******* which identify the main performance different - for topn it is collecting discontiguous groups took 0.616s for 10000 groups versus only 0.016s for the other method.
##remotes::install_github("Rdatatable/data.table", ref = "topn")
library(data.table)
N = 1000000L; K = 100L; set.seed(108L)
DT = data.table(id6 = sample(N/K, N, TRUE), v3 = runif(N, max = 100))
identical(DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)][order(-largest2_v3, id6)],
DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6][order(-largest2_v3, id6)])
#> [1] TRUE
microbenchmark::microbenchmark(
DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)],
DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6]
, times = 10L
)
#> Unit: milliseconds
#> expr
#> DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)]
#> DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6]
#> min lq mean median uq max neval
#> 242.2958 248.2904 254.6175 250.4815 255.1830 288.5708 10
#> 674.9867 680.0911 722.0210 682.1045 697.4592 915.3662 10
DT[DT[order(-v3), head(.I, 2L), by = id6, verbose = T]$V1, .(id6, largest2_v3 = v3)]
#> forder.c received 1000000 rows and 1 columns
#> i clause present and columns used in by detected, only these subset: id6
#> Detected that j uses these columns: <none>
#> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 10000
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'head(.I, 2L)'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> collecting discontiguous groups took 0.016s for 10000 groups *********
#> eval(j) took 0.090s for 10000 calls
#> 0.110s elapsed (0.110s cpu)
DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6, verbose = TRUE]
#> Detected that j uses these columns: v3
#> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 10000
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'list(v3[topn(v3, 2L, decreasing = TRUE)])'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> collecting discontiguous groups took 0.616s for 10000 groups **********
#> eval(j) took 0.034s for 10000 calls
#> 0.690s elapsed (0.440s cpu)
| fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na) | ||
| fcase = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L]) | ||
| topn = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing) | ||
|
|
There was a problem hiding this comment.
Should the default be decreasing = TRUE? When I think top I think highest values. If it helps, dplyr::top_n uses the sign of n to determine whether the top n or bottom n are returned.
There was a problem hiding this comment.
I totally don't mind. The way I implemented it was to be in line with order and sort. dplyr::top_n does not work for me.
There was a problem hiding this comment.
I think decreasing=TRUE may be the more common use case, but practically speaking most sorting functions in R and elsewhere are doing decreasing=FALSE by default.
More practically, in other places in data.table we use lazy evaluation to allow - to mean decreasing=TRUE and switch the argument without requiring to negate the vector (should be efficient). In particular for character input - doesn't work for "flipping the sign".
|
To answer your question on x = rnorm(5e7)
microbenchmark::microbenchmark(
topn(x, 1L),
which.max(x), times = 10L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# topn(x, 1L) 58.00159 60.38602 63.5568 63.06315 65.64793 71.91218 10
# which.max(x) 135.89879 153.44461 161.4339 163.98896 173.29448 176.23781 10And it was not even my intention to beat |
NEWS.md
Outdated
|
|
||
| microbenchmark::microbenchmark( | ||
| x[topn(x, 6L)], | ||
| sort(x)[1:6], |
There was a problem hiding this comment.
sort has a partial argument but it returns the full vector which means it still needs to be subsetted. Your implementation is still 6 times faster.
x = rnorm(5e7)
microbenchmark::microbenchmark(
sort(x, partial = 1:6)[1:6],
x[topn(x, 6L)],
times = 10L
)
##Unit: milliseconds
## expr min lq mean median uq
## sort(x, partial = 1:6)[1:6] 689.2710 724.5292 730.2857 728.4854 735.3350
## x[topn(x, 6L)] 128.7904 129.3341 151.5083 134.5046 159.8133
identical(sort(x, partial = 1:6)[1:6],x[topn(x, 6L)])
## [1] TRUE
There was a problem hiding this comment.
Thanks. I have updated the benchmarch. Your machine is clearly faster than mine :-)
There was a problem hiding this comment.
TIL about partial=TRUE, pretty cool.
|
|
||
| fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na) | ||
| fcase = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L]) | ||
| topn = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing) |
There was a problem hiding this comment.
We also have na.last argument for data.table:::forder, we should include that here if we really want to replace order(x)[1:n]
There was a problem hiding this comment.
na are always last, I did not look at forder but the behaviour of topn should be similar to order when it comes to NA.
| } | ||
| \arguments{ | ||
| \item{vec}{ A numeric vector of type numeric or integer. Other types are not supported yet. } | ||
| \item{n}{ A positive integer value greater or equal to 1. Maximum value is 1000. } |
There was a problem hiding this comment.
Why do we put such a maximum?
There was a problem hiding this comment.
Because the algorithm is not build fir large number. Try to remove the max and test the function for large number. You will notice a massive reduction in performance above a given threshold. It can even become 100 times slower than order for very large number.
src/data.table.h
Outdated
| //fifelse.c | ||
| SEXP fifelseR(SEXP l, SEXP a, SEXP b, SEXP na); | ||
| SEXP fcaseR(SEXP na, SEXP rho, SEXP args); | ||
| SEXP topnR(SEXP vec, SEXP n, SEXP dec); No newline at end of file |
There was a problem hiding this comment.
missing trailing newline
| return ans; | ||
| } | ||
|
|
||
| SEXP topnR(SEXP vec, SEXP n, SEXP dec) { |
There was a problem hiding this comment.
why is it in fifelse.c? Seems forder.c would be a more natural place, otherwise its own C file should be fine
There was a problem hiding this comment.
Feel free to add it wherever you wish. I just did not want to create too many files. There are already enough :-)
src/fifelse.c
Outdated
| if (isS4(vec) && !INHERITS(vec, char_nanotime)) { | ||
| error("S4 class objects (excluding nanotime) are not supported."); | ||
| } | ||
| if (len0 > 1000) { |
There was a problem hiding this comment.
If input is 1e8 values it should still be faster to do topn(x, 1500), no?
A fixed value I don't think makes sense. Maybe something like len1/2 makes more sense.
There was a problem hiding this comment.
Or perhaps just re-route with warning -- do order(x)[1:n] if n>length(x)/2
There was a problem hiding this comment.
Please see my comment above. Thanks.
src/fifelse.c
Outdated
| switch(tvec) { | ||
| case INTSXP: { | ||
| const int *restrict pvec = INTEGER(vec); | ||
| int min_value = pvec[0]; |
There was a problem hiding this comment.
IINM we haven't already returned length-0 input, so this will be out-of-memory access
There was a problem hiding this comment.
also, if we initialize this to INT_MAX, we won't have to access pvec[0] both here and within the next for loop, does that make sense to do?
There was a problem hiding this comment.
Not sure to understand. Do you mean we need to check that vec is of non null length? If yes, I think you are right I might need to add a check.
There was a problem hiding this comment.
Would you be able to provide an example that fails? Happy to correct it if needed.
src/fifelse.c
Outdated
| if (pvec[i] > min_value) { | ||
| min_value = pvec[i]; | ||
| pans[idx] = i; | ||
| for (j = 0; j <len0; ++j) { |
There was a problem hiding this comment.
I'll have to read more carefully, but it looks like we do a full re-sort of the len0 indices each time we find a new candidate is that right? Can we try to compare the min-heap implementation?
There was a problem hiding this comment.
No, we don't do a full re-sort, we just recompute the min/max value and its position to be able to compare it to future values.
There was a problem hiding this comment.
The sorting of the final vector is done in the next loop. Maybe I need to add some comments to explain what I did?
Closes #3804