New function topn by 2005m · Pull Request #4187 · Rdatatable/data.table

2005m · 2020-01-17T22:56:01Z

codecov · 2020-01-18T13:11:19Z

Codecov Report

Merging #4187 into master will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           master    #4187    +/-   ##
========================================
  Coverage   99.61%   99.61%            
========================================
  Files          72       72            
  Lines       13916    14046   +130     
========================================
+ Hits        13862    13992   +130     
  Misses         54       54

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1b1832...37f503e. Read the comment docs.

2005m · 2020-01-18T18:27:37Z

Tried to use dplyr::top_n but it crashed my session 3 times. Seems buggy. I gave up! If someone have some benchmarks I am happy to add them.

ColeMiller1 · 2020-01-19T13:38:13Z

For topn(x, 1L), this is a special case which equals which.min(x). which.max is more performant but not sure if it's worth accommodating it.

x = rnorm(5e7)
microbenchmark::microbenchmark(
  topn(x, 1L),
  which.min(x), times = 10L
)

##Unit: milliseconds
##         expr      min       lq      mean   median       uq      max neval
##  topn(x, 1L) 128.8612 129.0326 138.75400 130.1807 135.3845 202.9439    10
## which.min(x)  85.8646  86.6461  96.82178  89.4887  91.3402 156.3643    10

identical(topn(x, 1L), which.min(x))
## [1] TRUE

ColeMiller1 · 2020-01-19T13:42:53Z

src/fifelse.c

+		error("Function 'topn' is not built for large value of 'n'. The algorithm is made for small values. Please prefer the 'order' if you want to proceed with such large value.");
+	}
+	if (len0 > len1) {
+		error("'n' cannot be larger than length of 'vec'.");


A very likely use case is topn by group. I think most people familiar with top_n would expect this to work:

set.seed(123L) dt = data.table(x = rnorm(10), grp = sample(5L, 10L, replace = T)) dt[dt[, .I[topn(x, 2L)], by = grp]$V1, ] ##Error: 'n' cannot be larger than length of 'vec'.

so you suggest we return something like below?

> dt[dt[, .I[order(x)[1:2]], by = grp]$V1, ] x grp 1: -0.6868529 1 2: -0.5604756 1 3: -0.2301775 4 4: NA NA 5: 0.1292877 5 6: NA NA 7: -0.4456620 3 8: 1.7150650 3 9: -1.2650612 2 10: 0.4609162 2

My preference would it be to return seq_len(n) and maybe have a warning that there are less records than n.

dt[dt[, .I[order(x)[1:min(.N,2)]], by = grp]$V1, ] x grp <num> <int> 1: -0.6868529 1 2: -0.5604756 1 3: -0.2301775 4 4: 0.1292877 5 5: -0.4456620 3 6: 1.7150650 3 7: -1.2650612 2 8: 0.4609162 2

P.S., this PR is great.

set.seed(123) dt = data.table(x = rnorm(1e7), grp = sample(100, 1e7, TRUE)) microbenchmark::microbenchmark( dt[, .I[topn(x, 5L)], by = grp], dt[order(-x), .I[seq_len(5L)], by = grp], dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp], ##closest to dplyr::top_n times = 5L ) Unit: milliseconds expr min lq mean median uq max neval dt[, .I[topn(x, 5L)], by = grp] 157.9859 158.9587 161.4327 160.2456 160.6012 169.3721 5 dt[order(-x), .I[seq_len(5L)], by = grp] 1336.8226 1343.3002 1347.5568 1345.3929 1350.1008 1362.1676 5 dt[, .I[frank(-x, ties.method = "min") <= 5L], by = grp] 1432.7179 1459.2029 1529.9473 1498.1096 1626.5848 1633.1211 5

It should be now working like you asked.

Thanks!

This is more for future thought. I was also going to suggest the maintainers look at updating the Database-like ops benchmark for top n by group as dt[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)], by = id6]. But as the number of groups increase, there seems to be some overhead. I assume this is a memory and gforce type of issue. I can log a separate issue once this is merged for better grouping performance.

Details, benchmark, and verbose message

There are two rows with ******* which identify the main performance different - for topn it is collecting discontiguous groups took 0.616s for 10000 groups versus only 0.016s for the other method.

##remotes::install_github("Rdatatable/data.table", ref = "topn") library(data.table) N = 1000000L; K = 100L; set.seed(108L) DT = data.table(id6 = sample(N/K, N, TRUE), v3 = runif(N, max = 100)) identical(DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)][order(-largest2_v3, id6)], DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6][order(-largest2_v3, id6)]) #> [1] TRUE microbenchmark::microbenchmark( DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)], DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6] , times = 10L ) #> Unit: milliseconds #> expr #> DT[DT[order(-v3), head(.I, 2L), by = id6]$V1, .(id6, largest2_v3 = v3)] #> DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6] #> min lq mean median uq max neval #> 242.2958 248.2904 254.6175 250.4815 255.1830 288.5708 10 #> 674.9867 680.0911 722.0210 682.1045 697.4592 915.3662 10 DT[DT[order(-v3), head(.I, 2L), by = id6, verbose = T]$V1, .(id6, largest2_v3 = v3)] #> forder.c received 1000000 rows and 1 columns #> i clause present and columns used in by detected, only these subset: id6 #> Detected that j uses these columns: <none> #> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns #> 0.000s elapsed (0.000s cpu) #> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) #> Getting back original order ... forder.c received a vector type 'integer' length 10000 #> 0.000s elapsed (0.000s cpu) #> lapply optimization is on, j unchanged as 'head(.I, 2L)' #> GForce is on, left j unchanged #> Old mean optimization is on, left j unchanged. #> Making each group and running j (GForce FALSE) ... #> collecting discontiguous groups took 0.016s for 10000 groups ********* #> eval(j) took 0.090s for 10000 calls #> 0.110s elapsed (0.110s cpu) DT[, .(largest2_v3 = v3[topn(v3, 2L, decreasing = TRUE)]), by = id6, verbose = TRUE] #> Detected that j uses these columns: v3 #> Finding groups using forderv ... forder.c received 1000000 rows and 1 columns #> 0.000s elapsed (0.000s cpu) #> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) #> Getting back original order ... forder.c received a vector type 'integer' length 10000 #> 0.000s elapsed (0.000s cpu) #> lapply optimization is on, j unchanged as 'list(v3[topn(v3, 2L, decreasing = TRUE)])' #> GForce is on, left j unchanged #> Old mean optimization is on, left j unchanged. #> Making each group and running j (GForce FALSE) ... #> collecting discontiguous groups took 0.616s for 10000 groups ********** #> eval(j) took 0.034s for 10000 calls #> 0.690s elapsed (0.440s cpu)

ColeMiller1 · 2020-01-19T14:08:26Z

R/wrappers.R

 fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na)
 fcase   = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L])
+topn    = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing)



Should the default be decreasing = TRUE? When I think top I think highest values. If it helps, dplyr::top_n uses the sign of n to determine whether the top n or bottom n are returned.

I totally don't mind. The way I implemented it was to be in line with order and sort. dplyr::top_n does not work for me.

I think decreasing=TRUE may be the more common use case, but practically speaking most sorting functions in R and elsewhere are doing decreasing=FALSE by default.

More practically, in other places in data.table we use lazy evaluation to allow - to mean decreasing=TRUE and switch the argument without requiring to negate the vector (should be efficient). In particular for character input - doesn't work for "flipping the sign".

2005m · 2020-01-19T15:38:47Z

To answer your question on which.min / which.max, I am happy to play corner cases but these are function which are made for a different purpose plus correct me if I am wrong they don't bother dealing with NA. If you put the topn function that I have written in the issue and does not care about NA here is what you get:

 x = rnorm(5e7)
microbenchmark::microbenchmark(
   topn(x, 1L),
   which.max(x), times = 10L
 )
# Unit: milliseconds
#         expr       min        lq     mean    median        uq       max neval
#  topn(x, 1L)  58.00159  60.38602  63.5568  63.06315  65.64793  71.91218    10
# which.max(x) 135.89879 153.44461 161.4339 163.98896 173.29448 176.23781    10

And it was not even my intention to beat which.max. Just to say that:
-in my view topn is a more efficient way of getting the same ouput as order(x)[1:n] (for numeric vectors) which is what I understood from the feature request (and given the example provided).
-we can probably re-write which.min and which.max to get something which is faster
-if we want to play corner case (which I am totally happy to do :-)) I would recommend we stick to C and forget about R.

ColeMiller1 · 2020-01-19T16:22:10Z

NEWS.md

+
+microbenchmark::microbenchmark(
+  x[topn(x, 6L)],
+  sort(x)[1:6], 


sort has a partial argument but it returns the full vector which means it still needs to be subsetted. Your implementation is still 6 times faster.

x = rnorm(5e7) microbenchmark::microbenchmark( sort(x, partial = 1:6)[1:6], x[topn(x, 6L)], times = 10L ) ##Unit: milliseconds ## expr min lq mean median uq ## sort(x, partial = 1:6)[1:6] 689.2710 724.5292 730.2857 728.4854 735.3350 ## x[topn(x, 6L)] 128.7904 129.3341 151.5083 134.5046 159.8133 identical(sort(x, partial = 1:6)[1:6],x[topn(x, 6L)]) ## [1] TRUE

Thanks. I have updated the benchmarch. Your machine is clearly faster than mine :-)

TIL about partial=TRUE, pretty cool.

NEWS.md

MichaelChirico · 2020-01-21T09:38:38Z

R/wrappers.R


 fifelse = function(test, yes, no, na=NA) .Call(CfifelseR, test, yes, no, na)
 fcase   = function(..., default=NA) .Call(CfcaseR, default, parent.frame(), as.list(substitute(list(...)))[-1L])
+topn    = function(vec, n=6L, decreasing=FALSE) .Call(CtopnR, vec, n, decreasing)


We also have na.last argument for data.table:::forder, we should include that here if we really want to replace order(x)[1:n]

na are always last, I did not look at forder but the behaviour of topn should be similar to order when it comes to NA.

MichaelChirico · 2020-01-21T09:55:38Z

man/topn.Rd

+}
+\arguments{
+  \item{vec}{ A numeric vector of type numeric or integer. Other types are not supported yet. }
+  \item{n}{ A positive integer value greater or equal to 1. Maximum value is 1000. }


Why do we put such a maximum?

Because the algorithm is not build fir large number. Try to remove the max and test the function for large number. You will notice a massive reduction in performance above a given threshold. It can even become 100 times slower than order for very large number.

MichaelChirico · 2020-01-21T09:55:54Z

src/data.table.h

 //fifelse.c
 SEXP fifelseR(SEXP l, SEXP a, SEXP b, SEXP na);
 SEXP fcaseR(SEXP na, SEXP rho, SEXP args);
+SEXP topnR(SEXP vec, SEXP n, SEXP dec);


missing trailing newline

MichaelChirico · 2020-01-21T09:57:04Z

src/fifelse.c

  return ans;
 }
+
+SEXP topnR(SEXP vec, SEXP n, SEXP dec) {


why is it in fifelse.c? Seems forder.c would be a more natural place, otherwise its own C file should be fine

Feel free to add it wherever you wish. I just did not want to create too many files. There are already enough :-)

MichaelChirico · 2020-01-21T09:58:57Z

src/fifelse.c

+	if (isS4(vec) && !INHERITS(vec, char_nanotime)) {
+		error("S4 class objects (excluding nanotime) are not supported.");
+	}
+	if (len0 > 1000) {


If input is 1e8 values it should still be faster to do topn(x, 1500), no?

A fixed value I don't think makes sense. Maybe something like len1/2 makes more sense.

Or perhaps just re-route with warning -- do order(x)[1:n] if n>length(x)/2

Please see my comment above. Thanks.

MichaelChirico · 2020-01-21T17:07:19Z

src/fifelse.c

+		switch(tvec) {
+		case INTSXP: {
+			const int *restrict pvec = INTEGER(vec);
+			int min_value = pvec[0];


IINM we haven't already returned length-0 input, so this will be out-of-memory access

also, if we initialize this to INT_MAX, we won't have to access pvec[0] both here and within the next for loop, does that make sense to do?

Not sure to understand. Do you mean we need to check that vec is of non null length? If yes, I think you are right I might need to add a check.

Would you be able to provide an example that fails? Happy to correct it if needed.

MichaelChirico · 2020-01-21T17:15:30Z

src/fifelse.c

+				if (pvec[i] > min_value) {
+					min_value = pvec[i];
+					pans[idx] = i;
+					for (j = 0; j <len0; ++j) {


I'll have to read more carefully, but it looks like we do a full re-sort of the len0 indices each time we find a new candidate is that right? Can we try to compare the min-heap implementation?

No, we don't do a full re-sort, we just recompute the min/max value and its position to be able to compare it to future values.

The sorting of the final vector is done in the next loop. Maybe I need to add some comments to explain what I did?

first commit topn

0807b1d

2005m added the WIP label Jan 17, 2020

Update tests with correct numbers

aae0158

2005m added 2 commits January 18, 2020 13:24

Missing test

dfbfb28

Update NEWS.md

051886d

2005m removed the WIP label Jan 18, 2020

2005m requested review from MichaelChirico and jangorecki January 18, 2020 18:28

2005m added this to the 1.12.9 milestone Jan 18, 2020

Update topn.Rd

dd1772d

ColeMiller1 reviewed Jan 19, 2020

View reviewed changes

2005m and others added 7 commits January 19, 2020 17:47

limit n to length of vec

535bae8

Update tests.Rraw

656ff75

Update NEWS.md

4e24c8f

warning instead or error!

8322e9b

Update test

8f3c977

I have fat fingers

ded01ce

reworked NEWS

d2d1921