diff --git a/NEWS.md b/NEWS.md index 610f678e31..3ca44e1dc8 100644 --- a/NEWS.md +++ b/NEWS.md @@ -10,6 +10,10 @@ 3. The number of logical CPUs used by default has been reduced from 100% to 50%. The previous 100% default was reported to cause significant slow downs when other non-trivial processes were also running: [#3395](https://github.com/Rdatatable/data.table/issues/3395), [#3298](https://github.com/Rdatatable/data.table/issues/3298). Two new optional environment variables (`R_DATATABLE_NUM_PROCS_PERCENT` & `R_DATATABLE_NUM_THREADS`) control this default. \code(setDTthreads()) gains \code{percent=} and \code{?setDTthreads} has been significantly revised. \code{getDTthreads(verbose=TRUE)} has been expanded. The environment variable `OMP_THREAD_LIMIT` is now respected ([#3300](https://github.com/Rdatatable/data.table/issues/3300)) in addition to `OMP_NUM_THREADS` as before. +4. `rbind` and `rbindlist` now retain the position of duplicate column names rather than grouping them together [#3373](https://github.com/Rdatatable/data.table/issues/3373), fill length 0 columns (including NULL) with NA with warning [#1871](https://github.com/Rdatatable/data.table/issues/1871), and recycle length-1 columns [#524](https://github.com/Rdatatable/data.table/issues/524). Thanks to Kun Ren for the requests which arose when parsing JSON. + +5. `rbindlist`'s `use.names=` default has changed from `FALSE` to `"check"`. This warns if the column names of each item are not identical and then proceeds as if `use.names=FALSE` for backwards compatibility; i.e., bind by column number not by column name. In future, it will warn and then proceed as if `use.names=TRUE`. Eventually the default will be changed from `NA` to `TRUE` unless user feedback is negative. The `rbind` method for `data.table` already sets `use.names=TRUE` as does `rbind` for `data.frame` in base, and is clearly safer. To stack differently named columns together silently (the previous default behavior), it is now necessary to write `use.names=FALSE` for clarity to readers of your code. Thanks to Clayton Stanley who first raised the issue [here](http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-April/002480.html). + #### BUG FIXES 1. `rbindlist()` of a malformed factor missing levels attribute is now a helpful error rather than a cryptic error about `STRING_ELT`, [#3315](https://github.com/Rdatatable/data.table/issues/3315). Thanks to Michael Chirico for reporting. @@ -34,6 +38,14 @@ 11. A join's result could be incorrectly keyed when a single nomatch occurred at the very beginning while all other values matched, [#3441](https://github.com/Rdatatable/data.table/issues/3441). The incorrect key would cause incorrect results in subsequent queries. Thanks to @symbalex for reporting and @franknarf1 for pinpointing the root cause. +12. `rbind` and `rbindlist(..., use.names=TRUE)` with over 255 columns could return the columns in a random order, [#3373](https://github.com/Rdatatable/data.table/issues/3373). The contents and name of each column was correct but the order that the columns appeared in the result might not match the original input. + +13. `rbind` and `rbindlist` now combine `integer64` columns together with non-`integer64` columns correctly [#1349](https://github.com/Rdatatable/data.table/issues/1349), and support `raw` columns [#2819](https://github.com/Rdatatable/data.table/issues/2819). + +14. `NULL` columns are caught and error appropriately rather than segfault in some cases, [#2303](https://github.com/Rdatatable/data.table/issues/2303) [#2305](https://github.com/Rdatatable/data.table/issues/2305). Thanks to Hugh Parsonage and @franknarf1 for reporting. + +15. `melt` would error with 'factor malformed' or segfault in the presence of duplicate column names, [#1754](https://github.com/Rdatatable/data.table/issues/1754). Many thanks to @franknarf1, William Marble, wligtenberg and Toby Dylan Hocking for reproducible examples. All examples have been added to the test suite. + #### NOTES 1. When upgrading to 1.12.0 some Windows users might have seen `CdllVersion not found` in some circumstances. We found a way to catch that so the [helpful message](https://twitter.com/MattDowle/status/1084528873549705217) now occurs for those upgrading from versions prior to 1.12.0 too, as well as those upgrading from 1.12.0 to a later version. See item 1 in notes section of 1.12.0 below for more background. diff --git a/R/as.data.table.R b/R/as.data.table.R index e11732067b..a03ba35bb5 100644 --- a/R/as.data.table.R +++ b/R/as.data.table.R @@ -108,6 +108,8 @@ as.data.table.array <- function(x, keep.rownames=FALSE, sorted=TRUE, value.name= } as.data.table.list <- function(x, keep.rownames=FALSE, ...) { + wn = sapply(x,is.null) + if (any(wn)) x = x[!wn] if (!length(x)) return( null.data.table() ) # fix for #833, as.data.table.list with matrix/data.frame/data.table as a list element.. # TODO: move this entire logic (along with data.table() to C @@ -125,18 +127,17 @@ as.data.table.list <- function(x, keep.rownames=FALSE, ...) { idx = which(n < mn) if (length(idx)) { for (i in idx) { - if (!is.null(x[[i]])) {# avoids warning when a list element is NULL - if (inherits(x[[i]], "POSIXlt")) { - warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.") - x[[i]] = as.POSIXct(x[[i]]) - } - # Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L - if (!n[i] && mn) - warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'") - else if (n[i] && mn %% n[i] != 0L) - warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)") - x[[i]] = rep(x[[i]], length.out=mn) + # any is.null(x[[i]]) were removed above, otherwise warning when a list element is NULL + if (inherits(x[[i]], "POSIXlt")) { + warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.") + x[[i]] = as.POSIXct(x[[i]]) } + # Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L + if (!n[i] && mn) + warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'") + else if (n[i] && mn %% n[i] != 0L) + warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)") + x[[i]] = rep(x[[i]], length.out=mn) } } # fix for #842 diff --git a/R/data.table.R b/R/data.table.R index 2052d1c921..9081ee0594 100644 --- a/R/data.table.R +++ b/R/data.table.R @@ -215,17 +215,6 @@ replace_dot_alias <- function(e) { } } -# A (relatively) fast (uses DT grouping) wrapper for matching two vectors, BUT: -# it behaves like 'pmatch' but only the 'exact' matching part. That is, a value in -# 'x' is matched to 'table' only once. No index will be present more than once. -# This should make it even clearer: -# chmatch2(c("a", "a"), c("a", "a")) # 1,2 - the second 'a' in 'x' has a 2nd match in 'table' -# chmatch2(c("a", "a"), c("a", "b")) # 1,NA - the second one doesn't 'see' the first 'a' -# chmatch2(c("a", "a"), c("a", "a.1")) # 1,NA - this is where it differs from pmatch - we don't need the partial match. -chmatch2 <- function(x, table, nomatch=NA_integer_) { - .Call(Cchmatch2, x, table, as.integer(nomatch)) # this is in 'rbindlist.c' for now. -} - "[.data.table" <- function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch"), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL) { # ..selfcount <<- ..selfcount+1 # in dev, we check no self calls, each of which doubles overhead, or could @@ -369,14 +358,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) { } } - # To take care of duplicate column names properly (see chmatch2 function above `[data.table`) for description - dupmatch <- function(x, y, ...) { - if (anyDuplicated(x)) - pmax(chmatch(x,y, ...), chmatch2(x,y,0L)) - else chmatch(x,y) - } - - # setdiff removes duplicate entries, which'll create issues with duplicated names. Use '%chin% instead. + # setdiff removes duplicate entries, which'll create issues with duplicated names. Use %chin% instead. dupdiff <- function(x, y) x[!x %chin% y] if (!missing(i)) { @@ -739,7 +721,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) { if (length(tt)) jisvars[tt] = paste0("i.",jisvars[tt]) if (length(duprightcols <- rightcols[duplicated(rightcols)])) { nx = c(names(x), names(x)[duprightcols]) - rightcols = chmatch2(names(x)[rightcols], nx) + rightcols = chmatchdup(names(x)[rightcols], nx) nx = make.unique(nx) } else nx = names(x) ansvars = make.unique(c(nx, jisvars)) @@ -790,20 +772,16 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) { if (is.character(j)) { if (notj) { w = chmatch(j, names(x)) - if (anyNA(w)) { - warning("column(s) not removed because not found: ",paste(j[is.na(w)],collapse=",")) - w = w[!is.na(w)] - } - # changed names(x)[-w] to use 'setdiff'. Here, all instances of the column must be removed. - # Ex: DT <- data.table(x=1, y=2, x=3); DT[, !"x", with=FALSE] should just output 'y'. - # But keep 'dup cols' beause it's basically DT[, !names(DT) %chin% "x", with=FALSE] which'll subset all cols not 'x'. - ansvars = if (length(w)) dupdiff(names(x), names(x)[w]) else names(x) - ansvals = dupmatch(ansvars, names(x)) + if (anyNA(w)) warning("column(s) not removed because not found: ",paste(j[is.na(w)],collapse=",")) + # all duplicates of the name in names(x) must be removed; e.g. data.table(x=1, y=2, x=3)[, !"x"] should just output 'y'. + w = !names(x) %chin% j + ansvars = names(x)[w] + ansvals = which(w) } else { - # once again, use 'setdiff'. Basically, unless indices are specified in `j`, we shouldn't care about duplicated columns. - ansvars = j # x. and i. prefixes may be in here, and they'll be dealt with below - # dups = FALSE here.. even if DT[, c("x", "x"), with=FALSE], we subset only the first.. No way to tell which one the OP wants without index. - ansvals = chmatch(ansvars, names(x)) + # if DT[, c("x","x")] and "x" is duplicated in names(DT), we still subset only the first. Because dups are unusual and + # it's more common to select the same column a few times. A syntax would be needed to distinguish these intents. + ansvars = j # x. and i. prefixes may be in here, they'll result in NA and will be dealt with further below if length(leftcols) + ansvals = chmatch(ansvars, names(x)) # not chmatchdup() } if (!length(ansvals)) return(null.data.table()) if (!length(leftcols)) { @@ -1019,7 +997,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) { # over a subset of columns # all duplicate columns must be matched, because nothing is provided - ansvals = dupmatch(ansvars, names(x)) + ansvals = chmatchdup(ansvars, names(x)) } else { # FR #4979 - negative numeric and character indices for SDcols colsub = substitute(.SDcols) @@ -1432,6 +1410,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) { setattr(jval, 'class', class(x)) # fix for #5296 if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient) setattr(jval, 'sorted', key(x)) + for (i in seq_along(jval)) if (is.null(jval[[i]])) stop("Column ",i," of j evaluates to NULL. A NULL column is invalid.") } return(jval) } @@ -2432,10 +2411,6 @@ copy <- function(x) { alloc.col(newx) } -copyattr <- function(from, to) { - .Call(Ccopyattr, from, to) -} - point <- function(to, to_idx, from, from_idx) { .Call(CpointWrapper, to, to_idx, from, from_idx) } @@ -2652,13 +2627,19 @@ set <- function(x,i=NULL,j,value) # low overhead, loopable invisible(x) } -chmatch <- function(x,table,nomatch=NA_integer_) - .Call(Cchmatchwrapper,x,table,as.integer(nomatch[1L]),FALSE) # [1L] to fix #1672 +chmatch <- function(x, table, nomatch=NA_integer_) + .Call(Cchmatch, x, table, as.integer(nomatch[1L])) # [1L] to fix #1672 -"%chin%" <- function(x,table) { - # TO DO if table has 'ul' then match to that - .Call(Cchmatchwrapper,x,table,NA_integer_,TRUE) -} +# chmatchdup() behaves like 'pmatch' but only the 'exact' matching part; i.e. a value in +# 'x' is matched to 'table' only once. No index will be present more than once. For example: +# chmatchdup(c("a", "a"), c("a", "a")) # 1,2 - the second 'a' in 'x' has a 2nd match in 'table' +# chmatchdup(c("a", "a"), c("a", "b")) # 1,NA - the second one doesn't 'see' the first 'a' +# chmatchdup(c("a", "a"), c("a", "a.1")) # 1,NA - this is where it differs from pmatch - we don't need the partial match. +chmatchdup <- function(x, table, nomatch=NA_integer_) + .Call(Cchmatchdup, x, table, as.integer(nomatch[1L])) + +"%chin%" <- function(x, table) + .Call(Cchin, x, table) # TO DO if table has 'ul' then match to that chorder <- function(x) { o = forderv(x, sort=TRUE, retGrp=FALSE) @@ -2671,7 +2652,6 @@ chgroup <- function(x) { if (length(o)) as.vector(o) else seq_along(x) # as.vector removes the attributes } - .rbind.data.table <- function(..., use.names=TRUE, fill=FALSE, idcol=NULL) { # See FAQ 2.23 # Called from base::rbind.data.frame @@ -2681,14 +2661,20 @@ chgroup <- function(x) { rbindlist(l, use.names, fill, idcol) } -rbindlist <- function(l, use.names=fill, fill=FALSE, idcol=NULL) { +rbindlist <- function(l, use.names="check", fill=FALSE, idcol=NULL) { if (isFALSE(idcol)) { idcol = NULL } else if (!is.null(idcol)) { if (isTRUE(idcol)) idcol = ".id" if (!is.character(idcol)) stop("idcol must be a logical or character vector of length 1. If logical TRUE the id column will named '.id'.") idcol = idcol[1L] } - # fix for #1467, quotes result in "not resolved in current namespace" error + miss = missing(use.names) + # more checking of use.names happens at C level; this is just minimal to massage 'check' to NA + if (identical(use.names, NA)) stop("use.names=NA invalid") # otherwise use.names=NA could creep in an usage equivalent to use.names='check' + if (identical(use.names,"check")) { + if (!miss) stop("use.names='check' cannot be used explicitly because the value 'check' is new in v1.12.2 and subject to change. It is just meant to convey default behavior. See ?rbindlist.") + use.names = NA + } ans = .Call(Crbindlist, l, use.names, fill, idcol) if (!length(ans)) return(null.data.table()) setDT(ans)[] diff --git a/inst/tests/melt_1754.R.gz b/inst/tests/melt_1754.R.gz new file mode 100644 index 0000000000..f9b56ded80 Binary files /dev/null and b/inst/tests/melt_1754.R.gz differ diff --git a/inst/tests/melt_1754_synth.csv b/inst/tests/melt_1754_synth.csv new file mode 100644 index 0000000000..23c4cc3227 --- /dev/null +++ b/inst/tests/melt_1754_synth.csv @@ -0,0 +1,40 @@ +state,income,retailprice,percent_15_19,beercons,smoking88,smoking80,smoking75,smoking70,smoking71,smoking72,smoking73,smoking74,smoking75,smoking76,smoking77,smoking78,smoking79,smoking80,smoking81,smoking82,smoking83,smoking84,smoking85,smoking86,smoking87,smoking88,smoking89,smoking90,smoking91,smoking92,smoking93,smoking94,smoking95,smoking96,smoking97,smoking98,smoking99,smoking00 +1,9.678973622,89.34444512,0.174801901,18.95999985,112.0999985,123.1999969,111.6999969,89.80000305,95.40000153,101.0999985,102.9000015,108.1999969,111.6999969,116.1999969,117.0999985,123,121.4000015,123.1999969,119.5999985,119.0999985,116.3000031,113,114.5,116.3000031,114,112.0999985,105.5999985,108.5999985,107.9000015,109.0999985,108.5,107.0999985,102.5999985,101.4000015,104.9000015,106.1999969,100.6999969,96.19999695 +2,9.643623246,89.8777771,0.164611373,18.52000008,121.5,131.8000031,114.8000031,100.3000031,104.0999985,103.9000015,108,109.6999969,114.8000031,119.0999985,122.5999985,127.3000031,126.5,131.8000031,128.6999969,127.4000015,128,123.0999985,125.8000031,126,122.3000031,121.5,118.3000031,113.0999985,116.8000031,126,113.8000031,108.8000031,113,110.6999969,108.6999969,109.5,104.8000031,99.40000153 +4,9.984357198,82.62222205,0.173703247,25.08000031,94.59999847,131,131,124.8000031,125.5,134.3000031,137.8999939,132.8000031,131,134.1999969,132,129.1999969,131.5,131,133.8000031,130.5,125.3000031,119.6999969,112.4000015,109.9000015,102.4000015,94.59999847,88.80000305,87.40000153,90.19999695,88.30000305,88.59999847,89.09999847,85.40000153,83.09999847,81.30000305,81.19999695,79.59999847,73 +5,10.18803512,103.4777764,0.163659688,20.7,104.8000031,118,110.1999969,120,117.5999985,110.8000031,109.3000031,112.4000015,110.1999969,113.4000015,117.3000031,117.5,117.4000015,118,116.4000015,114.6999969,114.0999985,112.5,111,108.5,109,104.8000031,100.5999985,91.5,86.69999695,83.5,79.09999847,76.59999847,79.30000305,76,75.90000153,75.5,73.40000153,71.40000153 +6,9.974561161,90.05555513,0.178224497,26.08000031,137.1000061,150.5,147.6000061,155,161.1000061,156.3000031,154.6999969,151.3000031,147.6000061,153,153.3000031,155.5,150.1999969,150.5,152.6000061,154.1000061,149.6000061,144,144.5,142.3999939,141,137.1000061,131.6999969,127.1999969,118.8000031,120,123.8000031,126.0999985,127.1999969,128.3000031,124.0999985,132.8000031,139.5,140.6999969 +7,9.817172262,84.36666658,0.176944127,21.75999985,124.0999985,134,122.9000015,109.9000015,115.6999969,117,119.8000031,123.6999969,122.9000015,125.9000015,127.9000015,130.6000061,131,134,131.6999969,131.1999969,128.6000061,126.3000031,128.8000031,129,129.3000031,124.0999985,117.0999985,113.8000031,109.5999985,109.1999969,109.1999969,107.8000031,100.3000031,102.6999969,100.5999985,100.5,97.09999847,88.40000153 +8,9.711300956,86.07777786,0.152016721,22.22000008,84.5,115.1999969,123.3000031,102.4000015,108.5,126.0999985,121.8000031,125.5999985,123.3000031,125.0999985,125,122.8000031,117.5,115.1999969,114.0999985,111.5,111.3000031,103.5999985,100.6999969,96.69999695,95,84.5,78.40000153,90.09999847,85.40000153,85.09999847,86.69999695,93,78.19999695,73.59999847,75,78.90000153,75.09999847,66.90000153 +9,10.00688288,89.83333249,0.17028118,24.74000015,107.5999985,135.1999969,131.8000031,124.8000031,125.5999985,126.5999985,124.4000015,131.8999939,131.8000031,134.3999939,134,136.6999969,135.3000031,135.1999969,133,130.6999969,127.9000015,124,121.5999985,118.1999969,109.5,107.5999985,104.5999985,94.09999847,96.09999847,94.80000305,94.59999847,85.69999695,84.30000305,81.80000305,79.59999847,80.30000305,72.19999695,70 +10,9.831646389,81.08888838,0.175089656,21.97999992,134,146.8999939,162.3999939,134.6000061,139.3000031,149.1999969,156,159.6000061,162.3999939,166.6000061,173,150.8999939,148.8999939,146.8999939,148.5,147.6999969,143,137.8000031,135.3000031,137.6000061,134,134,132.5,128.3000031,127.1999969,128.1999969,126.8000031,128.1999969,135.3999939,135.1000061,135.3000031,135.8999939,133.3000031,125.5 +11,9.836926672,90.65555615,0.169908937,23.23999977,100.1999969,124.5999985,120.5,108.5,108.4000015,109.4000015,110.5999985,116.0999985,120.5,124.4000015,125.5,127.0999985,124.1999969,124.5999985,132.8999939,116.1999969,115.5999985,111.1999969,109.4000015,104.0999985,101.0999985,100.1999969,94.40000153,95.40000153,97.09999847,95.19999695,92.5,93.40000153,93,94,93.90000153,94,91.69999695,88.90000153 +12,9.916982969,87.78888872,0.170582652,19.94000015,103.1999969,127.0999985,123.4000015,114,102.8000031,111,115.1999969,118.5999985,123.4000015,127.6999969,127.9000015,127.0999985,126.4000015,127.0999985,132,130.8999939,127.5999985,121.6999969,115.6999969,109.4000015,105.1999969,103.1999969,96.5,94.30000305,91.80000305,90,89.90000153,89.09999847,90.09999847,88.69999695,89.19999695,87.59999847,83.30000305,79.80000305 +13,9.695040385,71.48888991,0.17540008,18.94000015,173.1999969,215.3000031,223,155.8000031,163.5,179.3999939,201.8999939,212.3999939,223,230.8999939,229.3999939,224.6999969,214.8999939,215.3000031,209.6999969,210.6000061,201.1000061,183.1999969,182.3999939,179.8000031,171.1999969,173.1999969,171.6000061,182.5,170.3999939,167.6000061,167.6000061,170.1000061,175.3000031,179,186.8000031,171.3000031,165.3000031,156.1999969 +14,9.747586568,90.0666665,0.181948922,23.87999992,110.9000015,143.8000031,133.6000061,115.9000015,119.8000031,125.3000031,126.6999969,129.8999939,133.6000061,139.6000061,140,142.6999969,140.1000061,143.8000031,144,143.8999939,133.6999969,128.8999939,125,121.1999969,116.5,110.9000015,103.5999985,101.5,107.1999969,108.5,106.1999969,105.3000031,105.6999969,106.8000031,105.3000031,103.1999969,101,104.3000031 +15,9.786902746,91.89999898,0.16571967,22.44000015,125,141.1999969,140.6999969,128.5,133.1999969,136.5,138,142.1000061,140.6999969,144.8999939,145.6000061,143.8999939,138.5,141.1999969,138.8999939,139.5,135.3999939,135.5,127.9000015,119,125,125,122.4000015,117.5,116.0999985,114.5,108.5,101.5999985,102.3000031,100,101.0999985,94.5,85.5,82.90000153 +16,9.938785023,96.28888872,0.172464155,23.12000008,94.09999847,117.6999969,111.5,104.3000031,116.4000015,96.80000305,106.8000031,110.5999985,111.5,116.6999969,117.1999969,118.9000015,118.3000031,117.6999969,120.8000031,119.4000015,113.1999969,110.8000031,113,104.3000031,108.8000031,94.09999847,92.30000305,90.69999695,86.19999695,83.80000305,81.59999847,83.40000153,84.09999847,81.69999695,84.09999847,83.19999695,80.69999695,76 +17,9.546848297,88.92222214,0.181491156,21.14000015,109,127,116.8000031,93.40000153,105.4000015,112.0999985,115,117.0999985,116.8000031,120.9000015,122.0999985,124.9000015,123.9000015,127,125.3000031,125.8000031,122.3000031,116.4000015,115.3000031,113.1999969,110,109,108.3000031,101.8000031,105.5999985,103.9000015,105.4000015,106,107.5,106.9000015,106.3000031,107,103.9000015,97.19999695 +18,9.877391073,84.46666675,0.166553891,23.88000069,127.4000015,142.1000061,135.6000061,121.3000031,127.5999985,130,132.1000061,135.3999939,135.6000061,139.5,140.8000031,141.8000031,140.1999969,142.1000061,140.5,139.6999969,134.1000061,130,129.1999969,128.8000031,128.6999969,127.4000015,122.8000031,119.0999985,119.9000015,122.3000031,121.5999985,119.4000015,124,124.0999985,120.5999985,120.0999985,118,113.8000031 +19,9.753027174,85.5888888,0.16537521,27.87999992,87.09999847,122,123.6999969,111.1999969,115.5999985,122.1999969,119.9000015,121.9000015,123.6999969,124.9000015,127,127.1999969,120.3000031,122,121.0999985,122.4000015,113.6999969,110.0999985,103.5999985,97.80000305,91.69999695,87.09999847,86.19999695,84.69999695,82.90000153,86.59999847,86,88.19999695,90.5,87.30000305,88.90000153,89.09999847,82.59999847,75.5 +20,9.850039482,89.48888906,0.168586676,24.71999969,92.90000153,116.3000031,114.0999985,108.0999985,108.5999985,104.9000015,106.5999985,110.5,114.0999985,118.0999985,117.6999969,117.4000015,116.0999985,116.3000031,117,117.0999985,110.8000031,107.6999969,105.0999985,103.0999985,101.3000031,92.90000153,93.80000305,89.90000153,92.40000153,90.59999847,91.09999847,85.90000153,88.5,86.19999695,85.5,83.09999847,86.59999847,77.59999847 +21,10.02443239,93.24444538,0.162917763,37,141.8999939,177.6999969,205.1999969,189.5,190.5,198.6000061,201.5,204.6999969,205.1999969,201.3999939,190.8000031,187,183.3000031,177.6999969,171.8999939,165.1000061,159.1999969,136.6000061,146.6999969,142.6000061,147.6999969,141.8999939,137.8999939,137.3000031,115.5,110,108.0999985,105.1999969,100.9000015,99,95.59999847,102.4000015,103.9000015,93.19999695 +22,10.00636715,83.39999941,0.169238773,34.95999985,180.3999939,247.8000031,269.1000061,265.7000122,278,296.2000122,279,269.7999878,269.1000061,290.5,278.7999878,269.6000061,254.6000061,247.8000031,245.3999939,239.8000031,232.8999939,215.1000061,201.1000061,195.8999939,195.1000061,180.3999939,172.8999939,152.3999939,144.8000031,143.6999969,148.8999939,153.8000031,158.5,158,174.3999939,173.8000031,171.6999969,147.3000031 +23,9.708400938,87.48888736,0.174311015,27.97999992,77.69999695,102.6999969,103.0999985,90,92.59999847,99.30000305,98.90000153,100.3000031,103.0999985,102.4000015,102.4000015,103.0999985,101,102.6999969,103,97.5,96.30000305,88.90000153,88,88.19999695,82.30000305,77.69999695,74.40000153,70.80000305,69.90000153,71.40000153,69,68.19999695,67,65.69999695,61.79999924,62.59999847,59.70000076,53.79999924 +24,9.751609802,71.44444402,0.179371097,19.92000008,146,187.8000031,226,172.3999939,187.6000061,214.1000061,226.5,227.3000031,226,230.1999969,217,205.5,197.3000031,187.8000031,179.3000031,179,169.8000031,160.6000061,156.3000031,154.3999939,150.5,146,139.3000031,133.6999969,132.6999969,128.8999939,129.6999969,112.6999969,124.9000015,129.6999969,125.5999985,126,113.0999985,109 +25,9.756118351,88.90000068,0.180801443,23.5,87.09999847,123.6999969,117.9000015,93.80000305,98.5,103.8000031,108.6999969,110.5,117.9000015,125.4000015,122.1999969,121.9000015,121.3000031,123.6999969,125.6999969,126.8000031,119.5999985,109.4000015,103.1999969,99.80000305,92.30000305,87.09999847,84.09999847,77.09999847,85.19999695,74.30000305,83,81,80.59999847,80.80000305,77.5,79.09999847,74.69999695,72.5 +26,9.886301253,84.55555513,0.169848301,24.02000008,122.4000015,133.5,122.5,121.5999985,124.5999985,124.4000015,120.5,122.0999985,122.5,124.5999985,127.3000031,131.3000031,130.8999939,133.5,132.8000031,134,130,127.0999985,126.6999969,126.3000031,124.5999985,122.4000015,118.5999985,115.5,113.1999969,112.3000031,108.9000015,108.5999985,111.6999969,107.5999985,108.5999985,106.4000015,104,99.90000153 +27,9.814118067,90.47777812,0.168725929,18.13999977,103.5999985,141.6000061,132.8999939,108.4000015,115.4000015,121.6999969,124.0999985,130.5,132.8999939,138.6000061,140.3999939,143.6000061,141.6000061,141.6000061,143.6999969,147,140,128.1000061,124.1999969,119.9000015,113.0999985,103.5999985,97.5,88.40000153,87.80000305,86.30000305,86.19999695,104.8000031,109.5,110.8000031,111.8000031,112.1999969,111.4000015,108.9000015 +28,9.926845233,89.17777803,0.164436671,25.07999992,107.5999985,124,114.5999985,107.3000031,106.3000031,109,110.6999969,114.1999969,114.5999985,118.8000031,120.0999985,122.3000031,122.5999985,124,125.1999969,123.3000031,125.3000031,115.3000031,115.8000031,113.9000015,110.5999985,107.5999985,107.0999985,101.3000031,102.5,96.19999695,94.69999695,95.40000153,95.40000153,93.30000305,92.90000153,92.09999847,91.09999847,87.90000153 +29,9.931006961,90.22222307,0.175420049,25.54000015,138,149.3000031,154.6999969,123.9000015,123.1999969,134.3999939,142,146.1000061,154.6999969,150.1999969,148.8000031,146.8000031,145.8000031,149.3000031,151.1999969,146.3000031,135.8000031,136.8999939,133.3999939,136.3000031,124.4000015,138,120.8000031,101.4000015,103.5999985,100.0999985,94.09999847,91.90000153,90.80000305,87.5,90,88.69999695,86.90000153,83.09999847 +30,9.673460537,76.61111196,0.184413918,22.9,124.4000015,138.3000031,130.5,103.5999985,115,118.6999969,125.5,129.6999969,130.5,136.8000031,137.1999969,140.3999939,135.6999969,138.3000031,136.1000061,136,131.1000061,127,125.4000015,126.5999985,126.5999985,124.4000015,122.4000015,118.5999985,121.5,112.8000031,115.1999969,112.1999969,109.1999969,102.9000015,124.5,126.9000015,109.4000015,103.9000015 +31,9.702802976,88.54444461,0.173589869,21.26000023,91.90000153,114.6999969,113.5,92.69999695,96.69999695,103,103.5,108.4000015,113.5,116.6999969,115.5999985,116.9000015,117.4000015,114.6999969,115.6999969,113,109.8000031,105.6999969,104.4000015,97,95.80000305,91.90000153,87.40000153,88.30000305,91.80000305,93,91.59999847,94.80000305,98.59999847,92.30000305,88.80000305,88.30000305,83.5,75.09999847 +32,9.737283919,85.17777846,0.171025995,20.57999992,125.3000031,130.3999939,117.4000015,99.80000305,106.3000031,111.5,109.6999969,114.8000031,117.4000015,121.6999969,124.5999985,127.3000031,127.1999969,130.3999939,129.1000061,131.3999939,129,125.0999985,128.6999969,129,130.6000061,125.3000031,124.6999969,121.8000031,120.5999985,121,120.8000031,118.8000031,125.4000015,119.1999969,118.9000015,119.6999969,115.5999985,108.6999969 +33,9.896063487,92.47777854,0.177760008,28.57999992,96.5,129.6999969,116,106.4000015,108.9000015,108.5999985,110.4000015,114.6999969,116,121.4000015,124.1999969,126.5999985,126.4000015,129.6999969,129,131.1999969,126.4000015,117.1999969,115.9000015,113.6999969,105.8000031,96.5,94.5,85.59999847,79.40000153,77.19999695,81.30000305,78.80000305,75.19999695,74.59999847,72.59999847,73.19999695,67.59999847,69.30000305 +34,9.678585158,89.4333335,0.187830499,13.33999996,55,74.80000305,75.80000305,65.5,67.69999695,71.30000305,72.69999695,75.59999847,75.80000305,77.90000153,78,79.59999847,79.09999847,74.80000305,77.59999847,73.59999847,69,66.30000305,66.5,64.40000153,67.69999695,55,57,53.40000153,53.5,55,56.20000076,55.79999924,52,54,57,42.29999924,43.90000153,40.70000076 +35,9.821148766,88.02222273,0.177342362,27.05999985,128.6999969,161.6000061,155.5,122.5999985,124.4000015,138,146.8000031,151.8000031,155.5,171.1000061,169.3999939,162.3999939,160.8999939,161.6000061,163.8000031,162.3000031,153.8000031,144.3000031,144.5,131.1999969,128.3000031,128.6999969,120.9000015,124.3000031,120.9000015,126.5,117.1999969,120.3000031,123.1999969,102.5,97.69999695,97,94.09999847,88.90000153 +36,9.957432535,74.78888914,0.177402943,22.99999962,129.5,148.8999939,152.6999969,124.3000031,128.3999939,137,143.1000061,149.6000061,152.6999969,158.1000061,157.6999969,155.8999939,151.8000031,148.8999939,149.8999939,147.3999939,144.6999969,136.8000031,134.6000061,135.8000031,133,129.5,122.5,118.9000015,109.0999985,108.1999969,105.4000015,106.1999969,106.6999969,104.5999985,108,105.5999985,102.0999985,96.69999695 +37,9.65476354,92.58888753,0.164830512,19.80000038,109.0999985,122.3000031,123.1999969,114.5,111.5,117.5,116.5999985,119.9000015,123.1999969,129.6999969,133.8999939,131.6000061,122.0999985,122.3000031,120.5,119.8000031,115.6999969,111.9000015,109.0999985,112.0999985,107.5,109.0999985,104,104.0999985,100.0999985,97.90000153,111,104.1999969,115.1999969,112.6999969,114.5,114.5999985,112.4000015,107.9000015 +38,9.882993592,95.15555784,0.174546391,32.04000015,102.5999985,117.5999985,113.5,106.4000015,105.4000015,108.8000031,109.5,111.8000031,113.5,115.4000015,117.1999969,116.6999969,117.0999985,117.5999985,119.9000015,115.5999985,106.3000031,105.5999985,107,105.4000015,106,102.5999985,100.3000031,94,95.5,96.19999695,91.19999695,91.80000305,93.5,92.09999847,91.90000153,88.69999695,84.40000153,80.09999847 +39,9.913661109,81.00000042,0.174207053,24.97999992,114.3000031,158.1000061,160.6999969,132.1999969,131.6999969,140,141.1999969,145.8000031,160.6999969,161.5,160.3999939,160.3000031,168.6000061,158.1000061,163.1000061,157.6999969,141.1999969,128.8999939,125.6999969,124.8000031,110.4000015,114.3000031,111.4000015,96.90000153,109.0999985,110.8000031,108.4000015,111.1999969,115,110.3000031,108.8000031,102.9000015,104.8000031,90.5 +3,10.07655864,89.42222341,0.173532382,24.28000031,90.09999847,120.1999969,127.0999985,123,121,123.5,124.4000015,126.6999969,127.0999985,128,126.4000015,126.0999985,121.9000015,120.1999969,118.5999985,115.4000015,110.8000031,104.8000031,102.8000031,99.69999695,97.5,90.09999847,82.40000153,77.80000305,68.69999695,67.5,63.40000153,58.59999847,56.40000153,54.5,53.79999924,52.29999924,47.20000076,41.59999847 diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 2993c05a22..c5d330ab55 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -22,7 +22,7 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) { as.ITime.default = data.table:::as.ITime.default binary = data.table:::binary brackify = data.table:::brackify - chmatch2 = data.table:::chmatch2 + chmatchdup = data.table:::chmatchdup compactprint = data.table:::compactprint cube.data.table = data.table:::cube.data.table dcast.data.table = data.table:::dcast.data.table @@ -1017,7 +1017,7 @@ test(350.6, DT[c(0,0,0), .N], 0L) # Test recycling list() on RHS of := DT = data.table(a=1:3,b=4:6,c=7:9,d=10:12) test(351, DT[,c("a","b"):=list(13:15)], data.table(a=13:15,b=13:15,c=7:9,d=10:12)) -test(352, DT[,letters[1:4]:=list(1L,NULL)], data.table(a=c(1L,1L,1L),c=c(1L,1L,1L))) +test(352, DT[,letters[1:4]:=list(1L,NULL)], error="Supplied 4 columns to be assigned 2 items. Please see NEWS for v1.12.2") # Test assigning new levels into factor columns DT = data.table(f=factor(c("a","b")),x=1:4) @@ -1313,9 +1313,9 @@ test(443, rbind(DT,data.table(a=4L,b=7L)), data.table(a=1:4,b=4:7)) test(444, rbind(DT,list(b=7L,a=4L)), data.table(a=1:4,b=4:7)) # rbind should by default check row names. Don't warn here. Add clear documentation instead. test(445, rbind(DT,data.frame(b=7L,a=4L)), data.table(a=1:4,b=4:7)) test(446, rbind(DT,data.table(b=7L,a=4L)), data.table(a=1:4,b=4:7)) -test(450, rbind(DT,list(c=4L,a=7L)), error="This could be because the items in the list may not ") -test(451, rbind(DT,data.frame(c=4L,a=7L)), error="This could be because the items in the list may not ") -test(452, rbind(DT,data.table(c=4L,a=7L)), error="This could be because the items in the list may not ") +test(450, rbind(DT,list(c=4L,a=7L)), error=tt<-"Column 1 ['c'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns)") +test(451, rbind(DT,data.frame(c=4L,a=7L)), error=tt) +test(452, rbind(DT,data.table(c=4L,a=7L)), error=tt) test(453, rbind(DT,list(4L,7L)), data.table(a=1:4,b=4:7)) # Test new use.names argument in 1.8.0 @@ -1775,13 +1775,11 @@ t = as.ITime(strptime(c("09:10:00","09:11:00","09:11:00","09:12:00"),"%H:%M:%S") test(626, unique(t), t[c(1,2,4)]) test(627, class(unique(t)), "ITime") -# Test recycling list() rbind - with recent C-level changes, this seems not possible (like rbindlist) -# old test commented. -# test(628, rbind(data.table(a=1:3,b=5:7,c=list(1:2,1:3,1:4)), list(4L,8L,as.list(1:3))), -# data.table(a=c(1:3,rep(4L,3L)),b=c(5:7,rep(8L,3L)),c=list(1:2,1:3,1:4,1L,2L,3L))) -test(628, rbind(data.table(a=1:3,b=5:7,c=list(1:2,1:3,1:4)), list(4L,8L,as.list(1:3))), error = "inconsistent with first column of that item which is length") +# Test recycling list() rbind; #524. This was commented out until v1.12.2 when it was reinstated in PR#3455 +test(628.1, rbind(data.table(a=1:3,b=5:7,c=list(1:2,1:3,1:4)), list(4L,8L,as.list(1:3))), + data.table(a=c(1:3,rep(4L,3L)),b=c(5:7,rep(8L,3L)),c=list(1:2,1:3,1:4,1L,2L,3L))) # Test switch in .rbind.data.table for factor columns -test(628.5, rbind(data.table(a=1:3,b=factor(letters[1:3]),c=factor("foo")), list(4L,factor("d"),factor("bar"))), +test(628.2, rbind(data.table(a=1:3,b=factor(letters[1:3]),c=factor("foo")), list(4L,factor("d"),factor("bar"))), data.table(a=1:4,b=factor(letters[1:4]),c=factor(c(rep("foo",3),"bar"), levels = c("foo", "bar")))) # Test merge with common names and all.y=TRUE, #2011 @@ -1930,13 +1928,16 @@ l = list(data.table(a=1:2, b=7:8), data.table(b=13:14), list(15:16,17L), list(c(18,19),20:21)) -test(676, rbindlist(l[1:3]), data.table(a=1:6,b=7:12)) -test(677, rbindlist(l[c(10,1,10,2,10)]), data.table(a=1:4,b=7:10)) # NULL items ignored +test(676.1, rbindlist(l[1:3]), ans<-data.table(a=1:6,b=7:12), warning="Column 2 [[]'V2'[]] of item 2 is missing in item 1.*Use fill=TRUE.*or use.names=FALSE") +test(676.2, rbindlist(l[1:3], use.names=FALSE), ans) +test(677.2, rbindlist(l[c(10,1,10,2,10)]), ans<-data.table(a=1:4,b=7:10), warning="Column 2 [[]'V2'[]] of item 4 is missing in item 2") # NULL items ignored +test(677.2, rbindlist(l[c(10,1,10,2,10)], use.names=FALSE), ans) test(678, rbindlist(l[c(1,4)]), error="Item 2 has 1 columns, inconsistent with item 1 which has 2") -test(679, rbindlist(l[c(1:2,5)]), error="Column 2 of item 3 is length 1, inconsistent with first column of that item which is length 2.") -test(680, rbindlist(l[c(2,6)]), data.table(a=c(3,4,18,19), V2=c(9:10,20:21))) # coerces 18 and 19 to numeric (with eddi's changes in commit 1012 - highest type is preserved now) --- Caught and changed by Arun on 26th Jan 2014 (in commit 1099). -### ----> Therefore this TO DO may not be necessary here anymore (added by Arun 26th Jan 2014) ---> # TO DO when options(datatable.pedantic=TRUE): test(680.5, rbindlist(l[c(2,6)]), warning="Column 1 of item 2 is type 'double', inconsistent with column 1 of item 1's type ('integer')") -test(681, rbindlist(list(data.table(a=letters[1:2],b=c(1.2,1.3),c=1:2), list("c",1.4,3L), NULL, list(letters[4:6],c(1.5,1.6,1.7),4:6))), data.table(a=letters[1:6], b=seq(1.2,1.7,by=0.1), c=1:6)) +test(679.1, rbindlist(l[c(1:2,5)]), ans<-data.table(a=c(1:4,15:16), b=c(7:10,17L,17L)), warning="Column 2 [[]'V2'[]] of item 2 is missing in item 1") +test(679.2, rbindlist(l[c(1:2,5)], use.names=FALSE), ans) +test(680, rbindlist(l[c(2,6)]), data.table(a=c(3,4,18,19), V2=c(9:10,20:21))) # coerces 18 and 19 to numeric +test(681, rbindlist(list(data.table(a=letters[1:2],b=c(1.2,1.3),c=1:2), list("c",1.4,3L), NULL, list(letters[4:6],c(1.5,1.6,1.7),4:6))), + data.table(a=letters[1:6], b=seq(1.2,1.7,by=0.1), c=1:6)) test(682, rbindlist(NULL), data.table(NULL)) test(683, rbindlist(list()), data.table(NULL)) test(684, rbindlist(list(NULL)), data.table(NULL)) @@ -2108,7 +2109,7 @@ test(753.1, DT[,c("x1","x2"):=4:6, verbose = TRUE], data.table(a=letters[1:3],x= output = "RHS for item 2 has been duplicated") test(753.2, DT[2,x2:=7L], data.table(a=letters[1:3],x=3:1,x1=4:6,x2=c(4L,7L,6L),key="a")) DT = data.table(a=letters[3:1],x=1:3,y=4:6) -test(754, setkey(DT[,c("x1","y1","x2","y2"):=list(x,y)],a), data.table(a=letters[1:3],x=3:1,y=6:4,x1=3:1,y1=6:4,x2=3:1,y2=6:4,key="a")) +test(754, DT[,c("x1","y1","x2"):=list(x,y)], error="Supplied 3 columns to be assigned 2 items. Please see NEWS for v1.12.2") # And non-recycling i.e. that a single column copy does copy the column DT = data.table(a=1:3) test(754.1, DT[,b:=a][1,a:=4L][2,b:=5L], data.table(a=INT(4,2,3),b=INT(1,5,3))) @@ -2392,9 +2393,13 @@ test(863, after < before+0.5) # Even if data.table is empty, as long as there are column names, they should be considered. # Ex: What if all data.tables are empty? What'll be the column name then? # If there are no names, then the first non-empty set of names will be allocated. -test(864.1, rbindlist(list(data.table(foo=logical(0),bar=logical(0)), DT<-data.table(baz=letters[1:3],qux=4:6))), setnames(DT, c("foo", "bar"))) +test(864.1, rbindlist(list(data.table(foo=logical(0),bar=logical(0)), DT<-data.table(baz=letters[1:3],qux=4:6))), + setnames(DT, c("foo", "bar")), + warning="Column 1 [[]'baz'[]] of item 2 is missing in item 1.*Use fill=TRUE.*or use.names=FALSE.*v1.12.2") # test 676 tests no warning when use.names=FALSE test(864.2, rbindlist(list(list(logical(0),logical(0)), DT<-data.table(baz=letters[1:3],qux=4:6))), DT) -test(864.3, rbindlist(list(data.table(logical(0),logical(0)), DT<-data.table(baz=letters[1:3],qux=4:6))), setnames(DT, c("V1", "V2"))) +test(864.3, rbindlist(list(data.table(logical(0),logical(0)), DT<-data.table(baz=letters[1:3],qux=4:6))), + setnames(DT, c("V1", "V2")), + warning="Column 1 [[]'baz'[]] of item 2 is missing in item 1.*Use fill=TRUE.*or use.names=FALSE.*v1.12.2") # Steve's find that setnames failed for numeric 'old' when pointing to duplicated names DT = data.table(a=1:3,b=1:3,v=1:6,w=1:6) @@ -2924,11 +2929,12 @@ test(1034, as.data.table(x<-as.character(sample(letters, 5))), data.table(V1=x)) # reshape2 does not need to be loaded to run these. # We run these routinely, in dev by cc(), on Travis (coverage) and on CRAN set.seed(45) + N=18L # increased in v1.12.2 from 6 to 18 to get NA in f_1 for coverage DT <- data.table( - i_1 = c(1:5, NA), - i_2 = c(NA,6,7,8,9,10), - f_1 = factor(sample(c(letters[1:3], NA), 6, TRUE)), - c_1 = sample(c(letters[1:3], NA), 6, TRUE), + i_1 = c(1:(N-1L), NA), + i_2 = c(NA,(N:(2L*N-2L))), + f_1 = factor(sample(c(letters[1:3], NA), N, TRUE)), + c_1 = sample(c(letters[1:3], NA), N, TRUE), d_1 = as.Date(c(1:3,NA,4:5), origin="2013-09-01"), d_2 = as.Date(6:1, origin="2012-01-01")) DT[, l_1 := DT[, list(c=list(rep(i_1, sample(5,1)))), by = i_1]$c] # generate list cols @@ -2941,15 +2947,15 @@ test(1034, as.data.table(x<-as.character(sample(letters, 5))), data.table(V1=x)) test(1036, melt(DT, id=c("i_1", "i_2", "l_2"), measure=c("l_1")), ans1) # melt retains attributes if all are of same type (new) - ans2 = data.table(c_1=DT$c_1, variable=rep(c("d_1", "d_2"), each=6), value=as.Date(c(DT$d_1, DT$d_2)))[!is.na(value)] + ans2 = data.table(c_1=DT$c_1, variable=rep(c("d_1", "d_2"), each=N), value=as.Date(c(DT$d_1, DT$d_2)))[!is.na(value)] test(1037, melt(DT, id=4, measure=5:6, na.rm=TRUE, variable.factor=FALSE), ans2) DT2 <- data.table(x=1:5, y=1+5i) # unimplemented class test(1038, melt(DT2, id=1), error="Unknown column type 'complex'") # more tests - DT[, f_2 := factor(c("z", "a", "x", "z", "a", "a"), ordered=TRUE)] - DT[, id := 1:6] + DT[, f_2 := factor(sample(letters, N), ordered=TRUE)] + DT[, id := 1:N] ans1 = cbind(melt(DT, id="id", measure=5:6, value.name="value1"), melt(DT, id=integer(0), measure=7:8, value.name="value2")[, variable:=NULL]) levels(ans1$variable) = as.character(1:2) test(1038.2, ans1, melt(DT, id="id", measure=list(5:6, 7:8))) @@ -2966,6 +2972,10 @@ test(1034, as.data.table(x<-as.character(sample(letters, 5))), data.table(V1=x)) levels(ans$variable) = as.character(1:2) test(1038.6, melt(DT, id="id", measure=list(c("c_1", "c_1"), c("f_1", "f_2"))), ans) + # non ordered factors + DT[, f_2 := factor(sample(letters, N), ordered=FALSE)] + test(1039, melt(DT, id="id", measure=c("f_1", "f_2"), value.factor=TRUE)$value, factor(c(as.character(DT$f_1), as.character(DT$f_2)), ordered=FALSE)) + # test to ensure attributes on non-factor id-columns are preserved after melt DT <- data.table(x=1:3, y=letters[1:3], z1=8:10, z2=11:13) setattr(DT$x, 'foo', 'bla1') @@ -3013,6 +3023,72 @@ test(1034, as.data.table(x<-as.character(sample(letters, 5))), data.table(V1=x)) test(1569.3, melt(dt, id=NULL, measure=-1), error="One or more values in 'measure.vars'") test(1569.4, melt(dt, id=5, measure=-1), error="One or more values in 'id.vars'") test(1569.5, melt(dt, id=1, measure=-1), error="One or more values in 'measure.vars'") + + if (test_R.utils) { + # dup names in variable used to generate malformed factor error and/or segfault, #1754 + R.utils::decompressFile(testDir("melt_1754.R.gz"), tt<-tempfile(), remove=FALSE, FUN=gzfile, ext=NULL) + source(tt, local=TRUE) # creates DT + test(1570.01, dim(DT), INT(1,327)) + test(1570.02, dim(ans<-melt(DT, 1:2)), INT(325,4), warning="All measure variables not of type 'character' will be coerced") + test(1570.03, length(levels(ans$variable)), 317L) + test(1570.04, levels(ans$variable)[c(1,2,316,317)], + tt <- c("Geography", + "Estimate; SEX AND AGE - Total population", + "Percent; HISPANIC OR LATINO AND RACE - Total housing units", + "Percent Margin of Error; HISPANIC OR LATINO AND RACE - Total housing units")) + test(1570.05, range(as.integer(ans$variable)), INT(1,317)) + test(1570.06, as.vector(table(table(as.integer(ans$variable)))), INT(309,8)) + test(1570.07, sapply(ans, class), c(Id="character",Id2="integer",variable="factor",value="character")) + test(1570.08, dim(ans<-melt(DT, 1:2, variable.factor=FALSE)), INT(325,4), warning="All measure variables not of type 'character' will be coerced") + test(1570.09, sapply(ans, class), c(Id="character",Id2="integer",variable="character",value="character")) + test(1570.10, ans$variable[c(1,2,324,325)], tt) + } + + # more from #1754 + DT = fread(testDir("melt_1754_synth.csv")) + test(1571.1, names(DT)[duplicated(names(DT))], c("smoking75","smoking80","smoking88")) + test(1571.2, dim(ans<-melt(DT, id.vars=c("state","income","retailprice","percent_15_19","beercons"), measure=patterns("^smoking"))), INT(1326,7)) + test(1571.3, print(ans[c(1,1326)]), output="state.*income.*retailprice.*percent_15_19.*beercons.*variable.*value.*1.*9.6.*89.34.*smoking88.*smoking00.*41.6") + + # more from #1754 + DT = setDT(data.frame("Time.point" = seq(0, 6), "Time.(h)" = c(0.0, 0.5, 1.0, 3.0, 5.0, 7.0, 24.0), + "NEW.ME" = runif(7), "NEW.ME" = runif(7), check.names = FALSE)) + test(1572.1, dim(melt(DT, c("Time.point", "Time.(h)"), na.rm=TRUE)), INT(14, 4)) + DT = setDT(data.frame("Time.point" = seq(0, 6), "Time.(h)" = c(0.0, 0.5, 1.0, 3.0, 5.0, 7.0, 24.0), + "NEW.ME" = runif(7), "NEW.ME" = runif(7), "NEW.ME" = runif(7), "NEW.ME" = runif(7), "NEW.ME" = runif(7), + "NEW.ME" = runif(7), "NEW.ME" = runif(7), "NEW.ME" = runif(7), "NEW.MER" = runif(7), "F050" = runif(7), + "NEW.MER" = runif(7), "F16-42-123p123C" = runif(7), "F16-42-123p123C" = runif(7), "NEW.MER" = runif(7), + "F16-42-123p123C" = runif(7), check.names = FALSE)) + test(1572.2, unique(names(DT)[duplicated(names(DT))]), c("NEW.ME","NEW.MER","F16-42-123p123C")) + test(1572.3, dim(melt(DT, c("Time.point", "Time.(h)"), na.rm = TRUE)), INT(105,4)) + + # more from #1754 + DT = fread( +"month,Record high,Average high,Daily mean,Average low,Record low,Average precipitation,Average rainfall,Average snowfall,Average precipitation,Average rainy,Average snowy,Mean monthly sunshine hours +Jan,12.8,-5.4,-8.9,-12.4,-33.5,73.6,28.4,45.9,15.8,4.3,13.6,99.2 +Feb,15,-3.7,-7.2,-10.6,-33.3,70.9,22.7,46.6,12.8,4,11.1,119.5 +Mar,25.9,2.4,-1.2,-4.8,-28.9,80.2,42.2,36.8,13.6,7.4,8.3,158.8 +Apr,30.1,11,7,2.9,-17.8,76.9,65.2,11.8,12.5,10.9,3,181.7 +May,34.2,19,14.5,10,-5,86.5,86.5,0.4,12.9,12.8,0.14,229.8 +Jun,34.5,23.7,19.3,14.9,1.1,87.5,87.5,0,13.8,13.8,0,250.1 +Jul,36.1,26.6,22.3,17.9,7.8,106.2,106.2,0,12.3,12.3,0,271.6 +Aug,35.6,24.8,20.8,16.7,6.1,100.6,100.6,0,13.4,13.4,0,230.7 +Sep,33.5,19.4,15.7,11.9,0,100.8,100.8,0,12.7,12.7,0,174.1") + test(1573, print(melt(DT, id.vars="month", verbose=TRUE)), output="'measure.vars' is missing.*Assigned.*are.*Record high.*1:.*Jan.*Record high.*12.8.*108:.*Sep.*sunshine hours.*174.1") + + # coverage of reworked fmelt.c:getvarcols; #1754 + # missing id satisfies data->lvalues!=1 at C level to test those branches + x = data.table(x1=1:2, x2=3:4, y1=5:6, y2=7:8, z1=9:10, z2=11:12) + test(1574.1, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"))), INT(4,5)) + test(1574.2, ans$variable, factor(c("1","1","2","2"))) + test(1574.3, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"), variable.factor=FALSE)), INT(4,5)) + test(1574.4, ans$variable, c("1","1","2","2")) + x[, c("y1","z1"):=NA] + test(1574.5, dim(melt(x, measure.vars=patterns("^y", "^z"))), INT(4,5)) + test(1574.6, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"), na.rm=TRUE)), INT(2,5)) + test(1574.7, ans$variable, factor(c("1","1"))) + test(1574.8, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"), na.rm=TRUE, variable.factor=FALSE)), INT(2,5)) + test(1574.9, ans$variable, c("1","1")) } # sorting and grouping of Inf, -Inf, NA and NaN, #4684, #4815 & #4883 @@ -3485,29 +3561,28 @@ test(1118, dt[, lapply(.SD, function(y) weighted.mean(y, b2, na.rm=TRUE)), by=x] DT <- data.table(x=5:1, y=1:5, key="y") test(1119, is.null(key(DT[, list(z = y, y = 1/y)]))) - ## various ordered factor rbind tests -DT = data.table(ordered('a', levels = c('a','b','c'))) -DT1 = data.table(factor('a', levels = c('b','a','f'))) -DT2 = data.table(ordered('b', levels = c('b','d','c'))) -DT3 = data.table(c('foo', 'bar')) -DT4 = data.table(ordered('a', levels = c('b', 'a'))) - -test(1120, rbind(DT, DT1, DT2, DT3), data.table(ordered(c('a','a','b', 'foo', 'bar'), levels = c('a','b','d','c','f', 'foo', 'bar')))) -test(1121, rbindlist(list(DT, DT1, DT2, DT3)), data.table(ordered(c('a','a','b', 'foo', 'bar'), levels = c('a','b','d','c','f', 'foo', 'bar')))) -test(1122, rbind(DT, DT4), data.table(factor(c('a','a'), levels = c('a','b','c'))), warning="ordered factor levels cannot be combined, going to convert to simple factor instead") -test(1123, rbindlist(list(DT, DT4)), data.table(factor(c('a','a'), levels = c('a','b','c'))), warning="ordered factor levels cannot be combined, going to convert to simple factor instead") -test(1124, rbind(DT1, DT1), data.table(factor(c('a','a'), levels = c('b','a','f')))) -test(1125, rbindlist(list(DT1, DT1)), data.table(factor(c('a','a'), levels = c('b','a','f')))) - -# coverage of rbindlist.c:289, #2346. -# The hashing there hashes pointer address (CHARSXP) so this test attempts to use a large enough -# sample of unique strings to generate that condition reliably to test that collision branch. +DT1 = data.table(ordered('a', levels = c('a','b','c'))) +DT2 = data.table(factor('a', levels = c('b','a','f'))) +DT3 = data.table(ordered('b', levels = c('b','d','c'))) +DT4 = data.table(c('foo', 'bar')) +DT5 = data.table(ordered('b', levels = c('b','a'))) +test(1120.1, rbind(DT1, DT2, DT3, DT4), ans<-data.table(factor(c('a','a','b','foo','bar'), levels = c('a','b','c','f','d', 'foo', 'bar'))), + warning=w<-"Column 1 of item 3.*level 2 [[]'d'[]] is missing from the ordered levels from column 1 of item 1.*regular factor") +test(1120.2, rbindlist(list(DT1, DT2, DT3, DT4)), ans, warning=w) +test(1121.1, rbind(DT1, DT5), ans<-data.table(factor(c('a','b'), levels = c('a','b','c'))), warning=w<-"'b'<'a'.*But 'a'<'b'.*regular factor") +test(1121.2, rbindlist(list(DT1, DT5)), ans, warning=w) +test(1122.1, rbind(DT2, DT2), data.table(factor(c('a','a'), levels = c('b','a','f')))) +test(1122.2, rbindlist(list(DT2, DT2)), data.table(factor(c('a','a'), levels = c('b','a','f')))) +test(1123.1, rbind(DT2,DT5), data.table(ordered(c('a','b'), levels=c('b','a','f')))) +test(1123.2, rbind(DT5,DT2), data.table(ordered(c('b','a'), levels=c('b','a','f')))) + +# Old test to cover pre-PR#3455 rbindlist.c:289, #2346 (hashing CHARSXP no longer done) set.seed(1) manyChars = paste0("id",sample(99999,10000)) DT1 = data.table(ordered(sample(manyChars, 1000), levels=sample(manyChars))) DT2 = data.table(factor(sample(manyChars, 1000))) -test(1125.1, rbindlist(list(DT1,DT2))[c(1,2,.N-1,.N),as.character(V1)], c("id85645","id80957","id73436","id33445")) +test(1125, rbindlist(list(DT1,DT2))[c(1,2,.N-1,.N),as.character(V1)], c("id85645","id80957","id73436","id33445")) ## test rbind(..., fill = TRUE) DT = data.table(a = 1:2, b = 1:2) @@ -3908,22 +3983,20 @@ A <- data.table(x=factor(1), key='x') B <- data.table(x=factor(), key='x') test(1168.1, rbindlist(list(B,A)), data.table(x=factor(1))) -# fix for bug #5120, it's related to rbind and factors as well - more or less similar to 1168.1 (#5355). Seems to have been fixed with that commit. Just adding test here. +# fix for bug #5120, it's related to rbind and factors as well - more or less similar to 1168.1 (#5355). tmp1 <- as.data.table(structure(list(Year = 2013L, Maturity = structure(1L, .Label = c("<1", -"1.0 - 1.5", "1.5 - 2.0", "2.0 - 2.5", "2.5 - 3.0", "3.0 - 4.0", -"4.0 - 5.0", ">5.0"), class = "factor"), Quality = structure(2L, .Label = c(">BBB", -"BBB", "BB", "B", "CCC", "5.0"), class = "factor"), Quality = structure(2L, .Label = c(">BBB", + "BBB", "BB", "B", "CCC", " 5] setnames(dt3 <- copy(dt2), c("A", "B")) -test(1288.03, names(rbindlist(list(dt2,dt3))), c("x", "y")) -test(1288.04, names(rbindlist(list(dt3,dt2))), c("A", "B")) -test(1288.05, names(rbindlist(list(dt1,dt3))), c("x", "y")) -test(1288.06, names(rbindlist(list(dt3,dt1))), c("A", "B")) +test(1288.03, names(rbindlist(list(dt2,dt3), use.names=FALSE)), c("x", "y")) # use.names=FALSE to avoid new warning in v1.12.2; PR#3455 +test(1288.04, names(rbindlist(list(dt3,dt2), use.names=FALSE)), c("A", "B")) +test(1288.05, names(rbindlist(list(dt1,dt3), use.names=FALSE)), c("x", "y")) +test(1288.06, names(rbindlist(list(dt3,dt1), use.names=FALSE)), c("A", "B")) # check fix for bug #5612 DT <- data.table(x=c(1,2,3)) @@ -4638,13 +4711,14 @@ test(1288.11, rbindlist(ll, use.names=TRUE), data.table(a=c(1:3, 5:7), b=c(4:6, ll <- list(list(1:3, 4:6), list(a=5:7, b=8:10)) test(1288.12, rbindlist(ll, use.names=TRUE), data.table(a=c(1:3, 5:7), b=c(4:6, 8:10))) ll <- list(list(a=1:3, 4:6), list(5:7, b=8:10)) -test(1288.13, rbindlist(ll, use.names=TRUE), error="Answer requires 3 columns whereas one or more item(s) in the input list has only 2 columns. This could be because the items in the list may not") +test(1288.13, rbindlist(ll, use.names=TRUE), error="Column 2 ['b'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA") ll <- list(list(a=1:3, 4:6), list(5:7, b=8:10)) test(1288.14, rbindlist(ll, fill=TRUE), data.table(a=c(1:3, rep(NA_integer_,3L)), V1=c(4:6,5:7), b=c(rep(NA_integer_, 3L), 8:10))) ll <- list(list(1:3, 4:6), list(5:7, 8:10)) -test(1288.15, rbindlist(ll, fill=TRUE), error="fill=TRUE, but names of input list at position 1") -ll <- list(list(1:3, 4:6), list(a=5:7, b=8:10)) -test(1288.16, rbindlist(ll, fill=TRUE), error="fill=TRUE, but names of input list at position 1") +test(1288.15, rbindlist(ll, fill=TRUE), error="use.names=TRUE but no item of input list has any names") +ll <- list(list(1:3, 6:8), list(a=4:5, b=9:10)) +test(1288.16, rbindlist(ll), data.table(a=1:5, b=6:10)) +test(1288.17, rbindlist(ll, fill=TRUE), data.table(a=1:5, b=6:10)) # fix for #5647 dt = data.table(x=1L, y=1:10) @@ -6205,14 +6279,16 @@ test(1454.2, fread('"Foo"\n5\n',sep="`"), data.table(Foo=5L)) DT <- data.table(a=c(1, 1, 1, 0, 0), b=c("A", "B", "A1", "A", "B")) test(1455, DT[, nrow(.SD[b == 'B']), by=.(a)], data.table(a=c(1,0), V1=1L)) -# Test for chmatch2 bug fix +# chmatchdup ... x1 = c("b", "a", "d", "a", "c", "a") x2 = c("a", "a", "a") x3 = c("d", "a", "a", "d", "a") table = rep(letters[1:3], each=2) -test(1456.1, chmatch2(x1, table), as.integer(c(3,1,NA,2,5,NA))) -test(1456.2, chmatch2(x2, table), as.integer(c(1,2,NA))) -test(1456.3, chmatch2(x3, table), as.integer(c(NA,1,2,NA,NA))) +test(1456.1, chmatchdup(x1, table), as.integer(c(3,1,NA,2,5,NA))) +test(1456.2, chmatchdup(x2, table), as.integer(c(1,2,NA))) +test(1456.3, chmatchdup(x3, table), as.integer(c(NA,1,2,NA,NA))) +test(1457.1, chmatchdup(c("x","x","x","x"), c("x","y","x","x","y","z")), INT(1,3,4,NA)) +test(1457.2, base::pmatch(c("x","x","x","x"), c("x","y","x","x","y","z")), INT(1,3,4,NA)) # Add tests for which_ x = sample(c(-5:5, NA), 25, TRUE) @@ -6739,8 +6815,8 @@ test(1493, dt[, .(x=sum(x)),by= x %% 2, verbose=TRUE], data.table(`x%%2`=c(1,0), # Fix for #705 DT1 = data.table(date=as.POSIXct("2014-06-22", format="%Y-%m-%d", tz="GMT")) DT2 = data.table(date=as.Date("2014-06-23")) -test(1494.1, rbind(DT1, DT2), error="Class attributes at column") -test(1494.2, rbind(DT2, DT1), error="Class attributes at column") +test(1494.1, rbind(DT1, DT2), error="Class attribute on column") +test(1494.2, rbind(DT2, DT1), error="Class attribute on column") # test 1495 has been added to melt's test section (fix for #1055) @@ -13626,6 +13702,135 @@ dx = data.table(id = 1L, key = "id") di = list(z=c(2L, 1L)) test(1999.2, key(dx[di]), NULL) +# chmatchdup test from benchmark at the bottom of chmatch.c +set.seed(45L) +x = sample(letters, 1e5, TRUE) +y = sample(letters, 1e6, TRUE) +test(2000, c(head(ans<-chmatchdup(x,y,0L)),tail(ans)), INT(7,49,11,20,69,25,99365,100750,97596,99671,103320,99406)) +rm(list=c("x","y")) + +# rbindlist use.names=TRUE returned random column order when ncol>255; #3373 +DT = setDT(replicate(300, rnorm(3L), simplify = FALSE)) +test(2001.1, colnames(rbind(DT[1], DT[3])), colnames(DT)) +# and use.names=TRUE keeps dups in original location; mentioned in #3373 +DT1 = data.table(a=1L, b=3L, c=5L, b=7L) +DT2 = data.table(a=2L, b=4L, c=6L, b=8L) +test(2001.2, rbind(DT1, DT2, use.names = TRUE), data.table(a=1:2, b=3:4, c=5:6, b=7:8)) # dup of b at the end; was a,b,b,c + +# rbindlist now fills NULL and empty columns with NA with warning, #1871 +test(2002.1, rbindlist( list(list(a=1L, b=2L, x=NULL), list(a=2L, b=3L, x=10L)) ), + data.table(a=1:2, b=2:3, x=INT(NA,10)), + warning="Column 3 ['x'] of item 1 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform.") +test(2002.2, rbindlist( list(list(a=1L, b=2L, x=NULL), list(a=2L, b=NULL, x=10L)) ), + data.table(a=1:2, b=INT(2,NA), x=INT(NA,10)), + warning="Column 3 ['x'] of item 1 is length 0. This (and 1 other like it) has been filled with NA (NULL for list columns) to make each item uniform.") +test(2002.3, rbindlist( list(list(a=1L, b=2L, x=NULL), list(a=2L, b=NULL, x=NULL)) ), + data.table(a=1:2, b=INT(2,NA), x=c(NA,NA)), + warning="Column 3 ['x'] of item 1 is length 0. This (and 2 others like it) has been filled with NA (NULL for list columns) to make each item uniform.") +# tests from #1302 +test(2002.4, rbindlist( list(list(a=1L,z=list()), list(a=2L, z=list("m"))) ), + data.table(a=1:2, z=list(NULL, "m")), + warning="Column 2 ['z'] of item 1 is length 0. This (and 0 others like it) has been filled with NA") +test(2002.5, rbindlist( list( list(a=1L, z=list("z")), list(a=2L, z=list(c("a","b"))) )), + data.table(a=1:2, z=list("z", c("a","b")))) +test(2002.6, rbindlist( list( list(a=1:2, z=list("z",1,"k")), list(a=2, z=list(c("a","b"))) )), + error="Column 1 of item 1 is length 2 inconsistent with column 2 which is length 3. Only length-1 columns are recycled.") +test(2002.7, rbindlist( list(list(a=1L, z=list(list())), list(a=2L, z=list(list("m")))) ), + data.table(a=1:2, z=list(list(),list("m")))) +test(2002.8, rbindlist( list(list(a=1L, z=list(list("z"))), list(a=2L, z=list(list(c("a","b"))))) ), + data.table(a=1:2, z=list(list("z"), list(c("a","b"))))) +test(2002.9, rbindlist( list(list(a=1L, z=list(list("z",1))), list(a=2L, z=list(list(c("a","b"))))) ), + data.table(a=1:2, z=list(list("z",1), list(c("a","b"))))) +# tests from #3343 +DT1=list(a=NULL); setDT(DT1) +DT2=list(a=NULL); setDT(DT2) +test(2002.10, rbind(DT1, DT2), data.table(a=logical())) +test(2002.11, rbind(A=DT1, B=DT2, idcol='id'), data.table(id=character(), a=logical())) +test(2002.12, rbind(DT1, DT2, idcol='id'), data.table(id=integer(), a=logical())) + +#rbindlist coverage +test(2003.1, rbindlist(list(), use.names=1), error="use.names= should be TRUE, FALSE, or not used [(]\"check\" by default[)]") +test(2003.2, rbindlist(list(), fill=1), error="fill= should be TRUE or FALSE") +test(2003.3, rbindlist(list(data.table(a=1:2), data.table(b=3:4)), fill=TRUE, use.names=FALSE), + data.table(a=c(1:2,NA,NA), b=c(NA,NA,3:4)), + warning="use.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE") + +# chmatch coverage for two different non-ascii encodings matching; issues mentioned in comments in chmatch.c #5159 #2538 #4818 +x1 = "fa\xE7ile" +Encoding(x1) = "latin1" +x2 = iconv(x1, "latin1", "UTF-8") +test(2004.1, identical(x1,x2)) +test(2004.2, Encoding(x1)!=Encoding(x2)) +test(2004.3, chmatch(c("a",x1,"b"), x2), c(NA,1L,NA)) # x contains mixed; covers first fallback in chmatchMain +test(2004.4, c("a",x1,"b") %chin% x2, c(FALSE,TRUE,FALSE)) # and the chin switch in the same fallback +test(2004.5, chmatch(c("a","b"), c("b",x1)), c(NA,1L)) # x doesn't contain encodings so covers the second fallback in chmatchMain +test(2004.6, chmatch(c("a","b"), c("b",x2)), c(NA,1L)) # the second fallback might be redundnant though; see comments in chmatch.c +test(2004.7, c("a","b") %in% c("b",x1,x2), c(FALSE, TRUE)) # the second fallback might be redundnant though; see comments in chmatch.c + +# more coverage ... +test(2005.1, truelength(NULL), 0L) +DT = data.table(a=1:3, b=4:6) +test(2005.2, set(DT, 4L, "b", NA), error="i[1] is 4 which is out of range [1,nrow=3]") +test(2005.3, set(DT, 3L, 8i, NA), error="j is type 'complex'. Must be integer, character, or numeric is coerced with warning.") +test(2005.4, set(DT, 1L, 2L, expression(x+2)), error="RHS of assignment is not NULL, not an an atomic vector (see ?is.atomic) and not a list column.") +DT[,foo:=factor(c("a","b","c"))] +test(2005.5, DT[2, foo:=8i], error="Can't assign to column 'foo' (type 'factor') a value of type 'complex' (not character, factor, integer or numeric)") +test(2005.6, DT[2, a:=9, verbose=TRUE], output="Coerced length-1 RHS from double to integer to match column's type. No precision was lost. If this") +test(2005.7, DT[2, a:=NA, verbose=TRUE], output="Coerced length-1 RHS from logical to integer to match column's type. If this") +test(2005.8, DT[2, a:=9.9]$a, INT(1,9,3), warning="Coerced double RHS to integer.*One or more RHS values contain fractions which have been lost.*9.9.*has been truncated to 9") + +# rbindlist raw type, #2819 +test(2006.1, rbindlist(list(data.table(x = as.raw(1), y=as.raw(3)), data.table(x = as.raw(2))), fill=TRUE), data.table(x=as.raw(1:2), y=as.raw(c(3,0)))) +test(2006.2, rbindlist(list(data.table(x = as.raw(1:2), y=as.raw(5:6)), data.table(x = as.raw(3:5))), fill=TRUE), data.table(x=as.raw(1:5), y=as.raw(c(5:6,0,0,0)))) + +# rbindlist integer64, #1349 +if (test_bit64) { + test(2007.1, rbindlist(list( list(a=as.integer64(1), b=3L), list(a=2L, b=4L) )), data.table(a=as.integer64(1:2), b=3:4)) + test(2007.2, rbindlist(list( list(a=3.4, b=5L), list(a=as.integer64(4), b=6L) )), data.table(a=as.integer64(3:4), b=5:6), + warning="Column 1 of item 1: coerced to integer64 but contains a non-integer value [(]3.40.* at position 1[)]; precision lost") + test(2007.3, rbindlist(list( list(a=3.0, b=5L), list(a=as.integer64(4), b=6L) )), data.table(a=as.integer64(3:4), b=5:6)) + test(2007.4, rbindlist(list( list(b=5:6), list(a=as.integer64(4), b=7L)), fill=TRUE), data.table(b=5:7, a=as.integer64(c(NA,NA,4)))) # tests writeNA of integer64 + test(2007.5, rbindlist(list( list(a=INT(1,NA,-2)), list(a=as.integer64(c(3,NA))) )), data.table(a=as.integer64(c(1,NA,-2,3,NA)))) # int NAs combined with int64 NA + test(2007.6, rbind(data.table(a=as.raw(10), b=5L), data.table(a=as.integer64(11), b=6L)), data.table(a=as.integer64(10:11), b=5:6)) +} + +# reworked ordered-factor handling in PR#3455, expanded from test for #3032 +DT1 = data.table(x = ordered(vals<-c("b","b","e","f","c","c"), levels=c("f","b","e","c"))) +DT2 = data.table(x = ordered(vals, levels=c("f","e","b","c"))) +DT3 = data.table(x = ordered(vals, levels=c("f","b","e","c","d"))) +DT4 = data.table(x = ordered(vals, levels=c("f","b","e","c","a","p"))) +test(2008.1, DT1$x[3] < DT1$x[5]) # e1], error="Internal error: column 1 of data.table is NULL; malformed") +DT = null.data.table() +x = NULL +test(2009.2, DT[, .(x)], error="Column 1 of j evaluates to NULL. A NULL column is invalid.") +test(2009.3, data.table(character(0), NULL), error="column or argument 2 is NULL") +test(2009.4, as.data.table(list(y = character(0), x = NULL)), data.table(y=character())) + +# use.names=NA warning for out-of-order; https://github.com/Rdatatable/data.table/pull/3455#issuecomment-472744347 +DT1 = data.table(a=1:2, b=5:6) +DT2 = data.table(b=7:8, a=3:4) +test(2010.1, rbindlist(list(DT1,DT2)), ans<-data.table(a=c(1:2,7:8), b=c(5:6,3:4)), + warning="Column 2 [[]'a'[]] of item 2 appears in position 1 in item 1.*use.names=TRUE.*or use.names=FALSE.*v1.12.2") +test(2010.2, rbindlist(list(DT1,DT2), use.names=FALSE), ans) +test(2010.3, rbindlist(list(DT1,DT2), use.names=TRUE), data.table(a=1:4, b=5:8)) +test(2010.4, rbindlist(list(DT1,DT2), use.names=NA), error="use.names=NA invalid") +test(2010.5, rbindlist(list(DT1,DT2), use.names='check'), + error="use.names='check' cannot be used explicitly because the value 'check' is new in v1.12.2 and subject to change. It is just meant to convey default behavior.") + ################################### # Add new tests above this line # diff --git a/man/rbindlist.Rd b/man/rbindlist.Rd index 1cbe4f5608..9b11c0b0cf 100644 --- a/man/rbindlist.Rd +++ b/man/rbindlist.Rd @@ -4,32 +4,26 @@ \alias{rbind} \title{ Makes one data.table from a list of many } \description{ - Same as \code{do.call("rbind", l)} on \code{data.frame}s, but much faster. See \code{DETAILS} for more. + Same as \code{do.call("rbind", l)} on \code{data.frame}s, but much faster. } \usage{ -rbindlist(l, use.names=fill, fill=FALSE, idcol=NULL) +rbindlist(l, use.names="check", fill=FALSE, idcol=NULL) # rbind(\dots, use.names=TRUE, fill=FALSE, idcol=NULL) } \arguments{ - \item{l}{ A list containing \code{data.table}, \code{data.frame} or \code{list} objects. At least one of the inputs should have column names set. \code{\dots} is the same but you pass the objects by name separately. } - \item{use.names}{If \code{TRUE} items will be bound by matching column names. By default \code{FALSE} for \code{rbindlist} (for backwards compatibility) and \code{TRUE} for \code{rbind} (consistency with base). Columns with duplicate names are bound in the order of occurrence, similar to base. When TRUE, at least one item of the input list has to have non-null column names.} - \item{fill}{If \code{TRUE} fills missing columns with NAs. By default \code{FALSE}. When \code{TRUE}, \code{use.names} has to be \code{TRUE}, and all items of the input list has to have non-null column names. } - \item{idcol}{Generates an index column. Default (\code{NULL}) is not to. If \code{idcol=TRUE} then the column is auto named \code{.id}. Alternatively the column name can be directly provided, e.g., \code{idcol = "id"}. - - If input is a named list, ids are generated using them, else using integer vector from \code{1} to length of input list. See \code{examples}.} + \item{l}{ A list containing \code{data.table}, \code{data.frame} or \code{list} objects. \code{\dots} is the same but you pass the objects by name separately. } + \item{use.names}{\code{TRUE} binds by matching column name, \code{FALSE} by position. `check` (default) warns if all items don't have the same names in the same order and then currently proceeds as if `use.names=FALSE` for backwards compatibility (\code{TRUE} in future); see news for v1.12.2.} + \item{fill}{\code{TRUE} fills missing columns with NAs. By default \code{FALSE}. When \code{TRUE}, \code{use.names} is set to \code{TRUE}.} + \item{idcol}{Creates a column in the result showing which list item those rows came from. \code{TRUE} names this column \code{".id"}. \code{idcol="file"} names this column \code{"file"}. If the input list has names, those names are the values placed in this id column, otherwise the values are an integer vector \code{1:length(l)}. See \code{examples}.} } \details{ -Each item of \code{l} can be a \code{data.table}, \code{data.frame} or \code{list}, including \code{NULL} (skipped) or an empty object (0 rows). \code{rbindlist} is most useful when there are a variable number of (potentially many) objects to stack, such as returned by \code{lapply(fileNames, fread)}. \code{rbind} however is most useful to stack two or three objects which you know in advance. \code{\dots} should contain at least one \code{data.table} for \code{rbind(\dots)} to call the fast method and return a \code{data.table}, whereas \code{rbindlist(l)} always returns a \code{data.table} even when stacking a plain \code{list} with a \code{data.frame}, for example. - -In versions \code{<= v1.9.2}, each item for \code{rbindlist} should have the same number of columns as the first non empty item. \code{rbind.data.table} gained a \code{fill} argument to fill missing columns with \code{NA} in \code{v1.9.2}, which allowed for \code{rbind(\dots)} binding unequal number of columns. - -In version \code{> v1.9.2}, these functionalities were extended to \code{rbindlist} (and written entirely in C for speed). \code{rbindlist} has \code{use.names} argument, which is set to \code{FALSE} by default for backwards compatibility. It also contains \code{fill} argument as well and can bind unequal columns when set to \code{TRUE}. +Each item of \code{l} can be a \code{data.table}, \code{data.frame} or \code{list}, including \code{NULL} (skipped) or an empty object (0 rows). \code{rbindlist} is most useful when there are an unknown number of (potentially many) objects to stack, such as returned by \code{lapply(fileNames, fread)}. \code{rbind} is most useful to stack two or three objects which you know in advance. \code{\dots} should contain at least one \code{data.table} for \code{rbind(\dots)} to call the fast method and return a \code{data.table}, whereas \code{rbindlist(l)} always returns a \code{data.table} even when stacking a plain \code{list} with a \code{data.frame}, for example. -With these changes, the only difference between \code{rbind(\dots)} and \code{rbindlist(l)} is their \emph{default argument} \code{use.names}. +Columns with duplicate names are bound in the order of occurrence, similar to base. The position (column number) that each duplicate name occurs is also retained. -If column \code{i} of input items do not all have the same type; e.g, a \code{data.table} may be bound with a \code{list} or a column is \code{factor} while others are \code{character} types, they are coerced to the highest type (SEXPTYPE). +If column \code{i} does not have the same type in each of the list items; e.g, the column is \code{integer} in item 1 while others are \code{numeric}, they are coerced to the highest type. -Note that any additional attributes that might exist on individual items of the input list would not be preserved in the result. +If a column contains factors then a factor is created. If any of the factors are also ordered factors then the longest set of ordered levels are found (the first if this is tied). Then the ordered levels from each list item are checked to be an ordered subset of these longest levels. If any ambiguities are found (e.g. \code{blue #include -static SEXP *saveds=NULL; -static R_len_t *savedtl=NULL, nalloc=0, nsaved=0; - static void finalizer(SEXP p) { SEXP x; @@ -136,7 +133,7 @@ static int _selfrefok(SEXP x, Rboolean checkNames, Rboolean verbose) { // because R copies the original vector's tl over despite allocating length. prot = R_ExternalPtrProtected(v); if (TYPEOF(prot) != EXTPTRSXP) // Very rare. Was error(".internal.selfref prot is not itself an extptr"). - return 0; // See http://stackoverflow.com/questions/15342227/getting-a-random-internal-selfref-error-in-data-table-for-r + return 0; // # nocov ; see http://stackoverflow.com/questions/15342227/getting-a-random-internal-selfref-error-in-data-table-for-r if (x != R_ExternalPtrAddr(prot)) SET_TRUELENGTH(x, LENGTH(x)); // R copied this vector not data.table, it's not actually over-allocated return checkNames ? names==tag : x==R_ExternalPtrAddr(prot); @@ -266,23 +263,13 @@ SEXP shallowwrapper(SEXP dt, SEXP cols) { } SEXP truelength(SEXP x) { - SEXP ans; - PROTECT(ans = allocVector(INTSXP, 1)); - if (!isNull(x)) { - INTEGER(ans)[0] = TRUELENGTH(x); - } else { - INTEGER(ans)[0] = 0; - } - UNPROTECT(1); - return(ans); + return ScalarInteger(isNull(x) ? 0 : TRUELENGTH(x)); } SEXP selfrefokwrapper(SEXP x, SEXP verbose) { return ScalarInteger(_selfrefok(x,FALSE,LOGICAL(verbose)[0])); } -void memrecycle(SEXP target, SEXP where, int r, int len, SEXP source); - SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP verb) { // For internal use only by := in [.data.table, and set() @@ -341,10 +328,11 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v error("i is type '%s'. Must be integer, or numeric is coerced with warning. If i is a logical subset, simply wrap with which(), and take the which() outside the loop if possible for efficiency.", type2char(TYPEOF(rows))); targetlen = length(rows); numToDo = 0; + const int *rowsd = INTEGER(rows); for (i=0; inrow) - error("i[%d] is %d which is out of range [1,nrow=%d].",i+1,INTEGER(rows)[i],nrow); - if (INTEGER(rows)[i]>=1) numToDo++; + if ((rowsd[i]<0 && rowsd[i]!=NA_INTEGER) || rowsd[i]>nrow) + error("i[%d] is %d which is out of range [1,nrow=%d].",i+1,rowsd[i],nrow); // set() reaches here (test 2005.2); := reaches the same error in subset.c first + if (rowsd[i]>=1) numToDo++; } if (verbose) Rprintf("Assigning to %d row subset of %d rows\n", numToDo, nrow); // TODO: include in message if any rows are assigned several times (e.g. by=.EACHI with dups in i) @@ -355,12 +343,12 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v } } if (!length(cols)) { - warning("length(LHS)==0; no columns to delete or assign RHS to."); + warning("length(LHS)==0; no columns to delete or assign RHS to."); // test 1295 covers return(dt); } // FR #2077 - set able to add new cols by reference if (isString(cols)) { - PROTECT(tmp = chmatch(cols, names, 0, FALSE)); + PROTECT(tmp = chmatch(cols, names, 0)); protecti++; buf = (int *) R_alloc(length(cols), sizeof(int)); for (i=0; i1) { - if (length(values)==0) error("Supplied %d columns to be assigned an empty list (which may be an empty data.table or data.frame since they are lists too). To delete multiple columns use NULL instead. To add multiple empty list columns, use list(list()).", length(cols)); - if (length(values)>length(cols)) - warning("Supplied %d columns to be assigned a list (length %d) of values (%d unused)", length(cols), length(values), length(values)-length(cols)); - else if (length(cols)%length(values) != 0) - warning("Supplied %d columns to be assigned a list (length %d) of values (recycled leaving remainder of %d items).",length(cols),length(values),length(cols)%length(values)); - } // else it's a list() column being assigned to one column - } + if (TYPEOF(values)==VECSXP) { + if (length(cols)>1) { + if (length(values)==0) error("Supplied %d columns to be assigned an empty list (which may be an empty data.table or data.frame since they are lists too). To delete multiple columns use NULL instead. To add multiple empty list columns, use list(list()).", length(cols)); + if (length(values)>1 && length(values)!=length(cols)) + error("Supplied %d columns to be assigned %d items. Please see NEWS for v1.12.2.", length(cols), length(values)); + } // else it's a list() column being assigned to one column } // Check all inputs : for (i=0; i oldncol && TYPEOF(thisvalue)!=VECSXP) { // list() is ok for new columns newcolnum = coln-length(names); if (newcolnum<0 || newcolnum>=length(newcolnames)) - error("Internal logical error. length(newcolnames)=%d, length(names)=%d, coln=%d", length(newcolnames), length(names), coln); + error("Internal error in assign.c: length(newcolnames)=%d, length(names)=%d, coln=%d", length(newcolnames), length(names), coln); // # nocov if (isNull(thisvalue)) { warning("Adding new column '%s' then assigning NULL (deleting it).",CHAR(STRING_ELT(newcolnames,newcolnum))); continue; @@ -455,13 +434,13 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v if (oldtncol>oldncol+10000L) warning("truelength (%d) is greater than 10,000 items over-allocated (length = %d). See ?truelength. If you didn't set the datatable.alloccol option very large, please report to data.table issue tracker including the result of sessionInfo().",oldtncol, oldncol); if (oldtncol < oldncol+LENGTH(newcolnames)) - error("Internal logical error. DT passed to assign has not been allocated enough column slots. l=%d, tl=%d, adding %d", oldncol, oldtncol, LENGTH(newcolnames)); + error("Internal error: DT passed to assign has not been allocated enough column slots. l=%d, tl=%d, adding %d", oldncol, oldtncol, LENGTH(newcolnames)); // # nocov if (!selfrefnamesok(dt,verbose)) - error("It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to data.table issue tracker."); + error("It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to data.table issue tracker."); // # nocov // Can growVector at this point easily enough, but it shouldn't happen in first place so leave it as // strong error message for now. else if (TRUELENGTH(names) != oldtncol) - error("selfrefnames is ok but tl names [%d] != tl [%d]", TRUELENGTH(names), oldtncol); + error("Internal error: selfrefnames is ok but tl names [%d] != tl [%d]", TRUELENGTH(names), oldtncol); // # nocov SETLENGTH(dt, oldncol+LENGTH(newcolnames)); SETLENGTH(names, oldncol+LENGTH(newcolnames)); for (i=0; i0) // assigning the same values to a second column. Have to ensure a copy #2540 ) { if (verbose) { - if (length(values)==length(cols)) { - // usual branch - Rprintf("RHS for item %d has been duplicated because NAMED is %d, but then is being plonked.\n", i+1, NAMED(thisvalue)); - } else { - // rare branch where the lhs of := is longer than the items on the rhs of := - Rprintf("RHS for item %d has been duplicated because the list of RHS values (length %d) is being recycled, but then is being plonked.\n", i+1, length(values)); - } + Rprintf("RHS for item %d has been duplicated because NAMED is %d, but then is being plonked. length(values)==%d; length(cols)==%d)\n", + i+1, NAMED(thisvalue), length(values), length(cols)); } thisvalue = duplicate(thisvalue); // PROTECT not needed as assigned as element to protected list below. } else { @@ -577,7 +551,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v } else { // value is either integer or numeric vector if (TYPEOF(thisvalue)!=INTSXP && TYPEOF(thisvalue)!=LGLSXP && !isReal(thisvalue)) - error("Internal logical error. Up front checks (before starting to modify DT) didn't catch type of RHS ('%s') assigning to factor column '%s'. please report to data.table issue tracker.", type2char(TYPEOF(thisvalue)), CHAR(STRING_ELT(names,coln))); + error("Internal error: up front checks (before starting to modify DT) didn't catch type of RHS ('%s') assigning to factor column '%s'. please report to data.table issue tracker.", type2char(TYPEOF(thisvalue)), CHAR(STRING_ELT(names,coln))); // # nocov if (isReal(thisvalue) || TYPEOF(thisvalue)==LGLSXP) { PROTECT(RHS = coerceVector(thisvalue,INTSXP)); protecti++; @@ -612,13 +586,13 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v char *s1 = (char *)type2char(TYPEOF(targetcol)); char *s2 = (char *)type2char(TYPEOF(thisvalue)); // FR #2551, added test for equality between RHS and thisvalue to not provide the warning when length(thisvalue) == 1 - if ( length(thisvalue)==1 && TYPEOF(RHS)!=VECSXP && TYPEOF(thisvalue)!=VECSXP && ( + if ( length(thisvalue)==1 && TYPEOF(RHS)!=VECSXP && ( ( isReal(thisvalue) && isInteger(targetcol) && REAL(thisvalue)[0]==INTEGER(RHS)[0] ) || // DT[,intCol:=4] rather than DT[,intCol:=4L] ( isLogical(thisvalue) && LOGICAL(thisvalue)[0] == NA_LOGICAL ) || // DT[,intCol:=NA] - ( isReal(targetcol) && isInteger(thisvalue) ) )) { + ( isInteger(thisvalue) && isReal(targetcol) ) )) { if (verbose) Rprintf("Coerced length-1 RHS from %s to %s to match column's type.%s If this assign is happening a lot inside a loop, in particular via set(), then it may be worth avoiding this coercion by using R's type postfix on the value being assigned; e.g. typeof(0) vs typeof(0L), and typeof(NA) vs typeof(NA_integer_) vs typeof(NA_real_).\n", s2, s1, - isInteger(targetcol) && isReal(thisvalue) ? "No precision was lost. " : ""); - // TO DO: datatable.pedantic could turn this into warning + isInteger(targetcol) && isReal(thisvalue) ? " No precision was lost." : ""); + // TO DO: datatable.pedantic could turn this into warning. Or we could catch and avoid the coerceVector allocation ourselves using a single int. } else { if (isReal(thisvalue) && isInteger(targetcol)) { int w = INTEGER(isReallyReal(thisvalue))[0]; // first fraction present (1-based), 0 if none @@ -646,7 +620,7 @@ SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values, SEXP v if (length(key)) { // if assigning to at least one key column, the key is truncated to one position before the first changed column. //any() and subsetVector() don't seem to be exposed by R API at C level, so this is done here long hand. - PROTECT(tmp = chmatch(key, assignedNames, 0, TRUE)); + PROTECT(tmp = chin(key, assignedNames)); protecti++; newKeyLength = xlength(key); for (i=0;i1 && slen!=len) error("Internal error: recycle length error not caught earlier. slen=%d len=%d", slen, len); // # nocov // Internal error because the column has already been added to the DT, so length mismatch should have been caught before adding the column. // for 5647 this used to limit slen to len, but no longer - + *memrecycle_message = '\0'; int protecti=0; if (isNewList(source)) { // A list() column; i.e. target is a column of pointers to SEXPs rather than the much more common case @@ -844,28 +819,91 @@ void memrecycle(SEXP target, SEXP where, int start, int len, SEXP source) protecti++; } } - if (!length(where)) { + if (!length(where)) { // e.g. called from rbindlist with where=R_NilValue switch (TYPEOF(target)) { - case LGLSXP: case INTSXP : + case RAWSXP: + if (TYPEOF(source)!=RAWSXP) { source = PROTECT(coerceVector(source, RAWSXP)); protecti++; } if (slen==1) { // recycle single items - int *td = INTEGER(target); + Rbyte *td = RAW(target)+start; + const Rbyte val = RAW(source)[0]; + for (int i=0; i=nalloc) { - nalloc *= 2; - char *tmp; - tmp = (char *)realloc(saveds, nalloc * sizeof(SEXP)); - if (tmp == NULL) { - savetl_end(); - error("Couldn't realloc saveds in savetl"); + if (nsaved==nalloc) { + if (nalloc==INT_MAX) { + savetl_end(); // # nocov + error("Internal error: reached maximum %d items for savetl. Please report to data.table issue tracker.", nalloc); // # nocov + } + nalloc = nalloc>(INT_MAX/2) ? INT_MAX : nalloc*2; + char *tmp = (char *)realloc(saveds, nalloc*sizeof(SEXP)); + if (tmp==NULL) { + // C spec states that if realloc() fails the original block is left untouched; it is not freed or moved. We rely on that here. + savetl_end(); // # nocov free(saveds) happens inside savetl_end + error("Failed to realloc saveds to %d items in savetl", nalloc); // # nocov } saveds = (SEXP *)tmp; - tmp = (char *)realloc(savedtl, nalloc * sizeof(R_len_t)); - if (tmp == NULL) { - savetl_end(); - error("Couldn't realloc savedtl in savetl"); + tmp = (char *)realloc(savedtl, nalloc*sizeof(R_len_t)); + if (tmp==NULL) { + savetl_end(); // # nocov + error("Failed to realloc savedtl to %d items in savetl", nalloc); // # nocov } savedtl = (R_len_t *)tmp; } @@ -1006,11 +1074,11 @@ void savetl_end() { // Can get called if nothing has been saved yet (nsaved==0), or even if _init() hasn't been called yet (pointers NULL). Such // as to clear up before error. Also, it might be that nothing needed to be saved anyway. for (int i=0; i0) savetl(s); // as from v1.8.0 we assume R's internal hash is positive. So in R < 2.14.0 we // don't save the uninitialised truelengths that by chance are negative, but // will save if positive. Hence R >= 2.14.0 may be faster and preferred now that R // initializes truelength to 0 from R 2.14.0. - SET_TRUELENGTH(s,0); + SET_TRUELENGTH(s,0); // TODO: do we need to set to zero first (we can rely on R 3.1.0 now)? } - for (i=length(table)-1; i>=0; i--) { - s = STRING_ELT(table,i); + const int tablelen = length(table); + const SEXP *td = STRING_PTR(table); + int nuniq=0; + for (int i=0; i0) savetl(s); - SET_TRUELENGTH(s, -i-1); + int tl = TRUELENGTH(s); + if (tl>0) { savetl(s); tl=0; } + if (tl==0) SET_TRUELENGTH(s, chmatchdup ? -(++nuniq) : -i-1); // first time seen this string in table } - if (in) { - for (i=0; i A(TL=1),B(2),C(3),D(4),E(5) => dupMap 1 2 3 5 6 | 8 7 4 + // dupLink 7 8 | 6 (blank=0) + int *counts = (int *)calloc(nuniq, sizeof(int)); + int *map = (int *)calloc(tablelen+nuniq, sizeof(int)); // +nuniq to store a 0 at the end of each group + if (!counts || !map) { + // # nocov start + for (int i=0; i #include -void setSizes() { - // called by init.c - int i; - for (i=0;i<100;i++) sizes[i]=0; - // only these types are currently allowed as column types : - sizes[INTSXP] = sizeof(int); // integer and factor - sizes[LGLSXP] = sizeof(int); // logical - sizes[REALSXP] = sizeof(double); // numeric - sizes[STRSXP] = sizeof(SEXP *); // character - sizes[VECSXP] = sizeof(SEXP *); // a column itself can be a list() - for (i=0;i<100;i++) { - if (sizes[i]>8) error("Type %d is sizeof() greater than 8 bytes on this machine. We haven't tested on any architecture greater than 64bit, yet.", i); - // One place we need the largest sizeof (assumed to be 8 bytes) is the working memory malloc in reorder.c - } - SelfRefSymbol = install(".internal.selfref"); -} - SEXP dogroups(SEXP dt, SEXP dtcols, SEXP groups, SEXP grpcols, SEXP jiscols, SEXP xjiscols, SEXP grporder, SEXP order, SEXP starts, SEXP lens, SEXP jexp, SEXP env, SEXP lhs, SEXP newnames, SEXP on, SEXP verbose) { R_len_t i, j, k, rownum, ngrp, nrowgroups, njval=0, ngrpcols, ansloc=0, maxn, estn=-1, thisansloc, grpn, thislen, igrp, origIlen=0, origSDnrow=0; diff --git a/src/fmelt.c b/src/fmelt.c index f54ba17fb0..100bfed3bc 100644 --- a/src/fmelt.c +++ b/src/fmelt.c @@ -133,7 +133,7 @@ SEXP measurelist(SEXP measure, SEXP dtnames) { ans = PROTECT(allocVector(VECSXP, n)); protecti++; for (i=0; i0) savetl(s); + SET_TRUELENGTH(s,-(++nlevel)); + levelsRaw[nlevel-1] = s; + } + } + for (int i=0; iisfactor[i] && TYPEOF(target) != VECSXP) { - SEXP clevels = PROTECT(combineFactorLevels(flevels, &(data->isfactor[i]), isordered)); - SEXP factorLangSxp = PROTECT(lang3(install(data->isfactor[i] == 1 ? "factor" : "ordered"), target, clevels)); - SET_VECTOR_ELT(ansvals, i, eval(factorLangSxp, R_GlobalEnv)); - UNPROTECT(2); // clevels, factorLangSxp + //SEXP clevels = PROTECT(combineFactorLevels(flevels, &(data->isfactor[i]), isordered)); + //SEXP factorLangSxp = PROTECT(lang3(install(data->isfactor[i] == 1 ? "factor" : "ordered"), target, clevels)); + //SET_VECTOR_ELT(ansvals, i, eval(factorLangSxp, R_GlobalEnv)); + //UNPROTECT(2); // clevels, factorLangSxp + SET_VECTOR_ELT(ansvals, i, combineFactorLevels(flevels, target, &(data->isfactor[i]), isordered)); } } UNPROTECT(2); // flevels, ansvals. Not using two protection counters (protecti and thisprotecti) to keep rchk happy. @@ -477,74 +543,82 @@ SEXP getvaluecols(SEXP DT, SEXP dtnames, Rboolean valfactor, Rboolean verbose, s } SEXP getvarcols(SEXP DT, SEXP dtnames, Rboolean varfactor, Rboolean verbose, struct processData *data) { - - int i,j,k,cnt=0,nrows=0, nlevels=0, protecti=0, thislen, zerolen=0; - SEXP ansvars, thisvaluecols, levels, target, matchvals, thisnames; - - ansvars = PROTECT(allocVector(VECSXP, 1)); protecti++; - SET_VECTOR_ELT(ansvars, 0, target=allocVector(INTSXP, data->totlen) ); - if (data->lvalues == 1) { - thisvaluecols = VECTOR_ELT(data->valuecols, 0); - // tmp fix for #1055 - thisnames = PROTECT(allocVector(STRSXP, length(thisvaluecols))); protecti++; - for (i=0; inarm) { - for (j=0; jlmax; j++) { - thislen = length(VECTOR_ELT(data->naidx, j)); - for (k=0; knrow * data->lmax == data->totlen + SEXP ansvars=PROTECT(allocVector(VECSXP, 1)); + int protecti=1; + SEXP target; + if (data->lvalues==1 && length(VECTOR_ELT(data->valuecols, 0)) != data->lmax) + error("Internal error: fmelt.c:getvarcols %d %d", length(VECTOR_ELT(data->valuecols, 0)), data->lmax); // # nocov + if (!varfactor) { + SET_VECTOR_ELT(ansvars, 0, target=allocVector(STRSXP, data->totlen)); + if (data->lvalues == 1) { + const int *thisvaluecols = INTEGER(VECTOR_ELT(data->valuecols, 0)); + for (int j=0, ansloc=0; jlmax; ++j) { + const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow; + SEXP str = STRING_ELT(dtnames, thisvaluecols[j]-1); + for (int k=0; klmax - zerolen; } else { - for (j=0; jlmax; j++) { - for (k=0; knrow; k++) - INTEGER(target)[data->nrow*j + k] = INTEGER(matchvals)[j]; + for (int j=0, ansloc=0, level=1; jlmax; ++j) { + const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow; + if (thislen==0) continue; // so as not to bump level + char buff[20]; + sprintf(buff, "%d", level++); + SEXP str = PROTECT(mkChar(buff)); + for (int k=0; klmax; } } else { - if (data->narm) { - for (j=0; jlmax; j++) { - thislen = length(VECTOR_ELT(data->naidx, j)); - for (k=0; ktotlen)); + SEXP levels; + int *td = INTEGER(target); + if (data->lvalues == 1) { + SEXP thisvaluecols = VECTOR_ELT(data->valuecols, 0); + int len = length(thisvaluecols); + levels = PROTECT(allocVector(STRSXP, len)); protecti++; + const int *vd = INTEGER(thisvaluecols); + for (int j=0; jnarm && length(VECTOR_ELT(data->naidx, j))==0)) { numRemove++; md[j]=0; } + } + if (numRemove) { + SEXP newlevels = PROTECT(allocVector(STRSXP, len-numRemove)); protecti++; + for (int i=0, loc=0; ilmax; ++j) { + const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow; + for (int k=0; klmax; j++) { - for (k=0; knrow; k++) - INTEGER(target)[data->nrow*j + k] = j+1; + int nlevel=0; + levels = PROTECT(allocVector(STRSXP, data->lmax)); protecti++; + for (int j=0, ansloc=0; jlmax; ++j) { + const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow; + if (thislen==0) continue; // so as not to bump level + char buff[20]; + sprintf(buff, "%d", nlevel+1); + SET_STRING_ELT(levels, nlevel++, mkChar(buff)); // generate levels = 1:nlevels + for (int k=0; klmax; - } - } - SEXP tmp = PROTECT(mkString("factor")); protecti++; - setAttrib(target, R_ClassSymbol, tmp); - cnt = 0; - if (data->lvalues == 1) { - levels = PROTECT(allocVector(STRSXP, nlevels)); protecti++; - thisvaluecols = VECTOR_ELT(data->valuecols, 0); // levels will be column names - for (i=0; ilmax; i++) { - if (data->narm) { - if (length(VECTOR_ELT(data->naidx, i)) == 0) continue; + if (nlevel < data->lmax) { + // data->narm is true and there are some all-NA items causing at least one 'if (thislen==0) continue' above + // shrink the levels + SEXP newlevels = PROTECT(allocVector(STRSXP, nlevel)); protecti++; + for (int i=0; i8) error("Pointers are %d bytes, greater than 8. We have not tested on any architecture greater than 64bit yet.", sizeof(char *)); + // One place we need the largest sizeof is the working memory malloc in reorder.c +} + void attribute_visible R_init_datatable(DllInfo *info) // relies on pkg/src/Makevars to mv data.table.so to datatable.so { @@ -249,11 +263,12 @@ void attribute_visible R_init_datatable(DllInfo *info) char_indices = PRINTNAME(install("indices")); char_allLen1 = PRINTNAME(install("allLen1")); char_allGrp1 = PRINTNAME(install("allGrp1")); + char_factor = PRINTNAME(install("factor")); + char_ordered = PRINTNAME(install("ordered")); if (TYPEOF(char_integer64) != CHARSXP) { // checking one is enough in case of any R-devel changes - error("PRINTNAME(install(\"integer64\")) has returned %s not %s", - type2char(TYPEOF(char_integer64)), type2char(CHARSXP)); + error("PRINTNAME(install(\"integer64\")) has returned %s not %s", type2char(TYPEOF(char_integer64)), type2char(CHARSXP)); // # nocov } // create commonly used symbols, same as R_*Symbol but internal to DT @@ -267,6 +282,7 @@ void attribute_visible R_init_datatable(DllInfo *info) sym_index = install("index"); sym_BY = install(".BY"); sym_maxgrpn = install("maxgrpn"); + SelfRefSymbol = install(".internal.selfref"); initDTthreads(); avoid_openmp_hang_within_fork(); @@ -312,9 +328,9 @@ inline double LLtoD(long long x) { return u.d; } - +// # nocov start SEXP hasOpenMP() { - // Just for use by onAttach to avoid an RPRINTF from C level which isn't suppressable by CRAN + // Just for use by onAttach (hence nocov) to avoid an RPRINTF from C level which isn't suppressable by CRAN // There is now a 'grep' in CRAN_Release.cmd to detect any use of RPRINTF in init.c, which is // why RPRINTF is capitalized in this comment to avoid that grep. // TODO: perhaps .Platform or .Machine in R itself could contain whether OpenMP is available. @@ -324,6 +340,7 @@ SEXP hasOpenMP() { return ScalarLogical(FALSE); #endif } +// # nocov end SEXP dllVersion() { // .onLoad calls this and checks the same as packageVersion() to ensure no R/C version mismatch, #3056 diff --git a/src/rbindlist.c b/src/rbindlist.c index bec3fdd4e5..042c9c99bb 100644 --- a/src/rbindlist.c +++ b/src/rbindlist.c @@ -1,958 +1,468 @@ #include "data.table.h" #include -#include -// #include // the debugging machinery + breakpoint aidee -// raise(SIGINT); -/* Eddi's hash setup for combining factor levels appropriately - untouched from previous state (except made combineFactorLevels static) */ - -// a simple linked list, will use this when finding global order for ordered factors -// will keep two ints -struct llist { - struct llist * next; - R_len_t i, j; -}; - -// hash table code copied from main/unique.c, specialized for our particular needs -// as our table will just be strings -// took out long vector ifdefs as that relied on too much base code -// can revisit this later if there is need for more than ~1e9 length factor columns -// UTF8 and Cache bools are not set correctly for now - -typedef size_t hlen; - -/* Hash function and equality test for keys */ -typedef struct _HashData HashData; - -struct _HashData { - int K; - hlen M; - RLEN nmax; - hlen (*hash)(SEXP, RLEN, HashData *); - int (*equal)(SEXP, RLEN, SEXP, RLEN); - struct llist ** HashTable; - - int nomatch; - Rboolean useUTF8; - Rboolean useCache; -}; - -/* -Integer keys are hashed via a random number generator -based on Knuth's recommendations. The high order K bits -are used as the hash code. - -NB: lots of this code relies on M being a power of two and -on silent integer overflow mod 2^32. - - Integer keys are wasteful for logical and raw vectors, but -the tables are small in that case. It would be much easier to -implement long vectors, though. -*/ - -/* Currently the hash table is implemented as a (signed) integer -array. So there are two 31-bit restrictions, the length of the -array and the values. The values are initially NIL (-1). O-based -indices are inserted by isDuplicated, and invalidated by setting -to NA_INTEGER. -*/ - -static hlen scatter(unsigned int key, HashData *d) -{ - return 3141592653U * key >> (32 - d->K); -} - -/* Hash CHARSXP by address. Hash values are int, For 64bit pointers, - * we do (upper ^ lower) */ -static hlen cshash(SEXP x, RLEN indx, HashData *d) -{ - intptr_t z = (intptr_t) STRING_ELT(x, indx); - unsigned int z1 = (unsigned int)(z & 0xffffffff), z2 = 0; -#if SIZEOF_LONG == 8 - z2 = (unsigned int)(z/0x100000000L); -#endif - return scatter(z1 ^ z2, d); -} - -static hlen shash(SEXP x, RLEN indx, HashData *d) -{ - unsigned int k; - const char *p; - const void *vmax = vmaxget(); - if(!d->useUTF8 && d->useCache) return cshash(x, indx, d); - /* Not having d->useCache really should not happen anymore. */ - p = translateCharUTF8(STRING_ELT(x, indx)); - k = 0; - while (*p++) - k = 11 * k + (unsigned int) *p; /* was 8 but 11 isn't a power of 2 */ - vmaxset(vmax); /* discard any memory used by translateChar */ - return scatter(k, d); -} - -static int sequal(SEXP x, RLEN i, SEXP y, RLEN j) -{ - // using our function instead of copying a lot more code from base - return !StrCmp(STRING_ELT(x, i), STRING_ELT(y, j)); -} - -/* -Choose M to be the smallest power of 2 -not less than 2*n and set K = log2(M). -Need K >= 1 and hence M >= 2, and 2^M < 2^31-1, hence n <= 2^29. - -Dec 2004: modified from 4*n to 2*n, since in the worst case we have -a 50% full table, and that is still rather efficient -- see -R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606. -*/ -static void MKsetup(HashData *d, RLEN n) +SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg) { - if(n < 0 || n >= 1073741824) /* protect against overflow to -ve */ - error("length %d is too large for hashing", n); - - size_t n2 = 2U * (size_t) n; - d->M = 2; - d->K = 1; - while (d->M < n2) { - d->M *= 2; - d->K++; + if (!isLogical(fillArg) || LENGTH(fillArg) != 1 || LOGICAL(fillArg)[0] == NA_LOGICAL) + error("fill= should be TRUE or FALSE"); + if (!isLogical(usenamesArg) || LENGTH(usenamesArg)!=1) + error("use.names= should be TRUE, FALSE, or not used (\"check\" by default)"); // R levels converts "check" to NA + if (!length(l)) return(l); + if (TYPEOF(l) != VECSXP) error("Input to rbindlist must be a list. This list can contain data.tables, data.frames or plain lists."); + Rboolean usenames = LOGICAL(usenamesArg)[0]; + const bool fill = LOGICAL(fillArg)[0]; + if (fill && usenames!=TRUE) { + if (usenames==FALSE) warning("use.names= cannot be FALSE when fill is TRUE. Setting use.names=TRUE."); // else no warning if usenames==NA (default) + usenames=TRUE; } - d->nmax = n; -} - -#define IMAX 4294967296L -static void HashTableSetup(HashData *d, RLEN n) -{ - d->hash = shash; - d->equal = sequal; - MKsetup(d, n); - //d->HashTable = malloc(sizeof(struct llist *) * (d->M)); - //if (d->HashTable == NULL) error("malloc failed in rbindlist.c. This part of the code will be reworked."); - d->HashTable = (struct llist **)R_alloc(d->M, sizeof(struct llist *)); - for (RLEN i = 0; i < d->M; i++) d->HashTable[i] = NULL; -} -/* -static void CleanHashTable(HashData *d) -{ - struct llist * root, * tmp; - - for (RLEN i = 0; i < d->M; ++i) { - root = d->HashTable[i]; - while (root != NULL) { - tmp = root->next; - free(root); - root = tmp; + const bool idcol = !isNull(idcolArg); + if (idcol && (!isString(idcolArg) || LENGTH(idcolArg)!=1)) error("Internal error: rbindlist.c idcol is not a single string"); // # nocov + int ncol=0, first=0; + int64_t nrow=0, upperBoundUniqueNames=1; + bool anyNames=false; + int numZero=0, firstZeroCol=0, firstZeroItem=0; + int *eachMax = (int *)R_alloc(LENGTH(l), sizeof(int)); + // pre-check for any errors here to save having to get cleanup right below when usenames + for (int i=0; i0 checked above + eachMax[i] = 0; + SEXP li = VECTOR_ELT(l, i); + if (isNull(li)) continue; + if (TYPEOF(li) != VECSXP) error("Item %d of input is not a data.frame, data.table or list", i+1); + const int thisncol = length(li); + if (!thisncol) continue; + // delete as now more flexible ... if (fill && isNull(getAttrib(li, R_NamesSymbol))) error("When fill=TRUE every item of the input must have column names. Item %d does not.", i+1); + if (fill) { + if (thisncol>ncol) ncol=thisncol; // this section initializes ncol with max ncol. ncol may be increased when usenames is accounted for further down + } else { + if (ncol==0) { ncol=thisncol; first=i; } + else if (thisncol!=ncol) error("Item %d has %d columns, inconsistent with item %d which has %d columns. To fill missing columns use fill=TRUE.", i+1, thisncol, first+1, ncol); + } + int nNames = length(getAttrib(li, R_NamesSymbol)); + if (nNames>0 && nNames!=thisncol) error("Item %d has %d columns but %d column names. Invalid object.", i+1, thisncol, nNames); + if (nNames>0) anyNames=true; + upperBoundUniqueNames += nNames; + int maxLen=0, whichMax=0; + for (int j=0; jmaxLen) { maxLen=tt; whichMax=j; } } + for (int j=0; j1 && tt!=maxLen) error("Column %d of item %d is length %d inconsistent with column %d which is length %d. Only length-1 columns are recycled.", j+1, i+1, tt, whichMax+1, maxLen); + if (tt==0 && maxLen>0 && numZero++==0) { firstZeroCol = j; firstZeroItem=i; } } + eachMax[i] = maxLen; + nrow += maxLen; } - free(d->HashTable); -} -*/ - -// factorType is 1 for factor and 2 for ordered -// will simply unique normal factors and attempt to find global order for ordered ones -SEXP combineFactorLevels(SEXP factorLevels, int * factorType, Rboolean * isRowOrdered) { - // find total length - RLEN size = 0; - R_len_t len = LENGTH(factorLevels), n, i, j; - for (i = 0; i < len; ++i) { - SEXP elem = VECTOR_ELT(factorLevels, i); - n = LENGTH(elem); - size += n; - /* for (j = 0; j < n; ++j) { */ - /* if(IS_BYTES(STRING_ELT(elem, j))) { */ - /* data.useUTF8 = FALSE; break; */ - /* } */ - /* if(ENC_KNOWN(STRING_ELT(elem, j))) { */ - /* data.useUTF8 = TRUE; */ - /* } */ - /* if(!IS_CACHED(STRING_ELT(elem, j))) { */ - /* data.useCache = FALSE; break; */ - /* } */ - /* } */ + if (numZero) { // #1871 + SEXP names = getAttrib(VECTOR_ELT(l, firstZeroItem), R_NamesSymbol); + const char *ch = names==R_NilValue ? "" : CHAR(STRING_ELT(names, firstZeroCol)); + warning("Column %d ['%s'] of item %d is length 0. This (and %d other%s like it) has been filled with NA (NULL for list columns) to make each item uniform.", + firstZeroCol+1, ch, firstZeroItem+1, numZero-1, numZero==2?"":"s"); } - - // set up hash to put duplicates in - HashData data; - data.useUTF8 = FALSE; - data.useCache = TRUE; - HashTableSetup(&data, size); - - struct llist **h = data.HashTable; - hlen idx; - struct llist * pl; - R_len_t uniqlen = 0; - // we insert in opposite order because it's more convenient later to choose first of the duplicates - for (i = len-1; i >= 0; --i) { - SEXP elem = VECTOR_ELT(factorLevels, i); - n = LENGTH(elem); - for (j = n-1; j >= 0; --j) { - idx = data.hash(elem, j, &data); - while (h[idx] != NULL) { - pl = h[idx]; - if (data.equal(VECTOR_ELT(factorLevels, pl->i), pl->j, elem, j)) - break; - // it's a collision, not a match, so iterate to a new spot - idx = (idx + 1) % data.M; - } - if (data.nmax-- < 0) error("hash table is full"); - - pl = (struct llist *)R_alloc(1, sizeof(struct llist)); - pl->next = NULL; - pl->i = i; - pl->j = j; - if (h[idx] != NULL) { - pl->next = h[idx]; - } else { - ++uniqlen; + if (nrow==0 && ncol==0) return(R_NilValue); + if (nrow>INT32_MAX) error("Total rows in the list is %lld which is larger than the maximum number of rows, currently %d", nrow, INT32_MAX); + if (usenames==TRUE && !anyNames) error("use.names=TRUE but no item of input list has any names"); + + int *colMap=NULL; // maps each column in final result to the column of each list item + if (usenames==TRUE || usenames==NA_LOGICAL) { + // here we proceed as if fill=true for brevity (accounting for dups is tricky) and then catch any missings after this branch + // when use.names==NA we also proceed here as if use.names was TRUE to save new code and then check afterwards the map is 1:ncol for every item + // first find number of unique column names present; i.e. length(unique(unlist(lapply(l,names)))) + SEXP *uniq = (SEXP *)malloc(upperBoundUniqueNames * sizeof(SEXP)); // upperBoundUniqueNames was initialized with 1 to ensure this is defined (otherwise 0 when no item has names) + if (!uniq) error("Failed to allocate upper bound of %lld unique column names [sum(lapply(l,ncol))]", upperBoundUniqueNames); + savetl_init(); + int nuniq=0; + for (int i=0; i0) savetl(s); + uniq[nuniq++] = s; + SET_TRUELENGTH(s,-nuniq); } - h[idx] = pl; } - } - - SEXP finalLevels = PROTECT(allocVector(STRSXP, uniqlen)); // UNPROTECTed at the end of this function - R_len_t counter = 0; - if (*factorType == 2) { - int *locs = (int *)R_alloc(len, sizeof(int)); - for (int i=0; ii), pl->j, elem, j)) { - do { - if (!isRowOrdered[pl->i]) continue; - - tmp = VECTOR_ELT(factorLevels, pl->i); - if (locs[pl->i] > pl->j) { - // failed to construct global order, need to break out of too many loops - // so will use goto :o - warning("ordered factor levels cannot be combined, going to convert to simple factor instead"); - counter = 0; - *factorType = 1; - goto normalFactor; - } - - for (k = locs[pl->i]; k < pl->j; ++k) { - SET_STRING_ELT(finalLevels, counter++, STRING_ELT(tmp, k)); - } - locs[pl->i] = pl->j + 1; - } while ( (pl = pl->next) ); // added parenthesis to remove compiler warning 'suggest parentheses around assignment used as truth value' - SET_STRING_ELT(finalLevels, counter++, STRING_ELT(elem, j)); - break; - } - // it's a collision, not a match, so iterate to a new spot - idx = (idx + 1) % data.M; - } - if (h[idx] == NULL) error("internal hash error, please report to data.table issue tracker"); - } + if (nuniq>0) { + SEXP *tt = realloc(uniq, nuniq*sizeof(SEXP)); // shrink to only what we need to release the spare + if (!tt) free(uniq); // shrink never fails; just keep codacy happy + uniq = tt; } - - // fill in the rest of the unordered elements - Rboolean record; - for (i = 0; i < len; ++i) { - if (isRowOrdered[i]) continue; - SEXP elem = VECTOR_ELT(factorLevels, i); - n = LENGTH(elem); - for (j = 0; j < n; ++j) { - idx = data.hash(elem, j, &data); - while (h[idx] != NULL) { - pl = h[idx]; - if (data.equal(VECTOR_ELT(factorLevels, pl->i), pl->j, elem, j)) { - // Fixes #899. "rest" can have identical levels in - // more than 1 data.table. - if (!(pl->i == i && pl->j == j)) break; - record = TRUE; - do { - // if this element was in an ordered list, it's been recorded already - if (isRowOrdered[pl->i]) { - record = FALSE; - break; - } - } while ( (pl = pl->next) ); // added parenthesis to remove compiler warning 'suggest parentheses around assignment used as truth value' - if (record) - SET_STRING_ELT(finalLevels, counter++, STRING_ELT(elem, j)); - - break; - } - // it's a collision, not a match, so iterate to a new spot - idx = (idx + 1) % data.M; - } - if (h[idx] == NULL) error("internal hash error, please report to data.table issue tracker"); + // now count the dups (if any) and how they're distributed across the items + int *counts = (int *)calloc(nuniq, sizeof(int)); // counts of names for each colnames + int *maxdup = (int *)calloc(nuniq, sizeof(int)); // the most number of dups for any name within one colname vector + if (!counts || !maxdup) { + // # nocov start + for (int i=0; i maxdup[u]) maxdup[u] = counts[u]; } } - } - - normalFactor: - if (*factorType == 1) { - for (i = 0; i < len; ++i) { - SEXP elem = VECTOR_ELT(factorLevels, i); - n = LENGTH(elem); - for (j = 0; j < n; ++j) { - idx = data.hash(elem, j, &data); - while (h[idx] != NULL) { - pl = h[idx]; - if (data.equal(VECTOR_ELT(factorLevels, pl->i), pl->j, elem, j)) { - if (pl->i == i && pl->j == j) { - SET_STRING_ELT(finalLevels, counter++, STRING_ELT(elem, j)); + int ttncol = 0; + for (int u=0; uncol) ncol=ttncol; + free(maxdup); maxdup=NULL; // not needed again + // ncol is now the final number of columns accounting for unique and dups across all colnames + // allocate a matrix: nrows==length(list) each entry contains which column to fetch for that final column + + int *colMapRaw = (int *)malloc(LENGTH(l)*ncol * sizeof(int)); // the result of this scope used later + int *uniqMap = (int *)malloc(ncol * sizeof(int)); // maps the ith unique string to the first time it occurs in the final result + int *dupLink = (int *)malloc(ncol * sizeof(int)); // if a colname has occurred before (a dup) links from the 1st to the 2nd time in the final result, 2nd to 3rd, etc + if (!colMapRaw || !uniqMap || !dupLink) { + // # nocov start + for (int i=0; i0) { w=dupLink[w]; --wi; } // hop through the dups + if (wi && dupLink[w]==-1) { + // first time we've seen this number of dups of this name + w = dupLink[w] = lastDup--; + uniqMap[w] = nextCol++; } - break; } - // it's a collision, not a match, so iterate to a new spot - idx = (idx + 1) % data.M; + colMapRaw[i*ncol + uniqMap[w]] = j; } - if (h[idx] == NULL) error("internal hash error, please report to data.table issue tracker"); } } + for (int i=0; i0) { - // last value - INTEGER(ans)[nv-1] = n - INTEGER(v)[nv-1] + 1; + if (usenames==NA_LOGICAL) { + usenames=FALSE; // for backwards compatibility, see warning above which says this will change to TRUE in future + ncol = length(VECTOR_ELT(l, first)); // ncol was increased as if fill=true, so reduce it back given fill=false (fill==false checked above) } - UNPROTECT(1); - return(ans); -} - -static SEXP match_names(SEXP v) { - - R_len_t i, j, idx, ncols, protecti=0; - SEXP ans, dt, lnames, ti; - SEXP uorder, starts, ulens, index, firstofeachgroup, origorder; - SEXP fnames, findices, runid, grpid; - - ans = PROTECT(allocVector(VECSXP, 2)); - dt = PROTECT(unlist2(v)); protecti++; - lnames = VECTOR_ELT(dt, 0); - grpid = PROTECT(duplicate(VECTOR_ELT(dt, 1))); protecti++; // dt[1] will be reused, so backup - runid = VECTOR_ELT(dt, 2); - uorder = PROTECT(fast_order(dt, 2, 1)); protecti++; // byArg alone is set, everything else is set inside fast_order - starts = getAttrib(uorder, sym_starts); - ulens = PROTECT(uniq_lengths(starts, length(lnames))); protecti++; - - // seq_len(.N) for each group - index = PROTECT(VECTOR_ELT(dt, 1)); protecti++; // reuse dt[1] (in 0-index coordinate), value already backed up above. - for (i=0; ifirst = -1; data->lcount = 0; data->n_rows = 0; data->n_cols = 0; data->protecti = 0; - data->max_type = NULL; data->is_factor = NULL; data->ans_ptr = R_NilValue; data->mincol=0; - data->fn_rows = (int *)R_alloc(LENGTH(l), sizeof(int)); - data->colname = R_NilValue; - - // get first non null name, 'rbind' was doing a 'match.names' for each item.. which is a bit more time consuming. - // And warning that it'll be matched by names is not necessary, I think, as that's the default for 'rbind'. We - // should instead document it. - for (i=0; icolname = PROTECT(col_name); data->protecti++; } - if (usenames) { lnames = PROTECT(allocVector(VECSXP, LENGTH(l))); data->protecti++;} - for (i=0; ifn_rows[i] = 0; // careful to initialize before continues as R_alloc above doesn't initialize - li = VECTOR_ELT(l, i); - if (isNull(li)) continue; - if (TYPEOF(li) != VECSXP) error("Item %d of list input is not a data.frame, data.table or list",i+1); - if (!LENGTH(li)) continue; - col_name = getAttrib(li, R_NamesSymbol); - if (fill && isNull(col_name)) - error("fill=TRUE, but names of input list at position %d is NULL. All items of input list must have names set when fill=TRUE.", i+1); - data->lcount++; - data->fn_rows[i] = length(VECTOR_ELT(li, 0)); - if (data->first == -1) { - data->first = i; - data->n_cols = LENGTH(li); - data->mincol = LENGTH(li); - if (!usenames) { - data->ans_ptr = PROTECT(allocVector(VECSXP, 2)); data->protecti++; - if (isNull(col_name)) SET_VECTOR_ELT(data->ans_ptr, 0, data->colname); - else SET_VECTOR_ELT(data->ans_ptr, 0, col_name); - } else { - if (isNull(col_name)) SET_VECTOR_ELT(lnames, i, data->colname); - else SET_VECTOR_ELT(lnames, i, col_name); + SEXP ans=PROTECT(allocVector(VECSXP, idcol + ncol)), ansNames; + setAttrib(ans, R_NamesSymbol, ansNames=allocVector(STRSXP, idcol + ncol)); + if (idcol) { + SET_STRING_ELT(ansNames, 0, STRING_ELT(idcolArg, 0)); + SEXP idval, listNames=getAttrib(l, R_NamesSymbol); + if (length(listNames)) { + SET_VECTOR_ELT(ans, 0, idval=allocVector(STRSXP, nrow)); + for (int i=0,ansloc=0; in_rows += data->fn_rows[i]; - continue; } else { - if (!fill && LENGTH(li) != data->n_cols) - if (LENGTH(li) != data->n_cols) error("Item %d has %d columns, inconsistent with item %d which has %d columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.",i+1, LENGTH(li), data->first+1, data->n_cols); - } - if (data->mincol > LENGTH(li)) data->mincol = LENGTH(li); - data->n_rows += data->fn_rows[i]; - if (usenames) { - if (isNull(col_name)) SET_VECTOR_ELT(lnames, i, data->colname); - else SET_VECTOR_ELT(lnames, i, col_name); + SET_VECTOR_ELT(ans, 0, idval=allocVector(INTSXP, nrow)); + int *idvald = INTEGER(idval); + for (int i=0,ansloc=0; ians_ptr = PROTECT(match_names(lnames)); data->protecti++; - fnames = VECTOR_ELT(data->ans_ptr, 0); - findices = VECTOR_ELT(data->ans_ptr, 1); - if (isNull(data->colname) && data->n_cols > 0) - error("use.names=TRUE but no item of input list has any names.\n"); - if (!fill && length(fnames) != data->mincol) { - error("Answer requires %d columns whereas one or more item(s) in the input list has only %d columns. This could be because the items in the list may not all have identical column names or some of the items may have duplicate names. In either case, if you're aware of this and would like to fill those missing columns, set the argument 'fill=TRUE'.", length(fnames), data->mincol); - } else data->n_cols = length(fnames); - } - // decide type of each column - // initialize the max types - will possibly increment later - data->max_type = (SEXPTYPE *)R_alloc(data->n_cols, sizeof(SEXPTYPE)); - data->is_factor = (int *)R_alloc(data->n_cols, sizeof(int)); - for (i = 0; i< data->n_cols; i++) { - thisClass = R_NilValue; - data->max_type[i] = 0; - data->is_factor[i] = 0; - if (usenames) f_ind = VECTOR_ELT(findices, i); - for (j=data->first; jis_factor[i] == 2) break; - idx = (usenames) ? INTEGER(f_ind)[j] : i; - li = VECTOR_ELT(l, j); - if (isNull(li) || !LENGTH(li) || idx < 0) continue; - thiscol = VECTOR_ELT(li, idx); - // Fix for #705, check attributes - if (j == data->first) - thisClass = getAttrib(thiscol, R_ClassSymbol); - if (isFactor(thiscol)) { - data->is_factor[i] = (isOrdered(thiscol)) ? 2 : 1; - data->max_type[i] = STRSXP; - } else { - // Fix for #705, check attributes and error if non-factor class and not identical - if (!data->is_factor[i] && - !R_compute_identical(thisClass, getAttrib(thiscol, R_ClassSymbol), 0) && !fill) { - error("Class attributes at column %d of input list at position %d does not match with column %d of input list at position %d. Coercion of objects of class 'factor' alone is handled internally by rbind/rbindlist at the moment.", i+1, j+1, i+1, data->first+1); + SEXP coercedForFactor = R_NilValue; + for(int j=0; jTYPEORDER(maxType)) maxType=thisType; + if (isFactor(thisCol)) { + if (isNull(getAttrib(thisCol,R_LevelsSymbol))) error("Column %d of item %d has type 'factor' but has no levels; i.e. malformed.", w+1, i+1); + factor = true; + if (isOrdered(thisCol)) { + orderedFactor = true; + int thisLen = length(getAttrib(thisCol, R_LevelsSymbol)); + if (thisLen>longestLen) { longestLen=thisLen; longestLevels=getAttrib(thisCol, R_LevelsSymbol); /*for warnings later ...*/longestW=w; longestI=i; } } - type = TYPEOF(thiscol); - if (type > data->max_type[i]) data->max_type[i] = type; + } else if (!isString(thisCol) && length(thisCol)) anyNotStringOrFactor=true; + if (INHERITS(thisCol, char_integer64)) { + if (firsti>=0 && !length(getAttrib(firstCol, R_ClassSymbol))) { firsti=i; firstw=w; firstCol=thisCol; } // so the integer64 attribute gets copied to target below + int64=true; + } + if (firsti==-1) { firsti=i; firstw=w; firstCol=thisCol; } + else if (!factor && !int64 && !R_compute_identical(getAttrib(thisCol, R_ClassSymbol), getAttrib(firstCol,R_ClassSymbol), 0)) { + error("Class attribute on column %d of item %d does not match with column %d of item %d.", w+1, i+1, firstw+1, firsti+1); } } - } -} - -// function does c(idcol, nm), where length(idcol)=1 -// fix for #1432, + more efficient to move the logic to C -SEXP add_idcol(SEXP nm, SEXP idcol, int cols) { - SEXP ans = PROTECT(allocVector(STRSXP, cols+1)); - SET_STRING_ELT(ans, 0, STRING_ELT(idcol, 0)); - for (int i=0; i INT32_MAX) { - error("Total rows in the list is %lld which is larger than the maximum number of rows, currently %d", - (long long)data.n_rows, INT32_MAX); - } - fnames = VECTOR_ELT(data.ans_ptr, 0); - if (isidcol) { - fnames = PROTECT(add_idcol(fnames, idcol, data.n_cols)); - protecti++; - } - SEXP factorLevels = PROTECT(allocVector(VECSXP, data.lcount)); protecti++; - Rboolean *isRowOrdered = (Rboolean *)R_alloc(data.lcount, sizeof(Rboolean)); - for (int i=0; i z regular factor because it contains an ambiguity: is a a a regular factor because this case isn't yet implemented. a0) savetl(s); + levelsRaw[k] = s; + SET_TRUELENGTH(s,-k-1); + } + for (int i=0; i=last) { // if tl>=0 then also tl>=last because last<=0 + if (tl>=0) { + sprintf(warnStr, // not direct warning as we're inside tl region + "Column %d of item %d is an ordered factor but level %d ['%s'] is missing from the ordered levels from column %d of item %d. " \ + "Each set of ordered factor levels should be an ordered subset of the first longest. A regular factor will be created for this column.", + w+1, i+1, k+1, CHAR(s), longestW+1, longestI+1); + } else { + sprintf(warnStr, + "Column %d of item %d is an ordered factor with '%s'<'%s' in its levels. But '%s'<'%s' in the ordered levels from column %d of item %d. " \ + "A regular factor will be created for this column due to this ambiguity.", + w+1, i+1, CHAR(levelsD[k-1]), CHAR(s), CHAR(s), CHAR(levelsD[k-1]), longestW+1, longestI+1); + // k>=1 (so k-1 is ok) because when k==0 last==0 and this branch wouldn't happen + } + orderedFactor=false; + i=LENGTH(l); // break outer i loop + break; // break inner k loop + // we leave the tl set for the longest levels; the regular factor will be created with the longest ordered levels first in case that useful for user + } + last = tl; // negative ordinal; last should monotonically grow more negative if the levels are an ordered subset of the longest + } + } + } } - switch(TYPEOF(target)) { - case STRSXP : - isRowOrdered[resi] = FALSE; - if (isFactor(thiscol)) { - levels = getAttrib(thiscol, R_LevelsSymbol); - if (isNull(levels)) error("Column %d of item %d has type 'factor' but has no levels; i.e. malformed.", j+1, i+1); - for (r=0; r0) savetl(s); + if (allocLevel==nLevel) { // including initial time when allocLevel==nLevel==0 + SEXP *tt = NULL; + if (allocLevel(int64_t)INT_MAX) ? INT_MAX : (int)new; + tt = (SEXP *)realloc(levelsRaw, allocLevel*sizeof(SEXP)); // first time levelsRaw==NULL and realloc==malloc in that case + } + if (tt==NULL) { + // # nocov start + // C spec states that if realloc() fails (above) the original block (levelsRaw) is left untouched: it is not freed or moved. We ... + for (int k=0; k nx-1) continue; - INTEGER(ans)[oi] = (li == 2) ? INTEGER(index)[INTEGER(order)[si+1]-1]+1 : INTEGER(nomatch)[0]; - } - UNPROTECT(7); - return(ans); -} - -// utility function used from within chmatch2 -static SEXP listlist(SEXP x) { - - R_len_t i,j,k, nl; - SEXP lx, xo, xs, xl, tmp, ans, ans0, ans1; - - lx = PROTECT(allocVector(VECSXP, 1)); - SET_VECTOR_ELT(lx, 0, x); - xo = PROTECT(fast_order(lx, 1, 1)); - xs = getAttrib(xo, sym_starts); - xl = PROTECT(uniq_lengths(xs, length(x))); - - ans0 = PROTECT(allocVector(STRSXP, length(xs))); - ans1 = PROTECT(allocVector(VECSXP, length(xs))); - k=0; - for (i=0; i