Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
25d96f2
interim
mattdowle Mar 4, 2019
e279db8
interim
mattdowle Mar 5, 2019
8f8d002
Merge branch 'master' into rbindlist
mattdowle Mar 6, 2019
570d33a
interim
mattdowle Mar 6, 2019
0d52e58
interim
mattdowle Mar 7, 2019
ab5288a
chatch2 rewritten and renamed chmatchdup; clearer calling of chmatch …
mattdowle Mar 9, 2019
3837d6c
interim
mattdowle Mar 9, 2019
4a8b345
interim
mattdowle Mar 9, 2019
a1c8ed8
interim
mattdowle Mar 11, 2019
860ff52
interim
mattdowle Mar 11, 2019
a40508b
interim
mattdowle Mar 11, 2019
c27fb06
interim
mattdowle Mar 11, 2019
afd700f
interim
mattdowle Mar 11, 2019
fce48e8
interim
mattdowle Mar 11, 2019
9e298b9
interim
mattdowle Mar 12, 2019
1ec90b5
interim
mattdowle Mar 12, 2019
3f6838d
interim
mattdowle Mar 12, 2019
e5b3467
interim
mattdowle Mar 12, 2019
f8cc4f7
interim
mattdowle Mar 12, 2019
51731ab
interim
mattdowle Mar 12, 2019
6ab1996
interim
mattdowle Mar 12, 2019
99a3c40
interim
mattdowle Mar 12, 2019
59cb3a8
interim
mattdowle Mar 13, 2019
3f6912d
interim
mattdowle Mar 13, 2019
43c9164
interim
mattdowle Mar 13, 2019
b2190bb
interim
mattdowle Mar 13, 2019
e6e20e1
tidy chmatch.c
mattdowle Mar 13, 2019
1fa3bdb
Merge branch 'master' into rbindlist
mattdowle Mar 13, 2019
86b9dd9
merge master with test number format-changes-only subtracted
mattdowle Mar 13, 2019
b594db9
better max alloc thanks to codacy for highlighting
mattdowle Mar 13, 2019
737929c
added tests from #3373
mattdowle Mar 13, 2019
f2f4ea5
rbindlist fill empty cols with NA with warning; #1871
mattdowle Mar 13, 2019
f9f21a2
Added tests from #1302
mattdowle Mar 14, 2019
9a71866
'input list' back to 'item' to remove test 1978 from diff
mattdowle Mar 14, 2019
ccfbf4a
news item
mattdowle Mar 14, 2019
826debb
added test from #3343
mattdowle Mar 14, 2019
43cd396
coverage
mattdowle Mar 14, 2019
78d3bbf
test 2000 size reduced 10x, and nocov, thanks to @jangorecki's comments
mattdowle Mar 14, 2019
10efd94
more tests for #1302
jangorecki Mar 14, 2019
f4f572b
added '(NULL for list columns)' to warnings about filling empty colum…
mattdowle Mar 14, 2019
002d3e3
tidier realloc and nocovs
mattdowle Mar 14, 2019
9b2c081
chmatch coverage
mattdowle Mar 14, 2019
1e42d35
Merge branch 'master' into rbindlist
mattdowle Mar 14, 2019
14192dd
coverage assign.c and turned off recycling in rare case of 1:4:=list(…
mattdowle Mar 15, 2019
44ec2f4
more coverage; removed copyattr as not used anywhere according to gre…
mattdowle Mar 15, 2019
385be77
combining integer64 with non-integer64; #1349
mattdowle Mar 18, 2019
d6a3c9f
rbind support raw type; #2819
mattdowle Mar 18, 2019
f2187aa
test coverage (rbindlist.c should drop from 100%)
mattdowle Mar 19, 2019
84e7756
test confirmed. now with '# nocov' rbindlist.c coverage should go up …
mattdowle Mar 19, 2019
eff89fe
fmelt NA factor coverage
mattdowle Mar 19, 2019
5936dfb
fmelt.c:412 coverage (regular factor not ordered)
mattdowle Mar 19, 2019
de06ac6
rbind and rbindlist recycle length-1 columns; #524
mattdowle Mar 20, 2019
c71c30f
coverage
mattdowle Mar 20, 2019
27bbc23
added test for #3032
mattdowle Mar 20, 2019
d6e254a
null columns; #2303 and #2305
mattdowle Mar 20, 2019
1e30f60
?rbindlist; #2113
mattdowle Mar 20, 2019
2972860
reworked fmelt.c:getvarcols for #1754
mattdowle Mar 21, 2019
c50b7e0
melt news item, and another test from #1754
mattdowle Mar 21, 2019
b99b270
more tests from #1754
mattdowle Mar 21, 2019
f4284a6
coverage
mattdowle Mar 21, 2019
e9c73ce
rbindlist use.names='check'; link in #524
mattdowle Mar 21, 2019
44a3e3f
news item tweaks
mattdowle Mar 21, 2019
c64d725
memory todo
mattdowle Mar 22, 2019
f850ac2
coverage
mattdowle Mar 22, 2019
4d052ab
keep codacy happy on realloc
mattdowle Mar 22, 2019
97c64ea
memory todo 2
mattdowle Mar 22, 2019
89f8098
fixed codecov typo
mattdowle Mar 22, 2019
d1c4258
ordered factors; tests reinstated and expanded
mattdowle Mar 23, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@

3. The number of logical CPUs used by default has been reduced from 100% to 50%. The previous 100% default was reported to cause significant slow downs when other non-trivial processes were also running: [#3395](https://github.com/Rdatatable/data.table/issues/3395), [#3298](https://github.com/Rdatatable/data.table/issues/3298). Two new optional environment variables (`R_DATATABLE_NUM_PROCS_PERCENT` & `R_DATATABLE_NUM_THREADS`) control this default. \code(setDTthreads()) gains \code{percent=} and \code{?setDTthreads} has been significantly revised. \code{getDTthreads(verbose=TRUE)} has been expanded. The environment variable `OMP_THREAD_LIMIT` is now respected ([#3300](https://github.com/Rdatatable/data.table/issues/3300)) in addition to `OMP_NUM_THREADS` as before.

4. `rbind` and `rbindlist` now retain the position of duplicate column names rather than grouping them together [#3373](https://github.com/Rdatatable/data.table/issues/3373), fill length 0 columns (including NULL) with NA with warning [#1871](https://github.com/Rdatatable/data.table/issues/1871), and recycle length-1 columns [#524](https://github.com/Rdatatable/data.table/issues/524). Thanks to Kun Ren for the requests which arose when parsing JSON.

5. `rbindlist`'s `use.names=` default has changed from `FALSE` to `"check"`. This warns if the column names of each item are not identical and then proceeds as if `use.names=FALSE` for backwards compatibility; i.e., bind by column number not by column name. In future, it will warn and then proceed as if `use.names=TRUE`. Eventually the default will be changed from `NA` to `TRUE` unless user feedback is negative. The `rbind` method for `data.table` already sets `use.names=TRUE` as does `rbind` for `data.frame` in base, and is clearly safer. To stack differently named columns together silently (the previous default behavior), it is now necessary to write `use.names=FALSE` for clarity to readers of your code. Thanks to Clayton Stanley who first raised the issue [here](http://lists.r-forge.r-project.org/pipermail/datatable-help/2014-April/002480.html).

#### BUG FIXES

1. `rbindlist()` of a malformed factor missing levels attribute is now a helpful error rather than a cryptic error about `STRING_ELT`, [#3315](https://github.com/Rdatatable/data.table/issues/3315). Thanks to Michael Chirico for reporting.
Expand All @@ -34,6 +38,14 @@

11. A join's result could be incorrectly keyed when a single nomatch occurred at the very beginning while all other values matched, [#3441](https://github.com/Rdatatable/data.table/issues/3441). The incorrect key would cause incorrect results in subsequent queries. Thanks to @symbalex for reporting and @franknarf1 for pinpointing the root cause.

12. `rbind` and `rbindlist(..., use.names=TRUE)` with over 255 columns could return the columns in a random order, [#3373](https://github.com/Rdatatable/data.table/issues/3373). The contents and name of each column was correct but the order that the columns appeared in the result might not match the original input.

13. `rbind` and `rbindlist` now combine `integer64` columns together with non-`integer64` columns correctly [#1349](https://github.com/Rdatatable/data.table/issues/1349), and support `raw` columns [#2819](https://github.com/Rdatatable/data.table/issues/2819).

14. `NULL` columns are caught and error appropriately rather than segfault in some cases, [#2303](https://github.com/Rdatatable/data.table/issues/2303) [#2305](https://github.com/Rdatatable/data.table/issues/2305). Thanks to Hugh Parsonage and @franknarf1 for reporting.

15. `melt` would error with 'factor malformed' or segfault in the presence of duplicate column names, [#1754](https://github.com/Rdatatable/data.table/issues/1754). Many thanks to @franknarf1, William Marble, wligtenberg and Toby Dylan Hocking for reproducible examples. All examples have been added to the test suite.

#### NOTES

1. When upgrading to 1.12.0 some Windows users might have seen `CdllVersion not found` in some circumstances. We found a way to catch that so the [helpful message](https://twitter.com/MattDowle/status/1084528873549705217) now occurs for those upgrading from versions prior to 1.12.0 too, as well as those upgrading from 1.12.0 to a later version. See item 1 in notes section of 1.12.0 below for more background.
Expand Down
23 changes: 12 additions & 11 deletions R/as.data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,8 @@ as.data.table.array <- function(x, keep.rownames=FALSE, sorted=TRUE, value.name=
}

as.data.table.list <- function(x, keep.rownames=FALSE, ...) {
wn = sapply(x,is.null)
if (any(wn)) x = x[!wn]
if (!length(x)) return( null.data.table() )
# fix for #833, as.data.table.list with matrix/data.frame/data.table as a list element..
# TODO: move this entire logic (along with data.table() to C
Expand All @@ -125,18 +127,17 @@ as.data.table.list <- function(x, keep.rownames=FALSE, ...) {
idx = which(n < mn)
if (length(idx)) {
for (i in idx) {
if (!is.null(x[[i]])) {# avoids warning when a list element is NULL
if (inherits(x[[i]], "POSIXlt")) {
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
x[[i]] = as.POSIXct(x[[i]])
}
# Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L
if (!n[i] && mn)
warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'")
else if (n[i] && mn %% n[i] != 0L)
warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)")
x[[i]] = rep(x[[i]], length.out=mn)
# any is.null(x[[i]]) were removed above, otherwise warning when a list element is NULL
if (inherits(x[[i]], "POSIXlt")) {
warning("POSIXlt column type detected and converted to POSIXct. We do not recommend use of POSIXlt at all because it uses 40 bytes to store one date.")
x[[i]] = as.POSIXct(x[[i]])
}
# Implementing FR #4813 - recycle with warning when nr %% nrows[i] != 0L
if (!n[i] && mn)
warning("Item ", i, " is of size 0 but maximum size is ", mn, ", therefore recycled with 'NA'")
else if (n[i] && mn %% n[i] != 0L)
warning("Item ", i, " is of size ", n[i], " but maximum size is ", mn, " (recycled leaving a remainder of ", mn%%n[i], " items)")
x[[i]] = rep(x[[i]], length.out=mn)
}
}
# fix for #842
Expand Down
80 changes: 33 additions & 47 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -215,17 +215,6 @@ replace_dot_alias <- function(e) {
}
}

# A (relatively) fast (uses DT grouping) wrapper for matching two vectors, BUT:
# it behaves like 'pmatch' but only the 'exact' matching part. That is, a value in
# 'x' is matched to 'table' only once. No index will be present more than once.
# This should make it even clearer:
# chmatch2(c("a", "a"), c("a", "a")) # 1,2 - the second 'a' in 'x' has a 2nd match in 'table'
# chmatch2(c("a", "a"), c("a", "b")) # 1,NA - the second one doesn't 'see' the first 'a'
# chmatch2(c("a", "a"), c("a", "a.1")) # 1,NA - this is where it differs from pmatch - we don't need the partial match.
chmatch2 <- function(x, table, nomatch=NA_integer_) {
.Call(Cchmatch2, x, table, as.integer(nomatch)) # this is in 'rbindlist.c' for now.
}

"[.data.table" <- function (x, i, j, by, keyby, with=TRUE, nomatch=getOption("datatable.nomatch"), mult="all", roll=FALSE, rollends=if (roll=="nearest") c(TRUE,TRUE) else if (roll>=0) c(FALSE,TRUE) else c(TRUE,FALSE), which=FALSE, .SDcols, verbose=getOption("datatable.verbose"), allow.cartesian=getOption("datatable.allow.cartesian"), drop=NULL, on=NULL)
{
# ..selfcount <<- ..selfcount+1 # in dev, we check no self calls, each of which doubles overhead, or could
Expand Down Expand Up @@ -369,14 +358,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
}
}

# To take care of duplicate column names properly (see chmatch2 function above `[data.table`) for description
dupmatch <- function(x, y, ...) {
if (anyDuplicated(x))
pmax(chmatch(x,y, ...), chmatch2(x,y,0L))
else chmatch(x,y)
}

# setdiff removes duplicate entries, which'll create issues with duplicated names. Use '%chin% instead.
# setdiff removes duplicate entries, which'll create issues with duplicated names. Use %chin% instead.
dupdiff <- function(x, y) x[!x %chin% y]

if (!missing(i)) {
Expand Down Expand Up @@ -739,7 +721,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
if (length(tt)) jisvars[tt] = paste0("i.",jisvars[tt])
if (length(duprightcols <- rightcols[duplicated(rightcols)])) {
nx = c(names(x), names(x)[duprightcols])
rightcols = chmatch2(names(x)[rightcols], nx)
rightcols = chmatchdup(names(x)[rightcols], nx)
nx = make.unique(nx)
} else nx = names(x)
ansvars = make.unique(c(nx, jisvars))
Expand Down Expand Up @@ -790,20 +772,16 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
if (is.character(j)) {
if (notj) {
w = chmatch(j, names(x))
if (anyNA(w)) {
warning("column(s) not removed because not found: ",paste(j[is.na(w)],collapse=","))
w = w[!is.na(w)]
}
# changed names(x)[-w] to use 'setdiff'. Here, all instances of the column must be removed.
# Ex: DT <- data.table(x=1, y=2, x=3); DT[, !"x", with=FALSE] should just output 'y'.
# But keep 'dup cols' beause it's basically DT[, !names(DT) %chin% "x", with=FALSE] which'll subset all cols not 'x'.
ansvars = if (length(w)) dupdiff(names(x), names(x)[w]) else names(x)
ansvals = dupmatch(ansvars, names(x))
if (anyNA(w)) warning("column(s) not removed because not found: ",paste(j[is.na(w)],collapse=","))
# all duplicates of the name in names(x) must be removed; e.g. data.table(x=1, y=2, x=3)[, !"x"] should just output 'y'.
w = !names(x) %chin% j
ansvars = names(x)[w]
ansvals = which(w)
} else {
# once again, use 'setdiff'. Basically, unless indices are specified in `j`, we shouldn't care about duplicated columns.
ansvars = j # x. and i. prefixes may be in here, and they'll be dealt with below
# dups = FALSE here.. even if DT[, c("x", "x"), with=FALSE], we subset only the first.. No way to tell which one the OP wants without index.
ansvals = chmatch(ansvars, names(x))
# if DT[, c("x","x")] and "x" is duplicated in names(DT), we still subset only the first. Because dups are unusual and
# it's more common to select the same column a few times. A syntax would be needed to distinguish these intents.
ansvars = j # x. and i. prefixes may be in here, they'll result in NA and will be dealt with further below if length(leftcols)
ansvals = chmatch(ansvars, names(x)) # not chmatchdup()
}
if (!length(ansvals)) return(null.data.table())
if (!length(leftcols)) {
Expand Down Expand Up @@ -1019,7 +997,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
# over a subset of columns

# all duplicate columns must be matched, because nothing is provided
ansvals = dupmatch(ansvars, names(x))
ansvals = chmatchdup(ansvars, names(x))
} else {
# FR #4979 - negative numeric and character indices for SDcols
colsub = substitute(.SDcols)
Expand Down Expand Up @@ -1432,6 +1410,7 @@ chmatch2 <- function(x, table, nomatch=NA_integer_) {
setattr(jval, 'class', class(x)) # fix for #5296
if (haskey(x) && all(key(x) %chin% names(jval)) && suppressWarnings(is.sorted(jval, by=key(x)))) # TO DO: perhaps this usage of is.sorted should be allowed internally then (tidy up and make efficient)
setattr(jval, 'sorted', key(x))
for (i in seq_along(jval)) if (is.null(jval[[i]])) stop("Column ",i," of j evaluates to NULL. A NULL column is invalid.")
}
return(jval)
}
Expand Down Expand Up @@ -2432,10 +2411,6 @@ copy <- function(x) {
alloc.col(newx)
}

copyattr <- function(from, to) {
.Call(Ccopyattr, from, to)
}

point <- function(to, to_idx, from, from_idx) {
.Call(CpointWrapper, to, to_idx, from, from_idx)
}
Expand Down Expand Up @@ -2652,13 +2627,19 @@ set <- function(x,i=NULL,j,value) # low overhead, loopable
invisible(x)
}

chmatch <- function(x,table,nomatch=NA_integer_)
.Call(Cchmatchwrapper,x,table,as.integer(nomatch[1L]),FALSE) # [1L] to fix #1672
chmatch <- function(x, table, nomatch=NA_integer_)
.Call(Cchmatch, x, table, as.integer(nomatch[1L])) # [1L] to fix #1672

"%chin%" <- function(x,table) {
# TO DO if table has 'ul' then match to that
.Call(Cchmatchwrapper,x,table,NA_integer_,TRUE)
}
# chmatchdup() behaves like 'pmatch' but only the 'exact' matching part; i.e. a value in
# 'x' is matched to 'table' only once. No index will be present more than once. For example:
# chmatchdup(c("a", "a"), c("a", "a")) # 1,2 - the second 'a' in 'x' has a 2nd match in 'table'
# chmatchdup(c("a", "a"), c("a", "b")) # 1,NA - the second one doesn't 'see' the first 'a'
# chmatchdup(c("a", "a"), c("a", "a.1")) # 1,NA - this is where it differs from pmatch - we don't need the partial match.
chmatchdup <- function(x, table, nomatch=NA_integer_)
.Call(Cchmatchdup, x, table, as.integer(nomatch[1L]))

"%chin%" <- function(x, table)
.Call(Cchin, x, table) # TO DO if table has 'ul' then match to that

chorder <- function(x) {
o = forderv(x, sort=TRUE, retGrp=FALSE)
Expand All @@ -2671,7 +2652,6 @@ chgroup <- function(x) {
if (length(o)) as.vector(o) else seq_along(x) # as.vector removes the attributes
}


.rbind.data.table <- function(..., use.names=TRUE, fill=FALSE, idcol=NULL) {
# See FAQ 2.23
# Called from base::rbind.data.frame
Expand All @@ -2681,14 +2661,20 @@ chgroup <- function(x) {
rbindlist(l, use.names, fill, idcol)
}

rbindlist <- function(l, use.names=fill, fill=FALSE, idcol=NULL) {
rbindlist <- function(l, use.names="check", fill=FALSE, idcol=NULL) {
if (isFALSE(idcol)) { idcol = NULL }
else if (!is.null(idcol)) {
if (isTRUE(idcol)) idcol = ".id"
if (!is.character(idcol)) stop("idcol must be a logical or character vector of length 1. If logical TRUE the id column will named '.id'.")
idcol = idcol[1L]
}
# fix for #1467, quotes result in "not resolved in current namespace" error
miss = missing(use.names)
# more checking of use.names happens at C level; this is just minimal to massage 'check' to NA
if (identical(use.names, NA)) stop("use.names=NA invalid") # otherwise use.names=NA could creep in an usage equivalent to use.names='check'
if (identical(use.names,"check")) {
if (!miss) stop("use.names='check' cannot be used explicitly because the value 'check' is new in v1.12.2 and subject to change. It is just meant to convey default behavior. See ?rbindlist.")
use.names = NA
}
ans = .Call(Crbindlist, l, use.names, fill, idcol)
if (!length(ans)) return(null.data.table())
setDT(ans)[]
Expand Down
Binary file added inst/tests/melt_1754.R.gz
Binary file not shown.
Loading