|
2 | 2 |
|
3 | 3 | # data.table [v1.15.99](https://github.com/Rdatatable/data.table/milestone/30) (in development) |
4 | 4 |
|
| 5 | +## BREAKING CHANGE |
| 6 | + |
| 7 | +1. `` `[.data.table` `` is un-exported again. This was exported to support an experimental feature (`DT()` functional form of `[`) that never made it to release, but we forgot to claw back this export in the NAMESPACE; sorry about that. We didn't find anyone calling the method directly (which is inadvisable to begin with). |
| 8 | + |
5 | 9 | ## NEW FEATURES |
6 | 10 |
|
7 | 11 | 1. `print.data.table()` shows empty (`NULL`) list column entries as `[NULL]` for emphasis. Previously they would just print nothing (same as for empty string). Part of [#4198](https://github.com/Rdatatable/data.table/issues/4198). Thanks @sritchie73 for the proposal and fix. |
|
40 | 44 |
|
41 | 45 | 14. `fread` loads `.bgz` files directly, [#5461](https://github.com/Rdatatable/data.table/issues/5461). Thanks to @TMRHarrison for the request with proposed fix, and Benjamin Schwendinger for the PR. |
42 | 46 |
|
| 47 | +15. `rbindlist(l, use.names=TRUE)` and `rbind` now works correctly on columns with different class attributes for certain classes such as `Date`, `IDate`, `ITime`, `POSIXct` and `AsIs` with other columns of similar classes, e.g., `IDate` and `Date`. The conversion is done automatically and the class attribute of the final column is determined by the first encountered class attribute in the binding list, [#5309](https://github.com/Rdatatable/data.table/issues/5309), [#4934](https://github.com/Rdatatable/data.table/issues/4934), [#5391](https://github.com/Rdatatable/data.table/issues/5391). |
| 48 | + |
| 49 | +`rbindlist(l, ignore.attr=TRUE)` and `rbind` also gains argument `ignore.attr` to manually deactivate the safety-net of binding columns with different column classes, [#3911](https://github.com/Rdatatable/data.table/issues/3911), [#5542](https://github.com/Rdatatable/data.table/issues/5542). Thanks to @dcaseykc, @fox34, @adrian-quintario, @berg-michael, @arunsrinivasan, @statquant, @pkress, @jrausch12, @therosko, @OfekShilon, @iMissile, @tdhock for the request and @ben-schwen for the PR. |
| 50 | + |
| 51 | +16. `fcase()` supports scalars in conditions (e.g. supplying just `TRUE`), vectors in `default=` (so the default can vary by row), and `default=` is now lazily evaluated, [#5461](https://github.com/Rdatatable/data.table/issues/5461). Thanks @sindribaldur for the feature request, which has been highly requested, @shrektan for doing most of the implementation, and @MichaelChirico for sewing things up. |
| 52 | + |
43 | 53 | ## BUG FIXES |
44 | 54 |
|
45 | 55 | 1. `unique()` returns a copy the case when `nrows(x) <= 1` instead of a mutable alias, [#5932](https://github.com/Rdatatable/data.table/pull/5932). This is consistent with existing `unique()` behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix. |
|
66 | 76 |
|
67 | 77 | 12. data.table's `all.equal()` method now dispatches to each column's own `all.equal()` method as appropriate, [#4543](https://github.com/Rdatatable/data.table/issues/4543). Thanks @MichaelChirico for the report and fix. Note that this had two noteworthy changes to data.table's own test suite that might affect you: (1) comparisons of POSIXct columns compare absolute, not relative differences, meaning that millisecond-scale differences might trigger a "not equal" report that was hidden before; and (2) comparisons of integer64 columns could be totally wrong since they were being compared on the basis of their representation as doubles, not long integers. The former might be a matter of preference requiring you to specify a different `tolerance=`, while the latter was clearly a bug. |
68 | 78 |
|
69 | | -13. `rbindlist` could lead to a protection stack overflow when applied to a list containing many nested lists exceeding the pointer protection stack size, [#4536](https://github.com/Rdatatable/data.table/issues/4536). Thanks to @ProfFancyPants for reporting, and Benjamin Schwendinger for the fix. |
| 79 | +13. `rbindlist` and `shift` could lead to a protection stack overflow when applied to a list containing many nested lists exceeding the pointer protection stack size, [#4536](https://github.com/Rdatatable/data.table/issues/4536). Thanks to @ProfFancyPants for reporting, and Benjamin Schwendinger (`rbindlist`) and @MichaelChirico (`shift`) for the fix. |
70 | 80 |
|
71 | 81 | 14. `fread(x, colClasses="POSIXct")` now also works for columns containing only NA values, [#6208](https://github.com/Rdatatable/data.table/issues/6208). Thanks to @markus-schaffer for the report, and Benjamin Schwendinger for the fix. |
72 | 82 |
|
| 83 | +15. `fread()` is more careful about detecting that a file is compressed in bzip2 format, [#6304](https://github.com/Rdatatable/data.table/issues/6304). In particular, we also check the 4th byte is a digit; in rare cases, a legitimate uncompressed CSV file could match 'BZh' as the first 3 bytes. We think an uncompressed CSV file matching 'BZh[1-9]' is all the more rare and unlikely to be encountered in "real" examples. Other formats (zip, gzip) are friendly enough to use non-printable characters in their magic numbers. Thanks @grainnemcguire for the report and @MichaelChirico for the fix. |
| 84 | + |
| 85 | +16. Selecting keyed list columns will retain key without a performance penalty, closes [#4498](https://github.com/Rdatatable/data.table/issues/4498). Thanks to @user9439449 on StackOverflow for the report. |
| 86 | + |
73 | 87 | ## NOTES |
74 | 88 |
|
75 | 89 | 1. `transform` method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1. |
|
116 | 130 |
|
117 | 131 | 20. Removed a warning about the now totally-obsolete option `datatable.CJ.names`, as discussed in previous releases. |
118 | 132 |
|
119 | | -21. Refactored some non-API calls to R macros for S4 objects (#6180)[https://github.com/Rdatatable/data.table/issues/6180]. There should be no user-visible change. Thanks to various R users & R core for pushing to have a clearer definition of "API" for R, and thanks @MichaelChirico for implementing here. |
| 133 | +21. Refactored some non-API calls in the package C code, (#6180)[https://github.com/Rdatatable/data.table/issues/6180]. There should be no user-visible change. Thanks to various R users, R core, and especially Luke Tierney for pushing to have a clearer definition of "API" for R and for offering clear documentation and suggested workarounds. Thanks @MichaelChirico and @TysonStanley for implementing changes for this release; more will follow. |
120 | 134 |
|
121 | 135 | 22. C code was unified more in how failures to allocate memory (`malloc()`/`calloc()`) are handled, (#1115)[https://github.com/Rdatatable/data.table/issues/1115]. No OOM issues were reported, as these regions of code typically request relatively small blocks of memory, but it is good to handle memory pressure consistently. Thanks @elfring for the report and @MichaelChirico for the clean-up effort and future-proofing linter. |
122 | 136 |
|
| 137 | +22. Internal routine for finding sort order will now re-use any existing index. A similar optimization was already present in R code, but this has now been pushed to C and covers a wider range of use cases and collects more statistics about its input (e.g. whether any infinite entries were found), opening the possibility for more optimizations in other functions. |
| 138 | +
|
| 139 | +Functions `setindex` (and `setindexv`) will now compute groups' positions as well. `setindex()` also collects the extra statistics alluded to above. |
| 140 | + |
| 141 | +Finding sort order in other routines (for example subset `d2[id==1L]`) does not include those extra statistics so as not to impose a slowdown. |
| 142 | + |
| 143 | +```r |
| 144 | +d2 = data.table(id=2:1, v2=1:2) |
| 145 | +setindexv(d2, "id") |
| 146 | +str(attr(attr(d2, "index"), "__id")) |
| 147 | +# int [1:2] 2 1 |
| 148 | +# - attr(*, "starts")= int [1:2] 1 2 |
| 149 | +# - attr(*, "maxgrpn")= int 1 |
| 150 | +# - attr(*, "anyna")= int 0 |
| 151 | +# - attr(*, "anyinfnan")= int 0 |
| 152 | +# - attr(*, "anynotascii")= int 0 |
| 153 | +# - attr(*, "anynotutf8")= int 0 |
| 154 | + |
| 155 | +d2 = data.table(id=2:1, v2=1:2) |
| 156 | +invisible(d2[id==1L]) |
| 157 | +str(attr(attr(d2, "index"), "__id")) |
| 158 | +# int [1:2] 2 1 |
| 159 | +``` |
| 160 | + |
| 161 | +This feature also enables re-use of sort index during joins, in cases where one of the calls to find sort order is made from C code. |
| 162 | + |
| 163 | +```r |
| 164 | +d1 = data.table(id=1:2, v1=1:2) |
| 165 | +d2 = data.table(id=2:1, v2=1:2) |
| 166 | +setindexv(d2, "id") |
| 167 | +d1[d2, on="id", verbose=TRUE] |
| 168 | +#... |
| 169 | +#Starting bmerge ... |
| 170 | +#forderReuseSorting: using existing index: __id |
| 171 | +#forderReuseSorting: opt=2, took 0.000s |
| 172 | +#... |
| 173 | +``` |
| 174 | + |
| 175 | +This feature resolves [#4387](https://github.com/Rdatatable/data.table/issues/4387), [#2947](https://github.com/Rdatatable/data.table/issues/2947), [#4380](https://github.com/Rdatatable/data.table/issues/4380), and [#1321](https://github.com/Rdatatable/data.table/issues/1321). Thanks to @jangorecki, @jan-glx, and @MichaelChirico for the reports and @jangorecki for implementing. |
| 176 | + |
123 | 177 | ## TRANSLATIONS |
124 | 178 |
|
125 | 179 | 1. Fix a typo in a Mandarin translation of an error message that was hiding the actual error message, [#6172](https://github.com/Rdatatable/data.table/issues/6172). Thanks @trafficfan for the report and @MichaelChirico for the fix. |
126 | 180 |
|
127 | 181 | 2. data.table is now translated into Brazilian Portuguese (`pt_BR`) as well as Mandarin (`zh_CN`). Thanks to the [new translation team](https://github.com/orgs/Rdatatable/teams/brazil) consisting initially of @rffontenelle, @leofontenelle, and @italo-07. The team is open if you'd also like to join and support maintenance of these translations. |
128 | 182 |
|
| 183 | +3. A more helpful error message for using `:=` inside the first argument (`i`) of `[.data.table` is now available in translation, [#6293](https://github.com/Rdatatable/data.table/issues/6293). Previously, the code to display this assumed an earlier message was printed in English. The solution is for calling `:=` directly (i.e., outside the second argument `j` of `[.data.table`) to throw an error of class `dt_invalid_let_error`. Thanks to Spanish translator @rikivillalba for spotting the issue and @MichaelChirico for the fix. |
| 184 | + |
129 | 185 | # data.table [v1.15.4](https://github.com/Rdatatable/data.table/milestone/33) (27 March 2024) |
130 | 186 |
|
131 | 187 | ## BUG FIXES |
|
0 commit comments