streamline loop in GForce j optimization by MichaelChirico · Pull Request #3777 · Rdatatable/data.table

MichaelChirico · 2019-08-19T03:37:12Z

I guess the original intent of using a for loop was that if nothing in jsub has to be overwritten, no need to copy it. But #1470 is a case where actually nothing is changed, and still copying the whole thing is faster by a factor of 50x

codecov · 2019-08-19T03:48:28Z

Codecov Report

Merging #3777 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3777      +/-   ##
==========================================
+ Coverage   99.41%   99.41%   +<.01%     
==========================================
  Files          71       71              
  Lines       13241    13242       +1     
==========================================
+ Hits        13164    13165       +1     
  Misses         77       77

Impacted Files	Coverage Δ
R/data.table.R	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ee9264...ebb611a. Read the comment docs.

MichaelChirico · 2019-08-19T03:51:20Z

R/data.table.R

      if (jsub[[1L]]=="list") {
-        for (ii in seq_along(jsub)[-1L]) {
-          this_jsub = jsub[[ii]]
-          if (dotN(this_jsub)) next; # For #5760


this dotN thing isn't doing anything? since this loop only affects is.call elements & dotN specifically checks is.name. So is.call and && will accomplish the same.

+1. And all the time was being spent in dotN too. The slowdown wasn't to do with the .optmean part, per se. Rprof output here: #1470 (comment)

Good catch! Maybe we should revert to the for loop approach then as well (though timings are pretty small in both cases)?

Yep good thought. I tried to revert to the for() loop approach but it was still 15s. Down from 30s but not 0.5s as it should be. So now I'm not not sure what's going on. Let's keep the sapply way then and revisit in the future.
Actually, this is consistent with the Rprof result. If I read it correctly, 50% was in the dotN, not "all" as I wrote above.

R/data.table.R

…avoid jsub[-1L]

MichaelChirico · 2019-08-28T22:51:58Z

R/data.table.R

+        cat("lapply optimization is on, j unchanged as '",deparse(jsub,width.cutoff=200L, nlines=1L),"'\n",sep="")
    }
-    dotN = function(x) is.name(x) && x == ".N" # For #5760
+    dotN = function(x) is.name(x) && x==".N" # For #5760. TODO: Rprof() showed dotN() may be the culprit if iterated (#1470)?; avoid the == which converts each x to character?


x == quote(.N) works, is it any faster?

I guess not somewhat surprisingly (?):

> microbenchmark::microbenchmark(times = 1e5, + quote(N) == quote(N), + quote(N) == 'N') Unit: nanoseconds expr min lq mean median uq max neval quote(N) == quote(N) 190 226 309.0476 234 244 3892045 1e+05 quote(N) == "N" 153 182 213.9202 192 198 39180 1e+05

Ditto if we store qN = quote(N) beforehand for the RHS. identical much worse.

…rage reasons

mattdowle · 2019-08-28T23:23:44Z

R/data.table.R

        if (jsub[[1L]]=="list") {
          GForce = TRUE
-          for (ii in seq_along(jsub)[-1L]) if (!.ok(jsub[[ii]])) GForce = FALSE
+          for (ii in seq_along(jsub)[-1L]) if (!.ok(jsub[[ii]])) {GForce = FALSE; break}


This change is the (good) culprit for the timing strangeness. I've been going back to master and making tweaks there to compare timings to this branch. But this .ok() also calls dotN() hence the confusion. Getting there ...

mattdowle · 2019-08-28T23:51:57Z

50% of the 30s does seem to be the for loop. I whittled it down to the following :

for (ii in seq_along(jsub)) {   # 0.5s  (for the rest of [.data.table, not this loop)
  # this_jsub = jsub[[ii]]
}

for (ii in seq_along(jsub)) {   # 15s  (an extra 14.5s for this loop)
  this_jsub = jsub[[ii]]
}

The Rprof() output for the 15s timing doesn't help much.

> Rprof()
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 13.028   1.052  13.933 
> Rprof(NULL)
> summaryRprof()
$by.self
               self.time self.pct total.time total.pct
"[.data.table"     13.98    98.73      14.08     99.44
"gc"                0.08     0.56       0.08      0.56
"list"              0.06     0.42       0.06      0.42
"FUN"               0.02     0.14       0.02      0.14
"new.env"           0.02     0.14       0.02      0.14

$by.total
                       total.time total.pct self.time self.pct
"system.time"               14.16    100.00      0.00     0.00
"[.data.table"              14.08     99.44     13.98    98.73
"["                         14.08     99.44      0.00     0.00
"gc"                         0.08      0.56      0.08     0.56
"list"                       0.06      0.42      0.06     0.42
"FUN"                        0.02      0.14      0.02     0.14
"new.env"                    0.02      0.14      0.02     0.14
".unsafe.opt"                0.02      0.14      0.00     0.00
"cmp"                        0.02      0.14      0.00     0.00
"cmpCall"                    0.02      0.14      0.00     0.00
"cmpfun"                     0.02      0.14      0.00     0.00
"compiler:::tryCmpfun"       0.02      0.14      0.00     0.00
"doTryCatch"                 0.02      0.14      0.00     0.00
"findCenvVar"                0.02      0.14      0.00     0.00
"genCode"                    0.02      0.14      0.00     0.00
"getInlineInfo"              0.02      0.14      0.00     0.00
"h"                          0.02      0.14      0.00     0.00
"lapply"                     0.02      0.14      0.00     0.00
"tryCatch"                   0.02      0.14      0.00     0.00
"tryCatchList"               0.02      0.14      0.00     0.00
"tryCatchOne"                0.02      0.14      0.00     0.00
"tryInline"                  0.02      0.14      0.00     0.00

There isn't even any subassign to the jsub here. So I'm not sure what's going on. Next step would be to attempt to reproduce independently of data.table and then ask on r-devel. A long dummy jsub could be created to do that in a fresh vanilla R session.

Since the PR is working great by doing away with the for loop, I'll merge and move on.

Closes #1470 -- streamline loop in GForce j optimization

08b4b44

MichaelChirico commented Aug 19, 2019

View reviewed changes

jangorecki reviewed Aug 20, 2019

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

mattdowle changed the title ~~Closes #1470 -- streamline loop in GForce j optimization~~ streamline loop in GForce j optimization Aug 28, 2019

mattdowle added 2 commits August 28, 2019 13:34

Merge branch 'master' into gforce_big_j

8b4e1be

moved news item up to the right version

55f813c

mattdowle added this to the 1.12.4 milestone Aug 28, 2019

used seq.int to avoid [-1L] copy, and similary sapply all of jsub to …

ad3903e

…avoid jsub[-1L]

mattdowle mentioned this pull request Aug 28, 2019

Bug report - RSession Hangs #1470

Closed

more comments and removed the if(length(whichMean))

8d0419d

MichaelChirico commented Aug 28, 2019

View reviewed changes

which() looked a bit early. changed more for readability and for cove…

ebb611a

…rage reasons

mattdowle reviewed Aug 28, 2019

View reviewed changes

mattdowle merged commit f10402c into master Aug 29, 2019

mattdowle deleted the gforce_big_j branch August 29, 2019 00:07

mattdowle mentioned this pull request Aug 29, 2019

add timing test for many .SD cols #3797

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamline loop in GForce j optimization#3777

streamline loop in GForce j optimization#3777
mattdowle merged 6 commits intomasterfrom
gforce_big_j

MichaelChirico commented Aug 19, 2019 •

edited by mattdowle

Loading

Uh oh!

codecov bot commented Aug 19, 2019 •

edited

Loading

Uh oh!

MichaelChirico Aug 19, 2019 •

edited

Loading

Uh oh!

mattdowle Aug 28, 2019 •

edited

Loading

Uh oh!

MichaelChirico Aug 28, 2019

Uh oh!

mattdowle Aug 28, 2019 •

edited

Loading

Uh oh!

Uh oh!

MichaelChirico Aug 28, 2019

Uh oh!

mattdowle Aug 28, 2019

Uh oh!

MichaelChirico Aug 28, 2019

Uh oh!

mattdowle Aug 28, 2019

Uh oh!

mattdowle commented Aug 28, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MichaelChirico commented Aug 19, 2019 • edited by mattdowle Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MichaelChirico Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdowle Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Aug 28, 2019

Choose a reason for hiding this comment

Uh oh!

mattdowle Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MichaelChirico Aug 28, 2019

Choose a reason for hiding this comment

Uh oh!

mattdowle Aug 28, 2019

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Aug 28, 2019

Choose a reason for hiding this comment

Uh oh!

mattdowle Aug 28, 2019

Choose a reason for hiding this comment

Uh oh!

mattdowle commented Aug 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MichaelChirico commented Aug 19, 2019 •

edited by mattdowle

Loading

codecov bot commented Aug 19, 2019 •

edited

Loading

MichaelChirico Aug 19, 2019 •

edited

Loading

mattdowle Aug 28, 2019 •

edited

Loading

mattdowle Aug 28, 2019 •

edited

Loading

mattdowle commented Aug 28, 2019 •

edited

Loading