Set three_fry as the default RNG for GPU #3951

ymwangg · 2022-08-30T20:30:04Z

As discussed in #3868, three_fry demonstrates better performance than the default (philox) RNG on GPU. So we should consider making three_fry the default RNG on GPU.

Below are some models I've tested:

model	default	three_fry	speedup
bert-base-uncased	85.1333625	93.148043	1.09414265
roberta-base	77.8481562	84.241191	1.08212185
facebook/bart-base	63.1630113	69.2708869	1.0967002
gpt2	86.7388357	89.7944364	1.0352276
google/mt5-small	53.6285788	54.9645302	1.02491118
t5-small	86.5723865	86.4317218	0.99837518
microsoft/deberta-base	52.1680586	53.5818296	1.02710032
google/long-t5-local-base	26.8973489	28.0964403	1.04458028
resnet50	1067.582	1070.72111	1.00294039

cc @JackCaoG @ang868

JackCaoG

There were some ci error before, not sure if it is related.

ymwangg · 2022-08-31T01:37:19Z

I'm able to reproduce the CI error locally.

res = tensor([[nan, nan], [nan, inf]], device='xla:1')
ref = [[nan nan], [nan nan]]

It looks like changing RNG happened to generate invalid inputs for torch.cov and the assertEqual treats inf and nan as not equal.

JackCaoG · 2022-08-31T18:14:47Z

Hmm, can you give me a bit more context, I assume that there is a input which torch.conv generate nan on cpu and inf on xla:gpu. If that's the case, we can just disable that test from our end. I just want to make sure this is not a regression.

ymwangg · 2022-08-31T19:13:55Z

It looks like this issue is not specific to three_fry RNG. Below is the minimum code to reproduce:

import numpy as np
import torch
import torch_xla.core.xla_model as xm
import os

os.environ['GPU_NUM_DEVICES'] = '1'
os.environ['XLA_RNG_BIT_GENERATOR'] = 'default'

device = xm.xla_device()
count = 1
x = torch.testing.make_tensor((2,1),dtype=torch.float, device=device)
y = torch.randint(1, 3, (1,), device=device)
z = torch.testing.make_tensor((1,), dtype=torch.float, device=device, low=1)

x = torch.testing.make_tensor((2,1),dtype=torch.float, device=device)
y = torch.tensor([count], dtype=torch.int64, device=device)
z = torch.testing.make_tensor((1,), dtype=torch.float, device=device, low=1)

xm.mark_step()
w = torch.cov(x,correction=2,fweights=y,aweights=z)
xm.mark_step()

print(w)
print(np.cov(x.cpu().numpy(), ddof=2, fweights=y.cpu().numpy(), aweights=z.cpu().numpy()))

With outputs:

xla: tensor([[nan, nan], [nan, inf]], device='xla:1')
numpy: [[nan nan] [nan nan]]

If I change count to 3, I got:

xla: tensor([[0.0000e+00, 0.0000e+00], [0.0000e+00, 1.7053e-13]], device='xla:1')
numpy: [[0. 0.], [0. 0.]]

I checked the pytorch implementation and it looks torch.cov will force a divide by zero if the degree of freedom is below 1 (here).

So my guess is that the discrepancy is due to 1.7053e-13/0.0 = inf while 0.0/0.0 = nan.

I've disabled this test since this op is not officially supported by pytorch/xla (due to lack of torch.equal support) and the error appears to be irrelevant.

JackCaoG

Thanks!

ymwangg force-pushed the default_rng branch from 2cab07c to cdfdaf0 Compare August 30, 2022 20:51

JackCaoG self-requested a review August 30, 2022 23:28

JackCaoG approved these changes Aug 31, 2022

View reviewed changes

Set three_fry as default RNG for GPU

4d4de3e

ymwangg force-pushed the default_rng branch from cdfdaf0 to 4295f35 Compare August 31, 2022 19:13

Disable torch.cov test

25b6715

ymwangg force-pushed the default_rng branch from 4295f35 to 25b6715 Compare August 31, 2022 19:17

JackCaoG approved these changes Aug 31, 2022

View reviewed changes

JackCaoG merged commit 2d87716 into pytorch:master Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set three_fry as the default RNG for GPU #3951

Set three_fry as the default RNG for GPU #3951

Uh oh!

ymwangg commented Aug 30, 2022

Uh oh!

JackCaoG left a comment

Uh oh!

ymwangg commented Aug 31, 2022

Uh oh!

JackCaoG commented Aug 31, 2022

Uh oh!

ymwangg commented Aug 31, 2022 •

edited

Loading

Uh oh!

JackCaoG left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Set three_fry as the default RNG for GPU #3951

Set three_fry as the default RNG for GPU #3951

Uh oh!

Conversation

ymwangg commented Aug 30, 2022

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

ymwangg commented Aug 31, 2022

Uh oh!

JackCaoG commented Aug 31, 2022

Uh oh!

ymwangg commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ymwangg commented Aug 31, 2022 •

edited

Loading