Skip to content

Conversation

@ymwangg
Copy link
Contributor

@ymwangg ymwangg commented Aug 30, 2022

As discussed in #3868, three_fry demonstrates better performance than the default (philox) RNG on GPU. So we should consider making three_fry the default RNG on GPU.

Below are some models I've tested:

model default three_fry speedup
bert-base-uncased 85.1333625 93.148043 1.09414265
roberta-base 77.8481562 84.241191 1.08212185
facebook/bart-base 63.1630113 69.2708869 1.0967002
gpt2 86.7388357 89.7944364 1.0352276
google/mt5-small 53.6285788 54.9645302 1.02491118
t5-small 86.5723865 86.4317218 0.99837518
microsoft/deberta-base 52.1680586 53.5818296 1.02710032
google/long-t5-local-base 26.8973489 28.0964403 1.04458028
resnet50 1067.582 1070.72111 1.00294039

cc @JackCaoG @ang868

@JackCaoG JackCaoG self-requested a review August 30, 2022 23:28
Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some ci error before, not sure if it is related.

@ymwangg
Copy link
Contributor Author

ymwangg commented Aug 31, 2022

I'm able to reproduce the CI error locally.

res = tensor([[nan, nan], [nan, inf]], device='xla:1')
ref = [[nan nan], [nan nan]]

It looks like changing RNG happened to generate invalid inputs for torch.cov and the assertEqual treats inf and nan as not equal.

@JackCaoG
Copy link
Collaborator

Hmm, can you give me a bit more context, I assume that there is a input which torch.conv generate nan on cpu and inf on xla:gpu. If that's the case, we can just disable that test from our end. I just want to make sure this is not a regression.

@ymwangg
Copy link
Contributor Author

ymwangg commented Aug 31, 2022

It looks like this issue is not specific to three_fry RNG. Below is the minimum code to reproduce:

import numpy as np
import torch
import torch_xla.core.xla_model as xm
import os

os.environ['GPU_NUM_DEVICES'] = '1'
os.environ['XLA_RNG_BIT_GENERATOR'] = 'default'

device = xm.xla_device()
count = 1
x = torch.testing.make_tensor((2,1),dtype=torch.float, device=device)
y = torch.randint(1, 3, (1,), device=device)
z = torch.testing.make_tensor((1,), dtype=torch.float, device=device, low=1)

x = torch.testing.make_tensor((2,1),dtype=torch.float, device=device)
y = torch.tensor([count], dtype=torch.int64, device=device)
z = torch.testing.make_tensor((1,), dtype=torch.float, device=device, low=1)

xm.mark_step()
w = torch.cov(x,correction=2,fweights=y,aweights=z)
xm.mark_step()

print(w)
print(np.cov(x.cpu().numpy(), ddof=2, fweights=y.cpu().numpy(), aweights=z.cpu().numpy()))

With outputs:

xla: tensor([[nan, nan], [nan, inf]], device='xla:1')
numpy: [[nan nan] [nan nan]]

If I change count to 3, I got:

xla: tensor([[0.0000e+00, 0.0000e+00], [0.0000e+00, 1.7053e-13]], device='xla:1')
numpy: [[0. 0.], [0. 0.]]

I checked the pytorch implementation and it looks torch.cov will force a divide by zero if the degree of freedom is below 1 (here).

So my guess is that the discrepancy is due to 1.7053e-13/0.0 = inf while 0.0/0.0 = nan.

I've disabled this test since this op is not officially supported by pytorch/xla (due to lack of torch.equal support) and the error appears to be irrelevant.

Copy link
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@JackCaoG JackCaoG merged commit 2d87716 into pytorch:master Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants