[WIP] Support TorchScript and graph rewrite by feihugis · Pull Request #54 · microsoft/fastseq

feihugis · 2020-11-11T21:49:49Z

This PR enables SelftAttention of transformers-bart model to be compatible with TorchScript and add the graph rewriter/optimization for einsum.

JiushengChen · 2020-11-11T22:35:10Z

Good to see this PR!

I suppose PR Encoder-decoder Multihead attention cpu optimization #43 can be deprecated now. Please work with @NickNickGo to merge everything into this PR.
Update benchmarks in scripts and readmes.

feihugis · 2020-11-11T23:02:28Z

Good to see this PR!

I suppose PR Encoder-decoder Multihead attention cpu optimization #43 can be deprecated now. Please work with @NickNickGo to merge everything into this PR.

The PR #43 (@NickNickGo) is for fairseq, and this PR currently only works for transformers-bart. There will be no conflicts between these two PRs.

[Jiusheng]: is it possible to cover both fairseq and transformers?

Update benchmarks in scripts and readmes.

There is a performance issue after the graph is rewritten. Once the issue is resolved, I will update the benchmark numbers and add docs.

fastseq/optimizer/jit/einsum_rewriter.py

fastseq/optimizer/transformers/modeling_bart_optimizer.py

JiushengChen · 2020-11-13T05:18:31Z

fastseq/optimizer/transformers/modeling_bart_optimizer.py

i see this is only for bart, with git, we should be able to optimize multiple model from backend?

Yes, the optimization can be applied to other models. The current limitation is that we need to check if other models are compatible with torch.jit.script.

This makes sense. Please include these in the PR. Look forward to review them.

JiushengChen · 2020-11-13T19:26:34Z

fastseq/optimizer/jit/einsum_rewriter.py

eqn[0:4] == eqn[13:17]?

space is allowed in equation, replace them first

eqn = eqn.replace(' ', '')

Yes, space is allowed here. One issue I'm working on is that adding replace triggers some weird issue in IRParser.

JiushengChen · 2020-11-13T19:34:00Z

fastseq/optimizer/jit/einsum_rewriter.py

JiushengChen · 2020-11-13T19:35:54Z

fastseq/optimizer/jit/einsum_rewriter.py

eqn[3] == eqn[16] is unnecessary if eqn[0:4] == eqn[13:17] used

JiushengChen · 2020-11-13T19:41:43Z

tests/optimizer/jit/test_einsum_rewriter.py

add some extra spaces in equation

use a different char set like i, j, k etc.

Item-2 is done.

NickNickGo

Thanks @feihugis for this PR! Looks good in general.

Looking forward to speedup / Profile comparison for einsum op before/after .
Can more cases/shapes be covered under "einsum_rewrite_pattern_0" function?
Could you briefly describe changes to make Self Attention of transformers-bart model compatible with JIT? Maybe add few comments in code.

NickNickGo · 2020-11-17T00:38:46Z

fastseq/optimizer/jit/einsum_rewriter.py

Can we make this more general ? Same pattern can be used for equations without batch dimension.

I prefer to leaving it as what it is. If we meet the cases in the future, it can be added easily with similar code block. To make it more general, it will be more like the implementation of einsum kernel.

From the micro benchmarking result, the runtime for large tensors will be very similar with/without the optimization.

NickNickGo · 2020-11-17T00:39:47Z

fastseq/optimizer/jit/einsum_rewriter.py

NickNickGo · 2020-11-17T00:41:12Z

fastseq/optimizer/jit/einsum_rewriter.py

Is returned tensor contiguous ? When comparing speedup with einsum, please take this into account as well.

No, the returned tensor is not contiguous, and the output of einsum is not contiguous either, so I think it is an apples to apples comparison.

NickNickGo · 2020-11-17T00:49:45Z

tests/optimizer/jit/test_einsum_rewriter.py

Is cuda synchronization taken care ?

Good point. torch.cuda.synchronize() is added now.

feihugis · 2020-11-17T06:54:58Z

Could you briefly describe changes to make Self Attention of transformers-bart model compatible with JIT? Maybe add few comments in code.

The major change is to make the code work with the limited data types that torchscript supports and handle the different behaviors between python and torchscript. For example, python can update the values of the dictionary in place, but torchscript could not. In order to handle these differences, the code logic is changed accordingly.

feihugis · 2020-11-17T18:39:12Z

Based on the below perf benchmarking, the performance with/without optimization are very similar.

Micro benchmark for optimized operation:

eqn='bmhtd,bnhsd->bmhts', shape0=[128, 4, 16, 5, 64], shape1=[128, 2, 16, 1024, 64])
- einsum took: 3.4279239177703857;
- optimized einsum torchscript took: 3.422758102416992;
- optimized einsum python took: 3.422323703765869;
eqn='kmijd,knisd->kmijs', shape0=[128, 4, 16, 1, 64], shape1=[128, 2, 16, 1024, 64])
- einsum took: 3.2339890003204346;
- optimized einsum torchscript took: 3.231293201446533;
- optimized einsum python took: 3.2313060760498047;
eqn='bmhts,bnhsd->bmhtd', shape0=[128, 4, 16, 5, 64], shape1=[128, 2, 16, 64, 1024])
- einsum took: 5.048973798751831;
- optimized einsum torchscript took: 5.0475754737854;
- optimized einsum python took: 5.050021171569824;
eqn='impts,inpsw->imptw', shape0=[128, 4, 16, 3, 64], shape1=[128, 2, 16, 64, 7])
- einsum took: 0.10066008567810059;
- optimized einsum torchscript took: 0.08646607398986816;
- optimized einsum python took: 0.08228182792663574;

E2E benchmark results:

with optimization

Util	Model	Task	Split	BatchSize	Samples	Tokens	Bleu	Rouge	Loss	Perplexity	Runtime(seconds)	Throughput(samples/s)	Throughput(tokens/s)
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.97\|25.28	NA	NA	156	6.6	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.94\|14.95\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.96\|14.97\|25.27	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.95\|25.27	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.92\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.98\|25.25	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.96\|14.98\|25.28	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.94\|25.29	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|15.01\|25.28	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.98\|25.26	NA	NA	92	11.1	NA

without optimization

Util	Model	Task	Split	BatchSize	Samples	Tokens	Bleu	Rouge	Loss	Perplexity	Runtime(seconds)	Throughput(samples/s)	Throughput(tokens/s)
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.96\|25.27	NA	NA	132	7.8	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.95\|25.30	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.95\|25.25	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.95\|25.30	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.96\|14.93\|25.27	NA	NA	93	11.0	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.99\|25.23	NA	NA	91	11.3	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.96\|25.25	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.94\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.97\|14.96\|25.26	NA	NA	92	11.1	NA
transformers_v3.0.2+fastseq_v0.0.4	facebook/bart-large-cnn	cnn_dm.1k/raw	val	128	1024	NA	NA	34.98\|14.95\|25.26	NA	NA	91	11.3	NA

* Fix prophenet dict loading. * Use logger. * Fix import.

* Generate the XML log file for each unit tests * run all fastseq unit tests * Add Nikhil's changes on pipeline to publish XML * Just use a small unit test to test pipeline * Change the xml folder path * Add more tests * Add env var for xml log dir and test the failures * Enable all fastseq unit tests * Enable all tests * Generate xml files for fairseq and transformers unit tests * Fix an issue in pytest command * Trigger the CI pipeline

… (#59) * Update install_requires and enable fairseq to work with torch 1.6&1.7 * Better error message and address some warnings in torch1.7 * Raise the error if fairseq/transformers are installed but the optmizations can not be applied * Move transformers/fairseq to extra_require * Remove the out-of-dated build files for ngram cuda op * Run fastseq units before transformers and fairseq

JiushengChen closed this Nov 12, 2020

feihugis reopened this Nov 12, 2020

JiushengChen reviewed Nov 12, 2020

View reviewed changes

fastseq/optimizer/jit/einsum_rewriter.py Outdated Show resolved Hide resolved

fastseq/optimizer/jit/einsum_rewriter.py Outdated Show resolved Hide resolved

JiushengChen reviewed Nov 13, 2020

View reviewed changes

feihugis force-pushed the dev_torchscript branch from cf38541 to 0575353 Compare November 14, 2020 19:51

NickNickGo reviewed Nov 17, 2020

View reviewed changes

feihugis force-pushed the dev_torchscript branch 2 times, most recently from c1955b7 to 5acea73 Compare November 17, 2020 06:40

JiushengChen and others added 12 commits November 17, 2020 21:54

Fix prophenet dict loading. (#58)

3ac6c2c

* Fix prophenet dict loading. * Use logger. * Fix import.

made ngram op device agnostic, unit test cleaned (#61)

62b6657

Add missing init files (#62)

7558a5c

Update the instructions for installation (#64)

a55e9de

Init code to support TorchScript and graph rewrite

f9d5a82

Improve einsum_rewrite_pattern

510f7f1

Make einsum_rewrite_pattern more general

83154f1

Enhance rewrite pattern and tests

69e5410

Add synchronize and check for contiguous

699fba7

Test commit for public fastseq

19fcdb1

Conversation

feihugis commented Nov 11, 2020

Uh oh!

JiushengChen commented Nov 11, 2020

Uh oh!

feihugis commented Nov 11, 2020 • edited by JiushengChen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickNickGo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feihugis commented Nov 17, 2020

Uh oh!

feihugis commented Nov 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feihugis commented Nov 11, 2020 •

edited by JiushengChen

Loading