Cuda op for ngram repeat blocking#40
Conversation
There was a problem hiding this comment.
Thanks @NickNickGo for implement this operation! I did not look into the details yet. Leave some general comments:
- add the unit tests to make sure the op works as expected for different cases;
- add the benchmarking unit test for this op;
- check if the inputs are valid either in the Python API or the backend?
- add the license header to each file;
- Add the docs for both Python and C++ code;
Looking forward to the performance number!
yuyan2do
left a comment
There was a problem hiding this comment.
Overall it looks good to me. Added some comments for parameter naming.
A tutorial for cuda program, which explain block, thread, shared memory.
https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf
| int no_repeat_ngram = 3; | ||
| int threads = step - no_repeat_ngram +2; | ||
| int shared_mem_size = (step+1) *sizeof(long); | ||
| if (threads <=0) return lprobs; |
There was a problem hiding this comment.
put return check at beginning of this function.
This page explains CUDA various architecture well. https://en.wikipedia.org/wiki/CUDA
|
| torch::Tensor tokens, | ||
| torch::Tensor lprobs, |
There was a problem hiding this comment.
Do we need to check the dimensions of tokens and lprobs? For example, the 0th dim of tokens is assumed to be batch dimension?
Could tokens be constant? const torch::Tensor& tokens?
|
Before :
After:
|
README.md
Outdated
| - ## How it works? | ||
| - We developped a wide range of speedup techniques, including improving beam search efficiency, reducing memory footprint, speeding up calculation for key operations etc, IO speedup etc. To seamlessly connect with community, they were applied to existing models from Fairseq and Huggingface Transformers in the backend, while keeping model interface and usage same as before. | ||
| - |
There was a problem hiding this comment.
Why adding "-" here? This is supposed to be an independent section:)
This PR involves 3 optimizations.
Transformers BART large BS 128 1k samples, throughput change - 9.1 to 11.8 (including model load time)
Fairseq BART large BS 128 1k samples, throughput change - 13.7 to 15.5. Generation time reduces from 48.4 to 39.7