Reduce vRAM usage during generation by allowing to transfer logits to CPU#40870
Reduce vRAM usage during generation by allowing to transfer logits to CPU#40870SamuelBarryCS wants to merge 8 commits into
Conversation
| @@ -0,0 +1,123 @@ | |||
| """ | |||
There was a problem hiding this comment.
(This script will of course be deleted before merging)
|
Thank you for the rapid response to the request! Since we’re at it, maybe it would be a good idea to do the same for output_attentions? |
Very fair point @YunruiZhang |
|
cc @gante |
There was a problem hiding this comment.
Please see my full reply here: #40794 (comment)
TL;DR the feature is desirable! But it will clash with an ongoing refactor, it will be much simpler if we add the feature after the refactor 💛
What
tests.generation.test_utils.test_offload_logits_to_cputo test non regressionHow to review
Testing performed
Benchmark
memory_test.pythat will be deleted before merging to showcase impactmax_new_tokens=1000: ~50% reduction of additional peak memory usage for <2% time overhead.