[Whisper] Fix forced decoder ids#20652
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM, we are just gonna get whispered at again by @ydshieh for failing tests 😩 🤣
|
|
||
| expected_ids = [START_OF_TRANSCRIPT, TRANSCRIBE, NOTIMESTAMPS] | ||
| expected_ids = [TRANSCRIBE, NOTIMESTAMPS] | ||
| self.assertListEqual([ids[-1] for ids in forced_decoder_ids], expected_ids) |
There was a problem hiding this comment.
Haha nice the test was indeed worse it
|
fyi @sgugger, the final fix we hope 🤞 |
ydshieh
left a comment
There was a problem hiding this comment.
LGTM, but could you explain this a bit:
the start of label sequence:
<|startoftranscript|> <|lang_id|> <|task|> <|notimestamps|> ...
My general understanding is that, the lable sequence should NOT contain the decoder_start_token_id (here is <|startoftranscript|>).
But here you mention the start of label sequence -- I have some doubt here.
|
Yes! Let me clarify! When training, we need to encode a sentence to a sequence of label ids. Here, we need to append the 'special' beginning of sentence tokens to the label ids. This is so that the model learns to predict the correct 'special' tokens for the generation process. For a full list of the tokens added, see this PR: #19921 One of these tokens is the from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
input_str = "the cat"
tokens = tokenizer(input_str).input_ids
print(tokenizer.decode(tokens))Print Output: Now, it doesn't matter for training whether or not we append the decoder start token id to the start of our label sequence, because we cut it in our data collator: So, adding the decoder start token id is more for making the tokeniser user friendly and consistent with other tokenisers in the library. |
|
@sanchit-gandhi Thanks. Just want to point out: For In Whisper, I understand we want to be user-friendly. And as you have cut it in data collator, it is fine. But IMO, this is something a bit different from our NLP models (i.e. Bart here). Hopefully I understand it correctly. |
* [Whisper] Fix forced decoder ids * fix test
What does this PR do?
The Whisper tokenizer has a property
self.prefix_tokensthat returns the token ids appended to the start of label sequence:In the PR #20589, the method
get_decoder_prompt_idswas copied from the Whisper processor to the Whisper tokenizer, where it then made use of the tokenizer propertyself.prefix_tokens. The methodget_decoder_prompt_idsis used to set the tokens that are forced at the beginning of the generation process.However, the forced decoder ids should not contain the
<|startoftranscript|>token: this is thedecoder_start_token_idthat we use as token 0 when we start generation. If we include<|startoftranscript|>in our forced decoder ids, we'll get a double generation of<|startoftranscript|>. Thus, we only want to set the following tokens in theforced_decoder_ids:Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.