[Whisper] Fix forced decoder ids by sanchit-gandhi · Pull Request #20652 · huggingface/transformers

sanchit-gandhi · 2022-12-07T13:59:13Z

What does this PR do?

The Whisper tokenizer has a property self.prefix_tokens that returns the token ids appended to the start of label sequence:

<|startoftranscript|> <|lang_id|> <|task|> <|notimestamps|> ...

In the PR #20589, the method get_decoder_prompt_ids was copied from the Whisper processor to the Whisper tokenizer, where it then made use of the tokenizer property self.prefix_tokens. The method get_decoder_prompt_ids is used to set the tokens that are forced at the beginning of the generation process.

However, the forced decoder ids should not contain the <|startoftranscript|> token: this is the decoder_start_token_id that we use as token 0 when we start generation. If we include <|startoftranscript|> in our forced decoder ids, we'll get a double generation of <|startoftranscript|>. Thus, we only want to set the following tokens in the forced_decoder_ids:

<|lang_id|> <|task|> <|notimestamps|> ...

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-12-07T14:12:39Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker

LGTM, we are just gonna get whispered at again by @ydshieh for failing tests 😩 🤣

ArthurZucker · 2022-12-07T14:32:34Z


-        expected_ids = [START_OF_TRANSCRIPT, TRANSCRIBE, NOTIMESTAMPS]
+        expected_ids = [TRANSCRIBE, NOTIMESTAMPS]
        self.assertListEqual([ids[-1] for ids in forced_decoder_ids], expected_ids)


Haha nice the test was indeed worse it

sanchit-gandhi · 2022-12-07T14:53:00Z

fyi @sgugger, the final fix we hope 🤞

ydshieh

LGTM, but could you explain this a bit:

the start of label sequence:
<|startoftranscript|> <|lang_id|> <|task|> <|notimestamps|> ...

My general understanding is that, the lable sequence should NOT contain the decoder_start_token_id (here is <|startoftranscript|>).

But here you mention the start of label sequence -- I have some doubt here.

sanchit-gandhi · 2022-12-07T15:13:26Z

Yes! Let me clarify!

When training, we need to encode a sentence to a sequence of label ids. Here, we need to append the 'special' beginning of sentence tokens to the label ids. This is so that the model learns to predict the correct 'special' tokens for the generation process. For a full list of the tokens added, see this PR: #19921

One of these tokens is the <|startoftranscript|> token. This is consistent with other tokenisers in the library, such as the BART tokeniser:

from transformers import BartTokenizer

tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
input_str = "the cat"
tokens = tokenizer(input_str).input_ids
print(tokenizer.decode(tokens))

Print Output:

<s>the cat</s>

Now, it doesn't matter for training whether or not we append the decoder start token id to the start of our label sequence, because we cut it in our data collator:

transformers/examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py

Line 249 in 3ac040b

# if bos token is appended in previous tokenization step,

So, adding the decoder start token id is more for making the tokeniser user friendly and consistent with other tokenisers in the library.

sgugger

Thanks for fixing!

ydshieh · 2022-12-07T15:35:05Z

@sanchit-gandhi Thanks. Just want to point out: For bart, yes, we have bos <s> (id 0). But it is not the decoder start token (which is </s> for bart, with id 2) - it is just the start of the sentence (not ready for generation). The labels has bos but not decoder_start_token. The labels will be shifted and prepended with </s> to become decoder input ids.

In Whisper, I understand we want to be user-friendly. And as you have cut it in data collator, it is fine. But IMO, this is something a bit different from our NLP models (i.e. Bart here). Hopefully I understand it correctly.

* [Whisper] Fix forced decoder ids * fix test

[Whisper] Fix forced decoder ids

effaa78

sanchit-gandhi requested a review from ArthurZucker December 7, 2022 13:59

fix test

787cb82

ArthurZucker approved these changes Dec 7, 2022

View reviewed changes

ydshieh approved these changes Dec 7, 2022

View reviewed changes

sgugger approved these changes Dec 7, 2022

View reviewed changes

sanchit-gandhi merged commit 77382e9 into huggingface:main Dec 7, 2022

mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022

[Whisper] Fix forced decoder ids (huggingface#20652)

8ffc9c1

* [Whisper] Fix forced decoder ids * fix test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Whisper] Fix forced decoder ids#20652

[Whisper] Fix forced decoder ids#20652
sanchit-gandhi merged 2 commits into
huggingface:mainfrom
sanchit-gandhi:whisper-ids

sanchit-gandhi commented Dec 7, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Dec 7, 2022 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Dec 7, 2022

Uh oh!

sanchit-gandhi commented Dec 7, 2022

Uh oh!

ydshieh left a comment

Uh oh!

sanchit-gandhi commented Dec 7, 2022 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

ydshieh commented Dec 7, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

sanchit-gandhi commented Dec 7, 2022

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 7, 2022

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi commented Dec 7, 2022

Uh oh!

ydshieh left a comment

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HuggingFaceDocBuilderDev commented Dec 7, 2022 •

edited

Loading

sanchit-gandhi commented Dec 7, 2022 •

edited

Loading

ydshieh commented Dec 7, 2022 •

edited

Loading