Special Language Token for PLBART needs to be updated

### System Info

The `FAIRSEQ_LANGUAGE_CODES` in PLBartTokenizer [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/plbart/tokenization_plbart.py#L90) need to be as follows.

```
FAIRSEQ_LANGUAGE_CODES = {
    "base": ["__java__", "__python__", "__en_XX__"],
    "multi": ["__java__", "__python__", "__en_XX__", "__javascript__", "__php__", "__ruby__", "__go__"],
}
```

The current PLBartTokenizer treats `java` as a special token, and thus it removes the token when decoding is performed. An example is given below.

```
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# The code output is: "public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
```

### Who can help?

@gunjan

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }
```

### Expected behavior

```
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Special Language Token for PLBART needs to be updated #19505

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Special Language Token for PLBART needs to be updated #19505

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions