System Info
The FAIRSEQ_LANGUAGE_CODES in PLBartTokenizer here need to be as follows.
FAIRSEQ_LANGUAGE_CODES = {
"base": ["__java__", "__python__", "__en_XX__"],
"multi": ["__java__", "__python__", "__en_XX__", "__javascript__", "__php__", "__ruby__", "__go__"],
}
The current PLBartTokenizer treats java as a special token, and thus it removes the token when decoding is performed. An example is given below.
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# The code output is: "public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
Who can help?
@gunjan
Information
Tasks
Reproduction
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# public void METHOD_1 ( TYPE_1 VAR_1 ) throws .lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }
Expected behavior
code = "public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }"
tokenizer = model_tokenizer_class.from_pretrained("uclanlp/plbart-base", language_codes="base")
model_inputs = tokenizer([code])
print(tokenizer.decode(model_inputs['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
# public void METHOD_1 ( TYPE_1 VAR_1 ) throws java.lang.Exception { super . METHOD_1 ( VAR_1 ) ; METHOD_2 ( VAR_1 ) ; }
System Info
The
FAIRSEQ_LANGUAGE_CODESin PLBartTokenizer here need to be as follows.The current PLBartTokenizer treats
javaas a special token, and thus it removes the token when decoding is performed. An example is given below.Who can help?
@gunjan
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior