fix: pass DOM tree features to model and fix preprocessing bugs#9
Open
LahadMbacke wants to merge 4 commits into
Open
fix: pass DOM tree features to model and fix preprocessing bugs#9LahadMbacke wants to merge 4 commits into
LahadMbacke wants to merge 4 commits into
Conversation
83a01f6 to
1ca73fd
Compare
torch_call/tf_call/numpy_call were returning only {input_ids, labels},
silently dropping node_ids, parent_node_ids, sibling_node_ids, depth_ids
and tag_ids. DOMLMEmbeddings.forward() falls back to padding when these
are absent, so TreePositionEmbeddings (P0-P4) never trained — model was
effectively plain RoBERTa.
1ca73fd to
1739a75
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DataCollator Does Not Pass DOM Features to the Model
torch_call,tf_calletnumpy_callinDataCollatorForDOMNodeMaskwere returning{input_ids, labels}, silently dropping the five DOM tree positional features:node_idsparent_node_idssibling_node_idsdepth_idstag_idsImpact
In DOMLMEmbeddings.forward(), when these keys are missing, the model falls back to padding values for all tree-position embeddings (P0–P4). As a result, the TreePositionEmbeddings never received the actual DOM structure during training. The model was effectively trained as a standard RoBERTa model, without any awareness of DOM hierarchy or structural relationships.
Fix
After calculating
inputsandlabels, collect each DOM feature of the batch and collate it in the same way asinput_ids: