Skip to content

fix: pass DOM tree features to model and fix preprocessing bugs#9

Open
LahadMbacke wants to merge 4 commits into
ilyalasy:mainfrom
LahadMbacke:fix/dom-features-and-bugs
Open

fix: pass DOM tree features to model and fix preprocessing bugs#9
LahadMbacke wants to merge 4 commits into
ilyalasy:mainfrom
LahadMbacke:fix/dom-features-and-bugs

Conversation

@LahadMbacke
Copy link
Copy Markdown

@LahadMbacke LahadMbacke commented Jun 2, 2026

DataCollator Does Not Pass DOM Features to the Model

torch_call, tf_call et numpy_call in DataCollatorForDOMNodeMask were returning {input_ids, labels}, silently dropping the five DOM tree positional features:

  • node_ids
  • parent_node_ids
  • sibling_node_ids
  • depth_ids
  • tag_ids

Impact

In DOMLMEmbeddings.forward(), when these keys are missing, the model falls back to padding values for all tree-position embeddings (P0–P4). As a result, the TreePositionEmbeddings never received the actual DOM structure during training. The model was effectively trained as a standard RoBERTa model, without any awareness of DOM hierarchy or structural relationships.

Fix

After calculating inputs and labels, collect each DOM feature of the batch and collate it in the same way as input_ids :

batch = {"input_ids": inputs, "labels": labels}
for key in ["node_ids", "parent_node_ids", "sibling_node_ids", "depth_ids", "tag_ids"]:
    batch[key] = _torch_collate_batch(
        [e[key] for e in examples], self.tokenizer,
        pad_to_multiple_of=self.pad_to_multiple_of
    )
return batch

@LahadMbacke LahadMbacke force-pushed the fix/dom-features-and-bugs branch 3 times, most recently from 83a01f6 to 1ca73fd Compare June 2, 2026 09:55
torch_call/tf_call/numpy_call were returning only {input_ids, labels},
silently dropping node_ids, parent_node_ids, sibling_node_ids, depth_ids
and tag_ids. DOMLMEmbeddings.forward() falls back to padding when these
are absent, so TreePositionEmbeddings (P0-P4) never trained — model was
effectively plain RoBERTa.
@LahadMbacke LahadMbacke force-pushed the fix/dom-features-and-bugs branch from 1ca73fd to 1739a75 Compare June 2, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant