Add Molmo2 by SangbumChoi · Pull Request #43451 · huggingface/transformers

SangbumChoi · 2026-01-23T14:47:55Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Adds AllenAI Molmo2 multimodal VLM to transformers, supporting: - Molmo2ForConditionalGeneration (image+video+text → text) - Molmo2TextModel / Molmo2TextForCausalLM (text-only) - Molmo2ImageProcessor and Molmo2VideoProcessor - Molmo2Processor Key implementation details: - Uses is_first_iteration (v5 API) for prepare_inputs_for_generation - Custom Molmo2Embedding with embedding + new_embedding parameters - Vision backbone with pooling adapter and multi-layer ViT features - Dynamic full cache support for generation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@strict

…odel_prefix - Replace einops.rearrange with native numpy reshape+transpose+reshape - Add @strict decorator to all 4 config classes (Molmo2VitConfig, Molmo2AdapterConfig, Molmo2TextConfig, Molmo2Config) to satisfy TRF010 - Set Molmo2Model.base_model_prefix = "model" (was empty, violating TRF002) - Fix image_mean/image_std mutable shared list (copy constants on init) - Fix test_image_processing: use image_processing_class instead of image_processor_list; skip CHW torch and 4-channel unsupported tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Re-sort _toctree.yml to place Molmo2 after mllama alphabetically - Add None guard in test_video_processor_from_dict_with_kwargs to skip when fast_video_processing_class is not defined Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Molmo2TextModel is an internal sub-component used by Molmo2Model and Molmo2ForConditionalGeneration and is tested implicitly through those. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

requests is not part of the standard library and caused ImportError in minimal environments (e.g. HuggingFace Jobs). Use urllib.request instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Molmo2's processor has several behaviors that are incompatible with the default ProcessorTesterMixin assumptions: - Chat template enforces strict user/assistant alternation (no system role) - Processor inserts BOS token, shifting sequence length by 1 - Image processor patchifies output, so rescale_factor passthrough fails - Video processor requires FPS metadata not provided by base tests - Hub processor_config.json contains auto_map not preserved in save/load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add @auto_docstring(checkpoint="allenai/Molmo2-8B") decorator to Molmo2TextConfig and Molmo2Config with custom_args for documenting non-standard parameters. This fixes check_config_docstrings CI check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… date Add parameter docstrings to Molmo2TextConfig and Molmo2Config __init__ methods so @strict-wrapped classes pass config docstring CI checks. Update model doc date to 2026-03-28. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move top-level `import torch` and `import torchvision.transforms` behind `is_torch_available()` / `is_torchvision_available()` guards in both image and video processors to prevent ModuleNotFoundError when torchvision is not installed. Also skip test_kwargs_overrides_default_image_processor_kwargs since Molmo2's patchifying image processor doesn't support rescale_factor passthrough. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Convert all absolute imports (from transformers.xxx) to relative imports (from ...xxx) in image_processing, video_processing, and processing modules to match the convention used by all other in-library models. Remove register_for_auto_class() calls which are only needed for custom hub models and were causing dynamic_module_utils to incorrectly scan local files for relative imports during save_pretrained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n_available The processor's top-level imports from image_processing_molmo2 and video_processing_molmo2 pull in PILImageResampling which requires PIL. Guard these imports with is_vision_available() so `from transformers import *` works when only torch is installed (no PIL/torchvision). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…L imports Move Molmo2ImagesKwargs and Molmo2VideosKwargs definitions directly into processing_molmo2.py instead of importing them from image/video processor modules which require PIL. Also remove Molmo2ImageProcessor/VideoProcessor type hints from __init__ to avoid NameError when vision is unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SangbumChoi · 2026-03-29T01:49:56Z

@molbap Hi I am still working on it since I have to make some example visualizer for this and (most of the code is generated by Claude code). However, you can start review this with brief level of code review! cc. @merveenoyan

Add integration tests for Molmo2-8B covering: - Image generation with exact expected text verification - Video QA (penguin identification) - Video pointing (coordinate output) - Multi-image comparison All expected values derived from actual model inference on A10G. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zucchini-nlp

Hey @SangbumChoi

Great model to add to Transformers. After reviewing I see that using modular would be much better since a lot of part are copy-paste from different models. I left comments on each class about where it can be copied from. Apart from that, there are a few places where we need to clean up and align API with the rest of VLMs for consistency

If you have q, ping me on slack. I will unsubscribe myself from this PR to not get notif about each commit, so when you want another review ping me again by @

zucchini-nlp

Hey @SangbumChoi

Great model to add to Transformers. After reviewing I see that using modular would be much better since a lot of part are copy-paste from different models. I left comments on each class about where it can be copied from. Apart from that, there are a few places where we need to clean up and align API with the rest of VLMs for consistency

If you have q, ping me on slack. I will unsubscribe myself from this PR to not get notif about each commit, so when you want another review ping me again by @

…odel Both were vestigial leftovers from the CLIP/SigLIP vision template: - self.scale (hidden_size**-0.5): Molmo2GQAAttention computes its own head_dim scaling, so this model-level attention scale was never read. - self.num_prefix_tokens (= 0): counts CLS/register tokens in CLIP-style ViTs; Molmo2 has no class embedding, so it was hardcoded to 0 and never read. Verified zero reads anywhere. self.positional_embedding (the only real entry under the comment) is kept. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014Z1uph57N8WrnxSL69bYL4

@molbap

Per @molbap's naming review: spell out the abbreviated variables in the image tiling path while keeping the two distinct quantities clearly separate: - original_h/original_w -> original_height/original_width (the input image size) - src_h/src_w -> src_height/src_width (the tiling-aligned resize target) Naming-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014Z1uph57N8WrnxSL69bYL4

Molmo2Model inherits EmbeddingAccessMixin (via Molmo2PreTrainedModel -> LlamaPreTrainedModel -> PreTrainedModel), so its manual get_input_embeddings / set_input_embeddings overrides were redundant: the mixin's path-4 resolution (self.language_model.get_input_embeddings()) already returns self.language_model.wte because Molmo2TextModel sets _input_embed_layer = 'wte'. - Remove the redundant Molmo2Model.get_input_embeddings / set_input_embeddings. - forward() now uses self.get_input_embeddings()(input_ids) instead of hardcoding self.language_model.wte(input_ids). Molmo2ForConditionalGeneration continues to resolve via the mixin's path-5 (self.model.get_input_embeddings()). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014Z1uph57N8WrnxSL69bYL4

@molbap

…_after flag Per @molbap's review: collapse Molmo2PostNormDecoderLayer into a single Molmo2DecoderLayer that branches on config.norm_after (post-norm normalizes each sublayer's output; pre-norm normalizes its input) -- the same flag-on-one-class style used elsewhere, instead of a subclass that only reorders the norm. Also fixes a real bug @molbap flagged: the old post-norm forward dropped **kwargs in the self_attn call (the pre-norm path passed them), so attention kwargs never reached the attention module on post-norm checkpoints. The merged forward passes **kwargs in both cases. Removes Molmo2PostNormDecoderLayer from _no_split_modules and the decoder_layer = ... if config.norm_after else ... dispatch in Molmo2TextModel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014Z1uph57N8WrnxSL69bYL4

@molbap

Per @molbap: the skip was only needed for the base test's assertIsInstance(get_input_embeddings(), nn.Embedding) -- Molmo2 uses a custom Molmo2Embedding (concatenated base + extra-vocab tables), not nn.Embedding. Override the test to relax that one type assertion to nn.Module while still verifying the get/set round-trip (the behavior that actually matters). Confirmed passing on a tiny model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014Z1uph57N8WrnxSL69bYL4

SangbumChoi · 2026-06-25T14:54:23Z

I assume #43451 (review) this is becuase resize logic with similar aspec_ratio or shape is more efficient? is there any blog for this?

Addresses two of guarin's image-processor review comments (the image_num_crops device fix was done separately by the maintainer): 1. do_rescale / float images: resize_and_normalize_image raised on any non-uint8 input and gated rescale on uint8, so float32 images were rejected and never rescaled. Route rescale/normalize through the base TorchvisionBackend rescale_and_normalize, which honors do_rescale/do_normalize for any dtype (we trust the user's do_rescale for their float scale) and builds the mean/std on the input device. The uint8 path is numerically unchanged. 2. device placement: resize_idx / patch_idx / image_grid were created on CPU while pixel_values live on the input image's device. Create them on image_chw.device so the returned tensors are device-consistent. The group-images-by-shape batching from the same review is a larger refactor, handled separately.

make fix-repo side-effect: the 'remove unnecessary slicing' change was applied to modular_molmo2.py but modeling_molmo2.py still had enumerate(self.blocks[: self.config.num_hidden_layers]). Regenerated so the decoder loop matches the modular (self.blocks already holds exactly num_hidden_layers layers).

Implements guarin's optimization: flatten the images, group_images_by_shape, process each unique shape as one [N,C,H,W] batch (tiling + all patch/pooling indices are shape-dependent only, computed once; resize/unfold vectorized over N), then reorder_images back to the original order. Semantically identical to the former per-image loop (when grouping is disabled each group is N=1). build_resized_image / _build_overlapping_crops are now batch-native; the video _build_frame_patches passes a 1-frame batch.

`get_placeholder_mask` returns a [batch, seq, 1] mask, but Molmo2 adds the image features onto the placeholder-token embeddings via `inputs_embeds[special_image_mask] + image_features`. Direct boolean indexing does not broadcast, so a [batch, seq, 1] mask cannot reach the hidden dim and raises `IndexError: shape of the mask [.., .., 1] at index 2 does not match the indexed tensor [.., .., hidden]` on every image/video input (text-only inputs skip this path, which is why they worked). Expand the mask to the hidden dim before the masked_scatter, restoring the residual-add semantics of the original Molmo2 implementation. Fixes 29 of the fast Molmo2ModelTest failures (test_model, save/load, all generate tests).

SangbumChoi · 2026-06-28T15:25:05Z

@molbap @guarin 70% over ready for review PR (also attached demo in chat). might be done in this week.

- test_tied_weights_keys: set `_tied_weights_keys = None` (Molmo2 ties no weights); an empty list broke the dict-based tied-weights handling. Removes the skip. - test_model_outputs_equivalence: un-skip; the old "shape mismatch" was the image-merge bug (placeholder mask not expanded), already fixed. - test_generate_from_inputs_embeds: allow image features to merge into a provided `inputs_embeds` instead of forbidding it, so the multimodal greedy path and a new text-only override (greedy + beam) run. Multimodal beam search stays skipped (it would need the flat-concatenated crops/pooling offsets expanded by beam width). - test_resize_tokens_embeddings / _untied: keep skipped but document the real reason (Molmo2Embedding's two-table layout with fixed special-token ids makes standard resize ill-defined), rather than the vague "custom embedding" note. Fast Molmo2ModelTest: 126 passed, 119 skipped (3 pre-existing assisted-decoding failures are unrelated).

`token_type_ids_mask_function` only clamped `kv_idx` against the length of `mm_token_type_ids` (which covers just the original prompt), but indexed `token_type_ids[batch_idx, q_idx]` directly. When generation verifies several new tokens in one forward (e.g. assisted/speculative decoding), the query length exceeds the prompt length and `q_idx` runs out of bounds: `IndexError: index N is out of bounds for dimension 1 with size N`. Newly generated positions are always text, never image patches, so clamp the `q_idx` lookup the same way as `kv_idx` and treat out-of-range positions as token-type 0. Fixes test_assisted_decoding_sample and test_assisted_decoding_matches_greedy_search. Fast Molmo2ModelTest is now fully green (129 passed, 119 skipped).

SangbumChoi · 2026-06-29T23:41:34Z

molmo2_finetune.ipynb
molmo2_demo_(5).ipynb

SangbumChoi · 2026-07-02T13:27:38Z

molmo2_finetune.ipynb

github-actions · 2026-07-02T13:29:11Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, molmo2

github-actions · 2026-07-02T13:41:11Z

CI recap

Dashboard: View test results in Grafana
Latest run: 28376151610:2
Result: failure | Jobs: 13 | Tests: 27,702 | Failures: 1 | Duration: 1h 37m

merveenoyan requested a review from molbap February 27, 2026 05:55

SangbumChoi force-pushed the molmo2 branch from b15107c to 3fee343 Compare March 26, 2026 23:28

SangbumChoi and others added 14 commits March 27, 2026 08:56

Merge branch 'main' into molmo2

bc04776

fix(molmo2): add Molmo2TextModel to IGNORE_NON_TESTED

e38b0a3

Molmo2TextModel is an internal sub-component used by Molmo2Model and Molmo2ForConditionalGeneration and is tested implicitly through those. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(molmo2): replace requests with stdlib urllib in video processor

cc06cbe

requests is not part of the standard library and caused ImportError in minimal environments (e.g. HuggingFace Jobs). Use urllib.request instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'main' into molmo2

f000173

Merge branch 'main' into molmo2

de8c268

zucchini-nlp self-requested a review March 30, 2026 15:23

zucchini-nlp reviewed Apr 2, 2026

View reviewed changes

SangbumChoi added 5 commits April 4, 2026 09:13

add comments fix

5d9811f

remove safe loading

3d6ec36

fix configuration with large refactor

2604a49

fix tests

e257e17

make formalize to standard way

d75fdbe

SangbumChoi force-pushed the molmo2 branch from 77a8c8c to 99c8582 Compare April 4, 2026 07:47

Merge branch 'main' into molmo2

91b6a78

SangbumChoi force-pushed the molmo2 branch from 99c8582 to 91b6a78 Compare April 4, 2026 07:51

claude and others added 10 commits June 23, 2026 16:08

use terinity

a69b0ad

remove unused function

c6dbec6

make style && make repo-fix

32957ba

make style

a37b6d4

Merge branch 'main' into molmo2

a4b300d

SangbumChoi and others added 9 commits June 26, 2026 00:04

change image_num_crop to have same tensor device

7a25c5e

remove unnecessary slicing

5c44c10

Merge branch 'main' into molmo2

945072a

Merge branch 'main' into molmo2

2ed208a

fix auto_check.py

e04e7ca

SangbumChoi requested review from guarin and molbap June 28, 2026 15:24

claude added 2 commits June 29, 2026 13:19

Merge branch 'main' into molmo2

8d83a8a

Uh oh!

Conversation

SangbumChoi commented Jan 23, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

SangbumChoi commented Mar 29, 2026

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

SangbumChoi commented Jun 25, 2026

Uh oh!

SangbumChoi commented Jun 28, 2026

Uh oh!

SangbumChoi commented Jun 29, 2026

Uh oh!

SangbumChoi commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

CI recap

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

SangbumChoi commented Jan 23, 2026 •

edited by github-actions Bot

Loading