Add model arcinstitute state by drbh · Pull Request #39480 · huggingface/transformers

drbh · 2025-07-17T14:55:18Z

This PR adds the arc state model

Run embedding model via transformers

git clone https://github.com/huggingface/transformers
git checkout add-model-arcinstitute-state
uv run sanity.py

sanity.py

# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "torch",
#     "transformers"
# ]
#
# [tool.uv.sources]
# transformers = { path = ".", editable = true }
# ///
import torch
from transformers import StateEmbeddingModel


model_name = "arcinstitute/SE-600M"
model = StateEmbeddingModel.from_pretrained(model_name)

torch.manual_seed(0)
input_ids = torch.randn((1, 1, 5120), dtype=torch.float32)
mask = torch.ones((1, 1, 5120), dtype=torch.bool)
mask[:, :, 2560:] = False # simulate half masking
print("Input sum:\t", input_ids.sum())
print("Mask sum:\t", mask.sum())

outputs = model(input_ids, mask)
print("Output sum:\t", outputs["gene_output"].sum())

outputs

Input sum:	 tensor(-38.6611)
Mask sum:	 tensor(2560)
Output sum:	 tensor(-19.6819, grad_fn=<SumBackward0>)

Compare to reference

git clone https://github.com/ArcInstitute/state.git
cd state
curl -OL https://huggingface.co/arcinstitute/SE-600M/resolve/main/se600m_epoch16.ckpt

next, apply this small patch so we can run the model file directly with a fixed input to compare with the impl above

file `compare.patch`

diff --git a/src/state/emb/nn/model.py b/src/state/emb/nn/model.py
index dbbefb3..42167a1 100644
--- a/src/state/emb/nn/model.py
+++ b/src/state/emb/nn/model.py
@@ -23,20 +23,20 @@ from torch.nn import TransformerEncoder, TransformerEncoderLayer, BCEWithLogitsL
 from tqdm.auto import tqdm
 from torch.optim.lr_scheduler import ChainedScheduler, LinearLR, CosineAnnealingLR, ReduceLROnPlateau
 
-from ..data import create_dataloader
-from ..utils import (
+from state.emb.data import create_dataloader
+from state.emb.utils import (
     compute_gene_overlap_cross_pert,
     get_embedding_cfg,
     get_dataset_cfg,
     compute_pearson_delta,
     compute_perturbation_ranking_score,
 )
-from ..eval.emb import cluster_embedding
-from .loss import WassersteinLoss, KLDivergenceLoss, MMDLoss, TabularLoss
+from state.emb.eval.emb import cluster_embedding
+from loss import WassersteinLoss, KLDivergenceLoss, MMDLoss, TabularLoss
 
 
-from .flash_transformer import FlashTransformerEncoderLayer
-from .flash_transformer import FlashTransformerEncoder
+from flash_transformer import FlashTransformerEncoderLayer
+from flash_transformer import FlashTransformerEncoder
 
 
 class SkipBlock(nn.Module):
@@ -196,7 +196,8 @@ class StateEmbeddingModel(L.LightningModule):
             self.dataset_embedder = nn.Linear(output_dim, 10)
 
             # Assume self.cfg.model.num_datasets is set to the number of unique datasets.
-            num_dataset = get_dataset_cfg(self.cfg).num_datasets
+            # num_dataset = get_dataset_cfg(self.cfg).num_datasets
+            num_dataset = 14420 
             self.dataset_encoder = nn.Sequential(
                 nn.Linear(output_dim, d_model),
                 nn.SiLU(),
@@ -686,3 +687,18 @@ class StateEmbeddingModel(L.LightningModule):
             "optimizer": optimizer,
             "lr_scheduler": {"scheduler": scheduler, "monitor": "train_loss", "interval": "step", "frequency": 1},
         }
+
+if __name__ == "__main__":
+    checkpoint = "/Users/drbh/Projects/state/se600m_epoch16.ckpt"
+    model = StateEmbeddingModel.load_from_checkpoint(checkpoint, dropout=0.0, strict=False)
+
+    torch.manual_seed(0)
+
+    input_ids = torch.randn((1, 1, 5120), dtype=torch.float32)
+    mask = torch.ones((1, 1, 5120), dtype=torch.bool)
+    mask[:, :, 2560:] = False
+    print("Input sum:\t", input_ids.sum())
+    print("Mask sum:\t", mask.sum())
+
+    output, embedding, dataset_emb = model(input_ids, mask)
+    print("Output shape:\t", output.sum())

can be applied like

# save above as compare.patch
git apply compare.patch

run the model

.venv/bin/python src/state/emb/nn/model.py

output

!!! Using Flash Attention !!!
Input sum: tensor(-38.6611)
Mask sum: tensor(2560)
Output shape: tensor(-19.6819, grad_fn=<SumBackward0>)

FL33TW00D · 2025-07-17T14:59:15Z

Paper for reference! https://www.biorxiv.org/content/10.1101/2025.06.26.661135v2

HuggingFaceDocBuilderDev · 2025-07-17T15:08:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2025-07-17T15:37:26Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

FL33TW00D · 2025-07-18T11:20:04Z

@abhinadduri for ref

ArthurZucker · 2025-07-25T08:42:42Z

cc @Cyrilvallez camn you have a look?!

Cyrilvallez · 2025-08-06T08:41:40Z

Hey! I'm a bit confused with the PR right now! What model are we adding? If it's arcinstitute, we only want modeling file refering to it! No general names such as StateEmbedding!
But I see that you incorporated modular, which is very good! So let's fix a bit/clarify the model names, fix the consistency issues and then we'll be good for a first review! 🤗

FL33TW00D · 2025-08-06T10:46:28Z

Hey @Cyrilvallez,
Thanks for taking a look!

Arc has created 2 models, StateEmbedding and StateTransition which would be good to add. They are the first group to surpass linear baselines for this problem.

We will clean up the PR before pinging for a proper review!

Cyrilvallez · 2025-08-07T12:57:45Z

Nice, thanks for the explanations @FL33TW00D! Makes sense! Let me know when you believe this is ready for review then 🤗 Just a heads-up that we want the folder/files names as full snake_case, e.g. state_embedding, and the class name all prefixed with CamelCase, e.g. StateEmbeddingModule! 👌

abhinadduri · 2025-08-11T18:23:16Z

thanks everyone! we are starting our review now, cc @Rive-001

Rive-001 · 2025-08-11T18:23:52Z

+
+class StateTxConfig(PretrainedConfig):
+    r"""
+    Configuration class for StateTx (State Transformer) model based on PertSetsPerturbationModel.


Low priority: We renamed the model class from PertSetsPerturbationModel to StateTransitionPerturbationModel.

Rive-001 · 2025-08-11T18:26:18Z

+    r"""
+    Configuration class for StateTx (State Transformer) model based on PertSetsPerturbationModel.
+
+    This model uses a bidirectional Llama transformer backbone to process perturbation data.


Low priority: We currently support bi-directional versions of both Llama and GPT-2.

Rive-001 · 2025-08-11T18:46:48Z

+from .configuration_state_tx import LlamaBidirectionalConfig, StateTxConfig
+
+
+class SamplesLoss(nn.Module):


We currently support MSE and MMD loss functions. We use SampleLoss from the geom library. SampleLoss documentation

https://github.com/ArcInstitute/state/blob/be0006c4556327431bda29b6db1b7b223d9eda8c/src/state/tx/models/state_transition.py#L163-L177

https://github.com/ArcInstitute/state/blob/be0006c4556327431bda29b6db1b7b223d9eda8c/src/state/tx/models/state_transition.py#L20-L34

Rive-001 · 2025-08-11T19:11:38Z

+        return F.mse_loss(predictions, targets)
+
+
+class LatentToGeneDecoder(nn.Module):


We currently support user input for number of layers in the decoder, dimensions of those layers and optional residual connections between layers.

https://github.com/ArcInstitute/state/blob/be0006c4556327431bda29b6db1b7b223d9eda8c/src/state/tx/models/base.py#L15-L116

Rive-001 · 2025-08-11T20:59:49Z

+    def __init__(self, config: StateTxConfig):
+        super().__init__()
+        self.decoder = nn.Sequential(
+            nn.Linear(config.gene_dim, 1024, bias=True),


This doesn't seem correct, as the input dimension would be a latent dimension and not the gene dimensions. The gene dimension is the output dimension.

https://github.com/ArcInstitute/state/blob/be0006c4556327431bda29b6db1b7b223d9eda8c/src/state/tx/models/base.py#L47

Rive-001 · 2025-08-11T23:19:56Z

+        #     batch_embeds = self.batch_encoder(batch_ids)  # (batch_size, hidden_dim)
+        #     # Add batch embedding to each position
+        #     combined_input = combined_input + batch_embeds.unsqueeze(1)
+        batch_embeddings = self.batch_encoder(torch.zeros([512]).long()).unsqueeze(1)


Hardcoding batch embeddings to 0s might not be correct.

Rive-001 · 2025-08-13T21:33:32Z

+
+        # Binary classification decoder
+        # binary_input_dim = config.output_dim + config.d_model + config.z_dim_rd + config.z_dim_ds
+        binary_input_dim = 4107


We might want the dimensions of this decoder to be based on the config values.

drbh added 4 commits July 17, 2025 10:51

feat: new modular model setup for state embed

bd30a59

feat: correct logit output

3629402

feat: update script to run model

01bd941

fix: state transition model based on st parse

ced0995

drbh requested review from ArthurZucker, FL33TW00D and cyrilzakka July 17, 2025 15:10

drbh added 2 commits July 17, 2025 11:36

fix: adjuist model naming

74892c1

fix: remove dev test script

c304452

ArthurZucker added the New model label Jul 17, 2025

Rive-001 reviewed Aug 11, 2025

View reviewed changes

fix: improve naming

6fc3736

Rive-001 reviewed Aug 11, 2025

View reviewed changes

fix: improve latent gene decoder and config

0dfbdd2

ArthurZucker requested review from Cyrilvallez and removed request for ArthurZucker August 13, 2025 09:02

Rive-001 reviewed Aug 13, 2025

View reviewed changes

evalstate mentioned this pull request Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

		from .configuration_state_tx import LlamaBidirectionalConfig, StateTxConfig


		class SamplesLoss(nn.Module):

		return F.mse_loss(predictions, targets)


		class LatentToGeneDecoder(nn.Module):

Uh oh!

Conversation

drbh commented Jul 17, 2025

Run embedding model via transformers

Compare to reference

Uh oh!

FL33TW00D commented Jul 17, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 17, 2025

Uh oh!

github-actions Bot commented Jul 17, 2025

Uh oh!

FL33TW00D commented Jul 18, 2025

Uh oh!

ArthurZucker commented Jul 25, 2025

Uh oh!

Cyrilvallez commented Aug 6, 2025

Uh oh!

FL33TW00D commented Aug 6, 2025

Uh oh!

Cyrilvallez commented Aug 7, 2025

Uh oh!

abhinadduri commented Aug 11, 2025

Uh oh!

Rive-001 Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Rive-001 Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Rive-001 Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rive-001 Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Rive-001 Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Rive-001 Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Rive-001 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Rive-001 Aug 11, 2025 •

edited

Loading