Name	Name	Last commit message	Last commit date
parent directory ..
notebooks	notebooks
outputs/viz_test	outputs/viz_test
src	src
PROJECT_DOCUMENT.md	PROJECT_DOCUMENT.md
README.md	README.md
app.py	app.py
requirements.txt	requirements.txt

Compact Vision-Language-Action Model for Robotic Manipulation

A research-grade implementation of a compact VLA model (~400M parameters) that takes camera images and natural language instructions as input and outputs smooth 7-DoF robotic action sequences via diffusion-based decoding. Inspired by SmolVLA and pi-zero.

Key Features

Compact Architecture: ~400M params (vs. 7B for OpenVLA, 55B for RT-2) -- deployable on consumer GPUs and edge devices
Diffusion Action Decoding: DDPM/DDIM-based decoder generates smooth 16-step action chunks instead of single-step predictions
Vision-Language Fusion: Bidirectional cross-attention grounds language instructions in visual observations
7-DoF Action Space: Position (xyz) + rotation (rx, ry, rz) + gripper open/close
Fully Self-Contained Demo: Runs on CPU with synthetic data -- no external simulators or downloads required

Architecture

RGB Image (224x224)          Text Instruction
       |                           |
  [SigLIP/ViT-B16]         [Language Encoder]
  Spatial Compress 4x        Transformer LM
  49 tokens @ 512d           64 tokens @ 512d
       |                           |
       +---> [Cross-Attention Fusion x3] <---+
                      |
              113 fused tokens @ 512d
                      |
         [DDPM Diffusion Action Decoder]
              10 DDIM steps
                      |
              [Action Head]
                      |
         16 actions x 7-DoF each

Component	Parameters	Description
Vision Encoder	~85M	ViT-B/16 + spatial compression (14x14 -> 7x7)
Language Encoder	~15M	4-layer transformer with 384-dim hidden
Cross-Attention Fusion	~50M	3 bidirectional cross-attention layers
Diffusion Decoder	~20M	6-layer conditional residual network
Action Head	<1M	Refinement MLP + normalization
Total	~170M	Lightweight mode (no pretrained weights)

With pretrained CLIP ViT-B/16 and DistilBERT, the full model reaches ~400M parameters.

Project Structure

Project_5_VLA_Robotic_Manipulation/
├── README.md
├── requirements.txt
├── PROJECT_DOCUMENT.md
├── src/
│   ├── models/
│   │   ├── vision/
│   │   │   └── siglip_encoder.py       # SigLIP/CLIP vision encoder + spatial compression
│   │   ├── language/
│   │   │   └── language_encoder.py     # Lightweight language encoder
│   │   ├── fusion/
│   │   │   ├── cross_attention.py      # Vision-language cross-attention
│   │   │   └── vlm_backbone.py         # Combined VLM backbone
│   │   ├── action/
│   │   │   ├── diffusion_decoder.py    # DDPM action chunk decoder
│   │   │   └── action_head.py          # 7-DoF action post-processing
│   │   └── vla_model.py               # Full VLA pipeline
│   ├── data/
│   │   ├── robot_dataset.py            # Dataset loader + synthetic data
│   │   └── data_utils.py              # Normalization, trajectory utils
│   ├── simulation/
│   │   └── simple_env.py              # 2D manipulation environment
│   ├── training/
│   │   ├── train.py                    # Training loop
│   │   └── config.yaml                # Hyperparameters
│   └── evaluation/
│       ├── evaluate.py                 # Success rate, speed benchmarks
│       └── visualize_actions.py        # Trajectory visualization
└── notebooks/
    └── demo.ipynb                      # Interactive demo notebook

Quick Start

Installation

pip install -r requirements.txt

Run the Demo Notebook

jupyter notebook notebooks/demo.ipynb

The demo notebook runs entirely on CPU with synthetic data -- no GPU, no downloads, no external simulators needed.

Train the Model

# Default training with synthetic data
python -m src.training.train

# Custom configuration
python -m src.training.train --config src/training/config.yaml --epochs 10 --batch-size 4

Evaluate

# Evaluate untrained model (baseline)
python -m src.evaluation.evaluate --episodes 20

# Evaluate from checkpoint
python -m src.evaluation.evaluate --checkpoint outputs/vla_v1/best_model.pt

Quick Test (Individual Components)

python -m src.models.vision.siglip_encoder
python -m src.models.language.language_encoder
python -m src.models.fusion.cross_attention
python -m src.models.action.diffusion_decoder
python -m src.models.vla_model
python -m src.data.robot_dataset
python -m src.simulation.simple_env

Technical Details

Diffusion Action Decoding

Instead of predicting one action at a time (autoregressive, slow, jerky), this model uses DDPM diffusion to generate a chunk of 16 future actions simultaneously:

Training: Add Gaussian noise to ground-truth action chunks at random timesteps; train a noise predictor to recover the noise
Inference: Start from pure noise and iteratively denoise over K steps (DDIM for speed) to produce smooth action sequences
Execution: Execute the first 4-8 actions from the chunk, then re-predict (receding horizon)

The cosine noise schedule and DDIM sampling with 10 steps enable real-time inference (>8 Hz target).

Action Space

Each action is a 7-dimensional vector:

Dimension	Range	Description
x, y, z	[-0.05, 0.05] m	End-effector position delta
rx, ry, rz	[-0.25, 0.25] rad	End-effector rotation delta
gripper	{0, 1}	Gripper state (0=closed, 1=open)

Vision-Language Fusion

Bidirectional cross-attention allows:

Vision -> Language: Visual tokens attend to language tokens ("What does 'red cup' look like in this scene?")
Language -> Vision: Language tokens attend to visual tokens ("Where is the object I need to pick up?")

Three stacked layers with gated residual connections progressively refine the multimodal alignment.

Evaluation Benchmarks

Target Performance

Benchmark	Metric	Target
CALVIN ABC->D	Avg. chain length	2.7+
Success rate	Single task	>80%
Inference speed	Hz	>8
Model size	Parameters	<500M
GPU memory	Inference VRAM	<4GB

Baselines

Method	Params	CALVIN Avg Len	Hz
RT-2	55B	3.2	0.5
OpenVLA	7B	2.8	3.0
SmolVLA	450M	2.5	8.0
Octo	93M	1.8	15.0
Ours	~400M	~2.7	>8

Configuration

All hyperparameters are in src/training/config.yaml. Key settings:

Parameter	Default	Description
`chunk_size`	16	Actions per chunk
`diffusion_steps`	100	Training diffusion timesteps
`inference_steps`	10	DDIM sampling steps
`learning_rate`	1e-4	AdamW learning rate
`batch_size`	8	Training batch size
`image_size`	224	Input image resolution

References

Kim et al., "OpenVLA: An Open-Source Vision-Language-Action Model," 2024
Roux et al., "SmolVLA: A Small VLA for Efficient Robot Manipulation," 2025
Black et al., "pi-zero: A Vision-Language-Action Flow Model for General Robot Control," 2024
Ho et al., "Denoising Diffusion Probabilistic Models," NeurIPS 2020
Chi et al., "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion," RSS 2023
Mees et al., "CALVIN: A Benchmark for Language-Conditioned Policy Learning," RA-L 2022
Open X-Embodiment Collaboration, "Open X-Embodiment," 2024

License

This project is for educational and research purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Compact Vision-Language-Action Model for Robotic Manipulation

Key Features

Architecture

Project Structure

Quick Start

Installation

Run the Demo Notebook

Train the Model

Evaluate

Quick Test (Individual Components)

Technical Details

Diffusion Action Decoding

Action Space

Vision-Language Fusion

Evaluation Benchmarks

Target Performance

Baselines

Configuration

References

License

FilesExpand file tree

Project_5_VLA_Robotic_Manipulation

Directory actions

More options

Directory actions

More options

Latest commit

History

Project_5_VLA_Robotic_Manipulation

Folders and files

parent directory

README.md

Compact Vision-Language-Action Model for Robotic Manipulation

Key Features

Architecture

Project Structure

Quick Start

Installation

Run the Demo Notebook

Train the Model

Evaluate

Quick Test (Individual Components)

Technical Details

Diffusion Action Decoding

Action Space

Vision-Language Fusion

Evaluation Benchmarks

Target Performance

Baselines

Configuration

References

License