Articulate Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, Eric Eaton
University of Pennsylvania
[Project Website] [Paper] [Twitter threads]
Articulate Anything is a powerful VLM system for articulating 3D objects using various input modalities.
articulate_anything_tiktokified_2_V3.mp4
Articulate 3D objects from text π descriptions
Articulate 3D objects from πΌ images
Articulate 3D objects from π₯ videos
We use Hydra for configuration management. You can easily customize the system by modifying the configuration files in configs/ or overload parameters from the command line. We can automatically articulate a variety of input modalities from a single command
python articulate.py modality={partnet, text, image, video} prompt={prompt} out_dir={output_dir}Articulate-anything uses a actor-critic system, allowing for self-correction and self-improvement over iterations.
-
Download preprocessed PartNet-Mobility dataset from π€ Articulate-Anything Dataset on Hugging Face.
-
To use an interactive demo, run
python gradio_app.py
articulate_anything_gradio_demo.mp4
See below for more detailed guides.
Note
Skip the downloading raw dataset step if you have already downloaded our dataset from π€ Articulate-Anything Dataset on Hugging Face.
-
Clone the repository:
git clone https://github.com/vlongle/articulate-anything.git cd articulate-anything -
Set up the Python environment:
conda create -n articulate-anything python=3.9 conda activate articulate-anything pip install -e . -
Download and extract the PartNet-Mobility dataset:
# Download from https://sapien.ucsd.edu/downloads mkdir datasets mv partnet-mobility-v0.zip datasets/partnet-mobility-v0.zip cd datasets mkdir partnet-mobility-v0 unzip partnet-mobility-v0 -d partnet-mobility-v0
Our system supports Google Gemini, OpenAI GPT, and Anthropic Claude. You can set the model_name in the config file conf/config.yaml to gemini-1.5-flash-latest, gpt-4o, or claude-3-5-sonnet-20241022. Get your API key from the respective website and set it as an environment variable:
export API_KEY=YOUR_API_KEYWe support reconstruction from in-the-wild text, images, or videos, or masked reconstruction from PartNet-Mobility dataset.
Note
Skip all the processing steps if you have downloaded our preprocessed dataset from π€ Articulate-Anything Dataset on Hugging Face.
- First, preprocess the parntet dataset by running
python preprocess_partnet.py parallel={int} modality={} - Run the interactive demo
python gradio_app.py
π It's articulation time! For a step-by-step guide on articulating a PartNet-Mobility object, see the notebook:
or run
python articulate.py modality=partnet prompt=45384 out_dir=results additional_prompt=joint_0to run for object_id=149.
- Preprocess the dataset:
python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality=text
Our precomputed CLIP embeddings is available from our repo in partnet_mobility_embeddings.csv. If you prefer to generate your own embeddings, follow these steps:
- Run the preprocessing with
render_aprt_views=trueto render part views for later part annotation.
python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality=text render_part_views=true -
Annotate mesh parts using VLM (skip if using our precomputed embeddings):
python articulate_anything/preprocess/annotate_partnet_parts.py parallel={int} -
Extract CLIP embeddings (skip if using our precomputed embeddings):
python articulate_anything/preprocess/create_partnet_embeddings.py
-
π It's articulation time! For a detailed guide, see:
or run
python articulate.py modality=text prompt="suitcase with a retractable handle" out_dir=results/text/suitcase joint_actor.targetted_affordance=false
-
Render images for each object:
python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality={image}This renders a front-view image for each object in the PartNet-Mobility dataset. This is necessary for our mesh retrieval as we will compare the visual similarity between the input image or video against each rendered template object.
-
π It's articulation time! For a detailed guide, see:
or run
python articulate.py modality=video prompt="datasets/in-the-wild-dataset/videos/suitcase.mp4" out_dir=results/video/suitcase
Note: Please download a checkpoint of cotracker for video articulation to visualize the motion traces.
Some implementation pecularity with the PartNet-Mobility dataset.
- Raise above ground: The meshes are centered at origin
(0,0,0). We usepybulletto raise the links above the ground. Done automatically insapien_simulate. - Rotate meshes: All the meshes will be on the ground. We have to get them in the upright orientation. Specifically, we need to add a fixed joint
<origin rpy="1.570796326794897 0 1.570796326794897" xyz="0 0 0"/>between the first link and thebaselink. This is almost done in the original PartNet-Mobility dataset.render_partnet_objwhich callsrotate_urdfsaves the original urdf undermobility.urdf.backupand get the correct rotation undermobility.urdf. Then, for our generated python program we need to make sure that the compiled python program also has this joint. This is done automatically by the compilerodio_urdf.pyusingalign_robot_orientationfunction.
Feel free to reach me at [email protected] if you'd like to collaborate, or have any questions. You can also open a Github issue if you encounter any problems.
If you find this work useful, please consider citing our paper:
@article{le2024articulate,
title={Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model},
author={Le, Long and Xie, Jason and Liang, William and Wang, Hung-Ju and Yang, Yue and Ma, Yecheng Jason and Vedder, Kyle and Krishna, Arjun and Jayaraman, Dinesh and Eaton, Eric},
journal={arXiv preprint arXiv:2410.13882},
year={2024}
}For more information, visit our project website.