- Talking Face
- Image Animation
- Video Generation
- TryOn
- Visual Edit
- Others
- Music2Dance and Co-speech
- Speech and Interaction
- Post Training
Talking Face
Talking Face
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-04 | LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging | Zhijian Shu et.al. | 2512.04939 | null |
| 2025-12-04 | Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild | Yigui Feng et.al. | 2512.04728 | null |
| 2025-12-02 | DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions | Yifan Zhou et.al. | 2512.02727 | null |
| 2025-12-01 | EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans | Yingjie Zhou et.al. | 2512.01340 | null |
| 2025-11-30 | EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head | Chang Liu et.al. | 2512.05991 | null |
| 2025-11-30 | TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model | Alireza Javanmardi et.al. | 2512.00909 | null |
| 2025-11-29 | MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection | Mengxue Hu et.al. | 2512.00336 | null |
| 2025-11-28 | AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement | Zhizhou Zhong et.al. | 2511.23475 | null |
| 2025-11-28 | CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation | Fengyi Fang et.al. | 2511.22863 | null |
| 2025-11-27 | AI killed the video star. Audio-driven diffusion model for expressive talking head generation | Baptiste Chopin et.al. | 2511.22488 | null |
| 2025-11-27 | VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task | Yuyue Wang et.al. | 2511.22229 | null |
| 2025-11-27 | IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer | Bo Chen et.al. | 2511.22167 | null |
| 2025-11-27 | Lips-Jaw and Tongue-Jaw Articulatory Tradeoff in DYNARTmo | Bernd J. KrΓΆger et.al. | 2511.22155 | null |
| 2025-11-26 | Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video | Filippo Cenacchi et.al. | 2511.13802 | null |
| 2025-11-24 | Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation | Mathis Wolter et.al. | 2511.19519 | null |
| 2025-11-24 | Assessing the alignment between infants' visual and linguistic experience using multimodal language models | Alvin Wei Ming Tan et.al. | 2511.18824 | null |
| 2025-11-23 | SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model | Kaidi Wang et.al. | 2512.05126 | null |
| 2025-11-23 | The Locally Deployable Virtual Doctor: LLM Based Human Interface for Automated Anamnesis and Database Conversion | Jan Benedikt Ruhland et.al. | 2511.18632 | null |
| 2025-11-23 | RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data | Wenchao Ma et.al. | 2511.18601 | null |
| 2025-11-22 | A superpersuasive autonomous policy debating system | Allen Roush et.al. | 2511.17854 | null |
| 2025-11-21 | Investigating self-supervised representations for audio-visual deepfake detection | Dragos-Alexandru Boldisor et.al. | 2511.17181 | null |
| 2025-11-20 | Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions | Takuya Igaue et.al. | 2511.16711 | null |
| 2025-11-19 | StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model | Yifan Yang et.al. | 2511.14223 | null |
| 2025-11-18 | Blur-Robust Detection via Feature Restoration: An End-to-End Framework for Prior-Guided Infrared UAV Target Detection | Xiaolin Wang et.al. | 2511.14371 | null |
| 2025-11-18 | Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning | Rui Liu et.al. | 2511.14249 | null |
| 2025-11-17 | B2F: End-to-End Body-to-Face Motion Generation with Style Reference | Bokyung Jang et.al. | 2511.13988 | null |
| 2025-11-17 | Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views | Junyi Ma et.al. | 2511.12878 | null |
| 2025-11-12 | GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow | Rui Wan et.al. | 2511.09272 | null |
| 2025-11-11 | Is It Truly Necessary to Process and Fit Minutes-Long Reference Videos for Personalized Talking Face Generation? | Rui-Qing Sun et.al. | 2511.07940 | null |
| 2025-11-10 | LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration | Tung Vu et.al. | 2511.07552 | null |
| 2025-11-10 | The Inner Kernel of the Classical Kuiper Belt | Amir Siraj et.al. | 2511.07512 | null |
| 2025-11-10 | ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search | Zhenjie Liu et.al. | 2511.06833 | null |
| 2025-11-08 | DiLO: Disentangled Latent Optimization for Learning Shape and Deformation in Grouped Deforming 3D Objects | Mostofa Rafid Uddin et.al. | 2511.06115 | null |
| 2025-11-08 | Reperio-rPPG: Relational Temporal Graph Neural Networks for Periodicity Learning in Remote Physiological Measurement | Ba-Thinh Nguyen et.al. | 2511.05946 | null |
| 2025-11-07 | Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis | Dogucan Yaman et.al. | 2511.05432 | null |
| 2025-11-07 | THEval. Evaluation Framework for Talking Head Video Generation | Nabyl Quignon et.al. | 2511.04520 | null |
| 2025-11-05 | Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework | Dogucan Yaman et.al. | 2511.08613 | null |
| 2025-11-05 | Laugh, Relate, Engage: Stylized Comment Generation for Short Videos | Xuan Ouyang et.al. | 2511.03757 | null |
| 2025-11-05 | UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions | Guozhen Zhang et.al. | 2511.03334 | null |
| 2025-11-04 | Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks | Dmitrii Pozdeev et.al. | 2511.02830 | null |
| 2025-11-01 | Beyond the Uncanny Valley: A Mixed-Method Investigation of Anthropomorphism in Protective Responses to Robot Abuse | Fan Yang et.al. | 2510.26082 | null |
| 2025-11-01 | Audio Driven Real-Time Facial Animation for Social Telepresence | Jiye Lee et.al. | 2510.01176 | null |
| 2025-10-29 | Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation | Yuxiang Mao et.al. | 2510.25234 | null |
| 2025-10-28 | See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement | Jinting Wang et.al. | 2510.26819 | null |
| 2025-10-28 | The Divine Software Engineering Comedy -- Inferno: The Okinawa Files | Michele Lanza et.al. | 2510.24483 | null |
| 2025-10-28 | GenTrack: A New Generation of Multi-Object Tracking | Toan Van Nguyen et.al. | 2510.24399 | null |
| 2025-10-28 | Variable Projected Augmented Lagrangian Methods for Generalized Lasso Problems | Stefano Aleotti et.al. | 2510.24140 | null |
| 2025-10-27 | Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation | Junyoung Seo et.al. | 2510.23581 | null |
| 2025-10-27 | Revising Second Order Terms in Deep Animation Video Coding | Konstantin Schmidt et.al. | 2510.23561 | null |
| 2025-10-26 | MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control | Fatemeh Nazarieh et.al. | 2510.22810 | null |
| 2025-10-26 | DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection | Kangran Zhao et.al. | 2510.22622 | null |
| 2025-10-24 | Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing | Danial Samadi Vahdati et.al. | 2510.03548 | null |
| 2025-10-23 | LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation | Xin Lu et.al. | 2510.21864 | null |
| 2025-10-16 | PIA: Deepfake Detection Using Phoneme-Temporal and Identity-Dynamic Analysis | Soumyya Kanti Datta et.al. | 2510.14241 | null |
| 2025-10-14 | Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback | Xingpei Ma et.al. | 2510.12089 | null |
| 2025-10-12 | DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis | Peiyin Chen et.al. | 2510.10650 | null |
| 2025-10-11 | VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation Framework | Donglin Huang et.al. | 2510.10269 | null |
| 2025-10-11 | SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation | Zeyu Ling et.al. | 2510.10069 | null |
| 2025-10-09 | Paper2Video: Automatic Video Generation from Scientific Papers | Zeyu Zhu et.al. | 2510.05096 | null |
| 2025-10-08 | A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages | Zibo Su et.al. | 2510.06612 | null |
| 2025-10-03 | EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation | Tianheng Zhu et.al. | 2510.08587 | null |
| 2025-10-02 | Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation | Beijia Lu et.al. | 2510.02617 | null |
| 2025-09-30 | 3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation | Balamurugan Thambiraja et.al. | 2509.26233 | null |
| 2025-09-28 | Durian: Dual Reference Image-Guided Portrait Animation with Attribute Transfer | Hyunsoo Cha et.al. | 2509.04434 | null |
| 2025-09-26 | StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing | Liyang Chen et.al. | 2509.21887 | null |
| 2025-09-25 | Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos | Sarmistha Das et.al. | 2509.20961 | null |
| 2025-09-24 | KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation | Tianle Lyu et.al. | 2509.20128 | null |
| 2025-09-24 | Comparative Study of Subjective Video Quality Assessment Test Methods in Crowdsourcing for Varied Use Cases | Babak Naderi et.al. | 2509.20118 | null |
| 2025-09-24 | SynchroRaMa : Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding | Phyo Thet Yee et.al. | 2509.19965 | null |
| 2025-09-24 | Talking Head Generation via AU-Guided Landmark Prediction | Shao-Yu Chang et.al. | 2509.19749 | null |
| 2025-09-24 | EAI-Avatar: Emotion-Aware Interactive Talking Head Generation | Haijie Yang et.al. | 2508.18337 | null |
| 2025-09-23 | Audio-Driven Universal Gaussian Head Avatars | Kartik Teotia et.al. | 2509.18924 | null |
| 2025-09-22 | "I don't like my avatar": Investigating Human Digital Doubles | Siyi Liu et.al. | 2509.17748 | null |
| 2025-09-22 | Stable Video-Driven Portraits | Mallikarjun B. R. et.al. | 2509.17476 | null |
| 2025-09-21 | Beat on Gaze: Learning Stylized Generation of Gaze and Head Dynamics | Chengwei Shi et.al. | 2509.17168 | null |
| 2025-09-21 | PGSTalker: Real-Time Audio-Driven Talking Head Generation via 3D Gaussian Splatting with Pixel-Aware Density Control | Tianheng Zhu et.al. | 2509.16922 | null |
| 2025-09-20 | Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation | Yue Ma et.al. | 2509.16630 | null |
| 2025-09-17 | Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis | Yikang Ding et.al. | 2509.09595 | null |
| 2025-09-16 | A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis | Javeria Amir et.al. | 2509.12831 | null |
| 2025-09-15 | AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective | Yuchen Deng et.al. | 2509.12052 | null |
| 2025-09-10 | Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video | Xiao Li et.al. | 2509.08376 | null |
| 2025-08-28 | EmoCAST: Emotional Talking Portrait via Emotive Text Description | Yiguo Jiang et.al. | 2508.20615 | null |
| 2025-08-27 | InfinityHuman: Towards Long-Term Audio-Driven Human | Xiaodi Li et.al. | 2508.20210 | null |
| 2025-08-27 | Improving Generalization in Deepfake Detection with Face Foundation Models and Metric Learning | Stelios Mylonas et.al. | 2508.19730 | null |
| 2025-08-26 | OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation | Jianwen Jiang et.al. | 2508.19209 | null |
| 2025-08-26 | Wan-S2V: Audio-Driven Cinematic Video Generation | Xin Gao et.al. | 2508.18621 | null |
| 2025-08-25 | Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation | Jianzhi Long et.al. | 2509.00052 | null |
| 2025-08-22 | Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars | NVIDIA et.al. | 2508.16401 | null |
| 2025-08-20 | D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis | Yuhang Guo et.al. | 2508.14449 | null |
| 2025-08-20 | Taming Transformer for Emotion-Controllable Talking Face Generation | Ziqi Zhang et.al. | 2508.14359 | null |
| 2025-08-19 | TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis | Shunian Chen et.al. | 2508.13618 | null |
| 2025-08-19 | EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis | Shuai Tan et.al. | 2508.13442 | null |
| 2025-08-18 | Human Feedback Driven Dynamic Speech Emotion Recognition | Ilya Fedorov et.al. | 2508.14920 | null |
| 2025-08-17 | CEM-Net: Cross-Emotion Memory Network for Emotional Talking Face Generation | Kangyi Wu et.al. | 2508.12368 | null |
| 2025-08-16 | RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis | Wenqing Wang et.al. | 2508.12163 | null |
| 2025-08-16 | SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System | Truong Thanh Hung Nguyen et.al. | 2508.11873 | null |
| 2025-08-15 | FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation | MengChao Wang et.al. | 2508.11255 | null |
| 2025-08-14 | HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis | Shiyu Liu et.al. | 2508.10566 | null |
| 2025-08-13 | LIA-X: Interpretable Latent Portrait Animator | Yaohui Wang et.al. | 2508.09959 | null |
| 2025-08-12 | Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos | Chaoyi Wang et.al. | 2508.08891 | null |
| 2025-08-11 | Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation | Hyung Kyu Kim et.al. | 2507.20568 | null |
| 2025-08-10 | KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features | Ivan Kukanov et.al. | 2508.07337 | null |
| 2025-08-08 | MotionSwap | Om Patil et.al. | 2508.06430 | null |
| 2025-08-07 | Evaluation of a Sign Language Avatar on Comprehensibility, User Experience & Acceptability | Fenya Wasserroth et.al. | 2508.05358 | null |
| 2025-08-07 | RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer | Fangyu Du et.al. | 2508.05115 | null |
| 2025-08-07 | UniTalker: Conversational Speech-Visual Synthesis | Yifan Hu et.al. | 2508.04585 | null |
| 2025-08-07 | AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation | Le Wang et.al. | 2508.00733 | null |
| 2025-08-06 | MienCap: Realtime Performance-Based Facial Animation with Live Mood Dynamics | Ye Pan et.al. | 2508.04687 | null |
| 2025-08-06 | READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation | Haotian Wang et.al. | 2508.03457 | null |
| 2025-08-05 | Multi-human Interactive Talking Dataset | Zeyu Zhu et.al. | 2508.03050 | null |
| 2025-08-04 | X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio | Chenxu Zhang et.al. | 2508.02944 | null |
| 2025-08-04 | Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering | Xu Wang et.al. | 2508.02362 | null |
| 2025-08-04 | Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos | Laura Pedrouzo-Rodriguez et.al. | 2508.00748 | null |
| 2025-07-31 | Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads | Yingjie Zhou et.al. | 2507.23343 | null |
| 2025-07-30 | X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention | Xiaochen Zhao et.al. | 2507.23143 | null |
| 2025-07-30 | Robust Deepfake Detection for Electronic Know Your Customer Systems Using Registered Images | Takuma Amada et.al. | 2507.22601 | null |
| 2025-07-29 | DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation | He Feng et.al. | 2508.06511 | null |
| 2025-07-29 | JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 | Xinhan Di et.al. | 2507.20987 | null |
| 2025-07-29 | Versatile Multimodal Controls for Expressive Talking Human Animation | Zheng Qin et.al. | 2503.08714 | null |
| 2025-07-28 | Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation | Dogucan Yaman et.al. | 2507.20953 | null |
| 2025-07-28 | MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization | Hyung Kyu Kim et.al. | 2507.20562 | null |
| 2025-07-28 | JOLT3D: Joint Learning of Talking Heads and 3DMM Parameters with Application to Lip-Sync | Sungjoon Park et.al. | 2507.20452 | null |
| 2025-07-25 | Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation | Fang Kang et.al. | 2507.19225 | null |
| 2025-07-24 | Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation | Zhen Han et.al. | 2507.18352 | null |
| 2025-07-24 | Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics | Yuezun Li et.al. | 2507.18015 | null |
| 2025-07-24 | MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding | Chang Liu et.al. | 2507.06071 | null |
| 2025-07-23 | MoDA: Multi-modal Diffusion Architecture for Talking Head Generation | Xinyang Li et.al. | 2507.03256 | null |
| 2025-07-22 | Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching | Haiyang Liu et.al. | 2507.18649 | null |
| 2025-07-22 | Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model | Mingtao Guo et.al. | 2507.16341 | null |
| 2025-07-21 | VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis | Alexandre Symeonidis-Herzig et.al. | 2507.06060 | null |
| 2025-07-18 | FantasyPortrait: Enhancing Multi-Character Portrait Animation with Expression-Augmented Diffusion Transformers | Qiang Wang et.al. | 2507.12956 | null |
| 2025-07-17 | ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise Diffusion | Hoang-Son Vo et.al. | 2507.12804 | null |
| 2025-07-17 | Think-Before-Draw: Decomposing Emotion Semantics & Fine-Grained Controllable Expressive Talking Head Generation | Hanlei Shi et.al. | 2507.12761 | null |
| 2025-07-17 | Cross-Modal Watermarking for Authentic Audio Recovery and Tamper Localization in Synthesized Audiovisual Forgeries | Minyoung Kim et.al. | 2507.12723 | null |
| 2025-07-16 | AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation | Hao Li et.al. | 2507.12001 | null |
| 2025-07-14 | M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation | Kui Jiang et.al. | 2507.08307 | null |
| 2025-07-11 | Detecting Deepfake Talking Heads from Facial Biometric Anomalies | Justin D. Norman et.al. | 2507.08917 | null |
| 2025-07-10 | GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation | Wentao Hu et.al. | 2506.21513 | null |
| 2025-07-07 | MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation | Yucheng Wang et.al. | 2507.05092 | null |
| 2025-07-05 | EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation | Rang Meng et.al. | 2507.03905 | null |
| 2025-07-03 | CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation | Xiangyang Luo et.al. | 2507.02691 | null |
| 2025-07-02 | FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases | Shuai Tan et.al. | 2507.01390 | null |
| 2025-07-01 | ICME 2025 Grand Challenge on Video Super-Resolution for Video Conferencing | Babak Naderi et.al. | 2506.12269 | link |
| 2025-06-30 | JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching | Mingi Kwon et.al. | 2506.23552 | null |
| 2025-06-27 | MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation | Dechao Meng et.al. | 2506.22065 | null |
| 2025-06-27 | Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field | Hong Nie et.al. | 2506.22044 | null |
| 2025-06-27 | RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture | Haofeng Wang et.al. | 2506.21865 | null |
| 2025-06-24 | Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router | Yubo Huang et.al. | 2506.19833 | null |
| 2025-06-23 | Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions | Vineet Kumar Rakesh et.al. | 2507.02900 | null |
| 2025-06-23 | OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation | Qijun Gan et.al. | 2506.18866 | null |
| 2025-06-17 | SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting | Ziqiao Peng et.al. | 2506.14742 | null |
| 2025-06-17 | Compressed Video Super-Resolution based on Hierarchical Encoding | Yuxuan Jiang et.al. | 2506.14381 | null |
| 2025-06-16 | Audio-Visual Driven Compression for Low-Bitrate Talking Head Videos | Riku Takahashi et.al. | 2506.13419 | null |
| 2025-06-15 | iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer | Zhelun Shen et.al. | 2506.12847 | null |
| 2025-06-10 | HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation | Ziyao Huang et.al. | 2506.08797 | null |
| 2025-06-03 | NTIRE 2025 XGC Quality Assessment Challenge: Methods and Results | Xiaohong Liu et.al. | 2506.02875 | null |
| 2025-06-02 | Cocktail-Party Audio-Visual Speech Recognition | Thai-Binh Nguyen et.al. | 2506.02178 | null |
| 2025-06-02 | Low-Rank Head Avatar Personalization with Registers | Sai Tanmay Reddy Chakkera et.al. | 2506.01935 | null |
| 2025-06-02 | Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation | Yuan Gan et.al. | 2506.01591 | link |
| 2025-06-01 | SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers | Zhengcong Fei et.al. | 2506.00830 | null |
| 2025-05-30 | TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection | Xinqi Xiong et.al. | 2505.24866 | null |
| 2025-05-29 | Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation | Jiahao Cui et.al. | 2505.23525 | link |
| 2025-05-29 | Video Editing for Audio-Visual Dubbing | Binyamin Manela et.al. | 2505.23406 | link |
| 2025-05-29 | Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation | Hao Li et.al. | 2505.23290 | link |
| 2025-05-29 | MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation | Siyuan Wang et.al. | 2505.23120 | link |
| 2025-05-28 | Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation | Zhe Kong et.al. | 2505.22647 | link |
| 2025-05-28 | Tell me Habibi, is it Real or Fake? | Kartik Kuckreja et.al. | 2505.22581 | null |
| 2025-05-28 | Neural Face Skinning for Mesh-agnostic Facial Expression Cloning | Sihun Cha et.al. | 2505.22416 | null |
| 2025-05-28 | FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing | Guanwen Feng et.al. | 2505.22141 | null |
| 2025-05-28 | RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling | Long-Khanh Pham et.al. | 2505.22024 | null |
| 2025-05-27 | OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers | Ziqiao Peng et.al. | 2505.21448 | null |
| 2025-05-26 | Total-Editing: Head Avatar with Editable Appearance, Motion, and Lighting | Yizhou Zhao et.al. | 2505.20582 | null |
| 2025-05-26 | DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations | Ziqiao Peng et.al. | 2505.18096 | null |
| 2025-05-22 | Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis | Radek DanΔΔek et.al. | 2504.13386 | null |
| 2025-05-14 | Test-Time Augmentation for Pose-invariant Face Recognition | Jaemin Jung et.al. | 2505.09256 | null |
| 2025-05-10 | VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback | Eason Chen et.al. | 2505.06676 | null |
| 2025-05-10 | OT-Talk: Animating 3D Talking Head with Optimal Transportation | Xinmu Wang et.al. | 2505.01932 | null |
| 2025-05-10 | MagicPortrait: Temporally Consistent Face Reenactment with 3D Geometric Guidance | Mengting Wei et.al. | 2504.21497 | link |
| 2025-05-08 | OXSeg: Multidimensional attention UNet-based lip segmentation using semi-supervised lip contours | Hanie Moghaddasi et.al. | 2505.05531 | null |
| 2025-05-03 | GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting | Anushka Agarwal et.al. | 2505.01928 | null |
| 2025-05-02 | Model See Model Do: Speech-Driven Facial Animation with Style Control | Yifang Pan et.al. | 2505.01319 | null |
| 2025-05-02 | FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing | Gaoxiang Cong et.al. | 2505.01263 | null |
| 2025-05-01 | KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution | Antoni Bigata et.al. | 2505.00497 | null |
| 2025-04-29 | IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos | Yuan Li et.al. | 2504.19165 | null |
| 2025-04-27 | Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions | Mohammad Mahdi Abootorabi et.al. | 2504.19056 | link |
| 2025-04-26 | Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning | Yifan Xie et.al. | 2504.18810 | null |
| 2025-04-25 | Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation | Weipeng Tan et.al. | 2504.18087 | null |
| 2025-04-14 | SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models | Stathis Galanakis et.al. | 2504.10716 | null |
| 2025-04-10 | ChildlikeSHAPES: Semantic Hierarchical Region Parsing for Animating Figure Drawings | Astitva Srivastava et.al. | 2504.08022 | null |
| 2025-04-08 | VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing | Juan Luis Gonzalez Bello et.al. | 2504.07146 | null |
| 2025-04-08 | SE4Lip: Speech-Lip Encoder for Talking Head Synthesis to Solve Phoneme-Viseme Alignment Ambiguity | Yihuan Huang et.al. | 2504.05803 | null |
| 2025-04-08 | Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation | Zhihua Xu et.al. | 2504.05746 | null |
| 2025-04-08 | Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation | Tianshui Chen et.al. | 2504.05672 | null |
| 2025-04-07 | Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation | Fa-Ting Hong et.al. | 2504.02542 | link |
| 2025-04-06 | FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency | Shiyan Liu et.al. | 2504.04427 | null |
| 2025-04-04 | A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations | Abdul Mannan Mohammed et.al. | 2504.03147 | null |
| 2025-04-03 | OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication | Zhongjian Wang et.al. | 2504.02433 | null |
| 2025-04-03 | VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models | Kim Sung-Bin et.al. | 2504.02386 | null |
| 2025-04-02 | Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies | Soumyya Kanti Datta et.al. | 2504.01470 | link |
| 2025-04-02 | EmoHead: Emotional Talking Head via Manipulating Semantic Expression Parameters | Xuli Shen et.al. | 2503.19416 | null |
| 2025-04-01 | Monocular and Generalizable Gaussian Talking Head Animation | Shengjie Gong et.al. | 2504.00665 | null |
| 2025-04-01 | Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics | Lee Chae-Yeon et.al. | 2503.20308 | null |
| 2025-03-30 | MoCha: Towards Movie-Grade Talking Character Synthesis | Cong Wei et.al. | 2503.23307 | null |
| 2025-03-29 | STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing | Zijun Ding et.al. | 2503.23039 | link |
| 2025-03-28 | Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis | Shuai Shen et.al. | 2503.22605 | null |
| 2025-03-28 | Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance | Haijie Yang et.al. | 2503.22225 | null |
| 2025-03-27 | ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model | Jinwei Qi et.al. | 2503.21144 | null |
| 2025-03-26 | Dual Audio-Centric Modality Coupling for Talking Head Generation | Ao Fu et.al. | 2503.22728 | null |
| 2025-03-25 | AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers | Jiazhi Guan et.al. | 2503.19824 | null |
| 2025-03-25 | MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation | Yukang Lin et.al. | 2503.19383 | null |
| 2025-03-25 | HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation | Zunnan Xu et.al. | 2503.18860 | null |
| 2025-03-25 | Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model | Yingying Fan et.al. | 2503.16942 | null |
| 2025-03-24 | DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model | Kangwei Liu et.al. | 2503.19001 | null |
| 2025-03-24 | Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation | Dingcheng Zhen et.al. | 2503.18429 | null |
| 2025-03-23 | DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation | Peng Chen et.al. | 2503.18159 | link |
| 2025-03-21 | TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting | Jianchuan Chen et.al. | 2503.17032 | null |
| 2025-03-21 | From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech | Ji-Hoon Kim et.al. | 2503.16956 | null |
| 2025-03-20 | UniSync: A Unified Framework for Audio-Visual Synchronization | Tao Feng et.al. | 2503.16357 | null |
| 2025-03-20 | PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation | Baiqin Wang et.al. | 2503.14295 | null |
| 2025-03-19 | DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis | Yuming Gu et.al. | 2503.15667 | link |
| 2025-03-19 | KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation | Antoni Bigata et.al. | 2503.01715 | null |
| 2025-03-17 | SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization | Xulin Fan et.al. | 2503.13371 | null |
| 2025-03-17 | Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait | Chaolong Yang et.al. | 2503.12963 | link |
| 2025-03-14 | Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control | Hejia Chen et.al. | 2503.14517 | null |
| 2025-03-14 | EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models | Yixuan Zhang et.al. | 2503.11028 | null |
| 2025-03-12 | StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation | An Yang et.al. | 2503.09852 | null |
| 2025-03-12 | Bidirectional Learned Facial Animation Codec for Low Bitrate Talking Head Videos | Riku Takahashi et.al. | 2503.09787 | null |
| 2025-03-09 | Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter | Yanyu Zhu et.al. | 2503.06397 | null |
| 2025-03-07 | MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice | Hongwei Yi et.al. | 2503.05978 | null |
| 2025-03-06 | FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis | Ziqi Ni et.al. | 2503.04067 | null |
| 2025-03-02 | FaceShot: Bring Any Character into Life | Junyao Gao et.al. | 2503.00740 | null |
| 2025-03-01 | Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture | Xuanchen Li et.al. | 2503.00495 | null |
| 2025-02-28 | Two-Stream Spatial-Temporal Transformer Framework for Person Identification via Natural Conversational Keypoints | Masoumeh Chapariniya et.al. | 2502.20803 | null |
| 2025-02-28 | ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model | Xuangeng Chu et.al. | 2502.20323 | null |
| 2025-02-27 | InsTaG: Learning Personalized 3D Talking Head from Few-Second Video | Jiahe Li et.al. | 2502.20387 | link |
| 2025-02-27 | High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model | Mingtao Guo et.al. | 2502.19894 | link |
| 2025-02-26 | FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion mode | Lingzhou Mu et.al. | 2502.19455 | null |
| 2025-02-24 | Dimitra: Audio-driven Diffusion model for Expressive Talking Head Generation | Baptiste Chopin et.al. | 2502.17198 | null |
| 2025-02-20 | NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis | Xiaoxing Liu et.al. | 2502.14178 | null |
| 2025-02-18 | AV-Flow: Transforming Text to Audio-Visual Human-like Interactions | Aggelina Chatziagapi et.al. | 2502.13133 | null |
| 2025-02-17 | SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion | Junxian Ma et.al. | 2502.11515 | null |
| 2025-02-15 | SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers | Di Qiu et.al. | 2502.10841 | link |
| 2025-02-13 | Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model | Fei Shen et.al. | 2502.09533 | null |
| 2025-02-13 | VTutor: An Open-Source SDK for Generative AI-Powered Animated Pedagogical Agents with Multi-Media Output | Eason Chen et.al. | 2502.04103 | null |
| 2025-02-11 | Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion | Xingpei Ma et.al. | 2502.07203 | null |
| 2025-02-07 | Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark | Han Zhang et.al. | 2502.04976 | null |
| 2025-02-02 | EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis | Junuk Cha et.al. | 2502.00654 | null |
| 2025-01-24 | SyncAnimation: A Real-Time End-to-End Framework for Audio-Driven Human Pose and Talking Head Animation | Yujian Liu et.al. | 2501.14646 | null |
| 2025-01-21 | A Lightweight and Interpretable Deepfakes Detection Framework | Muhammad Umar Farooq et.al. | 2501.11927 | null |
| 2025-01-18 | EMO2: End-Effector Guided Audio-Driven Avatar Video Generation | Linrui Tian et.al. | 2501.10687 | null |
| 2025-01-17 | TalkingEyes: Pluralistic Speech-Driven 3D Eye Gaze Animation | Yixiang Zhuang et.al. | 2501.09921 | null |
| 2025-01-15 | Joint Learning of Depth and Appearance for Portrait Image Animation | Xinya Ji et.al. | 2501.08649 | null |
| 2025-01-15 | Make-A-Character 2: Animatable 3D Character Generation From a Single Image | Lin Liu et.al. | 2501.07870 | null |
| 2025-01-09 | Towards Dynamic Neural Communication and Speech Neuroprosthesis Based on Viseme Decoding | Ji-Ha Park et.al. | 2501.14790 | null |
| 2025-01-09 | Identity-Preserving Video Dubbing Using Motion Warping | Runzhen Liu et.al. | 2501.04586 | null |
| 2025-01-09 | MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation | Huaize Liu et.al. | 2501.01808 | null |
| 2025-01-07 | Generating and Detecting Various Types of Fake Image and Audio Content: A Review of Modern Deep Learning Technologies and Tools | Arash Dehghani et.al. | 2501.06227 | null |
| 2025-01-07 | VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control | Yuanpeng Tu et.al. | 2501.01427 | null |
| 2025-01-06 | RDD4D: 4D Attention-Guided Road Damage Detection And Classification | Asma Alkalbani et.al. | 2501.02822 | link |
| 2025-01-06 | Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study | Mykola Maslych et.al. | 2501.00168 | null |
| 2025-01-03 | JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing | Qili Wang et.al. | 2501.01798 | link |
| 2024-12-28 | DEGSTalk: Decomposed Per-Embedding Gaussian Fields for Hair-Preserving Talking Face Synthesis | Kaijun Deng et.al. | 2412.20148 | link |
| 2024-12-26 | UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control | Wenzhang Sun et.al. | 2412.19860 | null |
| 2024-12-26 | Generating Editable Head Avatars with 3D Gaussian GANs | Guohao Li et.al. | 2412.19149 | link |
| 2024-12-23 | FaceLift: Single Image to 3D Head with View Generation and GS-LRM | Weijie Lyu et.al. | 2412.17812 | null |
| 2024-12-22 | FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation | Tianyun Zhong et.al. | 2412.16915 | null |
| 2024-12-18 | Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters | Steven Hogue et.al. | 2412.14333 | link |
| 2024-12-18 | GLCF: A Global-Local Multimodal Coherence Analysis Framework for Talking Face Generation Detection | Xiaocan Chen et.al. | 2412.13656 | null |
| 2024-12-18 | Learning to Control an Android Robot Head for Facial Animation | Marcel Heisler et.al. | 2412.13641 | null |
| 2024-12-18 | Real-time One-Step Diffusion-based Expressive Portrait Videos Generation | Hanzhong Guo et.al. | 2412.13479 | link |
| 2024-12-18 | VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization | Tao Liu et.al. | 2412.09892 | null |
| 2024-12-16 | Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content | Rohit Kundu et.al. | 2412.12278 | null |
| 2024-12-13 | GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression | Ziqi Zhou et.al. | 2412.09296 | link |
| 2024-12-12 | LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync | Chunyu Li et.al. | 2412.09262 | link |
| 2024-12-12 | EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing | Gaoxiang Cong et.al. | 2412.08988 | null |
| 2024-12-12 | PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis | Yifan Xie et.al. | 2412.08504 | null |
| 2024-12-10 | PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation | Fatemeh Nazarieh et.al. | 2412.07754 | null |
| 2024-12-10 | IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation | Sejong Yang et.al. | 2412.04000 | null |
| 2024-12-05 | MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation | Longtao Zheng et.al. | 2412.04448 | null |
| 2024-12-05 | Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks | Jiahao Cui et.al. | 2412.00733 | link |
| 2024-12-04 | SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model | Yan Li et.al. | 2412.03430 | null |
| 2024-12-02 | One Shot, One Talk: Whole-body Talking Avatar from a Single Image | Jun Xiang et.al. | 2412.01106 | null |
| 2024-12-01 | Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation | Shuling Zhao et.al. | 2412.00719 | null |
| 2024-11-29 | LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis | Tianqi Li et.al. | 2411.19525 | null |
| 2024-11-29 | Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis | Tianqi Li et.al. | 2411.19509 | link |
| 2024-11-29 | V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow | Jeongsoo Choi et.al. | 2411.19486 | link |
| 2024-11-26 | Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey | Hong-Hanh Nguyen-Le et.al. | 2411.17911 | null |
| 2024-11-25 | Sonic: Shifting Focus to Global Audio Perception in Portrait Animation | Xiaozhong Ji et.al. | 2411.16331 | null |
| 2024-11-25 | ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations | Xulong Zhang et.al. | 2411.13089 | null |
| 2024-11-24 | LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis | Haojie Zhang et.al. | 2411.16748 | null |
| 2024-11-23 | EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion | Haotian Wang et.al. | 2411.16726 | null |
| 2024-11-23 | ConsistentAvatar: Learning to Diffuse Fully Consistent Talking Head Avatar with Temporal Guidance | Haijie Yang et.al. | 2411.15436 | null |
| 2024-11-20 | Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis | Pegah Salehi et.al. | 2411.13209 | link |
| 2024-11-20 | JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation | Xuyang Cao et.al. | 2411.09209 | link |
| 2024-11-14 | LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space | Guanwen Feng et.al. | 2411.09268 | null |
| 2024-11-06 | Large Generative Model-assisted Talking-face Semantic Communication System | Feibo Jiang et.al. | 2411.03876 | null |
| 2024-11-05 | SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation | Changpeng Cai et.al. | 2405.07257 | null |
| 2024-10-31 | Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts | Xiang Deng et.al. | 2410.23836 | null |
| 2024-10-29 | Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing | Haonan Tong et.al. | 2410.22112 | null |
| 2024-10-24 | Real-time 3D-aware Portrait Video Relighting | Ziqi Cai et.al. | 2410.18355 | link |
| 2024-10-21 | Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions | Malte Prinzler et.al. | 2410.16395 | null |
| 2024-10-18 | Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization | Bin Lin et.al. | 2410.14283 | null |
| 2024-10-18 | DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation | Hanbo Cheng et.al. | 2410.13726 | link |
| 2024-10-16 | MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting | Yue Zhang et.al. | 2410.10122 | link |
| 2024-10-15 | Titanic Calling: Low Bandwidth Video Conference from the Titanic Wreck | Fevziye Irem Eyiokur et.al. | 2410.11434 | null |
| 2024-10-15 | MimicTalk: Mimicking a personalized and expressive 3D talking face in minutes | Zhenhui Ye et.al. | 2410.06734 | null |
| 2024-10-14 | Character-aware audio-visual subtitling in context | Jaesung Huh et.al. | 2410.11068 | null |
| 2024-10-14 | Beyond Fixed Topologies: Unregistered Training and Comprehensive Evaluation Metrics for 3D Talking Heads | Federico Nocentini et.al. | 2410.11041 | null |
| 2024-10-14 | TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model | Jiazhi Guan et.al. | 2410.10696 | null |
| 2024-10-14 | Generative Human Video Compression with Multi-granularity Temporal Trajectory Factorization | Shanzhi Yin et.al. | 2410.10171 | null |
| 2024-10-10 | MMHead: Towards Fine-grained Multi-modal 3D Facial Animation | Sijing Wu et.al. | 2410.07757 | null |
| 2024-10-09 | FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model | Feng Qiu et.al. | 2409.13180 | null |
| 2024-10-01 | LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details | Jian Yang et.al. | 2410.00990 | null |
| 2024-09-29 | Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation | Jingyi Xu et.al. | 2409.19501 | null |
| 2024-09-27 | Diverse Code Query Learning for Speech-Driven Facial Animation | Chunzhi Gu et.al. | 2409.19143 | null |
| 2024-09-26 | Stable Video Portraits | Mirela Ostrek et.al. | 2409.18083 | null |
| 2024-09-25 | ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE | Sichun Wu et.al. | 2409.07966 | link |
| 2024-09-24 | FastTalker: Jointly Generating Speech and Conversational Gestures from Text | Zixin Guo et.al. | 2409.16404 | null |
| 2024-09-23 | FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset | Donglin Di et.al. | 2410.07151 | null |
| 2024-09-23 | MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning | Yue Han et.al. | 2409.15179 | null |
| 2024-09-18 | JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation | Sai Tanmay Reddy Chakkera et.al. | 2409.12156 | null |
| 2024-09-18 | GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations | Kartik Teotia et.al. | 2409.11951 | null |
| 2024-09-17 | 3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy | Xuanmeng Sha et.al. | 2409.10848 | null |
| 2024-09-16 | DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis | Fa-Ting Hong et.al. | 2409.10281 | null |
| 2024-09-14 | StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads | Suzhen Wang et.al. | 2409.09292 | null |
| 2024-09-11 | DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures | Steven Hogue et.al. | 2409.07649 | null |
| 2024-09-11 | EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion | Jian Zhang et.al. | 2409.07255 | link |
| 2024-09-09 | PersonaTalk: Bring Attention to Your Persona in Visual Dubbing | Longhao Zhang et.al. | 2409.05379 | null |
| 2024-09-09 | KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation | Hoang-Son Vo-Thanh et.al. | 2409.05330 | link |
| 2024-09-05 | SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing | Lingyu Xiong et.al. | 2409.03605 | null |
| 2024-09-05 | SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model | Weipeng Tan et.al. | 2409.03270 | null |
| 2024-09-04 | PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation | Jun Ling et.al. | 2409.02657 | null |
| 2024-09-02 | KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding | Zhihao Xu et.al. | 2409.01113 | link |
| 2024-08-28 | Micro and macro facial expressions by driven animations in realistic Virtual Humans | Rubens Halbig Montanha et.al. | 2408.16110 | null |
| 2024-08-27 | MegActor- |
Shurong Yang et.al. | 2408.14975 | null |
| 2024-08-25 | TalkLoRA: Low-Rank Adaptation for Speech-Driven Animation | Jack Saunders et.al. | 2408.13714 | null |
| 2024-08-23 | G3FA: Geometry-guided GAN for Face Animation | Alireza Javanmardi et.al. | 2408.13049 | null |
| 2024-08-21 | AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition | Minheng Ni et.al. | 2408.11564 | null |
| 2024-08-21 | EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention | Yihong Lin et.al. | 2408.11518 | null |
| 2024-08-20 | DEGAS: Detailed Expressions on Full-Body Gaussian Avatars | Zhijing Shao et.al. | 2408.10588 | link |
| 2024-08-18 | FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model | Ziyu Yao et.al. | 2408.09384 | null |
| 2024-08-18 | Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation | Xukun Zhou et.al. | 2408.09357 | null |
| 2024-08-18 | S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis | Dongze Li et.al. | 2408.09347 | null |
| 2024-08-16 | GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer | Yihong Lin et.al. | 2408.01826 | null |
| 2024-08-14 | Content and Style Aware Audio-Driven Facial Animation | Qingju Liu et.al. | 2408.07005 | null |
| 2024-08-12 | DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation | Jisoo Kim et.al. | 2408.06010 | null |
| 2024-08-10 | High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model | Weizhi Zhong et.al. | 2408.05416 | null |
| 2024-08-10 | Style-Preserving Lip Sync via Audio-Aware Style Reference | Weizhi Zhong et.al. | 2408.05412 | null |
| 2024-08-09 | DeepSpeak Dataset v1.0 | Sarah Barrington et.al. | 2408.05366 | null |
| 2024-08-06 | ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer | Jiazhi Guan et.al. | 2408.03284 | null |
| 2024-08-03 | Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation | Jintao Tan et.al. | 2408.01732 | null |
| 2024-08-03 | JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model | Farzaneh Jafari et.al. | 2408.01627 | null |
| 2024-08-01 | UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model | Xiangyu Fan et.al. | 2408.00762 | null |
| 2024-08-01 | Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion | Manuel Kansy et.al. | 2408.00458 | null |
| 2024-08-01 | EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head | Qianyun He et.al. | 2408.00297 | null |
| 2024-07-31 | Deformable 3D Shape Diffusion Model | Dengsheng Chen et.al. | 2407.21428 | null |
| 2024-07-26 | LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement | Rui Zhang et.al. | 2407.18595 | null |
| 2024-07-24 | A Comprehensive Review and Taxonomy of Audio-Visual Synchronization Techniques for Realistic Speech Animation | Jose Geraldo Fernandes et.al. | 2407.17430 | null |
| 2024-07-24 | The impact of differences in facial features between real speakers and 3D face models on synthesized lip motions | Rabab Algadhy et.al. | 2407.17253 | null |
| 2024-07-22 | PAV: Personalized Head Avatar from Unstructured Video Collection | Akin Caliskan et.al. | 2407.21047 | null |
| 2024-07-21 | Anchored Diffusion for Video Face Reenactment | Idan Kligvasser et.al. | 2407.15153 | null |
| 2024-07-20 | Text-based Talking Video Editing with Cascaded Conditional Diffusion | Bo Han et.al. | 2407.14841 | null |
| 2024-07-17 | Universal Facial Encoding of Codec Avatars from VR Headsets | Shaojie Bai et.al. | 2407.13038 | null |
| 2024-07-17 | EmoFace: Audio-driven Emotional 3D Face Animation | Chang Liu et.al. | 2407.12501 | link |
| 2024-07-13 | Learning Online Scale Transformation for Talking Head Video Generation | Fa-Ting Hong et.al. | 2407.09965 | null |
| 2024-07-12 | Real Face Video Animation Platform | Xiaokai Chen et.al. | 2407.18955 | null |
| 2024-07-12 | One-Shot Pose-Driving Face Animation Platform | He Feng et.al. | 2407.08949 | null |
| 2024-07-12 | EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions | Zhiyuan Chen et.al. | 2407.08136 | link |
| 2024-07-08 | MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices | Jianwen Jiang et.al. | 2407.05712 | null |
| 2024-07-08 | Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN | Jiacheng Su et.al. | 2407.05577 | null |
| 2024-07-04 | Compressed Skinning for Facial Blendshapes | Ladislav Kavan et.al. | 2406.11597 | null |
| 2024-07-03 | LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control | Jianzhu Guo et.al. | 2407.03168 | link |
| 2024-07-02 | Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert | Han EunGi et.al. | 2407.01034 | null |
| 2024-06-26 | RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network | Xiaozhong Ji et.al. | 2406.18284 | null |
| 2024-06-24 | The Effects of Embodiment and Personality Expression on Learning in LLM-based Educational Agents | Sinan Sonlu et.al. | 2407.10993 | null |
| 2024-06-21 | EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot | Hao Fei et.al. | 2406.15177 | link |
| 2024-06-20 | MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset | Kim Sung-Bin et.al. | 2406.14272 | null |
| 2024-06-19 | DF40: Toward Next-Generation Deepfake Detection | Zhiyuan Yan et.al. | 2406.13495 | link |
| 2024-06-19 | AniFaceDiff: High-Fidelity Face Reenactment via Facial Parametric Conditioned Diffusion Models | Ken Chen et.al. | 2406.13272 | null |
| 2024-06-18 | RITA: A Real-time Interactive Talking Avatars Framework | Wuxinlin Cheng et.al. | 2406.13093 | null |
| 2024-06-18 | A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing | Ming Meng et.al. | 2406.10553 | null |
| 2024-06-17 | NLDF: Neural Light Dynamic Fields for Efficient 3D Talking Head Generation | Niu Guanchen et.al. | 2406.11259 | null |
| 2024-06-17 | Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement | Runyi Yu et.al. | 2406.08096 | null |
| 2024-06-16 | Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation | Mingwang Xu et.al. | 2406.08801 | null |
| 2024-06-14 | DNPM: A Neural Parametric Model for the Synthesis of Facial Geometric Details | Haitao Cao et.al. | 2405.19688 | null |
| 2024-06-13 | Talking Heads: Understanding Inter-layer Communication in Transformer Language Models | Jack Merullo et.al. | 2406.09519 | null |
| 2024-06-13 | DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing | Neha Sahipjohn et.al. | 2406.08802 | null |
| 2024-06-12 | Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation | Jiadong Liang et.al. | 2406.07895 | null |
| 2024-06-07 | Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation | Yue Ma et.al. | 2406.01900 | null |
| 2024-06-05 | Controllable Talking Face Generation by Implicit Facial Keypoints Editing | Dong Zhao et.al. | 2406.02880 | link |
| 2024-05-31 | MunchSonic: Tracking Fine-grained Dietary Actions through Active Acoustic Sensing on Eyeglasses | Saif Mahmud et.al. | 2405.21004 | null |
| 2024-05-31 | MegActor: Harness the Power of Raw Video for Vivid Portrait Animation | Shurong Yang et.al. | 2405.20851 | link |
| 2024-05-30 | Audio2Rig: Artist-oriented deep learning tool for facial animation | Bastien Arcelin et.al. | 2405.20412 | null |
| 2024-05-28 | OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance | Shuheng Ge et.al. | 2405.14709 | null |
| 2024-05-24 | InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation | Yuchi Wang et.al. | 2405.15758 | link |
| 2024-05-22 | Metabook: An Automatically Generated Augmented Reality Storybook Interaction System to Improve Children's Engagement in Storytelling | Yibo Wang et.al. | 2405.13701 | null |
| 2024-05-21 | Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control | Yue Han et.al. | 2405.12970 | null |
| 2024-05-16 | Faces that Speak: Jointly Synthesising Talking Face and Speech from Text | Youngjoon Jang et.al. | 2405.10272 | null |
| 2024-05-14 | PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset | Yang Hou et.al. | 2405.08838 | link |
| 2024-05-10 | NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior | Gihoon Kim et.al. | 2405.05749 | null |
| 2024-05-09 | SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space | Zeren Zhang et.al. | 2405.05636 | null |
| 2024-05-08 | Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention | Ruijie Tao et.al. | 2404.18501 | link |
| 2024-05-07 | Audio-Visual Speech Representation Expert for Enhanced Talking Face Video Generation and Evaluation | Dogucan Yaman et.al. | 2405.04327 | null |
| 2024-05-07 | AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding | Tao Liu et.al. | 2405.03121 | null |
| 2024-04-29 | EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars | Nikita Drobyshev et.al. | 2404.19110 | null |
| 2024-04-29 | GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting | Bo Chen et.al. | 2404.19040 | null |
| 2024-04-29 | Embedded Representation Learning Network for Animating Styled Video Portrait | Tianyong Wang et.al. | 2404.19038 | null |
| 2024-04-29 | CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation | Xiangyu Liang et.al. | 2404.18604 | null |
| 2024-04-28 | GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting | Hongyun Yu et.al. | 2404.14037 | null |
| 2024-04-25 | GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting | Kyusun Cho et.al. | 2404.16012 | link |
| 2024-04-23 | TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting | Jiahe Li et.al. | 2404.15264 | link |
| 2024-04-19 | Learn2Talk: 3D Talking Face Learns from 2D Talking Face | Yixiang Zhuang et.al. | 2404.12888 | null |
| 2024-04-16 | VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time | Sicheng Xu et.al. | 2404.10667 | null |
| 2024-04-15 | FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features | Andre Rochow et.al. | 2404.09736 | null |
| 2024-04-13 | THQA: A Perceptual Quality Assessment Database for Talking Heads | Yingjie Zhou et.al. | 2404.09003 | link |
| 2024-04-11 | EFHQ: Multi-purpose ExtremePose-Face-HQ dataset | Trung Tuan Dao et.al. | 2312.17205 | null |
| 2024-04-09 | Deepfake Generation and Detection: A Benchmark and Survey | Gan Pei et.al. | 2403.17881 | link |
| 2024-04-08 | SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation | Heyuan Li et.al. | 2404.05680 | null |
| 2024-04-07 | GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets | Dongjing Shan et.al. | 2404.04924 | null |
| 2024-04-07 | Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation | Renshuai Liu et.al. | 2401.01207 | null |
| 2024-04-03 | MI-NeRF: Learning a Single Face NeRF from Multiple Identities | Aggelina Chatziagapi et.al. | 2403.19920 | null |
| 2024-04-02 | EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis | Shuai Tan et.al. | 2404.01647 | null |
| 2024-04-02 | Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation | Taekyung Ki et.al. | 2404.00636 | null |
| 2024-04-02 | Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation | Se Jin Park et.al. | 2305.19556 | null |
| 2024-04-01 | FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio | Chao Xu et.al. | 2403.01901 | link |
| 2024-03-29 | Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior | Jaehoon Ko et.al. | 2403.20153 | link |
| 2024-03-28 | MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation | Seyeon Kim et.al. | 2403.19144 | link |
| 2024-03-28 | GOTCHA: Real-Time Video Deepfake Detection via Challenge-Response | Govind Mittal et.al. | 2210.06186 | link |
| 2024-03-27 | X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention | You Xie et.al. | 2403.15931 | null |
| 2024-03-26 | Superior and Pragmatic Talking Face Generation with Teacher-Student Framework | Chao Liang et.al. | 2403.17883 | null |
| 2024-03-26 | AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation | Huawei Wei et.al. | 2403.17694 | link |
| 2024-03-26 | Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis | Zhenhui Ye et.al. | 2401.08503 | null |
| 2024-03-25 | DiffusionAct: Controllable Diffusion Autoencoder for One-shot Face Reenactment | Stella Bounareli et.al. | 2403.17217 | null |
| 2024-03-25 | AnimateMe: 4D Facial Expressions via Diffusion Models | Dimitrios Gerogiannis et.al. | 2403.17213 | null |
| 2024-03-25 | Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework | Ziyao Huang et.al. | 2403.16510 | link |
| 2024-03-23 | Adaptive Super Resolution For One-Shot Talking-Head Generation | Luchuan Song et.al. | 2403.15944 | link |
| 2024-03-22 | LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example | Soyeon Yoon et.al. | 2403.15227 | link |
| 2024-03-22 | Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing | Juan Zhang et.al. | 2403.11700 | null |
| 2024-03-19 | EmoVOCA: Speech-Driven Emotional 3D Talking Heads | Federico Nocentini et.al. | 2403.12886 | link |
| 2024-03-19 | ScanTalk: 3D Talking Heads from Unregistered Scans | Federico Nocentini et.al. | 2403.10942 | link |
| 2024-03-15 | StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation | Dongchan Min et.al. | 2208.10922 | null |
| 2024-03-14 | GAIA: Zero-shot Talking Avatar Generation | Tianyu He et.al. | 2311.15230 | null |
| 2024-03-13 | Say Anything with Any Style | Shuai Tan et.al. | 2403.06363 | null |
| 2024-03-12 | FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization | Shuai Tan et.al. | 2403.06375 | null |
| 2024-03-12 | Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style | Shuai Tan et.al. | 2403.06365 | null |
| 2024-03-11 | A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos | Weixia Zhang et.al. | 2403.06421 | link |
| 2024-03-05 | Memories are One-to-Many Mapping Alleviators in Talking Face Generation | Anni Tang et.al. | 2212.05005 | null |
| 2024-03-02 | G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment | Juan Zhang et.al. | 2402.18122 | null |
| 2024-03-01 | DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder | Chenpeng Du et.al. | 2303.17550 | null |
| 2024-02-29 | Learning a Generalized Physical Face Model From Data | Lingchen Yang et.al. | 2402.19477 | null |
| 2024-02-28 | Context-aware Talking Face Video Generation | Meidai Xuanyuan et.al. | 2402.18092 | null |
| 2024-02-27 | EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions | Linrui Tian et.al. | 2402.17485 | null |
| 2024-02-27 | Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis | Zicheng Zhang et.al. | 2402.17364 | link |
| 2024-02-26 | Resolution-Agnostic Neural Compression for High-Fidelity Portrait Video Conferencing via Implicit Radiance Fields | Yifei Li et.al. | 2402.16599 | null |
| 2024-02-25 | AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation | Yasheng Sun et.al. | 2402.16124 | null |
| 2024-02-21 | Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters | Zechen Bai et.al. | 2402.13724 | link |
| 2024-02-21 | StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing | Gaoxiang Cong et.al. | 2402.12636 | link |
| 2024-02-12 | StyleLipSync: Style-based Personalized Lip-sync Video Generation | Taekyung Ki et.al. | 2305.00521 | null |
| 2024-02-08 | DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer | Zhiyuan Ma et.al. | 2402.05712 | link |
| 2024-02-05 | One-shot Neural Face Reenactment via Finding Directions in GAN's Latent Space | Stella Bounareli et.al. | 2402.03553 | null |
| 2024-02-02 | EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation | Guanwen Feng et.al. | 2402.01422 | null |
| 2024-01-31 | MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis | Wenhao Guan et.al. | 2312.10687 | null |
| 2024-01-30 | Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance | Qingcheng Zhao et.al. | 2401.15687 | null |
| 2024-01-28 | Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes | Weifeng Liu et.al. | 2401.15668 | link |
| 2024-01-27 | An Implicit Physical Face Model Driven by Expression and Style | Lingchen Yang et.al. | 2401.15414 | null |
| 2024-01-26 | Implicit Neural Representation for Physics-driven Actuated Soft Bodies | Lingchen Yang et.al. | 2401.14861 | null |
| 2024-01-25 | SAiD: Speech-driven Blendshape Facial Animation with Diffusion | Inkyu Park et.al. | 2401.08655 | link |
| 2024-01-23 | NeRF-AD: Neural Radiance Field with Attention-based Disentanglement for Talking Face Synthesis | Chongke Bi et.al. | 2401.12568 | null |
| 2024-01-19 | Fast Registration of Photorealistic Avatars for VR Facial Animation | Chaitanya Patel et.al. | 2401.11002 | null |
| 2024-01-18 | Exposing Lip-syncing Deepfakes from Mouth Inconsistencies | Soumyya Kanti Datta et.al. | 2401.10113 | link |
| 2024-01-18 | Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models | Jeongsoo Choi et.al. | 2306.16003 | null |
| 2024-01-16 | EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model | Bingyuan Zhang et.al. | 2401.08049 | null |
| 2024-01-12 | DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder | Tao Liu et.al. | 2311.01811 | link |
| 2024-01-11 | Dubbing for Everyone: Data-Efficient Visual Dubbing using Neural Rendering Priors | Jack Saunders et.al. | 2401.06126 | null |
| 2024-01-11 | Jump Cut Smoothing for Talking Heads | Xiaojuan Wang et.al. | 2401.04718 | null |
| 2024-01-08 | AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation | Liyang Chen et.al. | 2310.07236 | null |
| 2024-01-07 | Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness | Sicheng Yang et.al. | 2401.03476 | null |
| 2024-01-04 | Expressive Speech-driven Facial Animation with controllable emotions | Yutong Chen et.al. | 2301.02008 | link |
| 2023-12-23 | TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation | Xize Cheng et.al. | 2312.15197 | null |
| 2023-12-22 | DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation | Chenxu Zhang et.al. | 2312.13578 | null |
| 2023-12-20 | FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability | Linze Li et.al. | 2312.03775 | null |
| 2023-12-19 | Learning Dense Correspondence for NeRF-Based Face Reenactment | Songlin Yang et.al. | 2312.10422 | null |
| 2023-12-19 | Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing | Yushi Lan et.al. | 2312.03763 | null |
| 2023-12-18 | VectorTalker: SVG Talking Face Generation with Progressive Vectorisation | Hao Hu et.al. | 2312.11568 | null |
| 2023-12-18 | AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis | Dongze Li et.al. | 2312.10921 | null |
| 2023-12-18 | Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation | Hui Fu et.al. | 2312.10877 | null |
| 2023-12-15 | DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models | Yifeng Ma et.al. | 2312.09767 | link |
| 2023-12-15 | Attention-Based VR Facial Animation with Visual Mouth Camera Guidance for Immersive Telepresence Avatars | Andre Rochow et.al. | 2312.09750 | null |
| 2023-12-13 | uTalk: Bridging the Gap Between Humans and AI | Hussam Azzuni et.al. | 2310.02739 | null |
| 2023-12-13 | MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation | Haozhe Wu et.al. | 2303.09797 | null |
| 2023-12-12 | GMTalker: Gaussian Mixture based Emotional talking video Portraits | Yibo Xia et.al. | 2312.07669 | null |
| 2023-12-12 | GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance | Haiming Zhang et.al. | 2312.07385 | null |
| 2023-12-11 | Neural Text to Articulate Talk: Deep Text to Audiovisual Speech Synthesis achieving both Auditory and Photo-realism | Georgios Milis et.al. | 2312.06613 | link |
| 2023-12-11 | Study of Non-Verbal Behavior in Conversational Agents | Camila Vicari Maccari et.al. | 2312.06530 | null |
| 2023-12-11 | DiT-Head: High-Resolution Talking Head Synthesis using Diffusion Transformers | Aaron Mir et.al. | 2312.06400 | null |
| 2023-12-11 | Audio-driven Talking Face Generation by Overcoming Unintended Information Flow | Dogucan Yaman et.al. | 2307.09368 | null |
| 2023-12-10 | DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation | Fa-Ting Hong et.al. | 2305.06225 | link |
| 2023-12-09 | R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning | Zhiling Ye et.al. | 2312.05572 | null |
| 2023-12-09 | FT2TF: First-Person Statement Text-To-Talking Face Generation | Xingjian Diao et.al. | 2312.05430 | null |
| 2023-12-08 | SingingHead: A Large-scale 4D Dataset for Singing Head Animation | Sijing Wu et.al. | 2312.04369 | null |
| 2023-12-07 | VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior | Xusen Sun et.al. | 2312.01841 | null |
| 2023-12-05 | PMMTalk: Speech-Driven 3D Facial Animation from Complementary Pseudo Multi-modal Features | Tianshun Han et.al. | 2312.02781 | null |
| 2023-12-05 | MyPortrait: Morphable Prior-Guided Personalized Portrait Generation | Bo Ding et.al. | 2312.02703 | null |
| 2023-12-02 | DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser | Peng Chen et.al. | 2311.16565 | null |
| 2023-12-01 | 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing | Balamurugan Thambiraja et.al. | 2312.00870 | null |
| 2023-11-30 | Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data | Yu Deng et.al. | 2311.18729 | null |
| 2023-11-30 | Talking Head(?) Anime from a Single Image 4: Improved Model and Its Distillation | Pramook Khungurn et.al. | 2311.17409 | null |
| 2023-11-29 | SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis | Ziqiao Peng et.al. | 2311.17590 | link |
| 2023-11-28 | THInImg: Cross-modal Steganography for Presenting Talking Heads in Images | Lin Zhao et.al. | 2311.17177 | null |
| 2023-11-28 | BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis | Hao-Bin Duan et.al. | 2311.05521 | link |
| 2023-11-28 | Continuously Controllable Facial Expression Editing in Talking Face Videos | Zhiyao Sun et.al. | 2209.08289 | null |
| 2023-11-20 | MemoryCompanion: A Smart Healthcare Solution to Empower Efficient Alzheimer's Care Via Unleashing Generative AI | Lifei Zheng et.al. | 2311.14730 | null |
| 2023-11-15 | CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding | Jianzong Wang et.al. | 2311.08673 | null |
| 2023-11-13 | DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation | Guinan Su et.al. | 2311.04766 | null |
| 2023-11-12 | ChatAnything: Facetime Chat with LLM-Enhanced Personas | Yilin Zhao et.al. | 2311.06772 | null |
| 2023-11-08 | Synthetic Speaking Children -- Why We Need Them and How to Make Them | Muhammad Ali Farooq et.al. | 2311.06307 | null |
| 2023-11-06 | RADIO: Reference-Agnostic Dubbing Video Synthesis | Dongyeun Lee et.al. | 2309.01950 | null |
| 2023-11-05 | 3D-Aware Talking-Head Video Motion Transfer | Haomiao Ni et.al. | 2311.02549 | null |
| 2023-11-03 | Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading | Songtao Luo et.al. | 2310.05058 | link |
| 2023-11-02 | LaughTalk: Expressive 3D Talking Head Generation with Laughter | Kim Sung-Bin et.al. | 2311.00994 | null |
| 2023-11-02 | High-Fidelity and Freely Controllable Talking Head Video Generation | Yue Gao et.al. | 2304.10168 | null |
| 2023-10-31 | Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape | Wei Zhao et.al. | 2310.20240 | null |
| 2023-10-29 | On the Vulnerability of DeepFake Detectors to Attacks Generated by Denoising Diffusion Models | Marija Ivanovska et.al. | 2307.05397 | null |
| 2023-10-25 | Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control | Elif Bozkurt et.al. | 2310.17011 | null |
| 2023-10-23 | The Self 2.0: How AI-Enhanced Self-Clones Transform Self-Perception and Improve Presentation Skills | Qingxiao Zheng et.al. | 2310.15112 | null |
| 2023-10-19 | Gemino: Practical and Robust Neural Compression for Video Conferencing | Vibhaalakshmi Sivaraman et.al. | 2209.10507 | null |
| 2023-10-17 | CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation | Zhaojie Chu et.al. | 2310.11295 | null |
| 2023-10-15 | HyperLips: Hyper Control Lips with High Resolution Decoder for Talking Face Generation | Yaosen Chen et.al. | 2310.05720 | link |
| 2023-10-12 | CleftGAN: Adapting A Style-Based Generative Adversarial Network To Create Images Depicting Cleft Lip Deformity | Abdullah Hayajneh et.al. | 2310.07969 | link |
| 2023-10-12 | Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation | Yuan Gan et.al. | 2309.04946 | link |
| 2023-10-08 | GestSync: Determining who is speaking without a talking head | Sindhu B Hegde et.al. | 2310.05304 | link |
| 2023-09-30 | DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models | Zhiyao Sun et.al. | 2310.00434 | null |
| 2023-09-28 | OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions | Jin Liu et.al. | 2309.16148 | null |
| 2023-09-26 | Emotional Speech-Driven Animation with Content-Emotion Disentanglement | Radek DanΔΔek et.al. | 2306.08990 | null |
| 2023-09-20 | FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion | Stefan Stan et.al. | 2309.11306 | link |
| 2023-09-20 | Context-Aware Talking-Head Video Editing | Songlin Yang et.al. | 2308.00462 | null |
| 2023-09-18 | That's What I Said: Fully-Controllable Talking Face Generation | Youngjoon Jang et.al. | 2304.03275 | null |
| 2023-09-15 | Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech | Junjie Li et.al. | 2309.08408 | link |
| 2023-09-14 | DT-NeRF: Decomposed Triplane-Hash Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis | Yaoyu Su et.al. | 2309.07752 | null |
| 2023-09-14 | DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks | Zipeng Qi et.al. | 2309.07509 | null |
| 2023-09-14 | HDTR-Net: A Real-Time High-Definition Teeth Restoration Network for Arbitrary Talking Face Generation Methods | Yongyuan Li et.al. | 2309.07495 | link |
| 2023-09-13 | PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network | Qinghua Liu et.al. | 2309.06723 | null |
| 2023-09-12 | DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention | Aaditya Kharel et.al. | 2309.06511 | null |
| 2023-09-12 | Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos | Ekta Prashnani et.al. | 2305.03713 | null |
| 2023-09-11 | ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment | Yicheng Zhong et.al. | 2308.14448 | null |
| 2023-09-10 | MaskRenderer: 3D-Infused Multi-Mask Realistic Face Reenactment | Tina Behrouzi et.al. | 2309.05095 | null |
| 2023-09-09 | Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video | Xiuzhe Wu et.al. | 2309.04814 | link |
| 2023-09-01 | Unsupervised Learning of Style-Aware Facial Animation from Real Acting Performances | Wolfgang Paier et.al. | 2306.10006 | null |
| 2023-08-30 | From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications | Shreyank N Gowda et.al. | 2308.16041 | null |
| 2023-08-30 | SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces | Ziqiao Peng et.al. | 2306.10799 | link |
| 2023-08-30 | Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models | Antoni Bigata Casademunt et.al. | 2305.08854 | link |
| 2023-08-29 | Papeos: Augmenting Research Papers with Talk Videos | Tae Soo Kim et.al. | 2308.15224 | null |
| 2023-08-25 | EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation | Ziqiao Peng et.al. | 2303.11089 | link |
| 2023-08-24 | ToonTalker: Cross-Domain Face Reenactment | Yuan Gong et.al. | 2308.12866 | null |
| 2023-08-24 | Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis | Jiahe Li et.al. | 2307.09323 | link |
| 2023-08-23 | DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion | Se Jin Park et.al. | 2310.05934 | null |
| 2023-08-21 | Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis | Tong Sha et.al. | 2109.02081 | null |
| 2023-08-18 | Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization | Soumik Mukhopadhyay et.al. | 2308.09716 | link |
| 2023-08-18 | Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation | Fa-Ting Hong et.al. | 2307.09906 | link |
| 2023-08-17 | A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation | Li Liu et.al. | 2308.08849 | link |
| 2023-08-16 | Instruct-NeuralTalker: Editing Audio-Driven Talking Radiance Fields with Instructions | Yuqi Sun et.al. | 2306.10813 | null |
| 2023-08-12 | Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation | Zhichao Wang et.al. | 2308.06457 | link |
| 2023-08-12 | DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation | Yichao Yan et.al. | 2203.07931 | null |
| 2023-08-11 | Versatile Face Animator: Driving Arbitrary 3D Facial Avatar in RGBD Space | Haoyu Wang et.al. | 2308.06076 | link |
| 2023-08-11 | VAST: Vivify Your Talking Avatar via Zero-Shot Expressive Facial Style Transfer | Liyang Chen et.al. | 2308.04830 | null |
| 2023-08-10 | Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution | Hyojoon Park et.al. | 2305.03216 | null |
| 2023-08-02 | Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis | Zhenhui Ye et.al. | 2306.03504 | null |
| 2023-07-29 | Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation | MichaΕ StypuΕkowski et.al. | 2301.03396 | null |
| 2023-07-26 | Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation | Federico Nocentini et.al. | 2306.01415 | link |
| 2023-07-20 | HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces | Stella Bounareli et.al. | 2307.10797 | link |
| 2023-07-20 | MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions | Yunfei Liu et.al. | 2307.10008 | null |
| 2023-07-19 | Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline | Zhigang Chang et.al. | 2307.09821 | null |
| 2023-07-19 | OPHAvatars: One-shot Photo-realistic Head Avatars | Shaoxu Li et.al. | 2307.09153 | link |
| 2023-07-18 | FACTS: Facial Animation Creation using the Transfer of Styles | Jack Saunders et.al. | 2307.09480 | null |
| 2023-07-09 | Predictive Coding For Animation-Based Video Compression | Goluck Konuko et.al. | 2307.04187 | null |
| 2023-07-08 | FTFDNet: Learning to Detect Talking Face Video Manipulation with Tri-Modality Interaction | Ganglai Wang et.al. | 2307.03990 | null |
| 2023-07-05 | Interactive Conversational Head Generation | Mohan Zhou et.al. | 2307.02090 | null |
| 2023-07-04 | A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony in Talking Head Generation | Louis Airale et.al. | 2307.03270 | link |
| 2023-07-04 | Generating Animatable 3D Cartoon Faces from Single Portraits | Chuanyu Pan et.al. | 2307.01468 | null |
| 2023-07-03 | RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations | Neha Sahipjohn et.al. | 2307.01233 | null |
| 2023-06-20 | Audio-Driven 3D Facial Animation from In-the-Wild Videos | Liying Lu et.al. | 2306.11541 | null |
| 2023-06-13 | Parametric Implicit Face Representation for Audio-Driven Facial Reenactment | Ricong Huang et.al. | 2306.07579 | null |
| 2023-06-13 | AniFaceDrawing: Anime Portrait Exploration during Your Sketching | Zhengyu Huang et.al. | 2306.07476 | null |
| 2023-06-12 | NPVForensics: Jointing Non-critical Phonemes and Visemes for Deepfake Detection | Yu Chen et.al. | 2306.06885 | null |
| 2023-06-10 | StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles | Yifeng Ma et.al. | 2301.01081 | link |
| 2023-06-08 | ReliableSwap: Boosting General Face Swapping Via Reliable Supervision | Ge Yuan et.al. | 2306.05356 | link |
| 2023-06-06 | Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks | Jianrong Wang et.al. | 2306.03594 | null |
| 2023-06-05 | Instruct-Video2Avatar: Video-to-Avatar Generation with Instructions | Shaoxu Li et.al. | 2306.02903 | link |
| 2023-05-31 | High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning | Chao Xu et.al. | 2305.02572 | null |
| 2023-05-23 | CPNet: Exploiting CLIP-based Attention Condenser and Probability Map Guidance for High-fidelity Talking Face Generation | Jingning Xu et.al. | 2305.13962 | null |
| 2023-05-22 | RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars | Dongwei Pan et.al. | 2305.13353 | link |
| 2023-05-19 | UniFLG: Unified Facial Landmark Generator from Text or Speech | Kentaro Mitsui et.al. | 2302.14337 | null |
| 2023-05-18 | An Android Robot Head as Embodied Conversational Agent | Marcel Heisler et.al. | 2305.10945 | null |
| 2023-05-18 | Audio-Visual Person-of-Interest DeepFake Detection | Davide Cozzolino et.al. | 2204.03083 | link |
| 2023-05-17 | INCLG: Inpainting for Non-Cleft Lip Generation with a Multi-Task Image Processing Network | Shuang Chen et.al. | 2305.10589 | null |
| 2023-05-17 | LPMM: Intuitive Pose Control for Neural Talking-Head Model via Landmark-Parameter Morphable Model | Kwangho Lee et.al. | 2305.10456 | null |
| 2023-05-15 | Identity-Preserving Talking Face Generation with Landmark and Appearance Priors | Weizhi Zhong et.al. | 2305.08293 | link |
| 2023-05-09 | Zero-shot personalized lip-to-speech synthesis with face image based voice control | Zheng-Yan Sheng et.al. | 2305.14359 | null |
| 2023-05-09 | StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator | Jiazhi Guan et.al. | 2305.05445 | null |
| 2023-05-09 | Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator | Chao Xu et.al. | 2305.02594 | null |
| 2023-05-01 | StyleAvatar: Real-time Photo-realistic Portrait Avatar from a Single Video | Lizhen Wang et.al. | 2305.00942 | link |
| 2023-05-01 | GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation | Zhenhui Ye et.al. | 2305.00787 | null |
| 2023-04-28 | A Unified Compression Framework for Efficient Speech-Driven Talking-Face Generation | Bo-Kyeong Kim et.al. | 2304.00471 | null |
| 2023-04-27 | Controllable One-Shot Face Video Synthesis With Semantic Aware Prior | Kangning Liu et.al. | 2304.14471 | null |
| 2023-04-25 | AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head | Rongjie Huang et.al. | 2304.12995 | link |
| 2023-04-24 | VR Facial Animation for Immersive Telepresence Avatars | Andre Rochow et.al. | 2304.12051 | null |
| 2023-04-21 | Implicit Neural Head Synthesis via Controllable Local Deformation Fields | Chuhan Chen et.al. | 2304.11113 | null |
| 2023-04-20 | DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation | Shuai Shen et.al. | 2301.03786 | link |
| 2023-04-18 | Audio-Driven Talking Face Generation with Diverse yet Realistic Facial Animations | Rongliang Wu et.al. | 2304.08945 | null |
| 2023-04-17 | Autoregressive GAN for Semantic Unconditional Head Motion Generation | Louis Airale et.al. | 2211.00987 | link |
| 2023-04-11 | One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field | Weichuang Li et.al. | 2304.05097 | null |
| 2023-04-06 | Face Animation with an Attribute-Guided Diffusion Model | Bohan Zeng et.al. | 2304.03199 | link |
| 2023-04-06 | 4D Agnostic Real-Time Facial Animation Pipeline for Desktop Scenarios | Wei Chen et.al. | 2304.02814 | null |
| 2023-04-03 | CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior | Jinbo Xing et.al. | 2301.02379 | link |
| 2023-04-01 | DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance | Longwen Zhang et.al. | 2304.03117 | null |
| 2023-04-01 | TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles | Yifeng Ma et.al. | 2304.00334 | null |
| 2023-03-31 | FONT: Flow-guided One-shot Talking Head Generation with Natural Head Motions | Jin Liu et.al. | 2303.17789 | null |
| 2023-03-31 | Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert | Jiadong Wang et.al. | 2303.17480 | null |
| 2023-03-27 | OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis | Hongyi Xu et.al. | 2303.15539 | null |
| 2023-03-27 | Accurate and Interpretable Solution of the Inverse Rig for Realistic Blendshape Models with Quadratic Corrective Terms | Stevo RackoviΔ et.al. | 2302.04843 | null |
| 2023-03-27 | MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation | Bowen Zhang et.al. | 2212.08062 | link |
| 2023-03-27 | A Majorization-Minimization Based Method for Nonconvex Inverse Rig Problems in Facial Animation: Algorithm Derivation | Stevo RackoviΔ et.al. | 2205.04289 | null |
| 2023-03-26 | OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering | Zhiyuan Ma et.al. | 2303.14662 | link |
| 2023-03-26 | Emotionally Enhanced Talking Face Generation | Sahil Goyal et.al. | 2303.11548 | link |
| 2023-03-26 | Distributed Solution of the Inverse Rig Problem in Blendshape Facial Animation | Stevo RackoviΔ et.al. | 2303.06370 | null |
| 2023-03-24 | Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement | Siddarth Ravichandran et.al. | 2209.01320 | null |
| 2023-03-23 | PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360 |
Sizhe An et.al. | 2303.13071 | null |
| 2023-03-22 | Style Transfer for 2D Talking Head Animation | Trong-Thang Pham et.al. | 2303.09799 | link |
| 2023-03-22 | MARLIN: Masked Autoencoder for facial video Representation LearnINg | Zhixi Cai et.al. | 2211.06627 | link |
| 2023-03-14 | DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions | Geumbyeol Hwang et.al. | 2303.07697 | link |
| 2023-03-13 | SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation | Wenxuan Zhang et.al. | 2211.12194 | link |
| 2023-03-09 | FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning | Kazi Injamamul Haque et.al. | 2303.05416 | link |
| 2023-03-09 | Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation | Qi Chen et.al. | 2303.05322 | link |
| 2023-03-07 | DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video | Zhimeng Zhang et.al. | 2303.03988 | link |
| 2023-03-05 | Cyber Vaccine for Deepfake Immunity | Ching-Chun Chang et.al. | 2303.02659 | null |
| 2023-03-04 | High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors | Yunpeng Bai et.al. | 2211.15064 | null |
| 2023-03-01 | DPE: Disentanglement of Pose and Expression for General Video Portrait Editing | Youxin Pang et.al. | 2301.06281 | link |
| 2023-02-27 | Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video | Minsu Kim et.al. | 2303.08670 | null |
| 2023-02-27 | Memory-augmented Contrastive Learning for Talking Head Generation | Jianrong Wang et.al. | 2302.13469 | link |
| 2023-02-24 | Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention | Bin Liu et.al. | 2302.12532 | null |
| 2023-02-16 | OPT: One-shot Pose-Controllable Talking Head Generation | Jin Liu et.al. | 2302.08197 | null |
| 2023-02-14 | Expressive Talking Head Video Encoding in StyleGAN2 Latent-Space | Trevine Oorloff et.al. | 2203.14512 | link |
| 2023-01-31 | GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis | Zhenhui Ye et.al. | 2301.13430 | null |
| 2023-01-23 | Data standardization for robust lip sync | Chun Wang et.al. | 2202.06198 | null |
| 2023-01-20 | Neural Volumetric Blendshapes: Computationally Efficient Physics-Based Facial Blendshapes | Nicolas Wagner et.al. | 2212.14784 | null |
| 2023-01-15 | Learning Audio-Driven Viseme Dynamics for 3D Face Animation | Linchao Bao et.al. | 2301.06059 | null |
| 2022-12-30 | Imitator: Personalized Speech-driven 3D Facial Animation | Balamurugan Thambiraja et.al. | 2301.00023 | null |
| 2022-12-28 | All's well that FID's well? Result quality and metric scores in GAN models for lip-sychronization tasks | Carina Geldhauser et.al. | 2212.13810 | null |
| 2022-12-23 | Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing | William Brannon et.al. | 2212.12137 | null |
| 2022-12-09 | Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers | Yasheng Sun et.al. | 2212.04970 | null |
| 2022-12-07 | Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors | Zhentao Yu et.al. | 2212.04248 | null |
| 2022-12-07 | SPACE: Speech-driven Portrait Animation with Controllable Expression | Siddharth Gururani et.al. | 2211.09809 | null |
| 2022-11-30 | Extracting Semantic Knowledge from GANs with Unsupervised Learning | Jianjin Xu et.al. | 2211.16710 | null |
| 2022-11-29 | VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild | Kun Cheng et.al. | 2211.14758 | null |
| 2022-11-26 | Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis | Duomin Wang et.al. | 2211.14506 | link |
| 2022-11-22 | Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition | Jiaxiang Tang et.al. | 2211.12368 | null |
| 2022-11-10 | On the role of Lip Articulation in Visual Speech Perception | Zakaria Aldeneh et.al. | 2203.10117 | null |
| 2022-11-04 | SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory | Se Jin Park et.al. | 2211.00924 | null |
| 2022-10-21 | Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection | Alexandros Haliassos et.al. | 2201.07131 | link |
| 2022-10-14 | Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar | Aolan Sun et.al. | 2210.06877 | null |
| 2022-10-13 | Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors | Vladimir Iashin et.al. | 2210.07055 | link |
| 2022-10-07 | Compressing Video Calls using Synthetic Talking Heads | Madhav Agarwal et.al. | 2210.03692 | null |
| 2022-10-07 | A Keypoint Based Enhancement Method for Audio Driven Free View Talking Head Synthesis | Yichen Han et.al. | 2210.03335 | null |
| 2022-10-06 | Audio-Visual Face Reenactment | Madhav Agarwal et.al. | 2210.02755 | link |
| 2022-10-06 | Finding Directions in GAN's Latent Space for Neural Face Reenactment | Stella Bounareli et.al. | 2202.00046 | link |
| 2022-10-04 | Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale | Aditya Agarwal et.al. | 2208.09796 | null |
| 2022-09-29 | Facial Landmark Predictions with Applications to Metaverse | Qiao Han et.al. | 2209.14698 | link |
| 2022-09-27 | StyleMask: Disentangling the Style Space of StyleGAN2 for Neural Face Reenactment | Stella Bounareli et.al. | 2209.13375 | link |
| 2022-09-23 | EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model | Xinya Ji et.al. | 2205.15278 | null |
| 2022-09-21 | FNeVR: Neural Volume Rendering for Face Animation | Bohan Zeng et.al. | 2209.10340 | link |
| 2022-09-19 | AutoLV: Automatic Lecture Video Generator | Wenbin Wang et.al. | 2209.08795 | null |
| 2022-09-09 | Talking Head from Speech Audio using a Pre-trained Image Generator | Mohammed M. Alghamdi et.al. | 2209.04252 | null |
| 2022-09-07 | Restructurable Activation Networks | Kartikeya Bhardwaj et.al. | 2208.08562 | link |
| 2022-08-29 | StableFace: Analyzing and Improving Motion Stability for Talking Face Generation | Jun Ling et.al. | 2208.13717 | null |
| 2022-08-17 | Extreme-scale Talking-Face Video Upsampling with Audio-Visual Priors | Sindhu B Hegde et.al. | 2208.08118 | link |
| 2022-08-03 | Free-HeadGAN: Neural Talking Head Synthesis with Explicit Gaze Control | Michail Christos Doukas et.al. | 2208.02210 | null |
| 2022-08-02 | Perceptual Conversational Head Generation with Regularized Driver and Enhanced Renderer | Ailin Huang et.al. | 2206.12837 | link |
| 2022-08-01 | A Feasibility Study on Image Inpainting for Non-cleft Lip Generation from Patients with Cleft Lip | Shuang Chen et.al. | 2208.01149 | link |
| 2022-07-27 | A Hybrid Deep Animation Codec for Low-bitrate Video Conferencing | Goluck Konuko et.al. | 2207.13530 | null |
| 2022-07-24 | Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis | Shuai Shen et.al. | 2207.11770 | link |
| 2022-07-22 | Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos | Panagiotis P. Filntisis et.al. | 2207.11094 | link |
| 2022-07-20 | NARRATE: A Normal Assisted Free-View Portrait Stylizer | Youjia Wang et.al. | 2207.00974 | null |
| 2022-07-20 | VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection | Joanna Hong et.al. | 2206.07458 | null |
| 2022-07-20 | Responsive Listening Head Generation: A Benchmark Dataset and Baseline | Mohan Zhou et.al. | 2112.13548 | null |
| 2022-07-13 | FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis | Yongqi Wang et.al. | 2207.03800 | link |
| 2022-06-29 | Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs | Bo-Kyeong Kim et.al. | 2206.14658 | null |
| 2022-06-09 | Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos | Alexander Waibel et.al. | 2206.04523 | null |
| 2022-05-31 | Text/Speech-Driven Full-Body Animation | Wenlin Zhuang et.al. | 2205.15573 | null |
| 2022-05-27 | Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast | Boqing Zhu et.al. | 2204.14057 | link |
| 2022-05-26 | One-Shot Face Reenactment on Megapixels | Wonjun Kang et.al. | 2205.13368 | null |
| 2022-05-24 | Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel's Weekly Video Podcasts | Debjoy Saha et.al. | 2205.12194 | link |
| 2022-05-20 | MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement | Alexander Richard et.al. | 2104.08223 | link |
| 2022-05-13 | Talking Face Generation with Multilingual TTS | Hyoung-Kyu Song et.al. | 2205.06421 | null |
| 2022-05-02 | Emotion-Controllable Generalized Talking Face Generation | Sanjana Sinha et.al. | 2205.01155 | null |
| 2022-05-02 | A Novel Speech-Driven Lip-Sync Model with CNN and LSTM | Xiaohong Li et.al. | 2205.00916 | null |
| 2022-04-27 | Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion | Sen Chen et.al. | 2204.12756 | null |
| 2022-04-25 | Fast Facial Landmark Detection and Applications: A Survey | Kostiantyn Khabarlak et.al. | 2101.10808 | null |
| 2022-04-13 | Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions | Zipeng Ye et.al. | 2204.06180 | null |
| 2022-04-12 | Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation in the Wild | Ganglai Wang et.al. | 2203.03984 | null |
| 2022-04-06 | Transformer-S2A: Robust and Efficient Speech-to-Animation | Liyang Chen et.al. | 2111.09771 | null |
| 2022-04-03 | Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text | Pulkit Tandon et.al. | 2106.14014 | link |
| 2022-03-30 | End to End Lip Synchronization with a Temporal AutoEncoder | Yoav Shalev et.al. | 2203.16224 | link |
| 2022-03-29 | Thin-Plate Spline Motion Model for Image Animation | Jian Zhao et.al. | 2203.14367 | link |
| 2022-03-17 | StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN | Fei Yin et.al. | 2203.04036 | link |
| 2022-03-17 | FaceFormer: Speech-Driven 3D Facial Animation with Transformers | Yingruo Fan et.al. | 2112.05329 | link |
| 2022-03-16 | Efficient conditioned face animation using frontally-viewed embedding | Maxime Oquab et.al. | 2203.08765 | null |
| 2022-03-15 | Depth-Aware Generative Adversarial Network for Talking Head Video Generation | Fa-Ting Hong et.al. | 2203.06605 | link |
| 2022-03-10 | An Audio-Visual Attention Based Multimodal Network for Fake Talking Face Videos Detection | Ganglai Wang et.al. | 2203.05178 | null |
| 2022-03-04 | Multi-modality Deep Restoration of Extremely Compressed Face Videos | Xi Zhang et.al. | 2107.05548 | null |
| 2022-03-01 | FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset | Hasam Khalid et.al. | 2108.05080 | link |
| 2022-02-25 | FSGANv2: Improved Subject Agnostic Face Swapping and Reenactment | Yuval Nirkin et.al. | 2202.12972 | null |
| 2022-02-22 | Thinking the Fusion Strategy of Multi-reference Face Reenactment | Takuya Yashima et.al. | 2202.10758 | null |
| 2022-01-24 | Selective Listening by Synchronizing Speech with Lips | Zexu Pan et.al. | 2106.07150 | link |
| 2022-01-22 | Text2Video: Text-driven Talking-head Video Synthesis with Personalized Phoneme-Pose Dictionary | Sibo Zhang et.al. | 2104.14631 | null |
| 2022-01-21 | Stitch it in Time: GAN-Based Facial Editing of Real Videos | Rotem Tzaban et.al. | 2201.08361 | link |
| 2022-01-17 | Towards Realistic Visual Dubbing with Heterogeneous Sources | Tianyi Xie et.al. | 2201.06260 | null |
| 2022-01-16 | Audio-Driven Talking Face Video Generation with Dynamic Convolution Kernels | Zipeng Ye et.al. | 2201.05986 | null |
| 2022-01-03 | DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering | Shunyu Yao et.al. | 2201.00791 | null |
| 2021-12-20 | Parallel and High-Fidelity Text-to-Lip Generation | Jinglin Liu et.al. | 2107.06831 | link |
| 2021-12-19 | Initiative Defense against Facial Manipulation | Qidong Huang et.al. | 2112.10098 | link |
| 2021-12-07 | Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation | Yingruo Fan et.al. | 2112.02214 | null |
| 2021-12-06 | One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning | Suzhen Wang et.al. | 2112.02749 | null |
| 2021-11-29 | Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates | Shenhan Qian et.al. | 2108.08020 | link |
| 2021-11-04 | FEAFA+: An Extended Well-Annotated Dataset for Facial Expression Analysis and 3D Facial Animation | Wei Gan et.al. | 2111.02751 | null |
| 2021-11-02 | BiosecurID: a multimodal biometric database | Julian Fierrez et.al. | 2111.03472 | null |
| 2021-10-30 | Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis | Haozhe Wu et.al. | 2111.00203 | link |
| 2021-10-26 | Emotion recognition in talking-face videos using persistent entropy and neural networks | Eduardo Paluzo-Hidalgo et.al. | 2110.13571 | link |
| 2021-10-26 | ViDA-MAN: Visual Dialog with Digital Humans | Tong Shen et.al. | 2110.13384 | null |
| 2021-10-22 | Invertible Frowns: Video-to-Video Facial Emotion Translation | Ian Magnusson et.al. | 2109.08061 | null |
| 2021-10-19 | Talking Head Generation with Audio and Speech Related Facial Action Units | Sen Chen et.al. | 2110.09951 | null |
| 2021-10-16 | Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor | Anchit Gupta et.al. | 2110.08580 | null |
| 2021-10-12 | Fine-grained Identity Preserving Landmark Synthesis for Face Reenactment | Haichao Zhang et.al. | 2110.04708 | null |
| 2021-10-07 | Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution | Yangyang Shi et.al. | 2110.05241 | null |
| 2021-09-24 | Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation | Yuanxun Lu et.al. | 2109.10595 | null |
| 2021-09-20 | Accurate, Interpretable, and Fast Animation: An Iterative, Sparse, and Nonconvex Approach | Stevo Rackovic et.al. | 2109.08356 | null |
| 2021-09-17 | Detection of GAN-synthesized street videos | Omran Alamayreh et.al. | 2109.04991 | null |
| 2021-08-30 | Audiovisual Speech Synthesis using Tacotron2 | Ahmed Hussen Abdelaziz et.al. | 2008.00620 | null |
| 2021-08-23 | KoDF: A Large-scale Korean DeepFake Detection Dataset | Patrick Kwon et.al. | 2103.10094 | null |
| 2021-08-23 | HeadGAN: One-shot Neural Head Synthesis and Editing | Michail Christos Doukas et.al. | 2012.08261 | null |
| 2021-08-19 | AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis | Yudong Guo et.al. | 2103.11078 | link |
| 2021-08-18 | DeepFake MNIST+: A DeepFake Facial Animation Dataset | Jiajun Huang et.al. | 2108.07949 | link |
| 2021-08-18 | FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning | Chenxu Zhang et.al. | 2108.07938 | link |
| 2021-08-12 | UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing | Meng Cao et.al. | 2108.05650 | null |
| 2021-08-11 | AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person | Xinsheng Wang et.al. | 2108.04325 | null |
| 2021-08-06 | SofGAN: A Portrait Image Generator with Dynamic Styling | Anpei Chen et.al. | 2007.03780 | link |
| 2021-07-27 | Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations | Laurent Benaroya et.al. | 2107.12346 | null |
| 2021-07-21 | Speech Driven Talking Face Generation from a Single Image and an Emotion Condition | Sefik Emre Eskimez et.al. | 2008.03592 | link |
| 2021-07-20 | Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion | Suzhen Wang et.al. | 2107.09293 | link |
| 2021-07-10 | Speech2Video: Cross-Modal Distillation for Speech to Video Generation | Shijing Si et.al. | 2107.04806 | null |
| 2021-07-07 | Egocentric Videoconferencing | Mohamed Elgharib et.al. | 2107.03109 | null |
| 2021-06-09 | LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization | Avisek Lahiri et.al. | 2106.04185 | null |
| 2021-05-20 | Audio-Driven Emotional Video Portraits | Xinya Ji et.al. | 2104.07452 | null |
| 2021-05-07 | Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation | Lincheng Li et.al. | 2104.07995 | link |
| 2021-05-05 | A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors | Ruobing Zheng et.al. | 2002.08700 | null |
| 2021-04-29 | Learned Spatial Representations for Few-shot Talking-Head Synthesis | Moustafa Meshry et.al. | 2104.14557 | null |
| 2021-04-26 | One-shot Face Reenactment Using Appearance Adaptive Normalization | Guangming Yao et.al. | 2102.03984 | null |
| 2021-04-25 | 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head | Qianyun Wang et.al. | 2104.12051 | null |
| 2021-04-23 | Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation | Hang Zhou et.al. | 2104.11116 | null |
| 2021-04-07 | Single Source One Shot Reenactment using Weighted motion From Paired Feature Points | Soumya Tripathy et.al. | 2104.03117 | null |
| 2021-04-07 | Everything's Talkin': Pareidolia Face Reenactment | Linsen Song et.al. | 2104.03061 | link |
| 2021-04-07 | LI-Net: Large-Pose Identity-Preserving Face Reenactment Network | Jin Liu et.al. | 2104.02850 | null |
| 2021-04-02 | One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing | Ting-Chun Wang et.al. | 2011.15126 | null |
| 2021-03-20 | Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization | Komal Chugh et.al. | 2005.14405 | link |
| 2021-03-19 | End-to-End Lip Synchronisation Based on Pattern Classification | You Jin Kim et.al. | 2005.08606 | null |
| 2021-03-05 | Real-time RGBD-based Extended Body Pose Estimation | Renat Bashirov et.al. | 2103.03663 | link |
| 2021-03-03 | Estimating Uniqueness of I-Vector Representation of Human Voice | Erkam Sinan Tandogan et.al. | 2008.11985 | null |
| 2021-02-25 | MakeItTalk: Speaker-Aware Talking-Head Animation | Yang Zhou et.al. | 2004.12992 | null |
| 2021-02-19 | One Shot Audio to Animated Video Generation | Neeraj Kumar et.al. | 2102.09737 | null |
| 2021-02-18 | AudioVisual Speech Synthesis: A brief literature review | Efthymios Georgiou et.al. | 2103.03927 | null |
| 2020-12-14 | Robust One Shot Audio to Video Generation | Neeraj Kumar et.al. | 2012.07842 | null |
| 2020-12-14 | Multi Modal Adaptive Normalization for Audio to Video Generation | Neeraj Kumar et.al. | 2012.07304 | null |
| 2020-11-30 | Adaptive Compact Attention For Few-shot Video-to-video Translation | Risheng Huang et.al. | 2011.14695 | null |
| 2020-11-21 | Stochastic Talking Face Generation Using Latent Distribution Matching | Ravindra Yadav et.al. | 2011.10727 | link |
| 2020-11-21 | Iterative Text-based Editing of Talking-heads Using Neural Retargeting | Xinwei Yao et.al. | 2011.10688 | null |
| 2020-11-09 | FACEGAN: Facial Attribute Controllable rEenactment GAN | Soumya Tripathy et.al. | 2011.04439 | null |
| 2020-11-06 | Large-scale multilingual audio visual dubbing | Yi Yang et.al. | 2011.03530 | null |
| 2020-11-02 | Facial Keypoint Sequence Generation from Audio | Prateek Manocha et.al. | 2011.01114 | null |
| 2020-10-25 | APB2FaceV2: Real-Time Audio-Guided Multi-Face Reenactment | Jiangning Zhang et.al. | 2010.13017 | link |
| 2020-10-12 | Intuitive Facial Animation Editing Based On A Generative RNN Framework | EloΓ―se Berson et.al. | 2010.05655 | null |
| 2020-10-05 | SMILE: Semantically-guided Multi-attribute Image and Layout Editing | AndrΓ©s Romero et.al. | 2010.02315 | link |
| 2020-10-05 | Dynamic Facial Asset and Rig Generation from a Single Scan | Jiaman Li et.al. | 2010.00560 | null |
| 2020-09-20 | An Improved Approach of Intention Discovery with Machine Learning for POMDP-based Dialogue Management | Ruturaj Raval et.al. | 2009.09354 | null |
| 2020-09-18 | Mesh Guided One-shot Face Reenactment using Graph Convolutional Networks | Guangming Yao et.al. | 2008.07783 | null |
| 2020-09-12 | DualLip: A System for Joint Lip Reading and Generation | Weicong Chen et.al. | 2009.05784 | null |
| 2020-09-02 | Seeing wake words: Audio-visual Keyword Spotting | Liliane Momeni et.al. | 2009.01225 | null |
| 2020-08-29 | "It took me almost 30 minutes to practice this". Performance and Production Practices in Dance Challenge Videos on TikTok | Daniel Klug et.al. | 2008.13040 | null |
| 2020-08-25 | A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild | K R Prajwal et.al. | 2008.10010 | null |
| 2020-08-11 | Audio- and Gaze-driven Facial Animation of Codec Avatars | Alexander Richard et.al. | 2008.05023 | null |
| 2020-08-04 | Speaker dependent acoustic-to-articulatory inversion using real-time MRI of the vocal tract | TamΓ‘s GΓ‘bor CsapΓ³ et.al. | 2008.02098 | link |
| 2020-08-04 | Real-Time Cleaning and Refinement of Facial Animation Signals | EloΓ―se Berson et.al. | 2008.01332 | null |
| 2020-08-02 | Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos | Yanhui Guo et.al. | 2008.01652 | null |
| 2020-07-29 | Neural Voice Puppetry: Audio-driven Facial Reenactment | Justus Thies et.al. | 1912.05566 | link |
| 2020-07-20 | Deformable Style Transfer | Sunnie S. Y. Kim et.al. | 2003.11038 | link |
| 2020-07-18 | A Robust Interactive Facial Animation Editing System | EloΓ―se Berson et.al. | 2007.09367 | null |
| 2020-07-16 | Talking-head Generation with Rhythmic Head Motion | Lele Chen et.al. | 2007.08547 | link |
| 2020-07-08 | Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision | Abhinav Shukla et.al. | 2007.04134 | null |
| 2020-06-20 | Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams | Huirong Huang et.al. | 2006.11610 | null |
| 2020-05-27 | Modality Dropout for Improved Performance-driven Talking Faces | Ahmed Hussen Abdelaziz et.al. | 2005.13616 | null |
| 2020-05-25 | Identity-Preserving Realistic Talking Face Generation | Sanjana Sinha et.al. | 2005.12318 | null |
| 2020-05-22 | Head2Head: Video-based Neural Head Synthesis | Mohammad Rami Koujan et.al. | 2005.10954 | null |
| 2020-05-16 | FReeNet: Multi-Identity Face Reenactment | Jiangning Zhang et.al. | 1905.11805 | null |
| 2020-05-13 | FaR-GAN for One-Shot Face Reenactment | Hanxiang Hao et.al. | 2005.06402 | null |
| 2020-05-13 | Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning | Hao Zhu et.al. | 1812.06589 | null |
| 2020-05-11 | Dancing to the Partisan Beat: A First Analysis of Political Communication on TikTok | Juan Carlos Medina Serrano et.al. | 2004.05478 | link |
| 2020-05-07 | What comprises a good talking-head video generation?: A Survey and Benchmark | Lele Chen et.al. | 2005.03201 | link |
| 2020-05-04 | Disentangled Speech Embeddings using Cross-modal Self-supervision | Arsha Nagrani et.al. | 2002.08742 | null |
| 2020-04-30 | APB2Face: Audio-guided face reenactment with auxiliary pose and blink signals | Jiangning Zhang et.al. | 2004.14569 | null |
| 2020-03-30 | ActGAN: Flexible and Efficient One-shot Face Reenactment | Ivan Kosarevych et.al. | 2003.13840 | null |
| 2020-03-29 | Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose | Xianfang Zeng et.al. | 2003.12957 | null |
| 2020-03-26 | High-Accuracy Facial Depth Models derived from 3D Synthetic Data | Faisal Khan et.al. | 2003.06211 | null |
| 2020-03-06 | Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose | Ran Yi et.al. | 2002.10137 | null |
| 2020-03-05 | Talking-Heads Attention | Noam Shazeer et.al. | 2003.02436 | link |
| 2020-03-01 | Towards Automatic Face-to-Face Translation | Prajwal K R et.al. | 2003.00418 | link |
| 2020-02-19 | Speech-driven facial animation using polynomial fusion of features | Triantafyllos Kefalas et.al. | 1912.05833 | null |
| 2020-01-17 | ICface: Interpretable and Controllable Face Reenactment Using GANs | Soumya Tripathy et.al. | 1904.01909 | null |
| 2019-12-20 | Disentangling Style and Content in Anime Illustrations | Sitao Xiang et.al. | 1905.10742 | null |
| 2019-11-21 | FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis | Kuangxiao Gu et.al. | 1911.09224 | null |
| 2019-11-19 | MarioNETte: Few-shot Face Reenactment Preserving Identity of Unseen Targets | Sungjoo Ha et.al. | 1911.08139 | null |
| 2019-10-28 | Few-shot Video-to-Video Synthesis | Ting-Chun Wang et.al. | 1910.12713 | null |
| 2019-10-19 | Real-Time Lip Sync for Live 2D Animation | Deepali Aneja et.al. | 1910.08685 | link |
| 2019-10-16 | Designing Style Matching Conversational Agents | Deepali Aneja et.al. | 1910.07514 | null |
| 2019-10-15 | A High-Fidelity Open Embodied Avatar with Lip Syncing and Expression Capabilities | Deepali Aneja et.al. | 1909.08766 | link |
| 2019-10-09 | EmoCo: Visual Analysis of Emotion Coherence in Presentation Videos | Haipeng Zeng et.al. | 1907.12918 | null |
| 2019-10-02 | Animating Face using Disentangled Audio Representations | Gaurav Mittal et.al. | 1910.00726 | null |
| 2019-09-25 | Few-Shot Adversarial Learning of Realistic Neural Talking Head Models | Egor Zakharov et.al. | 1905.08233 | null |
| 2019-09-06 | Neural Style-Preserving Visual Dubbing | Hyeongwoo Kim et.al. | 1909.02518 | null |
| 2019-08-29 | 3D Face Pose and Animation Tracking via Eigen-Decomposition based Bayesian Approach | Ngoc-Trung Tran et.al. | 1908.11039 | null |
| 2019-08-20 | Prosodic Phrase Alignment for Machine Dubbing | Alp Γktem et.al. | 1908.07226 | link |
| 2019-08-16 | FSGAN: Subject Agnostic Face Swapping and Reenactment | Yuval Nirkin et.al. | 1908.05932 | link |
| 2019-08-11 | Emotion Dependent Facial Animation from Affective Speech | Rizwan Sadiq et.al. | 1908.03904 | null |
| 2019-08-05 | One-shot Face Reenactment | Yunxuan Zhang et.al. | 1908.03251 | link |
| 2019-07-25 | Talking Face Generation by Conditional Recurrent Adversarial Network | Yang Song et.al. | 1804.04786 | link |
| 2019-07-24 | Data-Driven Physical Face Inversion | Yeara Kozlov et.al. | 1907.10402 | null |
| 2019-07-23 | A system for efficient 3D printed stop-motion face animation | Rinat Abdrashitov et.al. | 1907.10163 | null |
| 2019-06-14 | Realistic Speech-Driven Facial Animation with GANs | Konstantinos Vougioukas et.al. | 1906.06337 | null |
| 2019-06-04 | Text-based Editing of Talking-head Video | Ohad Fried et.al. | 1906.01524 | null |
| 2019-05-27 | Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks | Guanzhong Tian et.al. | 1905.11142 | null |
| 2019-05-09 | Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss | Lele Chen et.al. | 1905.03820 | link |
| 2019-05-08 | Capture, Learning, and Synthesis of 3D Speaking Styles | Daniel Cudeiro et.al. | 1905.03079 | link |
| 2019-04-23 | Talking Face Generation by Adversarially Disentangled Audio-Visual Representation | Hang Zhou et.al. | 1807.07860 | null |
| 2019-04-02 | FEAFA: A Well-Annotated Dataset for Facial Expression Analysis and 3D Facial Animation | Yanfu Yan et.al. | 1904.01509 | null |
| 2019-03-13 | Animating an Autonomous 3D Talking Avatar | Dominik Borer et.al. | 1903.05448 | null |
| 2018-12-22 | Deep Audio-Visual Speech Recognition | Triantafyllos Afouras et.al. | 1809.02108 | null |
| 2018-12-20 | DeepFakes: a New Threat to Face Recognition? Assessment and Detection | Pavel Korshunov et.al. | 1812.08685 | null |
| 2018-11-22 | Towards Highly Accurate and Stable Face Alignment for High-Resolution Videos | Ying Tai et.al. | 1811.00342 | link |
| 2018-11-16 | Influence of visual cues on head and eye movements during listening tasks in multi-talker audiovisual environments with animated characters | Maartje M. E. Hendrikse et.al. | 1812.02088 | null |
| 2018-08-28 | GANimation: Anatomically-aware Facial Animation from a Single Image | Albert Pumarola et.al. | 1807.09251 | link |
| 2018-08-19 | Dynamic Temporal Alignment of Speech to Lips | Tavi Halperin et.al. | 1808.06250 | link |
| 2018-07-29 | ReenactGAN: Learning to Reenact Faces via Boundary Transfer | Wayne Wu et.al. | 1807.11079 | link |
| 2018-07-26 | Learnable PINs: Cross-Modal Embeddings for Person Identity | Arsha Nagrani et.al. | 1805.00833 | null |
| 2018-07-19 | End-to-End Speech-Driven Facial Animation with Temporal GANs | Konstantinos Vougioukas et.al. | 1805.09313 | null |
| 2018-05-29 | Deep Video Portraits | Hyeongwoo Kim et.al. | 1805.11714 | null |
| 2018-05-24 | VisemeNet: Audio-Driven Animator-Centric Speech Animation | Yang Zhou et.al. | 1805.09488 | null |
| 2018-05-21 | Anime Style Space Exploration Using Metric Learning and Generative Adversarial Networks | Sitao Xiang et.al. | 1805.07997 | null |
| 2018-04-23 | Generating Talking Face Landmarks from Speech | Sefik Emre Eskimez et.al. | 1803.09803 | null |
| 2018-03-28 | Generative Adversarial Talking Head: Bringing Portraits to Life with a Weakly Supervised Neural Network | Hai X. Pham et.al. | 1803.07716 | null |
| 2018-03-20 | Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks | Seyed Ali Jalalifar et.al. | 1803.07461 | null |
| 2017-12-07 | End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech | Hai X. Pham et.al. | 1710.00920 | null |
| 2017-12-06 | ObamaNet: Photo-realistic lip-sync from text | Rithesh Kumar et.al. | 1801.01442 | null |
| 2017-07-30 | Kernel Projection of Latent Structures Regression for Facial Animation Retargeting | Christos Ouzounis et.al. | 1707.09629 | null |
| 2017-07-26 | Fast Deep Matting for Portrait Animation on Mobile Phone | Bingke Zhu et.al. | 1707.08289 | null |
| 2017-07-21 | Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking | Rahul Sharma et.al. | 1707.06830 | null |
| 2017-07-18 | You said that? | Joon Son Chung et.al. | 1705.02966 | null |
| 2017-01-30 | Lip Reading Sentences in the Wild | Joon Son Chung et.al. | 1611.05358 | link |
| 2016-10-28 | Galaxy gas as obscurer: II. Separating the galaxy-scale and nuclear obscurers of Active Galactic Nuclei | Johannes Buchner et.al. | 1610.09380 | link |
| 2016-07-11 | Large-Scale MIMO is Capable of Eliminating Power-Thirsty Channel Coding for Wireless Transmission of HEVC/H.265 Video | Shaoshi Yang et.al. | 1601.06684 | null |
| 2016-05-22 | Improving Facial Analysis and Performance Driven Animation through Disentangling Identity and Expression | David Rim et.al. | 1512.08212 | null |
| 2016-02-08 | Automatic Face Reenactment | Pablo Garrido et.al. | 1602.02651 | null |
| 2015-11-20 | ExpressionBot: An Emotive Lifelike Robotic Face for Face-to-Face Communication | Ali Mollahosseini et.al. | 1511.06502 | null |
| 2014-09-03 | Visual Speech Recognition | Ahmad B. A. Hassanat et.al. | 1409.1411 | null |
| 2012-09-22 | Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis | Ingmar Steiner et.al. | 1209.4982 | null |
| 2012-03-30 | Face Expression Recognition and Analysis: The State of the Art | Vinay Bettadapura et.al. | 1203.6722 | null |
| 2012-01-19 | Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis | Ingmar Steiner et.al. | 1201.4080 | null |
| 2010-03-01 | Re-verification of a Lip Synchronization Protocol using Robust Reachability | Piotr Kordy et.al. | 1003.0431 | null |
Image Animation
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-05 | SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations | Wenhao Yan et.al. | 2512.05905 | null |
| 2025-12-05 | Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer | Rong Wang et.al. | 2512.05593 | null |
| 2025-12-04 | ShadowDraw: From Any Object to Shadow-Drawing Compositional Art | Rundong Luo et.al. | 2512.05110 | null |
| 2025-12-04 | Efficient Spatially-Variant Convolution via Differentiable Sparse Kernel Complex | Zhizhen Wu et.al. | 2512.04556 | null |
| 2025-12-03 | Artificial Microsaccade Compensation: Stable Vision for an Ornithopter | Levi Burner et.al. | 2512.03995 | null |
| 2025-12-02 | PPTArena: A Benchmark for Agentic PowerPoint Editing | Michael Ofengenden et.al. | 2512.03042 | null |
| 2025-12-01 | Know Thyself by Knowing Others: Learning Neuron Identity from Population Context | Vinam Arora et.al. | 2512.01199 | null |
| 2025-12-01 | One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer | Shijun Shi et.al. | 2511.22940 | null |
| 2025-11-30 | TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model | Alireza Javanmardi et.al. | 2512.00909 | null |
| 2025-11-29 | Astro-Animation -- How Artists and Scientists Envision the Universe | Laurence Arcadias et.al. | 2512.00535 | null |
| 2025-11-28 | MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation | Yuta Oshima et.al. | 2511.22989 | null |
| 2025-11-28 | OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild | Yuncheng Guo et.al. | 2511.08423 | null |
| 2025-11-27 | A Progressive Evaluation Framework for Multicultural Analysis of Story Visualization | Janak Kapuriya et.al. | 2511.22576 | null |
| 2025-11-27 | INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts | Anshul Bagaria et.al. | 2511.22351 | null |
| 2025-11-25 | MotionV2V: Editing Motion in a Video | Ryan Burgert et.al. | 2511.20640 | null |
| 2025-11-25 | New York Smells: A Large Multimodal Dataset for Olfaction | Ege Ozguroglu et.al. | 2511.20544 | null |
| 2025-11-24 | SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation | Jiaming Zhang et.al. | 2511.19320 | null |
| 2025-11-22 | AnimAgents: Coordinating Multi-Stage Animation Pre-Production with Human-Multi-Agent Collaboration | Wen-Fan Wang et.al. | 2511.17906 | null |
| 2025-11-20 | Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions | Takuya Igaue et.al. | 2511.16711 | null |
| 2025-11-20 | Integrating Deep Learning and Spatial Statistics in Marine Ecosystem Monitoring | Gian Mario Sangiovanni et.al. | 2511.16447 | null |
| 2025-11-20 | How Robot Dogs See the Unseeable | Oliver Bimber et.al. | 2511.16262 | null |
| 2025-11-18 | PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos | Dianbing Xi et.al. | 2511.12935 | null |
| 2025-11-16 | Sketch2PoseNet: Efficient and Generalized Sketch to 3D Human Pose Prediction | Li Wang et.al. | 2510.26196 | null |
| 2025-11-14 | EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation | Zongyang Qiu et.al. | 2511.11002 | null |
| 2025-11-11 | oboro: Text-to-Image Synthesis on Limited Data using Flow-based Diffusion Transformer with MMH Attention | Ryusuke Mizutani et.al. | 2511.08168 | null |
| 2025-11-11 | Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis | Aditi Singhania et.al. | 2511.08087 | null |
| 2025-11-09 | Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising | Assaf Singer et.al. | 2511.08633 | null |
| 2025-11-04 | Video Text Preservation with Synthetic Text-Rich Videos | Ziyang Liu et.al. | 2511.05573 | null |
| 2025-11-03 | FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion | Chuhao Chen et.al. | 2510.25765 | null |
| 2025-11-02 | A Hybrid YOLOv5-SSD IoT-Based Animal Detection System for Durian Plantation Protection | Anis Suttan Shahrir et.al. | 2511.00777 | null |
| 2025-10-31 | DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model | Yucheng Xing et.al. | 2510.27169 | null |
| 2025-10-29 | 4-Doodle: Text to 3D Sketches that Move! | Hao Chen et.al. | 2510.25319 | null |
| 2025-10-28 | DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery | Zan Wang et.al. | 2510.24117 | null |
| 2025-10-27 | Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation | Junyoung Seo et.al. | 2510.23581 | null |
| 2025-10-27 | Revising Second Order Terms in Deep Animation Video Coding | Konstantin Schmidt et.al. | 2510.23561 | null |
| 2025-10-26 | Cross-Species Transfer Learning in Agricultural AI: Evaluating ZebraPose Adaptation for Dairy Cattle Pose Estimation | Mackenzie Tapp et.al. | 2510.22618 | null |
| 2025-10-26 | DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss | Jing Yang et.al. | 2510.22473 | null |
| 2025-10-20 | From Volume Rendering to 3D Gaussian Splatting: Theory and Applications | Vitor Pereira Matias et.al. | 2510.18101 | null |
| 2025-10-16 | Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation | Shaowei Liu et.al. | 2510.14976 | null |
| 2025-10-16 | Zero-Shot Wildlife Sorting Using Vision Transformers: Evaluating Clustering and Continuous Similarity Ordering | Hugo Markoff et.al. | 2510.14596 | null |
| 2025-10-16 | Hierarchical Re-Classification: Combining Animal Classification Models with Vision Transformers | Hugo Markoff et.al. | 2510.14594 | null |
| 2025-10-16 | Evaluating plastic scintillator performance as a substitute of LYSO in SiPM based animal PET scanners: A GEANT4 simulation analysis | Davinder Siwal et.al. | 2510.14437 | null |
| 2025-10-16 | Multi-identity Human Image Animation with Structural Video Diffusion | Zhenzhi Wang et.al. | 2504.04126 | null |
| 2025-09-19 | TT-DF: A Large-Scale Diffusion-Based Dataset and Benchmark for Human Body Forgery Detection | Wenkui Yang et.al. | 2505.08437 | null |
| 2025-09-09 | LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors | Wenshuo Gao et.al. | 2509.07484 | null |
| 2025-08-23 | AnimateAnywhere: Rouse the Background in Human Image Animation | Xiaoyu Liu et.al. | 2504.19834 | null |
| 2025-08-13 | Animate-X++: Universal Character Image Animation with Dynamic Backgrounds | Shuai Tan et.al. | 2508.09454 | null |
| 2025-08-10 | Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers | Xin Ma et.al. | 2508.07246 | null |
| 2025-07-20 | StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation | Shuyuan Tu et.al. | 2507.15064 | null |
| 2025-07-11 | X-Dancer: Expressive Music to Human Dance Video Generation | Zeyuan Chen et.al. | 2502.17414 | null |
| 2025-07-01 | DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution | Zhe Kong et.al. | 2507.01012 | null |
| 2025-07-01 | Recomposed realities: animating still images via patch clustering and randomness | Markus Juvonen et.al. | 2506.22556 | null |
| 2025-05-30 | MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation | Yanbo Ding et.al. | 2505.10238 | null |
| 2025-05-29 | HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions | Shuolin Xu et.al. | 2505.22977 | null |
| 2025-05-24 | EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation | Qiang Qu et.al. | 2503.18552 | null |
| 2025-05-18 | DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation | Haoyu Zhao et.al. | 2503.21246 | null |
| 2025-04-20 | DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance | Yuxuan Luo et.al. | 2504.01724 | null |
| 2025-04-15 | UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer | Xiang Wang et.al. | 2504.11289 | null |
| 2025-04-15 | Taming Consistency Distillation for Accelerated Human Image Animation | Xiang Wang et.al. | 2504.11143 | null |
| 2025-04-04 | Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images | In-Hwan Jin et.al. | 2504.05458 | null |
| 2025-04-01 | VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer | Xinyu Liu et.al. | 2502.05979 | null |
| 2025-03-23 | MotiF: Making Text Count in Image Animation with Motion Focal Loss | Shijie Wang et.al. | 2412.16153 | null |
| 2025-03-13 | Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer | Jiahao Cui et.al. | 2412.00733 | link |
| 2025-03-10 | Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation | Yingjie Chen et.al. | 2501.05020 | null |
| 2025-02-25 | DisPose: Disentangling Pose Guidance for Controllable Human Image Animation | Hongxiang Li et.al. | 2412.09349 | link |
| 2025-02-15 | SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers | Di Qiu et.al. | 2502.10841 | null |
| 2025-02-10 | Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance | Li Hu et.al. | 2502.06145 | null |
| 2025-02-06 | MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation | Jinbo Xing et.al. | 2502.04299 | null |
| 2025-02-03 | Every Image Listens, Every Image Dances: Music-Driven Image Animation | Zhikang Dong et.al. | 2501.18801 | null |
| 2025-01-20 | X-Dyna: Expressive Dynamic Human Image Animation | Di Chang et.al. | 2501.10021 | null |
| 2025-01-15 | Joint Learning of Depth and Appearance for Portrait Image Animation | Xinya Ji et.al. | 2501.08649 | null |
| 2024-12-12 | Animate-X: Universal Character Image Animation with Enhanced Motion Representation | Shuai Tan et.al. | 2410.10306 | null |
| 2024-12-04 | FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait | Taekyung Ki et.al. | 2412.01064 | null |
| 2024-11-30 | DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses | Yatian Pang et.al. | 2412.00397 | null |
| 2024-11-28 | JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation | Xuyang Cao et.al. | 2411.09209 | link |
| 2024-11-27 | StableAnimator: High-Quality Identity-Preserving Human Image Animation | Shuyuan Tu et.al. | 2411.17697 | link |
| 2024-11-24 | LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis | Haojie Zhang et.al. | 2411.16748 | null |
| 2024-11-22 | HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation | Zhenzhi Wang et.al. | 2407.17438 | null |
| 2024-10-31 | TPC: Test-time Procrustes Calibration for Diffusion-based Human Image Animation | Sunjae Yoon et.al. | 2410.24037 | null |
| 2024-10-20 | FrameBridge: Improving Image-to-Video Generation with Bridge Models | Yuji Wang et.al. | 2410.15371 | null |
| 2024-10-14 | Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation | Jiahao Cui et.al. | 2410.07718 | link |
| 2024-09-30 | Illustrious: an Open Advanced Illustration Model | Sang Hyun Park et.al. | 2409.19946 | null |
| 2024-09-29 | High Quality Human Image Animation using Regional Supervision and Motion Blur Condition | Zhongcong Xu et.al. | 2409.19580 | null |
| 2024-09-22 | Dormant: Defending against Pose-driven Human Image Animation | Jiachen Zhou et.al. | 2409.14424 | link |
| 2024-07-23 | Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models | Xin Ma et.al. | 2407.15642 | link |
| 2024-07-12 | TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models | Jeongho Kim et.al. | 2407.09012 | null |
| 2024-07-12 | EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions | Zhiyuan Chen et.al. | 2407.08136 | link |
| 2024-07-11 | MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model | Muyao Niu et.al. | 2405.20222 | link |
| 2024-06-16 | Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation | Mingwang Xu et.al. | 2406.08801 | null |
| 2024-06-14 | Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation | Li Hu et.al. | 2311.17117 | null |
| 2024-06-13 | Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control | Jingyun Xue et.al. | 2406.03035 | null |
| 2024-06-03 | UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation | Xiang Wang et.al. | 2406.01188 | null |
| 2024-06-01 | Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance | Shenhao Zhu et.al. | 2403.14781 | link |
| 2024-05-29 | Evaluating the efectiveness of sonifcation in science education using Edukoi | Lucrezia Guiotto Nai Fovino et.al. | 2405.18908 | null |
| 2024-05-28 | VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation | Qilin Wang et.al. | 2405.18156 | null |
| 2024-05-28 | Controllable Longer Image Animation with Diffusion Models | Qiang Wang et.al. | 2405.17306 | null |
| 2024-03-26 | PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models | Yiming Zhang et.al. | 2312.13964 | null |
| 2024-03-13 | Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts | Yue Ma et.al. | 2403.08268 | link |
| 2024-03-08 | Audio-Synchronized Visual Animation | Lin Zhang et.al. | 2403.05659 | link |
| 2024-03-05 | Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation | Weijie Li et.al. | 2403.02827 | null |
| 2024-01-17 | Continuous Piecewise-Affine Based Motion Model for Image Animation | Hexiang Wang et.al. | 2401.09146 | link |
| 2024-01-03 | Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions | David Junhao Zhang et.al. | 2401.01827 | link |
| 2023-12-08 | AnimateZero: Video Diffusion Models are Zero-Shot Image Animators | Jiwen Yu et.al. | 2312.03793 | null |
| 2023-12-06 | AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance | Zuozhuo Dai et.al. | 2311.12886 | null |
| 2023-12-05 | LivePhoto: Real Image Animation with Text-guided Motion Control | Xi Chen et.al. | 2312.02928 | null |
| 2023-11-30 | Motion-Conditioned Image Animation for Video Editing | Wilson Yan et.al. | 2311.18827 | null |
| 2023-11-27 | MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model | Zhongcong Xu et.al. | 2311.16498 | null |
| 2023-11-27 | DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors | Jinbo Xing et.al. | 2310.12190 | link |
| 2023-11-19 | Differential Motion Evolution for Fine-Grained Motion Deformation in Unsupervised Image Animation | Peirong Liu et.al. | 2110.04658 | null |
| 2023-10-16 | LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation | Ruiqi Wu et.al. | 2310.10769 | link |
| 2023-10-11 | LEO: Generative Latent Image Animator for Human Video Synthesis | Yaohui Wang et.al. | 2305.03989 | link |
| 2023-09-26 | Text-Guided Synthesis of Eulerian Cinemagraphs | Aniruddha Mahapatra et.al. | 2307.03190 | link |
| 2023-09-25 | Automatic Animation of Hair Blowing in Still Portrait Photos | Wenpeng Xiao et.al. | 2309.14207 | null |
| 2023-07-10 | AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning | Yuwei Guo et.al. | 2307.04725 | link |
| 2023-07-09 | Predictive Coding For Animation-Based Video Compression | Goluck Konuko et.al. | 2307.04187 | null |
| 2023-04-12 | VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs | Moayed Haji Ali et.al. | 2304.06020 | null |
| 2023-03-10 | 3D Cinemagraphy from a Single Image | Xingyi Li et.al. | 2303.05724 | null |
| 2023-02-02 | Dreamix: Video Diffusion Models are General Video Editors | Eyal Molad et.al. | 2302.01329 | null |
| 2023-01-27 | Animating Still Images | Kushagr Batra et.al. | 2209.10497 | null |
| 2023-01-14 | Continuous odor profile monitoring to study olfactory navigation in small animals | Kevin S. Chen et.al. | 2301.05905 | null |
| 2022-11-30 | NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation | Yu Yin et.al. | 2211.17235 | null |
| 2022-10-05 | Implicit Warping for Animation with Image Sets | Arun Mallya et.al. | 2210.01794 | null |
| 2022-09-28 | Motion Transformer for Unsupervised Image Animation | Jiale Tao et.al. | 2209.14024 | link |
| 2022-07-19 | Single Stage Virtual Try-on via Deformable Attention Flows | Shuai Bai et.al. | 2207.09161 | link |
| 2022-07-08 | Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation | Yucheng Suo et.al. | 2207.03714 | null |
| 2022-06-11 | Bayesian Statistics Guided Label Refurbishment Mechanism: Mitigating Label Noise in Medical Image Classification | Mengdi Gao et.al. | 2106.12284 | link |
| 2022-04-05 | Neural Fields in Visual Computing and Beyond | Yiheng Xie et.al. | 2111.11426 | null |
| 2022-03-30 | Image Animation with Perturbed Masks | Yoav Shalev et.al. | 2011.06922 | null |
| 2022-03-29 | Thin-Plate Spline Motion Model for Image Animation | Jian Zhao et.al. | 2203.14367 | link |
| 2022-03-25 | 3D GAN Inversion for Controllable Portrait Image Animation | Connor Z. Lin et.al. | 2203.13441 | null |
| 2022-03-18 | Latent Image Animator: Learning to Animate Images via Latent Space Navigation | Yaohui Wang et.al. | 2203.09043 | null |
| 2021-12-21 | Image Animation with Keypoint Mask | Or Toledano et.al. | 2112.10457 | link |
| 2021-12-19 | Move As You Like: Image Animation in E-Commerce Scenario | Borun Xu et.al. | 2112.13647 | null |
| 2021-12-17 | AI-Empowered Persuasive Video Generation: A Survey | Chang Liu et.al. | 2112.09401 | null |
| 2021-12-01 | Deep Spatial Transformation for Pose-Guided Person Image Generation and Animation | Yurui Ren et.al. | 2008.12606 | null |
| 2021-10-28 | Application of Time Separation Technique to Enhance C-arm CT Dynamic Liver Perfusion Imaging | Hana HaseljiΔ et.al. | 2110.14318 | null |
| 2021-10-26 | Incremental Learning for Animal Pose Estimation using RBF k-DPP | Gaurav Kumar Nayak et.al. | 2110.13598 | null |
| 2021-10-07 | Enhancement of Anime Imaging Enlargement using Modified Super-Resolution CNN | Tanakit Intaniyom et.al. | 2110.02321 | null |
| 2021-09-06 | Sparse to Dense Motion Transfer for Face Image Animation | Ruiqi Zhao et.al. | 2109.00471 | null |
| 2021-08-18 | DeepFake MNIST+: A DeepFake Facial Animation Dataset | Jiajun Huang et.al. | 2108.07949 | link |
| 2021-06-23 | Analisis Kualitas Layanan Website E-Commerce Bukalapak Terhadap Kepuasan Pengguna Mahasiswa Universitas Bina Darma Menggunakan Metode Webqual 4.0 | Adellia et.al. | 2106.15342 | null |
| 2021-04-07 | Single Source One Shot Reenactment using Weighted motion From Paired Feature Points | Soumya Tripathy et.al. | 2104.03117 | null |
| 2021-03-23 | PriorityCut: Occlusion-guided Regularization for Warp-based Image Animation | Wai Ting Cheung et.al. | 2103.11600 | null |
| 2020-12-01 | Ultra-low bitrate video conferencing using deep image animation | Goluck Konuko et.al. | 2012.00346 | null |
| 2020-10-01 | First Order Motion Model for Image Animation | Aliaksandr Siarohin et.al. | 2003.00196 | link |
| 2019-08-30 | Animating Arbitrary Objects via Deep Motion Transfer | Aliaksandr Siarohin et.al. | 1812.08861 | link |
| 2019-07-01 | Style Generator Inversion for Image Enhancement and Animation | Aviv Gabbay et.al. | 1906.11880 | null |
| 2018-10-09 | 3D model silhouette-based tracking in depth images for puppet suit dynamic video-mapping | Guillaume Caron et.al. | 1810.03956 | null |
| 2018-06-24 | A Design of FPGA Based Small Animal PET Real Time Digital Signal Processing and Correction Logic | Jiaming Lu et.al. | 1806.09117 | null |
| 2018-01-31 | RAPTOR I: Time-dependent radiative transfer in arbitrary spacetimes | Thomas Bronzwaer et.al. | 1801.10452 | null |
| 2017-10-23 | Quasi-random Agents for Image Transition and Animation | Aneta Neumann et.al. | 1710.07421 | null |
| 2016-06-23 | Gender and Interest Targeting for Sponsored Post Advertising at Tumblr | Mihajlo Grbovic et.al. | 1606.07189 | null |
| 2015-03-16 | Use of Effective Audio in E-learning Courseware | Kisor Ray et.al. | 1503.04837 | null |
| 2015-02-04 | Multimedia-Video for Learning | Kah Hean Chua et.al. | 1502.01090 | null |
| 2013-01-25 | Measurements of Martian Dust Devil Winds with HiRISE | David S. Choi et.al. | 1301.6130 | null |
| 2010-01-04 | Tutoring System for Dance Learning | Rajkumar Kannan et.al. | 1001.0440 | null |
Video Generation
Video Generation
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-08 | UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation | Jiehui Huang et.al. | 2512.07831 | null |
| 2025-12-08 | WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling | Shaoheng Fang et.al. | 2512.07821 | null |
| 2025-12-08 | OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory | Zhaochong An et.al. | 2512.07802 | null |
| 2025-12-08 | ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation | Fan Yang et.al. | 2512.07720 | null |
| 2025-12-08 | Communication-Efficient Serving for Video Diffusion Models with Latent Parallelism | Zhiyuan Wu et.al. | 2512.07350 | null |
| 2025-12-08 | ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation | Ziyang Mai et.al. | 2512.07328 | null |
| 2025-12-08 | Unified Camera Positional Encoding for Controlled Video Generation | Cheng Zhang et.al. | 2512.07237 | null |
| 2025-12-07 | VideoVLA: Video Generators Can Be Generalizable Robot Manipulators | Yichao Shen et.al. | 2512.06963 | null |
| 2025-12-07 | Scaling Zero-Shot Reference-to-Video Generation | Zijian Zhou et.al. | 2512.06905 | null |
| 2025-12-07 | RunawayEvil: Jailbreaking the Image-to-Video Generative Models | Songping Wang et.al. | 2512.06674 | null |
| 2025-12-07 | MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment | Ruicheng Zhang et.al. | 2512.06628 | null |
| 2025-12-06 | Are AI-Generated Driving Videos Ready for Autonomous Driving? A Diagnostic Evaluation Framework | Xinhao Xiang et.al. | 2512.06376 | null |
| 2025-12-05 | Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation | Su Sun et.al. | 2512.06158 | null |
| 2025-12-05 | AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement | Munsif Ali et.al. | 2512.05960 | null |
| 2025-12-05 | World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty | Zhiting Mei et.al. | 2512.05927 | null |
| 2025-12-05 | Bring Your Dreams to Life: Continual Text-to-Video Customization | Jiahua Dong et.al. | 2512.05802 | null |
| 2025-12-05 | USV: Unified Sparsification for Accelerating Video Diffusion Models | Xinjian Wu et.al. | 2512.05754 | null |
| 2025-12-05 | ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior | Weikai Lu et.al. | 2512.05745 | null |
| 2025-12-05 | InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem | Yeobin Hong et.al. | 2512.05672 | null |
| 2025-12-05 | ProPhy: Progressive Physical Alignment for Dynamic World Simulation | Zijun Wang et.al. | 2512.05564 | null |
| 2025-12-05 | VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation | Chinthani Sugandhika et.al. | 2512.05524 | null |
| 2025-12-05 | User Negotiations of Authenticity, Ownership, and Governance on AI-Generated Video Platforms: Evidence from Sora | Bohui Shen et.al. | 2512.05519 | null |
| 2025-12-05 | WaterWave: Bridging Underwater Image Enhancement into Video Streams via Wavelet-based Temporal Consistency Field | Qi Zhu et.al. | 2512.05492 | null |
| 2025-12-05 | Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability | Shizhan Liu et.al. | 2512.05394 | null |
| 2025-12-04 | IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction | Dmitrii Torbunov et.al. | 2512.05240 | null |
| 2025-12-04 | Invariance Co-training for Robot Visual Generalization | Jonathan Yang et.al. | 2512.05230 | null |
| 2025-12-04 | Light-X: Generative 4D Video Rendering with Camera and Illumination Control | Tianqi Liu et.al. | 2512.05115 | null |
| 2025-12-04 | NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation | Yu Zeng et.al. | 2512.05106 | null |
| 2025-12-04 | TV2TV: A Unified Framework for Interleaved Language and Video Generation | Xiaochuang Han et.al. | 2512.05103 | null |
| 2025-12-04 | From Generated Human Videos to Physically Plausible Robot Trajectories | James Ni et.al. | 2512.05094 | null |
| 2025-12-04 | Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression | Jung Yi et.al. | 2512.05081 | null |
| 2025-12-04 | Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints | Minghan Zhu et.al. | 2512.05079 | null |
| 2025-12-04 | BulletTime: Decoupled Control of Time and Camera Pose for Video Generation | Yiming Wang et.al. | 2512.05076 | null |
| 2025-12-04 | Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image | Yanran Zhang et.al. | 2512.05044 | null |
| 2025-12-04 | Generative Neural Video Compression via Video Diffusion Prior | Qi Mao et.al. | 2512.05016 | null |
| 2025-12-04 | Exploring YouTube's Political Communication Networks during the 2024 French Elections | Caroline Violot et.al. | 2512.04971 | null |
| 2025-12-04 | Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing | Maria-Paola Forte et.al. | 2512.04862 | null |
| 2025-12-04 | Multi Task Denoiser Training for Solving Linear Inverse Problems | ClΓ©ment Bled et.al. | 2512.04709 | null |
| 2025-12-04 | Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation | Yunhong Lu et.al. | 2512.04678 | null |
| 2025-12-04 | Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length | Yubo Huang et.al. | 2512.04677 | null |
| 2025-12-04 | SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding | Chang-Hsun Wu et.al. | 2512.04643 | null |
| 2025-12-04 | VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management | Hongbo Jin et.al. | 2512.04540 | null |
| 2025-12-04 | X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale | Pei Yang et.al. | 2512.04537 | null |
| 2025-12-04 | PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement | Yu-Wei Zhan et.al. | 2512.04532 | null |
| 2025-12-04 | VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory | Yifei Yu et.al. | 2512.04519 | null |
| 2025-12-04 | EgoLCD: Egocentric Video Generation with Long Context Diffusion | Liuzhou Zhang et.al. | 2512.04515 | null |
| 2025-12-03 | Stable Signer: Hierarchical Sign Language Generative Model | Sen Fang et.al. | 2512.04048 | null |
| 2025-12-03 | RELIC: Interactive Video World Model with Long-Horizon Memory | Yicong Hong et.al. | 2512.04040 | null |
| 2025-12-03 | PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation | Xiaolong Li et.al. | 2512.04025 | null |
| 2025-12-03 | TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning | Tao Wu et.al. | 2512.03963 | null |
| 2025-12-03 | UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework | Youxin Pang et.al. | 2512.03918 | null |
| 2025-12-03 | Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence | Shuai Yang et.al. | 2512.03905 | null |
| 2025-12-03 | ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos | Qi'ao Xu et.al. | 2512.03666 | null |
| 2025-12-03 | The promising potential of vision language models for the generation of textual weather forecasts | Edward C. C. Steele et.al. | 2512.03623 | null |
| 2025-12-03 | ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation | Yaokun Li et.al. | 2512.03621 | null |
| 2025-12-03 | LAMP: Language-Assisted Motion Planning for Controllable Video Generation | Muhammed Burak Kizil et.al. | 2512.03619 | null |
| 2025-12-03 | Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding | Haoran Zhou et.al. | 2512.03601 | null |
| 2025-12-03 | Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation | Yuchen Deng et.al. | 2512.03590 | null |
| 2025-12-03 | Dynamic Optical Test for Bot Identification (DOT-BI): A simple check to identify bots in surveys and online processes | Malte Bleeker et.al. | 2512.03580 | null |
| 2025-12-03 | Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching | Wei Chee Yew et.al. | 2512.03553 | null |
| 2025-12-03 | Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation | Subin Kim et.al. | 2512.03534 | null |
| 2025-12-03 | FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation | Yiyi Cai et.al. | 2512.03520 | null |
| 2025-12-03 | Towards Object-centric Understanding for Instructional Videos | Wenliang Guo et.al. | 2512.03479 | null |
| 2025-12-03 | GeoVideo: Introducing Geometric Regularization into Video Generation Model | Yunpeng Bai et.al. | 2512.03453 | null |
| 2025-12-03 | GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers | Zhiye Song et.al. | 2512.03451 | null |
| 2025-12-03 | FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting | Nan Zhou et.al. | 2512.03369 | null |
| 2025-12-02 | Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling | Yueru Jia et.al. | 2512.03044 | null |
| 2025-12-02 | OneThinker: All-in-one Reasoning Model for Image and Video | Kaituo Feng et.al. | 2512.03043 | null |
| 2025-12-02 | MultiShotMaster: A Controllable Multi-Shot Video Generation Framework | Qinghe Wang et.al. | 2512.03041 | null |
| 2025-12-02 | Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation | Zeqi Xiao et.al. | 2512.03040 | null |
| 2025-12-02 | ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation | Mengchen Zhang et.al. | 2512.03036 | null |
| 2025-12-02 | MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation | Youxin Pang et.al. | 2512.03034 | null |
| 2025-12-02 | SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control | Yuxuan Mu et.al. | 2512.03028 | null |
| 2025-12-02 | Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks | Matthew Dutson et.al. | 2512.03014 | null |
| 2025-12-02 | In-Context Sync-LoRA for Portrait Video Editing | Sagi Polaczek et.al. | 2512.03013 | null |
| 2025-12-02 | Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench | Lanxiang Hu et.al. | 2512.02942 | null |
| 2025-12-02 | LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization | Zhihan Xiao et.al. | 2512.02933 | null |
| 2025-12-02 | Taming Camera-Controlled Video Generation with Verifiable Geometry Reward | Zhaoqing Wang et.al. | 2512.02870 | null |
| 2025-12-02 | Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video? | Manuel Benavent-Lledo et.al. | 2512.02846 | null |
| 2025-12-02 | ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning | Yifan Li et.al. | 2512.02835 | null |
| 2025-12-02 | From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity | Haoming Liu et.al. | 2512.02826 | null |
| 2025-12-02 | FiMMIA: scaling semantic perturbation-based membership inference across modalities | Anton Emelyanov et.al. | 2512.02786 | null |
| 2025-12-02 | Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset | Qifan Liang et.al. | 2512.02780 | null |
| 2025-12-02 | Reasoning-Aware Multimodal Fusion for Hateful Video Detection | Shuonan Yang et.al. | 2512.02743 | null |
| 2025-12-02 | Hear What Matters! Text-conditioned Selective Video-to-Audio Generation | Junwon Lee et.al. | 2512.02650 | null |
| 2025-12-02 | RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence | Xuming He et.al. | 2512.02622 | null |
| 2025-12-01 | Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don't Know Galileo's Principle...for now | Varun Varma Thozhiyoor et.al. | 2512.02016 | null |
| 2025-12-01 | Generative Video Motion Editing with 3D Point Tracks | Yao-Chih Lee et.al. | 2512.02015 | null |
| 2025-12-01 | TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models | Zhiheng Liu et.al. | 2512.02014 | null |
| 2025-12-01 | Learning Dexterous Manipulation Skills from Imperfect Simulations | Elvis Hsieh et.al. | 2512.02011 | null |
| 2025-12-01 | Learning Visual Affordance from Audio | Lidong Lu et.al. | 2512.02005 | null |
| 2025-12-01 | PAI-Bench: A Comprehensive Benchmark For Physical AI | Fengzhe Zhou et.al. | 2512.01989 | null |
| 2025-12-01 | SpriteHand: Real-Time Versatile Hand-Object Interaction with Autoregressive Video Generation | Zisu Li et.al. | 2512.01960 | null |
| 2025-12-01 | GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment | Haoyang He et.al. | 2512.01952 | null |
| 2025-12-01 | Script: Graph-Structured and Query-Conditioned Semantic Token Pruning for Multimodal Large Language Models | Zhongyu Yang et.al. | 2512.01949 | null |
| 2025-12-01 | COACH: Collaborative Agents for Contextual Highlighting - A Multi-Agent Framework for Sports Video Analysis | Tsz-To Wong et.al. | 2512.01853 | null |
| 2025-12-01 | JPEGs Just Got Snipped: Croppable Signatures Against Deepfake Images | Pericle Perazzo et.al. | 2512.01845 | null |
| 2025-12-01 | PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models | Zeqing Wang et.al. | 2512.01843 | null |
| 2025-12-01 | Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling | Meng Cao et.al. | 2512.01821 | null |
| 2025-12-01 | Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos | Xavier Thomas et.al. | 2512.01803 | null |
| 2025-12-01 | Evaluating SAM2 for Video Semantic Segmentation | Syed Hesham Syed Ariff et.al. | 2512.01774 | null |
| 2025-12-01 | VideoScoop: A Non-Traditional Domain-Independent Framework For Video Analysis | Hafsa Billah et.al. | 2512.01769 | null |
| 2025-12-01 | StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos | Daeun Lee et.al. | 2512.01707 | null |
| 2025-12-01 | DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models | Patrick Kwon et.al. | 2512.01686 | null |
| 2025-12-01 | Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation | Haodong Yan et.al. | 2512.01677 | null |
| 2025-12-01 | Exploring Scavenging Strategies and Cognitive Problem-Solving in Indian Free-Ranging Dogs | Tuhin Subhra Pal et.al. | 2512.01637 | null |
| 2025-11-30 | CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions | Simon Kohaut et.al. | 2512.01095 | null |
| 2025-11-30 | Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal | Shenxi Liu et.al. | 2512.01045 | null |
| 2025-11-30 | VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference | Jiaming Tang et.al. | 2512.01031 | null |
| 2025-11-30 | Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning | Qi Wang et.al. | 2512.00961 | null |
| 2025-11-30 | Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction | Boran Wen et.al. | 2512.00960 | null |
| 2025-11-30 | TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model | Alireza Javanmardi et.al. | 2512.00909 | null |
| 2025-11-30 | PanFlow: Decoupled Motion Control for Panoramic Video Generation | Cheng Zhang et.al. | 2512.00832 | null |
| 2025-11-30 | Seeing the Wind from a Falling Leaf | Zhiyuan Gao et.al. | 2512.00762 | null |
| 2025-11-29 | Image Generation as a Visual Planner for Robotic Manipulation | Ye Pang et.al. | 2512.00532 | null |
| 2025-11-29 | Structured Context Learning for Generic Event Boundary Detection | Xin Gu et.al. | 2512.00475 | null |
| 2025-11-29 | What about gravity in video generation? Post-Training Newton's Laws with Verifiable Rewards | Minh-Quan Le et.al. | 2512.00425 | null |
| 2025-11-29 | SplatFont3D: Structure-Aware Text-to-3D Artistic Font Generation with Part-Level Style Control | Ji Gan et.al. | 2512.00413 | null |
| 2025-11-29 | Low-Bitrate Video Compression through Semantic-Conditioned Diffusion | Lingdong Wang et.al. | 2512.00408 | null |
| 2025-11-29 | MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection | Mengxue Hu et.al. | 2512.00336 | null |
| 2025-11-29 | Comparative Evaluation of Generative AI Models for Chest Radiograph Report Generation in the Emergency Department | Woo Hyeon Lim et.al. | 2512.00271 | null |
| 2025-11-29 | "Why the face?": Exploring Robot Error Detection Using Instrumented Bystander Reactions | Maria Teresa Parreira et.al. | 2512.00262 | null |
| 2025-11-29 | Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views | Kunwar Maheep Singh et.al. | 2512.00255 | null |
| 2025-11-28 | Chunking Strategies for Multimodal AI Systems | Shashanka B R et.al. | 2512.00185 | null |
| 2025-11-28 | Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models | Muhammad Maaz et.al. | 2511.23478 | null |
| 2025-11-28 | AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement | Zhizhou Zhong et.al. | 2511.23475 | null |
| 2025-11-28 | Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model | Junshu Tang et.al. | 2511.23429 | null |
| 2025-11-28 | DisMo: Disentangled Motion Representations for Open-World Motion Transfer | Thomas Ressler-Antal et.al. | 2511.23428 | null |
| 2025-11-28 | Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach | Haruki Sakajo et.al. | 2511.23311 | null |
| 2025-11-28 | Vision Bridge Transformer at Scale | Zhenxiong Tan et.al. | 2511.23199 | null |
| 2025-11-28 | GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation | Yuhao Wan et.al. | 2511.23191 | null |
| 2025-11-28 | Fast Multi-view Consistent 3D Editing with Video Priors | Liyi Chen et.al. | 2511.23172 | null |
| 2025-11-28 | InstanceV: Instance-Level Video Generation | Yuheng Chen et.al. | 2511.23146 | null |
| 2025-11-28 | DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation | Hongfei Zhang et.al. | 2511.23127 | null |
| 2025-11-28 | LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models | Zuolei Li et.al. | 2511.23034 | null |
| 2025-11-28 | McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning | Qiushi Yang et.al. | 2511.22974 | null |
| 2025-11-28 | BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation | Zeyu Zhang et.al. | 2511.22973 | null |
| 2025-11-28 | RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video | Haiyang Mei et.al. | 2511.22950 | null |
| 2025-11-28 | One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe | Shijun Shi et.al. | 2511.22940 | null |
| 2025-11-28 | TARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE | Jiawen Wei et.al. | 2511.22853 | null |
| 2025-11-28 | Captain Safari: A World Engine | Yu-Cheng Chou et.al. | 2511.22815 | null |
| 2025-11-27 | ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering | Alberto Compagnoni et.al. | 2511.22715 | null |
| 2025-11-27 | Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration | Mengyu Yang et.al. | 2511.22533 | null |
| 2025-11-27 | AI killed the video star. Audio-driven diffusion model for expressive talking head generation | Baptiste Chopin et.al. | 2511.22488 | null |
| 2025-11-26 | TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos | Seungjae Lee et.al. | 2511.21690 | null |
| 2025-11-26 | MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training | Haotian Xue et.al. | 2511.21592 | null |
| 2025-11-26 | Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy | Teng Hu et.al. | 2511.21579 | null |
| 2025-11-26 | Video Generation Models Are Good Latent Reward Models | Xiaoyue Mi et.al. | 2511.21541 | null |
| 2025-11-26 | MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices | Shuai Zhang et.al. | 2511.21475 | null |
| 2025-11-26 | Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning | Xin Gu et.al. | 2511.21375 | null |
| 2025-11-26 | AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs | Shuhan Xia et.al. | 2511.21251 | null |
| 2025-11-26 | AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control | Xinyue Guo et.al. | 2511.21146 | null |
| 2025-11-26 | TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models | Jiaming He et.al. | 2511.21145 | null |
| 2025-11-26 | Referring Video Object Segmentation with Cross-Modality Proxy Queries | Baoli Sun et.al. | 2511.21139 | null |
| 2025-11-26 | Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning | Changlin Li et.al. | 2511.21136 | null |
| 2025-11-26 | SocialNav: Training Human-Inspired Foundation Model for Socially-Aware Embodied Navigation | Ziyi Chen et.al. | 2511.21135 | null |
| 2025-11-26 | CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion | Dianbing Xi et.al. | 2511.21129 | null |
| 2025-11-26 | CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation | Jionghao Han et.al. | 2511.21045 | null |
| 2025-11-26 | TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs | Md Adnan Arefeen et.al. | 2511.20965 | null |
| 2025-11-25 | V |
Jiancheng Pan et.al. | 2511.20886 | null |
| 2025-11-25 | Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries | Sree Bhattacharyya et.al. | 2511.20854 | null |
| 2025-11-25 | MODEST: Multi-Optics Depth-of-Field Stereo Dataset | Nisarg K. Trivedi et.al. | 2511.20853 | null |
| 2025-11-25 | Layer-Aware Video Composition via Split-then-Merge | Ozgur Kara et.al. | 2511.20809 | null |
| 2025-11-25 | Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout | Hidir Yesiltepe et.al. | 2511.20649 | null |
| 2025-11-25 | Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization | Tahira Kazimi et.al. | 2511.20647 | null |
| 2025-11-25 | MotionV2V: Editing Motion in a Video | Ryan Burgert et.al. | 2511.20640 | null |
| 2025-11-25 | iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation | Zhoujie Fu et.al. | 2511.20635 | null |
| 2025-11-25 | MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models | Chieh-Yun Chen et.al. | 2511.20629 | null |
| 2025-11-25 | ShapeGen: Towards High-Quality 3D Shape Synthesis | Yangguang Li et.al. | 2511.20624 | null |
| 2025-11-25 | Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI | Xinhao Liu et.al. | 2511.20620 | null |
| 2025-11-25 | E2E-GRec: An End-to-End Joint Training Framework for Graph Neural Networks and Recommender Systems | Rui Xue et.al. | 2511.20564 | null |
| 2025-11-25 | A Reason-then-Describe Instruction Interpreter for Controllable Video Generation | Shengqiong Wu et.al. | 2511.20563 | null |
| 2025-11-25 | PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding | Haoze Zhang et.al. | 2511.20562 | null |
| 2025-11-25 | STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow | Jiatao Gu et.al. | 2511.20462 | null |
| 2025-11-25 | Block Cascading: Training Free Acceleration of Block-Causal Video Models | Hmrishav Bandyopadhyay et.al. | 2511.20426 | null |
| 2025-11-25 | TReFT: Taming Rectified Flow Models For One-Step Image Translation | Shengqian Li et.al. | 2511.20307 | null |
| 2025-11-25 | Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations | Chao Wang et.al. | 2511.20295 | null |
| 2025-11-25 | Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement | Yang Liu et.al. | 2511.20280 | null |
| 2025-11-25 | Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation | Daniel Kienzle et.al. | 2511.20250 | null |
| 2025-11-25 | SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA | Haibin He et.al. | 2511.20190 | null |
| 2025-11-25 | Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis | Mohammad Mahdi et.al. | 2511.20186 | null |
| 2025-11-25 | UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers | Min Zhao et.al. | 2511.20123 | null |
| 2025-11-25 | Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos | Youngseo Kim et.al. | 2511.19936 | null |
| 2025-11-24 | VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection | Qiang Wang et.al. | 2511.19436 | null |
| 2025-11-24 | Are Image-to-Video Models Good Zero-Shot Image Editors? | Zechuan Zhang et.al. | 2511.19435 | null |
| 2025-11-24 | In-Video Instructions: Visual Signals as Generative Control | Gongfan Fang et.al. | 2511.19401 | null |
| 2025-11-24 | Growing with the Generator: Self-paced GRPO for Video Generation | Rui Li et.al. | 2511.19356 | null |
| 2025-11-24 | SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation | Jiaming Zhang et.al. | 2511.19320 | null |
| 2025-11-24 | SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis | Lingwei Dang et.al. | 2511.19319 | null |
| 2025-11-24 | LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models | Shuai Wang et.al. | 2511.19261 | null |
| 2025-11-24 | IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes | Carl LindstrΓΆm et.al. | 2511.19235 | null |
| 2025-11-24 | Learning Plug-and-play Memory for Guiding Video Diffusion Models | Selena Song et.al. | 2511.19229 | null |
| 2025-11-24 | AvatarBrush: Monocular Reconstruction of Gaussian Avatars with Intuitive Local Editing | Mengtian Li et.al. | 2511.19189 | null |
| 2025-11-24 | RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning | Deyi Ji et.al. | 2511.19168 | null |
| 2025-11-24 | HABIT: Human Action Benchmark for Interactive Traffic in CARLA | Mohan Ramesh et.al. | 2511.19109 | null |
| 2025-11-24 | Beyond Reward Margin: Rethinking and Resolving Likelihood Displacement in Diffusion Models via Video Generation | Ruojun Xu et.al. | 2511.19049 | null |
| 2025-11-24 | View-Consistent Diffusion Representations for 3D-Consistent Video Generation | Duolikun Danier et.al. | 2511.18991 | null |
| 2025-11-24 | Eevee: Towards Close-up High-resolution Video-based Virtual Try-on | Jianhao Zeng et.al. | 2511.18957 | null |
| 2025-11-24 | One4D: Unified 4D Generation and Reconstruction via Decoupled LoRA Control | Zhenxing Mi et.al. | 2511.18922 | null |
| 2025-11-24 | EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models | Wenhao Xu et.al. | 2511.18920 | null |
| 2025-11-24 | Learning What to Trust: Bayesian Prior-Guided Optimization for Visual Generation | Ruiying Liu et.al. | 2511.18919 | null |
| 2025-11-24 | MagicWorld: Interactive Geometry-driven Video World Exploration | Guangyuan Li et.al. | 2511.18886 | null |
| 2025-11-24 | HunyuanVideo 1.5 Technical Report | Bing Wu et.al. | 2511.18870 | null |
| 2025-11-23 | ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access | Timing Yang et.al. | 2511.18382 | null |
| 2025-11-23 | MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models | Xiyang Wu et.al. | 2511.18373 | null |
| 2025-11-23 | Alias-free 4D Gaussian Splatting | Zilong Chen et.al. | 2511.18367 | null |
| 2025-11-23 | TRANSPORTER: Transferring Visual Semantics from VLM Manifolds | Alexandros Stergiou et.al. | 2511.18359 | null |
| 2025-11-23 | MagicWand: A Universal Agent for Generation and Evaluation Aligned with User Preference | Zitong Xu et.al. | 2511.18352 | null |
| 2025-11-23 | FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement | Wenshuo Gao et.al. | 2511.18346 | null |
| 2025-11-23 | AnyExperts: On-Demand Expert Allocation for Multimodal Language Models with Mixture of Expert | Yuting Gao et.al. | 2511.18314 | null |
| 2025-11-23 | Point-to-Point: Sparse Motion Guidance for Controllable Video Editing | Yeji Song et.al. | 2511.18277 | null |
| 2025-11-23 | SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors | Ruijie Fan et.al. | 2511.18264 | null |
| 2025-11-23 | EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning | Yogesh Kulkarni et.al. | 2511.18242 | null |
| 2025-11-22 | MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning | Yi-Yang Zhang et.al. | 2511.18209 | null |
| 2025-11-22 | InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity | Haoming Wang et.al. | 2511.18200 | null |
| 2025-11-22 | EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses | Enrico Pallotta et.al. | 2511.18173 | null |
| 2025-11-22 | Video4Edit: Viewing Image Editing as a Degenerate Temporal Process | Xiaofan Li et.al. | 2511.18131 | null |
| 2025-11-22 | Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning | Xiaohong Liu et.al. | 2511.18104 | null |
| 2025-11-22 | Spotlight: Identifying and Localizing Video Generation Errors Using VLMs | Aditya Chinchure et.al. | 2511.18102 | null |
| 2025-11-22 | Hybrid Event Frame Sensors: Modeling, Calibration, and Simulation | Yunfan Lu et.al. | 2511.18037 | null |
| 2025-11-22 | Diverse Instance Generation via Diffusion Models for Enhanced Few-Shot Object Detection in Remote Sensing Images | Yanxing Liu et.al. | 2511.18031 | null |
| 2025-11-22 | Plan-X: Instruct Video Generation via Semantic Planning | Lun Huang et.al. | 2511.17986 | null |
| 2025-11-22 | VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment | Ziheng Jia et.al. | 2511.17962 | null |
| 2025-11-21 | EvDiff: High Quality Video with an Event Camera | Weilun Li et.al. | 2511.17492 | null |
| 2025-11-21 | Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination | Yolo Yunlong Tang et.al. | 2511.17490 | null |
| 2025-11-21 | Counterfactual World Models via Digital Twin-conditioned Video Diffusion | Yiqing Shen et.al. | 2511.17481 | null |
| 2025-11-21 | Planning with Sketch-Guided Verification for Physics-Aware Video Generation | Yidong Huang et.al. | 2511.17450 | null |
| 2025-11-21 | Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal | Xiaolong Qian et.al. | 2511.17353 | null |
| 2025-11-21 | Loomis Painter: Reconstructing the Painting Process | Markus Pobitzer et.al. | 2511.17344 | null |
| 2025-11-21 | Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM | Chiori Hori et.al. | 2511.17335 | null |
| 2025-11-21 | FORWARD: Dataset of a forwarder operating in rough terrain | Mikael LundbΓ€ck et.al. | 2511.17318 | null |
| 2025-11-21 | PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention | Yipeng Chen et.al. | 2511.17185 | null |
| 2025-11-21 | Investigating self-supervised representations for audio-visual deepfake detection | Dragos-Alexandru Boldisor et.al. | 2511.17181 | null |
| 2025-11-21 | Sparse Reasoning is Enough: Biological-Inspired Framework for Video Anomaly Detection with Large Pre-trained Models | He Huang et.al. | 2511.17094 | null |
| 2025-11-21 | H-GAR: A Hierarchical Interaction Framework via Goal-Driven Observation-Action Refinement for Robotic Manipulation | Yijie Zhu et.al. | 2511.17079 | null |
| 2025-11-21 | MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis | Di Luo et.al. | 2511.16957 | null |
| 2025-11-21 | Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models | Dailan He et.al. | 2511.16955 | null |
| 2025-11-21 | Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling | Yicheng Deng et.al. | 2511.16952 | null |
| 2025-11-21 | FingerCap: Fine-grained Finger-level Hand Motion Captioning | Xin Shen et.al. | 2511.16951 | null |
| 2025-11-21 | R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios | Lu Zhu et.al. | 2511.16901 | null |
| 2025-11-21 | Avoiding Quality Saturation in UGC Compression Using Denoised References | Xin Xiong et.al. | 2511.16876 | null |
| 2025-11-20 | Vorion: A RISC-V GPU with Hardware-Accelerated 3D Gaussian Rendering and Training | Yipeng Wang et.al. | 2511.16831 | null |
| 2025-11-20 | Generative Augmented Reality: Paradigms, Technologies, and Future Applications | Chen Liang et.al. | 2511.16783 | null |
| 2025-11-20 | Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO | Junhao Cheng et.al. | 2511.16669 | null |
| 2025-11-20 | V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models | Yang Luo et.al. | 2511.16668 | null |
| 2025-11-20 | SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking | Haofeng Liu et.al. | 2511.16618 | null |
| 2025-11-20 | YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras | Fan Yang et.al. | 2511.16521 | null |
| 2025-11-20 | An analytical and experimental study of the energy transition discourse on YouTube | Aleix Bassolas et.al. | 2511.16497 | null |
| 2025-11-20 | Flow and Depth Assisted Video Prediction with Latent Transformer | Eliyas Suleyman et.al. | 2511.16484 | null |
| 2025-11-20 | PIPHEN: Physical Interaction Prediction with Hamiltonian Energy Networks | Kewei Chen et.al. | 2511.16200 | null |
| 2025-11-20 | FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos | Jeremie Ochin et.al. | 2511.16183 | null |
| 2025-11-20 | Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight | Yi Yang et.al. | 2511.16175 | null |
| 2025-11-20 | Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning | Yibin Huang et.al. | 2511.16160 | null |
| 2025-11-20 | MagBotSim: Physics-Based Simulation and Reinforcement Learning Environments for Magnetic Robotics | Lara Bergmann et.al. | 2511.16158 | null |
| 2025-11-20 | Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video | Li Yu et.al. | 2511.16137 | null |
| 2025-11-20 | VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation | Chenyang Wu et.al. | 2511.16124 | null |
| 2025-11-20 | Decoupling Complexity from Scale in Latent Diffusion Model | Tianxiong Zhong et.al. | 2511.16117 | null |
| 2025-11-20 | VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning | Zishan Xu et.al. | 2511.16077 | null |
| 2025-11-20 | Panel-by-Panel Souls: A Performative Workflow for Expressive Faces in AI-Assisted Manga Creation | Qing Zhang et.al. | 2511.16038 | null |
| 2025-11-20 | Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion | Dingkun Zhou et.al. | 2511.16020 | null |
| 2025-11-20 | Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click | Raphael Ruschel et.al. | 2511.15948 | null |
| 2025-11-20 | Automated Interpretable 2D Video Extraction from 3D Echocardiography | Milos Vukadinovic et.al. | 2511.15946 | null |
| 2025-11-19 | RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification | Meilong Xu et.al. | 2511.15923 | null |
| 2025-11-19 | First Frame Is the Place to Go for Video Content Customization | Jingxi Chen et.al. | 2511.15700 | null |
| 2025-11-19 | Joint Semantic-Channel Coding and Modulation for Token Communications | Jingkai Ying et.al. | 2511.15699 | null |
| 2025-11-19 | The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification | Dante Francisco Wasmuht et.al. | 2511.15622 | null |
| 2025-11-19 | Multimodal Evaluation of Russian-language Architectures | Artem Chervyakov et.al. | 2511.15552 | null |
| 2025-11-19 | Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners | Xabier Lekunberri et.al. | 2511.15468 | null |
| 2025-11-19 | ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation | Simon Boeder et.al. | 2511.15396 | null |
| 2025-11-19 | PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback | Sirui Chen et.al. | 2511.15253 | null |
| 2025-11-19 | Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation | Firdavs Nasriddinov et.al. | 2511.15159 | null |
| 2025-11-19 | Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks | Cheng Yang et.al. | 2511.15065 | null |
| 2025-11-19 | Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation | Vladimir Arkhipkin et.al. | 2511.14993 | null |
| 2025-11-18 | SVBRD-LLM: Self-Verifying Behavioral Rule Discovery for Autonomous Vehicle Identification | Xiangyu Li et.al. | 2511.14977 | null |
| 2025-11-18 | RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems | Jaro Meyer et.al. | 2511.14948 | null |
| 2025-11-18 | CPSL: Representing Volumetric Video via Content-Promoted Scene Layers | Kaiyuan Hu et.al. | 2511.14927 | null |
| 2025-11-18 | GeoSceneGraph: Geometric Scene Graph Diffusion Model for Text-guided 3D Indoor Scene Synthesis | Antonio Ruiz et.al. | 2511.14884 | null |
| 2025-11-18 | Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising | Yifan Wang et.al. | 2511.14719 | null |
| 2025-11-18 | FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation | Yunfeng Wu et.al. | 2511.14712 | null |
| 2025-11-18 | ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection | Mohammad Romani et.al. | 2511.14554 | null |
| 2025-11-18 | DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation | Xiangchen Yin et.al. | 2511.14530 | null |
| 2025-11-18 | FlowRoI A Fast Optical Flow Driven Region of Interest Extraction Framework for High-Throughput Image Compression in Immune Cell Migration Analysis | Xiaowei Xu et.al. | 2511.14419 | null |
| 2025-11-18 | ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries | Junfu Pu et.al. | 2511.14349 | null |
| 2025-11-18 | Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs | Yiyi Miao et.al. | 2511.14315 | null |
| 2025-11-18 | Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning | Rui Liu et.al. | 2511.14249 | null |
| 2025-11-18 | Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion | Zhuo Li et.al. | 2511.14178 | null |
| 2025-11-18 | Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models | Hao Zhen et.al. | 2511.14120 | null |
| 2025-11-18 | Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services | Liuyi Jin et.al. | 2511.14119 | null |
| 2025-11-18 | A Patient-Independent Neonatal Seizure Prediction Model Using Reduced Montage EEG and ECG | Sithmini Ranasingha et.al. | 2511.14110 | null |
| 2025-11-18 | Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations | Yiqing Shen et.al. | 2511.14100 | null |
| 2025-11-18 | Privis: Towards Content-Aware Secure Volumetric Video Delivery | Kaiyuan Hu et.al. | 2511.14005 | null |
| 2025-11-17 | Learning Skill-Attributes for Transferable Assessment in Video | Kumar Ashutosh et.al. | 2511.13993 | null |
| 2025-11-17 | PoCGM: Poisson-Conditioned Generative Model for Sparse-View CT Reconstruction | Changsheng Fang et.al. | 2511.13967 | null |
| 2025-11-17 | SAE-MCVT: A Real-Time and Scalable Multi-Camera Vehicle Tracking Framework Powered by Edge Computing | Yuqiang Lin et.al. | 2511.13904 | null |
| 2025-11-17 | Temporal Realism Evaluation of Generated Videos Using Compressed-Domain Motion Vectors | Mert Onur Cakiroglu et.al. | 2511.13897 | null |
| 2025-11-17 | Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark | Xinxin Liu et.al. | 2511.13853 | null |
| 2025-11-17 | Segment Anything Across Shots: A Method and Benchmark | Hengrui Hu et.al. | 2511.13715 | null |
| 2025-11-17 | UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity | Junwei Yu et.al. | 2511.13714 | null |
| 2025-11-17 | TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models | Harold Haodong Chen et.al. | 2511.13704 | null |
| 2025-11-17 | Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting | Jiangnan Ye et.al. | 2511.13684 | null |
| 2025-11-17 | CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding | Shrenik Patel et.al. | 2511.13644 | null |
| 2025-11-17 | Computer Vision based group activity detection and action spotting | Narthana Sivalingam et.al. | 2511.13315 | null |
| 2025-11-17 | CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving | Enhui Ma et.al. | 2511.13297 | null |
| 2025-11-17 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-17 | Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification | Rifen Lin et.al. | 2511.13150 | null |
| 2025-11-17 | VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language | Zonghao Ying et.al. | 2511.13127 | null |
| 2025-11-17 | CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model | Yuqi Zhang et.al. | 2511.13121 | null |
| 2025-11-17 | Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining | Zhaocheng Yu et.al. | 2511.13113 | null |
| 2025-11-17 | Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention | Taiye Chen et.al. | 2511.12940 | null |
| 2025-11-17 | Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models | Guoyan Wang et.al. | 2511.12937 | null |
| 2025-11-17 | PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos | Dianbing Xi et.al. | 2511.12935 | null |
| 2025-11-17 | Generative Photographic Control for Scene-Consistent Video Cinematic Editing | Huiqiang Sun et.al. | 2511.12921 | null |
| 2025-11-17 | Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views | Junyi Ma et.al. | 2511.12878 | null |
| 2025-11-17 | Video Finetuning Improves Reasoning Between Frames | Ruiqi Yang et.al. | 2511.12868 | null |
| 2025-11-16 | SAGA: Source Attribution of Generative AI Videos | Rohit Kundu et.al. | 2511.12834 | null |
| 2025-11-16 | Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis | Zeqin Yu et.al. | 2511.12658 | null |
| 2025-11-16 | Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | Yunxin Li et.al. | 2511.12609 | null |
| 2025-11-16 | TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction | Yukuo Ma et.al. | 2511.12578 | null |
| 2025-11-16 | ReaSon: Reinforced Causal Search with Information Bottleneck for Video Understanding | Yuan Zhou et.al. | 2511.12530 | null |
| 2025-11-16 | DualGR: Generative Retrieval with Long and Short-Term Interests Modeling | Zhongchao Yi et.al. | 2511.12518 | null |
| 2025-11-16 | DINO-Detect: A Simple yet Effective Framework for Blur-Robust AI-Generated Image Detection | Jialiang Shen et.al. | 2511.12511 | null |
| 2025-11-16 | VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving | Hyunki Seong et.al. | 2511.12405 | null |
| 2025-11-16 | SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs | Shail Desai et.al. | 2511.12404 | null |
| 2025-11-15 | Fast Reasoning Segmentation for Images and Videos | Yiqing Shen et.al. | 2511.12368 | null |
| 2025-11-15 | Constructing and Interpreting Digital Twin Representations for Visual Reasoning via Reinforcement Learning | Yiqing Shen et.al. | 2511.12365 | null |
| 2025-11-15 | AURA: Development and Validation of an Augmented Unplanned Removal Alert System using Synthetic ICU Videos | Junhyuk Seo et.al. | 2511.12241 | null |
| 2025-11-15 | Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System | Aditi Bhalla et.al. | 2511.12196 | null |
| 2025-11-15 | Towards Obstacle-Avoiding Control of Planar Snake Robots Exploring Neuro-Evolution of Augmenting Topologies | Advik Sinha et.al. | 2511.12148 | null |
| 2025-11-15 | Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models | Tianle Cheng et.al. | 2511.12099 | null |
| 2025-11-15 | Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound | Dengming Zhang et.al. | 2511.12077 | null |
| 2025-11-15 | ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation | Jiahui Sun et.al. | 2511.12072 | null |
| 2025-11-15 | PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling | Sijie Wang et.al. | 2511.12056 | null |
| 2025-11-15 | TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space | Wenxuan Miao et.al. | 2511.12035 | null |
| 2025-11-14 | Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models | Siyou Li et.al. | 2511.11910 | null |
| 2025-11-14 | KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference | Huawei Zhang et.al. | 2511.11907 | null |
| 2025-11-14 | Scalable Policy Evaluation with Video World Models | Wei-Cheng Tseng et.al. | 2511.11520 | null |
| 2025-11-14 | Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis | Feng-Qi Cui et.al. | 2511.11406 | null |
| 2025-11-14 | YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation | Pavel Rojtberg et.al. | 2511.11344 | null |
| 2025-11-14 | RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting | Ruocheng Wu et.al. | 2511.11213 | null |
| 2025-11-14 | VIDEOP2R: Video Understanding from Perception to Reasoning | Yifan Jiang et.al. | 2511.11113 | null |
| 2025-11-14 | LiteAttention: A Temporal Sparse Attention for Diffusion Transformers | Dor Shmilovich et.al. | 2511.11062 | null |
| 2025-11-14 | EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation | Zongyang Qiu et.al. | 2511.11002 | null |
| 2025-11-14 | Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment | Wenbin Bai et.al. | 2511.10987 | null |
| 2025-11-14 | Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition | Gunho Jung et.al. | 2511.10958 | null |
| 2025-11-14 | Language-Guided Graph Representation Learning for Video Summarization | Wenrui Li et.al. | 2511.10953 | null |
| 2025-11-14 | Short-Window Sliding Learning for Real-Time Violence Detection via LLM-based Auto-Labeling | Seoik Jung et.al. | 2511.10866 | null |
| 2025-11-13 | Expert Consensus-based Video-Based Assessment Tool for Workflow Analysis in Minimally Invasive Colorectal Surgery: Development and Validation of ColoWorkflow | Pooja P Jain et.al. | 2511.10766 | null |
| 2025-11-13 | Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals | Shruti Singh Baghel et.al. | 2511.10615 | null |
| 2025-11-13 | TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding | Jinxuan Li et.al. | 2511.10241 | null |
| 2025-11-13 | Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization | Ashutosh Anshul et.al. | 2511.10212 | null |
| 2025-11-13 | SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition | Qilang Ye et.al. | 2511.10091 | null |
| 2025-11-13 | When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? | Qilang Ye et.al. | 2511.10059 | null |
| 2025-11-13 | Reinforcing Trustworthiness in Multimodal Emotional Support Systems | Huy M. Le et.al. | 2511.10011 | null |
| 2025-11-13 | AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting | Aymen Mir et.al. | 2511.09827 | null |
| 2025-11-12 | Density Estimation and Crowd Counting | Balachandra Devarangadi Sunil et.al. | 2511.09723 | null |
| 2025-11-12 | PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild | Felix B. Mueller et.al. | 2511.09675 | null |
| 2025-11-12 | TempRetinex: Retinex-based Unsupervised Enhancement for Low-light Video Under Diverse Lighting Conditions | Yini Li et.al. | 2511.09609 | null |
| 2025-11-12 | Bridging the Data Gap: Spatially Conditioned Diffusion Model for Anomaly Generation in Photovoltaic Electroluminescence Images | Shiva Hanifi et.al. | 2511.09604 | null |
| 2025-11-12 | Diffusion-Based Quality Control of Medical Image Segmentations across Organs | Vincenzo MarcianΓ² et.al. | 2511.09588 | null |
| 2025-11-12 | Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation | Xinyi Tong et.al. | 2511.09585 | null |
| 2025-11-12 | SPIDER: Scalable Physics-Informed Dexterous Retargeting | Chaoyi Pan et.al. | 2511.09484 | null |
| 2025-11-12 | MCAD: Multimodal Context-Aware Audio Description Generation For Soccer | Lipisha Chaudhary et.al. | 2511.09448 | null |
| 2025-11-12 | A cross-modal pre-training framework with video data for improving performance and generalization of distributed acoustic sensing | Junyi Duan et.al. | 2511.09342 | null |
| 2025-11-12 | GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow | Rui Wan et.al. | 2511.09272 | null |
| 2025-11-12 | Unveiling the Impact of Data and Model Scaling on High-Level Control for Humanoid Robots | Yuxi Wei et.al. | 2511.09241 | null |
| 2025-11-12 | AILINKPREVIEWER: Enhancing Code Reviews with LLM-Powered Link Previews | Panya Trakoolgerntong et.al. | 2511.09223 | null |
| 2025-11-12 | DBINDS -- Can Initial Noise from Diffusion Model Inversion Help Reveal AI-Generated Videos? | Yanlin Wu et.al. | 2511.09184 | null |
| 2025-11-10 | Robot Learning from a Physical World Model | Jiageng Mao et.al. | 2511.07416 | null |
| 2025-11-10 | StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation | Tianrui Feng et.al. | 2511.07399 | null |
| 2025-11-10 | Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation | Jie Du et.al. | 2511.01450 | null |
| 2025-11-09 | GenAI vs. Human Creators: Procurement Mechanism Design in Two-/Three-Layer Markets | Rui Ai et.al. | 2511.06559 | null |
| 2025-11-09 | RelightMaster: Precise Video Relighting with Multi-plane Light Images | Weikang Bian et.al. | 2511.06271 | null |
| 2025-11-08 | Neodragon: Mobile Video Generation using Diffusion Transformer | Animesh Karnewar et.al. | 2511.06055 | null |
| 2025-11-07 | THEval. Evaluation Framework for Talking Head Video Generation | Nabyl Quignon et.al. | 2511.04520 | null |
| 2025-11-06 | InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation | Jinlai Liu et.al. | 2511.04675 | null |
| 2025-11-06 | Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm | Jingqi Tong et.al. | 2511.04570 | null |
| 2025-11-06 | RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation | Xiangjun Zhang et.al. | 2511.04317 | null |
| 2025-11-06 | PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection | Peiyao Wang et.al. | 2511.03997 | null |
| 2025-11-05 | UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions | Guozhen Zhang et.al. | 2511.03334 | null |
| 2025-11-05 | Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising | Shuangquan Lyu et.al. | 2511.03272 | null |
| 2025-11-04 | Video Text Preservation with Synthetic Text-Rich Videos | Ziyang Liu et.al. | 2511.05573 | null |
| 2025-11-04 | ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation | Panwang Pan et.al. | 2511.00511 | null |
| 2025-11-03 | How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment | Zhen Chen et.al. | 2511.01775 | null |
| 2025-11-03 | Driving scenario generation and evaluation using a structured layer representation and foundational models | Arthur Hubert et.al. | 2511.01541 | null |
| 2025-11-03 | Towards One-step Causal Video Generation via Adversarial Self-Distillation | Yongqi Yang et.al. | 2511.01419 | null |
| 2025-11-03 | MotionStream: Real-Time Video Generation with Interactive Motion Controls | Joonghyuk Shin et.al. | 2511.01266 | null |
| 2025-11-01 | Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models | Panwang Pan et.al. | 2511.00503 | null |
| 2025-10-31 | Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals | Xiangyu Fan et.al. | 2510.27684 | null |
| 2025-10-31 | Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V | Meftun Akarsu et.al. | 2510.27364 | null |
| 2025-10-31 | DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model | Yucheng Xing et.al. | 2510.27169 | null |
| 2025-10-31 | Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark | Ziyu Guo et.al. | 2510.26802 | null |
| 2025-10-30 | AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency | Piyushkumar Patel et.al. | 2511.00107 | null |
| 2025-10-30 | LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation | Huanlin Gao et.al. | 2511.00090 | null |
| 2025-10-30 | SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting | Dongyue Lu et.al. | 2510.26796 | null |
| 2025-10-30 | The Quest for Generalizable Motion Generation: Data, Model, and Evaluation | Jing Lin et.al. | 2510.26794 | null |
| 2025-10-30 | Co-Evolving Latent Action World Models | Yucen Wang et.al. | 2510.26433 | null |
| 2025-10-30 | LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation | Xiangqing Zheng et.al. | 2510.26412 | null |
| 2025-10-29 | VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning | Baolu Li et.al. | 2510.25772 | null |
| 2025-10-29 | VC4VG: Optimizing Video Captions for Text-to-Video Generation | Yang Du et.al. | 2510.24134 | null |
| 2025-10-28 | World Simulation with Video Foundation Models for Physical AI | NVIDIA et.al. | 2511.00062 | null |
| 2025-10-28 | VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos | Qiucheng Wu et.al. | 2510.24904 | null |
| 2025-10-28 | Generative View Stitching | Chonghyuk Song et.al. | 2510.24718 | null |
| 2025-10-28 | Uniform Discrete Diffusion with Metric Path for Video Generation | Haoge Deng et.al. | 2510.24717 | null |
| 2025-10-28 | MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration | Junhyuk So et.al. | 2510.24211 | null |
| 2025-10-28 | LongCat-Video Technical Report | Meituan LongCat Team et.al. | 2510.22200 | null |
| 2025-10-27 | CoMo: Compositional Motion Customization for Text-to-Video Generation | Youcan Xu et.al. | 2510.23007 | null |
| 2025-10-27 | Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method | Bohan Li et.al. | 2510.22973 | null |
| 2025-10-26 | MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control | Fatemeh Nazarieh et.al. | 2510.22810 | null |
| 2025-10-25 | Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration | Zheng Wei et.al. | 2510.22431 | null |
| 2025-10-24 | Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising | Mateo Clemente et.al. | 2510.21991 | null |
| 2025-10-24 | BachVid: Training-Free Video Generation with Consistent Background and Character | Han Yan et.al. | 2510.21696 | null |
| 2025-10-24 | Epipolar Geometry Improves Video Generation Models | Orest Kupyn et.al. | 2510.21615 | null |
| 2025-10-24 | OmniNWM: Omniscient Driving Navigation World Models | Bohan Li et.al. | 2510.18313 | null |
| 2025-10-23 | Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications | Shamim Yazdani et.al. | 2510.21887 | null |
| 2025-10-23 | Video-As-Prompt: Unified Semantic Control for Video Generation | Yuxuan Bian et.al. | 2510.20888 | null |
| 2025-10-23 | Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers | Dean L Slack et.al. | 2510.20807 | null |
| 2025-10-23 | RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling | Bingjie Gao et.al. | 2510.20206 | null |
| 2025-10-23 | Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories | Aaron Appelle et.al. | 2510.20182 | null |
| 2025-10-23 | Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning | Takehiro Aoshima et.al. | 2510.19193 | null |
| 2025-10-23 | A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition | Peiqin Zhuang et.al. | 2510.18705 | null |
| 2025-10-22 | Improving the Physics of Video Generation with VJEPA-2 Reward Signal | Jianhao Yuan et.al. | 2510.21840 | null |
| 2025-10-22 | A new wave of vehicle insurance fraud fueled by generative AI | Amir Hever et.al. | 2510.19957 | null |
| 2025-10-22 | PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis | Qing Mao et.al. | 2510.19527 | null |
| 2025-10-22 | GigaBrain-0: A World Model-Powered Vision-Language-Action Model | GigaBrain Team et.al. | 2510.19430 | null |
| 2025-10-22 | FeatureFool: Zero-Query Fooling of Video Models via Feature Map | Duoxun Tang et.al. | 2510.18362 | null |
| 2025-10-22 | MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models | Yongshun Zhang et.al. | 2510.17519 | null |
| 2025-10-22 | ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints | Meiqi Wu et.al. | 2510.14847 | null |
| 2025-10-21 | MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models | Aritra Bhowmik et.al. | 2510.19022 | null |
| 2025-10-21 | UltraGen: High-Resolution Video Generation with Hierarchical Attention | Teng Hu et.al. | 2510.18775 | null |
| 2025-10-21 | MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation | Weinan Jia et.al. | 2510.18692 | null |
| 2025-10-21 | Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model | Zhenxing Zhang et.al. | 2510.18573 | null |
| 2025-10-20 | World-in-World: World Models in a Closed-Loop World | Jiahan Zhang et.al. | 2510.18135 | null |
| 2025-10-20 | Demystifying Transition Matching: When and Why It Can Beat Flow Matching | Jaihoon Kim et.al. | 2510.17991 | null |
| 2025-10-20 | From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models | Zefan Cai et.al. | 2510.17247 | null |
| 2025-10-20 | DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion | Weijie Wang et.al. | 2510.15264 | null |
| 2025-10-20 | Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization | Liao Shen et.al. | 2510.14255 | null |
| 2025-10-19 | An empirical study of the effect of video encoders on Temporal Video Grounding | Ignacio M. De la Jara et.al. | 2510.17007 | null |
| 2025-10-19 | From Mannequin to Human: A Pose-Aware and Identity-Preserving Video Generation Framework for Lifelike Clothing Display | Xiangyu Mu et.al. | 2510.16833 | null |
| 2025-10-19 | STANCE: Motion Coherent Video Generation Via Sparse-to-Dense Anchored Encoding | Zhifei Chen et.al. | 2510.14588 | null |
| 2025-10-17 | VISTA: A Test-Time Self-Improving Video Generation Agent | Do Xuan Long et.al. | 2510.15831 | null |
| 2025-10-17 | Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset | Qingyan Bai et.al. | 2510.15742 | null |
| 2025-10-17 | Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement Learning | Xiangyu Meng et.al. | 2510.14256 | null |
| 2025-10-17 | Ctrl-VI: Controllable Video Synthesis via Variational Inference | Haoyi Duan et.al. | 2510.07670 | null |
| 2025-10-16 | TGT: Text-Grounded Trajectories for Locally Controlled Video Generation | Guofeng Zhang et.al. | 2510.15104 | null |
| 2025-10-16 | RealDPO: Real or Not Real, that is the Preference | Guo Cheng et.al. | 2510.14955 | null |
| 2025-10-16 | DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation | Yu Zhou et.al. | 2510.14949 | null |
| 2025-10-16 | 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation | JoungBin Lee et.al. | 2510.14945 | null |
| 2025-10-16 | In-Context Learning with Unpaired Clips for Instruction-based Video Editing | Xinyao Liao et.al. | 2510.14648 | null |
| 2025-10-16 | Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures | Yuancheng Xu et.al. | 2510.14179 | null |
| 2025-10-15 | PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning | Sihui Ji et.al. | 2510.13809 | null |
| 2025-10-15 | CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas | Zian Li et.al. | 2510.13669 | null |
| 2025-10-15 | VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator | Hyojun Go et.al. | 2510.13454 | null |
| 2025-10-15 | Counting Hallucinations in Diffusion Models | Shuai Fu et.al. | 2510.13080 | null |
| 2025-10-14 | SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models | Zhengxu Tang et.al. | 2510.13042 | null |
| 2025-10-14 | MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars | Felix Taubner et.al. | 2510.12785 | null |
| 2025-10-14 | Time-Correlated Video Bridge Matching | Viacheslav Vasilev et.al. | 2510.12453 | null |
| 2025-10-14 | BIGFix: Bidirectional Image Generation with Token Fixing | Victor Besnier et.al. | 2510.12231 | null |
| 2025-10-14 | Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback | Xingpei Ma et.al. | 2510.12089 | null |
| 2025-10-13 | Point Prompting: Counterfactual Tracking with Video Diffusion Models | Ayush Shrivastava et.al. | 2510.11715 | null |
| 2025-10-13 | MoMaps: Semantics-Aware Scene Motion Generation with Motion Maps | Jiahui Lei et.al. | 2510.11107 | null |
| 2025-10-13 | Q-Router: Agentic Video Quality Assessment with Expert Model Routing and Artifact Localization | Shuo Xing et.al. | 2510.08789 | null |
| 2025-10-12 | AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes | Yu Li et.al. | 2510.10670 | null |
| 2025-10-12 | DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis | Peiyin Chen et.al. | 2510.10650 | null |
| 2025-10-11 | EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection | Huaizhi Qu et.al. | 2510.13652 | null |
| 2025-10-11 | MultiCOIN: Multi-Modal COntrollable Video INbetweening | Maham Tanveer et.al. | 2510.08561 | null |
| 2025-10-10 | Stable Video Infinity: Infinite-Length Video Generation with Error Recycling | Wuyang Li et.al. | 2510.09212 | null |
| 2025-10-10 | MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling | Qian Wang et.al. | 2508.08487 | null |
| 2025-10-09 | SkipSR: Faster Super Resolution with Token Skipping | Rohan Choudhury et.al. | 2510.08799 | null |
| 2025-10-09 | NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos | Hongyu Li et.al. | 2510.08568 | null |
| 2025-10-09 | VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning | Minghong Cai et.al. | 2510.08555 | null |
| 2025-10-09 | X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering | Zhitong Huang et.al. | 2510.08530 | null |
| 2025-10-09 | FlexTraj: Image-to-Video Generation with Flexible Point Trajectory Control | Zhiyuan Zhang et.al. | 2510.08527 | null |
| 2025-10-09 | UniVideo: Unified Understanding, Generation, and Editing for Videos | Cong Wei et.al. | 2510.08377 | null |
| 2025-10-09 | LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation | Yushi Huang et.al. | 2510.08318 | null |
| 2025-10-09 | UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution | Shian Du et.al. | 2510.08143 | null |
| 2025-10-09 | Real-Time Motion-Controllable Autoregressive Video Diffusion | Kesen Zhao et.al. | 2510.08131 | null |
| 2025-10-09 | CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving | Tianrui Zhang et.al. | 2510.07944 | null |
| 2025-10-09 | TTOM: Test-Time Optimization and Memorization for Compositional Video Generation | Leigang Qu et.al. | 2510.07940 | null |
| 2025-10-09 | Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection | Yanjie Pan et.al. | 2510.07654 | null |
| 2025-10-09 | Paper2Video: Automatic Video Generation from Scientific Papers | Zeyu Zhu et.al. | 2510.05096 | null |
| 2025-10-08 | TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility | Saman Motamed et.al. | 2510.07550 | null |
| 2025-10-08 | DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis | Nithin C. Babu et.al. | 2510.07441 | null |
| 2025-10-08 | WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation | Zezhong Qian et.al. | 2510.07313 | null |
| 2025-10-08 | MATRIX: Mask Track Alignment for Interaction-aware Video Generation | Siyoon Jin et.al. | 2510.07310 | null |
| 2025-10-08 | TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation | Jiaben Chen et.al. | 2510.07249 | null |
| 2025-10-08 | MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis | Yihao Zhi et.al. | 2510.07190 | null |
| 2025-10-08 | Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report | Riccardo Mereu et.al. | 2510.07092 | null |
| 2025-10-08 | Addressing the ID-Matching Challenge in Long Video Captioning | Zhantao Yang et.al. | 2510.06973 | null |
| 2025-10-07 | Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models | Jiahao Wang et.al. | 2510.06209 | null |
| 2025-10-07 | When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach | Daniel GonzΓ‘lbez-Biosca et.al. | 2510.05661 | null |
| 2025-10-06 | LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation | Yang Xiao et.al. | 2510.05367 | null |
| 2025-10-06 | VChain: Chain-of-Visual-Thought for Reasoning in Video Generation | Ziqi Huang et.al. | 2510.05094 | null |
| 2025-10-06 | Character Mixing for Video Generation | Tingting Liao et.al. | 2510.05093 | null |
| 2025-10-06 | Bridging Text and Video Generation: A Survey | Nilay Kumar et.al. | 2510.04999 | null |
| 2025-10-06 | What Drives Compositional Generalization in Visual Generative Models? | Karim Farid et.al. | 2510.03075 | null |
| 2025-10-05 | ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation | Jay Zhangjie Wu et.al. | 2510.04290 | null |
| 2025-10-05 | Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers | Shikang Zheng et.al. | 2510.04188 | null |
| 2025-10-04 | Generating Human Motion Videos using a Cascaded Text-to-Video Framework | Hyelin Nam et.al. | 2510.03909 | null |
| 2025-10-03 | Mask2IV: Interaction-Centric Video Generation via Mask Trajectories | Gen Li et.al. | 2510.03135 | null |
| 2025-10-03 | Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction | Kaisi Guan et.al. | 2510.03117 | null |
| 2025-10-03 | When and Where do Events Switch in Multi-Event Video Generation? | Ruotong Liao et.al. | 2510.03049 | null |
| 2025-10-03 | Pack and Force Your Memory: Long-form and Consistent Video Generation | Xiaofei Wu et.al. | 2510.01784 | null |
| 2025-10-02 | Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation | Beijia Lu et.al. | 2510.02617 | null |
| 2025-10-02 | How Confident are Video Models? Empowering Video Models to Express their Uncertainty | Zhiting Mei et.al. | 2510.02571 | null |
| 2025-10-02 | Inferring Dynamic Physical Properties from Video Foundation Models | Guanqi Zhan et.al. | 2510.02311 | null |
| 2025-10-02 | MultiModal Action Conditioned Video Generation | Yichen Li et.al. | 2510.02287 | null |
| 2025-10-02 | Learning to Generate Object Interactions with Physics-Guided Video Diffusion | David Romero et.al. | 2510.02284 | null |
| 2025-10-02 | Self-Forcing++: Towards Minute-Scale High-Quality Video Generation | Justin Cui et.al. | 2510.02283 | null |
| 2025-10-02 | TempoControl: Temporal Attention Guidance for Text-to-Video Models | Shira Schiber et.al. | 2510.02226 | null |
| 2025-10-02 | Multi-marginal temporal SchrΓΆdinger Bridge Matching for video generation from unpaired data | Thomas Gravier et.al. | 2510.01894 | null |
| 2025-10-01 | IMAGEdit: Let Any Subject Transform | Fei Shen et.al. | 2510.01186 | null |
| 2025-10-01 | EvoWorld: Evolving Panoramic World Generation with Explicit 3D Memory | Jiahao Wang et.al. | 2510.01183 | null |
| 2025-10-01 | Code2Video: A Code-centric Paradigm for Educational Video Generation | Yanzhe Chen et.al. | 2510.01174 | null |
| 2025-10-01 | From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation | Fan Yang et.al. | 2510.00806 | null |
| 2025-10-01 | Arbitrary Generative Video Interpolation | Guozhen Zhang et.al. | 2510.00578 | null |
| 2025-10-01 | BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration | Zhaoyang Li et.al. | 2510.00438 | null |
| 2025-09-30 | Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation | Chetwin Low et.al. | 2510.01284 | null |
| 2025-09-30 | Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation | Agneet Chatterjee et.al. | 2509.26555 | null |
| 2025-09-30 | MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation | Chenhui Zhu et.al. | 2509.26391 | null |
| 2025-09-30 | PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution | Shian Du et.al. | 2509.26025 | null |
| 2025-09-30 | Wan-Alpha: High-Quality Text-to-Video Generation with Alpha Channel | Haotian Dong et.al. | 2509.24979 | null |
| 2025-09-30 | QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification | Weilun Feng et.al. | 2509.23681 | null |
| 2025-09-29 | FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation | Yunyang Ge et.al. | 2509.25187 | null |
| 2025-09-29 | DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder | Junyu Chen et.al. | 2509.25182 | null |
| 2025-09-29 | Rolling Forcing: Autoregressive Long Video Diffusion in Real Time | Kunhao Liu et.al. | 2509.25161 | null |
| 2025-09-29 | PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion | Yuyang Yin et.al. | 2509.24997 | null |
| 2025-09-29 | SDPose: Exploiting Diffusion Priors for Out-of-Domain and Robust Pose Estimation | Shuang Liang et.al. | 2509.24980 | null |
| 2025-09-29 | Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer | Mohsen Ghafoorian et.al. | 2509.24899 | null |
| 2025-09-29 | Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility | Yutong Hao et.al. | 2509.24702 | null |
| 2025-09-29 | SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer | Junsong Chen et.al. | 2509.24695 | null |
| 2025-09-29 | Learning Object-Centric Representations Based on Slots in Real World Scenarios | Adil Kaan Akan et.al. | 2509.24652 | null |
| 2025-09-29 | UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark | Ailing Zhang et.al. | 2509.24427 | null |
| 2025-09-29 | CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers | Kai Liu et.al. | 2509.24416 | null |
| 2025-09-29 | NeRV-Diffusion: Diffuse Implicit Neural Representations for Video Synthesis | Yixuan Ren et.al. | 2509.24353 | null |
| 2025-09-29 | FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation | Seungwook Kim et.al. | 2509.24241 | null |
| 2025-09-28 | Autoregressive Video Generation beyond Next Frames Prediction | Sucheng Ren et.al. | 2509.24081 | null |
| 2025-09-28 | SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention | Jintao Zhang et.al. | 2509.24006 | null |
| 2025-09-28 | VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement | Shulian Zhang et.al. | 2509.23584 | null |
| 2025-09-27 | Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing | Rohit Chowdhury et.al. | 2509.23279 | null |
| 2025-09-27 | Sparse2Dense: A Keypoint-driven Generative Framework for Human Video Compression and Vertex Prediction | Bolin Chen et.al. | 2509.23169 | null |
| 2025-09-26 | Physically Plausible Multi-System Trajectory Generation and Symmetry Discovery | Jiayin Liu et.al. | 2509.23003 | null |
| 2025-09-26 | VideoScore2: Think before You Score in Generative Video Evaluation | Xuan He et.al. | 2509.22799 | null |
| 2025-09-26 | Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs | Xingyu Fu et.al. | 2509.22646 | null |
| 2025-09-26 | LongLive: Real-time Interactive Long Video Generation | Shuai Yang et.al. | 2509.22622 | null |
| 2025-09-26 | EgoDemoGen: Novel Egocentric Demonstration Generation Enables Viewpoint-Robust Manipulation | Yuan Xu et.al. | 2509.22578 | null |
| 2025-09-26 | EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer | Zhehao Dong et.al. | 2509.22407 | null |
| 2025-09-26 | Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers | Jibin Song et.al. | 2509.21893 | null |
| 2025-09-26 | DiTraj: training-free trajectory control for video diffusion transformer | Cheng Lei et.al. | 2509.21839 | null |
| 2025-09-26 | MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation | Yu Shang et.al. | 2509.21797 | null |
| 2025-09-26 | LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE | Yu Shang et.al. | 2509.21790 | null |
| 2025-09-26 | UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models | Lan Chen et.al. | 2509.21760 | null |
| 2025-09-25 | FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction | Yixiang Dai et.al. | 2509.21657 | null |
| 2025-09-25 | What Happens Next? Anticipating Future Motion by Generating Point Trajectories | Gabrijel Boduljak et.al. | 2509.21592 | null |
| 2025-09-25 | ControlHair: Physically-based Video Diffusion for Controllable Dynamic Hair Rendering | Weikai Lin et.al. | 2509.21541 | null |
| 2025-09-25 | NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics | Yu Yuan et.al. | 2509.21309 | null |
| 2025-09-25 | MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation | Guojun Lei et.al. | 2509.21119 | null |
| 2025-09-25 | EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning | Xuan Ju et.al. | 2509.20360 | null |
| 2025-09-24 | PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation | Chen Wang et.al. | 2509.20358 | null |
| 2025-09-24 | 4D Driving Scene Generation With Stereo Forcing | Hao Lu et.al. | 2509.20251 | null |
| 2025-09-24 | CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion | Chenhao Ji et.al. | 2509.19979 | null |
| 2025-09-24 | OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling | Yang Zhou et.al. | 2509.12201 | null |
| 2025-09-23 | Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters | Pin-Yen Chiu et.al. | 2509.18831 | null |
| 2025-09-22 | VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models | Geonung Kim et.al. | 2509.17985 | null |
| 2025-09-22 | I2VWM: Robust Watermarking for Image to Video Generation | Guanjie Wang et.al. | 2509.17773 | null |
| 2025-09-21 | Echo-Path: Pathology-Conditioned Echo Video Generation | Kabir Hamzah Muhammad et.al. | 2509.17190 | null |
| 2025-09-21 | VidCLearn: A Continual Learning Approach for Text-to-Video Generation | Luca Zanchetta et.al. | 2509.16956 | null |
| 2025-09-21 | Yuanzhi Li et.al. | 2509.16873 | null | |
| 2025-09-20 | RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation | Tianyi Yan et.al. | 2509.16500 | null |
| 2025-09-19 | Lynx: Towards High-Fidelity Personalized Video Generation | Shen Sang et.al. | 2509.15496 | null |
| 2025-09-19 | AToken: A Unified Tokenizer for Vision | Jiasen Lu et.al. | 2509.14476 | null |
| 2025-09-18 | OpenViGA: Video Generation for Automotive Driving Scenes by Streamlining and Fine-Tuning Open Source Models with Public Data | BjΓΆrn MΓΆller et.al. | 2509.15479 | null |
| 2025-09-18 | RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation | Yuming Jiang et.al. | 2509.15212 | null |
| 2025-09-18 | WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance | Chenxi Song et.al. | 2509.15130 | null |
| 2025-09-18 | DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images | Kazuma Nagata et.al. | 2509.14685 | null |
| 2025-09-18 | BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching | Hanshuai Cui et.al. | 2509.13789 | null |
| 2025-09-17 | PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models | Artem Lykov et.al. | 2509.13903 | null |
| 2025-09-17 | TeraSim-World: Worldwide Safety-Critical Data Synthesis for End-to-End Autonomous Driving | Jiawei Wang et.al. | 2509.13164 | null |
| 2025-09-17 | Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis | Yikang Ding et.al. | 2509.09595 | null |
| 2025-09-16 | \textsc{Gen2Real}: Towards Demo-Free Dexterous Manipulation by Harnessing Generated Video | Kai Ye et.al. | 2509.14178 | null |
| 2025-09-16 | BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models | Yuming Li et.al. | 2509.06040 | null |
| 2025-09-15 | AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective | Yuchen Deng et.al. | 2509.12052 | null |
| 2025-09-15 | SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching | Jiacheng Liu et.al. | 2509.11628 | null |
| 2025-09-15 | MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment | Yanyun Pu et.al. | 2509.11589 | null |
| 2025-09-14 | VideoAgent: Personalized Synthesis of Scientific Videos | Xiao Liang et.al. | 2509.11253 | null |
| 2025-09-14 | PanoLora: Bridging Perspective and Panoramic Video Generation with LoRA Adaptation | Zeyu Dong et.al. | 2509.11092 | null |
| 2025-09-12 | Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation | Hao Zhang et.al. | 2509.10687 | null |
| 2025-09-12 | T2Bs: Text-to-Character Blendshapes via Video Generation | Jiahao Luo et.al. | 2509.10678 | null |
| 2025-09-12 | Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching | Zhixin Zheng et.al. | 2509.10312 | null |
| 2025-09-11 | Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders | Dohun Lee et.al. | 2509.09547 | null |
| 2025-09-11 | Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training | Ruicheng Zhang et.al. | 2509.06723 | null |
| 2025-09-10 | RewardDance: Reward Scaling in Visual Generation | Jie Wu et.al. | 2509.08826 | null |
| 2025-09-10 | GeneVA: A Dataset of Human Annotations for Generative Text to Video Artifacts | Jenna Kang et.al. | 2509.08818 | null |
| 2025-09-10 | HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning | Liyang Chen et.al. | 2509.08519 | null |
| 2025-09-09 | ANYPORTAL: Zero-Shot Consistent Video Background Replacement | Wenshuo Gao et.al. | 2509.07472 | null |
| 2025-09-09 | Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching | Feng Wang et.al. | 2509.05952 | null |
| 2025-09-09 | Attention of a Kiss: Exploring Attention Maps in Video Diffusion for XAIxArts | Adam Cole et.al. | 2509.05323 | null |
| 2025-09-07 | UniVerse-1: Unified Audio-Video Generation via Stitching of Experts | Duomin Wang et.al. | 2509.06155 | null |
| 2025-09-04 | Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image -- Technical Preview | Jun-Kun Chen et.al. | 2509.04450 | null |
| 2025-09-04 | Human Motion Video Generation: A Survey | Haiwei Xue et.al. | 2509.03883 | null |
| 2025-09-03 | CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation | Zixin Zhu et.al. | 2509.01028 | null |
| 2025-09-01 | Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement | Jiayi Gao et.al. | 2509.01362 | null |
| 2025-09-01 | Communicative Agents for Slideshow Storytelling Video Generation based on LLMs | Jingxing Fan et.al. | 2509.01277 | null |
| 2025-09-01 | FantasyHSI: Video-Generation-Centric 4D Human Synthesis In Any Scene through A Graph-based Multi-Agent Framework | Lingzhou Mu et.al. | 2509.01232 | null |
| 2025-08-30 | DevilSight: Augmenting Monocular Human Avatar Reconstruction through a Virtual Perspective | Yushuo Chen et.al. | 2509.00403 | null |
| 2025-08-28 | Mixture of Contexts for Long Video Generation | Shengqu Cai et.al. | 2508.21058 | null |
| 2025-08-28 | POSE: Phased One-Step Adversarial Equilibrium for Video Diffusion Models | Jiaxiang Cheng et.al. | 2508.21019 | null |
| 2025-08-28 | Learning Primitive Embodied World Models: Towards Scalable Robotic Learning | Qiao Sun et.al. | 2508.20840 | null |
| 2025-08-28 | Realistic and Controllable 3D Gaussian-Guided Object Editing for Driving Video Generation | Jiusi Li et.al. | 2508.20471 | null |
| 2025-08-28 | Ego-centric Predictive Model Conditioned on Hand Trajectories | Binjie Zhang et.al. | 2508.19852 | null |
| 2025-08-28 | MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation | Ming Chen et.al. | 2508.19320 | null |
| 2025-08-27 | ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion | Xurui Peng et.al. | 2508.21091 | null |
| 2025-08-26 | ROSE: Remove Objects with Side Effects in Videos | Chenxuan Miao et.al. | 2508.18633 | null |
| 2025-08-26 | Wan-S2V: Audio-Driven Cinematic Video Generation | Xin Gao et.al. | 2508.18621 | null |
| 2025-08-26 | Waver: Wave Your Way to Lifelike Video Generation | Yifu Zhang et.al. | 2508.15761 | null |
| 2025-08-25 | SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling | Fanjiang Ye et.al. | 2508.17756 | null |
| 2025-08-25 | OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models | Huanpeng Chu et.al. | 2508.16212 | null |
| 2025-08-24 | A Synthetic Dataset for Manometry Recognition in Robotic Applications | Pedro Antonio Rabelo Saraiva et.al. | 2508.17468 | null |
| 2025-08-24 | MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling | Haoyu Wang et.al. | 2508.17404 | null |
| 2025-08-24 | DiCache: Let Diffusion Model Determine Its Own Cache | Jiazi Bu et.al. | 2508.17356 | null |
| 2025-08-23 | SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation | Peng Hu et.al. | 2508.17062 | null |
| 2025-08-23 | HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial-based Feature Caching | Liang Feng et.al. | 2508.16984 | null |
| 2025-08-23 | HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation | Sizhe Shan et.al. | 2508.16930 | null |
| 2025-08-22 | Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation | Chun-Peng Chang et.al. | 2508.16512 | null |
| 2025-08-22 | Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers | Shikang Zheng et.al. | 2508.16211 | null |
| 2025-08-21 | Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning | Yijun Liu et.al. | 2508.15874 | null |
| 2025-08-21 | CineScale: Free Lunch in High-Resolution Cinematic Visual Generation | Haonan Qiu et.al. | 2508.15774 | null |
| 2025-08-21 | Scaling Group Inference for Diverse and High-Quality Generation | Gaurav Parmar et.al. | 2508.15773 | null |
| 2025-08-21 | WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception | Zhiheng Liu et.al. | 2508.15720 | null |
| 2025-08-21 | TiP4GEN: Text to Immersive Panorama 4D Scene Generation | Ke Xing et.al. | 2508.12415 | null |
| 2025-08-20 | DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing | Weitao Wang et.al. | 2508.14465 | null |
| 2025-08-20 | MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation | Guile Wu et.al. | 2508.14327 | null |
| 2025-08-19 | xDiff: Online Diffusion Model for Collaborative Inter-Cell Interference Management in 5G O-RAN | Peihao Yan et.al. | 2508.15843 | null |
| 2025-08-19 | InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing | Shaoshu Yang et.al. | 2508.14033 | null |
| 2025-08-19 | Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment | Samuel Seligardi et.al. | 2508.13989 | null |
| 2025-08-18 | 4DNeX: Feed-Forward 4D Generative Modeling Made Easy | Zhaoxi Chen et.al. | 2508.13154 | null |
| 2025-08-18 | Precise Action-to-Video Generation Through Visual Action Prompts | Yuang Wang et.al. | 2508.13104 | null |
| 2025-08-18 | EgoTwin: Dreaming Body and View in First Person | Jingqiao Xiu et.al. | 2508.13013 | null |
| 2025-08-18 | Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model | Xianglong He et.al. | 2508.13009 | null |
| 2025-08-18 | Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation | Qirui Li et.al. | 2508.12969 | null |
| 2025-08-18 | Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models | Jianshu Zeng et.al. | 2508.12945 | null |
| 2025-08-18 | S^2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models | Chubin Chen et.al. | 2508.12880 | null |
| 2025-08-18 | E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model | Ronghao Lin et.al. | 2508.12854 | null |
| 2025-08-18 | MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration | Yuanxin Wei et.al. | 2508.12691 | null |
| 2025-08-15 | CineTrans: Learning to Generate Videos with Cinematic Transitions via Masked Diffusion Models | Xiaoxue Wu et.al. | 2508.11484 | null |
| 2025-08-15 | Preacher: Paper-to-Video Agentic System | Jingwei Liu et.al. | 2508.09632 | null |
| 2025-08-14 | GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning | Kelin Yu et.al. | 2508.11049 | null |
| 2025-08-14 | EVCtrl: Efficient Control Adapter for Visual Generation | Zixiang Yang et.al. | 2508.10963 | null |
| 2025-08-14 | Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation | Harold Haodong Chen et.al. | 2508.10858 | null |
| 2025-08-14 | Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation | Youping Gu et.al. | 2508.10774 | null |
| 2025-08-14 | AEGIS: Authenticity Evaluation Benchmark for AI-Generated Video Sequences | Jieyu Li et.al. | 2508.10771 | null |
| 2025-08-14 | HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis | Shiyu Liu et.al. | 2508.10566 | null |
| 2025-08-14 | From Large Angles to Consistent Faces: Identity-Preserving Video Generation via Mixture of Facial Experts | Yuji Wang et.al. | 2508.09476 | null |
| 2025-08-14 | Yan: Foundational Interactive Video Generation | Deheng Ye et.al. | 2508.08601 | null |
| 2025-08-13 | Physical Autoregressive Model for Robotic Manipulation without Action Pretraining | Zijian Song et.al. | 2508.09822 | null |
| 2025-08-12 | X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents | Guoxian Song et.al. | 2508.09383 | null |
| 2025-08-12 | Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices | Ya Zou et.al. | 2508.09136 | null |
| 2025-08-12 | TaoCache: Structure-Maintained Video Generation Acceleration | Zhentao Fan et.al. | 2508.08978 | null |
| 2025-08-12 | Subjective and Objective Quality Assessment of Banding Artifacts on Compressed Videos | Qi Zheng et.al. | 2508.08700 | null |
| 2025-08-12 | RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space | Jingyun Liang et.al. | 2508.08588 | null |
| 2025-08-12 | S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix | Peng Dai et.al. | 2508.08048 | null |
| 2025-08-12 | Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation | Fangyuan Mao et.al. | 2508.07981 | null |
| 2025-08-12 | Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation | Bowen Xue et.al. | 2508.07901 | null |
| 2025-08-11 | VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By \underline{V}alue \underline{S}ign \underline{F}lip | Wenqi Guo et.al. | 2508.10931 | null |
| 2025-08-11 | StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation | Shuyuan Tu et.al. | 2508.08248 | null |
| 2025-08-11 | Matrix-3D: Omnidirectional Explorable 3D World Generation | Zhongqi Yang et.al. | 2508.08086 | null |
| 2025-08-11 | Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation | Xiaoyan Liu et.al. | 2508.07769 | null |
| 2025-08-11 | ShoulderShot: Generating Over-the-Shoulder Dialogue Videos | Yuang Zhang et.al. | 2508.07597 | null |
| 2025-08-08 | Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video | Jixuan He et.al. | 2508.06715 | null |
| 2025-08-08 | SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment | Yanxiao Sun et.al. | 2508.06082 | null |
| 2025-08-08 | DreamVE: Unified Instruction-based Image and Video Editing | Bin Xia et.al. | 2508.06080 | null |
| 2025-08-07 | Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation | Yue Liao et.al. | 2508.05635 | null |
| 2025-08-07 | B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding | Changho Choi et.al. | 2508.05269 | null |
| 2025-08-07 | PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation | Jingxuan He et.al. | 2508.05091 | null |
| 2025-08-07 | S |
Weilun Feng et.al. | 2508.04016 | null |
| 2025-08-06 | MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning | Quang-Trung Truong et.al. | 2508.04549 | null |
| 2025-08-06 | LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation | Kangrui Cen et.al. | 2508.04228 | null |
| 2025-08-06 | Motion is the Choreographer: Learning Latent Pose Dynamics for Seamless Sign Language Generation | Jiayi He et.al. | 2508.04049 | null |
| 2025-08-06 | Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation | Xunzhi Xiang et.al. | 2508.03334 | null |
| 2025-08-05 | Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm | Lin Zhang et.al. | 2508.03955 | null |
| 2025-08-05 | LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation | Jianxiong Gao et.al. | 2508.03694 | null |
| 2025-08-05 | RAAG: Ratio Aware Adaptive Guidance | Shangwen Zhu et.al. | 2508.03442 | null |
| 2025-08-05 | V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models | Jisoo Kim et.al. | 2508.03254 | null |
| 2025-08-05 | Multi-human Interactive Talking Dataset | Zeyu Zhu et.al. | 2508.03050 | null |
| 2025-08-05 | MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention | Qi Xie et.al. | 2508.03034 | null |
| 2025-08-05 | D3: Training-Free AI-Generated Video Detection Using Second-Order Features | Chende Zheng et.al. | 2508.00701 | null |
| 2025-08-04 | X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio | Chenxu Zhang et.al. | 2508.02944 | null |
| 2025-08-04 | DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework | Tongchun Zuo et.al. | 2508.02807 | null |
| 2025-08-04 | QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots | Sheng Wu et.al. | 2508.02512 | null |
| 2025-08-04 | PoseGuard: Pose-Guided Generation with Safety Guardrails | Kongxin Wang et.al. | 2508.02476 | null |
| 2025-08-04 | Talking Surveys: How Photorealistic Embodied Conversational Agents Shape Response Quality, Engagement, and Satisfaction | Matus Krajcovic et.al. | 2508.02376 | null |
| 2025-08-03 | Versatile Transition Generation with Image-to-Video Diffusion | Zuhao Yang et.al. | 2508.01698 | null |
| 2025-08-01 | Video Generators are Robot Policies | Junbang Liang et.al. | 2508.00795 | null |
| 2025-08-01 | SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation | Kien T. Pham et.al. | 2508.00782 | null |
| 2025-08-01 | Video Forgery Detection with Optical Flow Residuals and Spatial-Temporal Consistency | Xi Xue et.al. | 2508.00397 | null |
| 2025-08-01 | GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection | Suhang Cai et.al. | 2508.00312 | null |
| 2025-08-01 | Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence | Danzhen Fu et.al. | 2508.00299 | null |
| 2025-08-01 | HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly | Chang Liu et.al. | 2507.19924 | null |
| 2025-07-31 | World Consistency Score: A Unified Metric for Video Generation Quality | Akshat Rakheja et.al. | 2508.00144 | null |
| 2025-07-30 | GVD: Guiding Video Diffusion Model for Scalable Video Distillation | Kunyang Li et.al. | 2507.22360 | null |
| 2025-07-29 | JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 | Xinhan Di et.al. | 2507.20987 | null |
| 2025-07-28 | Compositional Video Synthesis by Temporal Object-Centric Learning | Adil Kaan Akan et.al. | 2507.20855 | null |
| 2025-07-27 | MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation | Shuolin Xu et.al. | 2507.20368 | null |
| 2025-07-26 | ChoreoMuse: Robust Music-to-Dance Video Generation with Style Transfer and Beat-Adherent Motion | Xuanchen Wang et.al. | 2507.19836 | null |
| 2025-07-25 | ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment | Chong Xia et.al. | 2507.19058 | null |
| 2025-07-24 | Captain Cinema: Towards Short Movie Generation | Junfei Xiao et.al. | 2507.18634 | null |
| 2025-07-24 | Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis | Yanzuo Lu et.al. | 2507.18569 | null |
| 2025-07-24 | Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows | Simin Huo et.al. | 2507.18405 | null |
| 2025-07-24 | T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation | Yubin Chen et.al. | 2507.18107 | null |
| 2025-07-24 | Enhancing Scene Transition Awareness in Video Generation via Post-Training | Hanwen Shen et.al. | 2507.18046 | null |
| 2025-07-24 | Celeb-DF++: A Large-scale Challenging Video DeepFake Benchmark for Generalizable Forensics | Yuezun Li et.al. | 2507.18015 | null |
| 2025-07-24 | Controllable Video Generation: A Survey | Yue Ma et.al. | 2507.16869 | null |
| 2025-07-23 | Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA | Rameen Abdal et.al. | 2507.17963 | null |
| 2025-07-23 | Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation | Jaechul Roh et.al. | 2507.17937 | null |
| 2025-07-23 | Yume: An Interactive World Generation Model | Xiaofeng Mao et.al. | 2507.17744 | null |
| 2025-07-23 | EndoGen: Conditional Autoregressive Endoscopic Video Generation | Xinyu Liu et.al. | 2507.17388 | null |
| 2025-07-22 | Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching | Haiyang Liu et.al. | 2507.18649 | null |
| 2025-07-22 | MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation | Yanchen Liu et.al. | 2507.16310 | null |
| 2025-07-22 | PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation | Yaofang Liu et.al. | 2507.16116 | null |
| 2025-07-21 | Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models | Enes Sanli et.al. | 2507.15824 | null |
| 2025-07-21 | TokensGen: Harnessing Condensed Tokens for Long Video Generation | Wenqi Ouyang et.al. | 2507.15728 | null |
| 2025-07-21 | Conditional Video Generation for High-Efficiency Video Compression | Fangqiu Yi et.al. | 2507.15269 | null |
| 2025-07-19 | BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM | Haiquan Wen et.al. | 2507.14632 | null |
| 2025-07-19 | Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey | Jiahui Zhang et.al. | 2507.14501 | null |
| 2025-07-18 | Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis | Tongtong Su et.al. | 2507.13753 | null |
| 2025-07-17 | Dmitrii Mikhailov et.al. | 2507.13546 | null | |
| 2025-07-17 | "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models | Jing Gu et.al. | 2507.13428 | null |
| 2025-07-17 | Taming Diffusion Transformer for Real-Time Mobile Video Generation | Yushu Wu et.al. | 2507.13343 | null |
| 2025-07-17 | Leveraging Pre-Trained Visual Models for AI-Generated Video Detection | Keerthi Veeramachaneni et.al. | 2507.13224 | null |
| 2025-07-17 | LoViC: Efficient Long Video Generation with Context Compression | Jiaxiu Jiang et.al. | 2507.12952 | null |
| 2025-07-17 | World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving | Yanchen Guan et.al. | 2507.12762 | null |
| 2025-07-16 | EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models | Jiajian Xie et.al. | 2507.11980 | null |
| 2025-07-15 | NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models | X. Feng et.al. | 2507.11245 | null |
| 2025-07-14 | Flows and Diffusions on the Neural Manifold | Daniel Saragih et.al. | 2507.10623 | null |
| 2025-07-14 | M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation | Kui Jiang et.al. | 2507.08307 | null |
| 2025-07-14 | Democratizing High-Fidelity Co-Speech Gesture Video Generation | Xu Yang et.al. | 2507.06812 | null |
| 2025-07-12 | Zhimin Liao et.al. | 2507.09144 | null | |
| 2025-07-11 | Taming generative video models for zero-shot optical flow extraction | Seungwoo Kim et.al. | 2507.09082 | null |
| 2025-07-11 | Detecting Deepfake Talking Heads from Facial Biometric Anomalies | Justin D. Norman et.al. | 2507.08917 | null |
| 2025-07-11 | Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective | Hangjie Yuan et.al. | 2507.08801 | null |
| 2025-07-11 | Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers | Wongi Jeong et.al. | 2507.08422 | null |
| 2025-07-11 | T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates | Zhitao Wang et.al. | 2507.07633 | null |
| 2025-07-10 | Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling | Haoyu Wu et.al. | 2507.07982 | null |
| 2025-07-10 | Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions | Longfei Li et.al. | 2507.07978 | null |
| 2025-07-10 | Scaling RL to Long Videos | Yukang Chen et.al. | 2507.07966 | null |
| 2025-07-09 | A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality | Mohamed Elmoghany et.al. | 2507.07202 | null |
| 2025-07-09 | Physics-Grounded Motion Forecasting via Equation Discovery for Trajectory-Guided Image-to-Video Generation | Tao Feng et.al. | 2507.06830 | null |
| 2025-07-09 | PromptTea: Let Prompts Tell TeaCache the Optimal Threshold | Zishen Huang et.al. | 2507.06739 | null |
| 2025-07-09 | Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis | Hao Tang et.al. | 2507.06689 | null |
| 2025-07-09 | FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation | Liqiang Jing et.al. | 2507.06523 | null |
| 2025-07-09 | Omni-Video: Democratizing Unified Video Understanding and Generation | Zhiyu Tan et.al. | 2507.06119 | null |
| 2025-07-09 | Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation | Zhenghao Zhang et.al. | 2507.05963 | null |
| 2025-07-09 | LongAnimation: Long Animation Generation with Dynamic Global-Local Memory | Nan Chen et.al. | 2507.01945 | null |
| 2025-07-08 | Bridging Sequential Deep Operator Network and Video Diffusion: Residual Refinement of Spatio-Temporal PDE Solutions | Jaewan Park et.al. | 2507.06133 | null |
| 2025-07-08 | MedGen: Unlocking Medical Video Generation by Scaling Granularly-annotated Medical Videos | Rongsheng Wang et.al. | 2507.05675 | null |
| 2025-07-08 | StreamDiT: Real-Time Streaming Text-to-Video Generation | Akio Kodaira et.al. | 2507.03745 | null |
| 2025-07-07 | HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding | Yuxuan Cai et.al. | 2507.04909 | null |
| 2025-07-07 | Music2Palette: Emotion-aligned Color Palette Generation via Cross-Modal Representation Learning | Jiayun Hu et.al. | 2507.04758 | null |
| 2025-07-07 | Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations | Yuji Wang et.al. | 2507.04705 | null |
| 2025-07-06 | MambaVideo for Discrete Video Tokenization with Channel-Split Quantization | Dawit Mureja Argaw et.al. | 2507.04559 | null |
| 2025-07-06 | CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining & Reinforcement Learning | Fatmaelzahraa Ali Ahmed et.al. | 2507.04317 | null |
| 2025-07-05 | PresentAgent: Multimodal Agent for Presentation Video Generation | Jingwei Shi et.al. | 2507.04036 | null |
| 2025-07-05 | EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation | Rang Meng et.al. | 2507.03905 | null |
| 2025-07-03 | RefTok: Reference-Based Tokenization for Video Generation | Xiang Fan et.al. | 2507.02862 | null |
| 2025-07-03 | Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching | Xin Zhou et.al. | 2507.02860 | null |
| 2025-07-03 | AnyI2V: Animating Any Conditional Image with Motion Control | Ziye Li et.al. | 2507.02857 | null |
| 2025-07-03 | Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation | FranΓ§ois Rozet et.al. | 2507.02608 | null |
| 2025-07-03 | RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment | Jianing Jin et.al. | 2506.23852 | null |
| 2025-07-02 | SD-Acc: Accelerating Stable Diffusion through Phase-aware Sampling and Hardware Co-Optimizations | Zhican Wang et.al. | 2507.01309 | null |
| 2025-07-02 | LLM-based Realistic Safety-Critical Driving Video Generation | Yongjie Fu et.al. | 2507.01264 | null |
| 2025-07-02 | AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation | Xiao Liu et.al. | 2507.01255 | null |
| 2025-07-01 | Geometry-aware 4D Video Generation for Robot Manipulation | Zeyi Liu et.al. | 2507.01099 | null |
| 2025-07-01 | Populate-A-Scene: Affordance-Aware Human Video Generation | Mengyi Shan et.al. | 2507.00334 | null |
| 2025-07-01 | Listener-Rewarded Thinking in VLMs for Image Preferences | Alexander Gambashidze et.al. | 2506.22832 | null |
| 2025-06-30 | FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion | Yu Lu et.al. | 2507.00162 | null |
| 2025-06-30 | Epona: Autoregressive Diffusion World Model for Autonomous Driving | Kaiwen Zhang et.al. | 2506.24113 | null |
| 2025-06-30 | VMoBA: Mixture-of-Block Attention for Video Diffusion Models | Jianzong Wu et.al. | 2506.23858 | null |
| 2025-06-30 | SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation | Shuai Tan et.al. | 2506.23690 | null |
| 2025-06-30 | ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models | Zixun Fang et.al. | 2506.23513 | null |
| 2025-06-29 | Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis | Lei-lei Li et.al. | 2506.23263 | null |
| 2025-06-29 | RoboScape: Physics-informed Embodied World Model | Yu Shang et.al. | 2506.23135 | null |
| 2025-06-27 | Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy | Yuhao Liu et.al. | 2506.22432 | null |
| 2025-06-27 | RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation | Liudi Yang et.al. | 2506.22007 | null |
| 2025-06-27 | ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models | Hongbo Liu et.al. | 2506.21356 | null |
| 2025-06-27 | DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing | Lingling Cai et.al. | 2506.20967 | null |
| 2025-06-26 | SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture | Kehan Sui et.al. | 2506.21478 | null |
| 2025-06-26 | HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation | Diego Biagini et.al. | 2506.21287 | null |
| 2025-06-26 | Video Virtual Try-on with Conditional Diffusion Transformer Inpainter | Cheng Zou et.al. | 2506.21270 | null |
| 2025-06-26 | Consistent Zero-shot 3D Texture Synthesis Using Geometry-aware Diffusion and Temporal Video Models | Donggoo Kang et.al. | 2506.20946 | null |
| 2025-06-25 | Video Perception Models for 3D Scene Synthesis | Rui Huang et.al. | 2506.20601 | null |
| 2025-06-25 | BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos | Jiahao Lin et.al. | 2506.20103 | null |
| 2025-06-24 | Radial Attention: |
Xingyang Li et.al. | 2506.19852 | null |
| 2025-06-24 | GenHSI: Controllable Generation of Human-Scene Interaction Videos | Zekun Li et.al. | 2506.19840 | null |
| 2025-06-24 | SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution | Liangbin Xie et.al. | 2506.19838 | null |
| 2025-06-24 | Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router | Yubo Huang et.al. | 2506.19833 | null |
| 2025-06-24 | Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation | Jintao Rong et.al. | 2506.19348 | null |
| 2025-06-23 | VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory | Runjia Li et.al. | 2506.18903 | null |
| 2025-06-23 | From Virtual Games to Real-World Play | Wenqiang Sun et.al. | 2506.18901 | null |
| 2025-06-23 | FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation | Kaiyi Huang et.al. | 2506.18899 | null |
| 2025-06-23 | MinD: Unified Visual Imagination and Control via Hierarchical World Models | Xiaowei Chi et.al. | 2506.18897 | null |
| 2025-06-23 | OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation | Qijun Gan et.al. | 2506.18866 | null |
| 2025-06-23 | Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset | Zhuowei Chen et.al. | 2506.18851 | null |
| 2025-06-23 | Matrix-Game: Interactive World Foundation Model | Yifan Zhang et.al. | 2506.18701 | null |
| 2025-06-23 | RDPO: Real Data Preference Optimization for Physics Consistency Video Generation | Wenxu Qian et.al. | 2506.18655 | null |
| 2025-06-23 | BulletGen: Improving 4D Reconstruction with Bullet-Time Generation | Denys Rozumnyi et.al. | 2506.18601 | null |
| 2025-06-23 | VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning | Xuanyu Zhang et.al. | 2506.18564 | null |
| 2025-06-23 | Emergent Temporal Correspondences from Video Diffusion Transformers | Jisu Nam et.al. | 2506.17220 | link |
| 2025-06-21 | STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation | Jiamin Wang et.al. | 2506.13138 | null |
| 2025-06-20 | Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition | Jiaqi Li et.al. | 2506.17201 | null |
| 2025-06-20 | Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation | Riccardo Corvi et.al. | 2506.16802 | null |
| 2025-06-20 | Sekai: A Video Dataset towards World Exploration | Zhen Li et.al. | 2506.15675 | null |
| 2025-06-20 | Show-o2: Improved Native Unified Multimodal Models | Jinheng Xie et.al. | 2506.15564 | link |
| 2025-06-19 | VideoGAN-based Trajectory Proposal for Automated Vehicles | Annajoyce Mariani et.al. | 2506.16209 | link |
| 2025-06-19 | FastInit: Fast Noise Initialization for Temporally Consistent Video Generation | Chengyu Bai et.al. | 2506.16119 | null |
| 2025-06-19 | PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models | Tianchen Zhao et.al. | 2506.16054 | null |
| 2025-06-19 | Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization | Cong Wang et.al. | 2506.15980 | link |
| 2025-06-18 | VideoMAR: Autoregressive Video Generatio with Continuous Tokens | Hu Yu et.al. | 2506.14168 | null |
| 2025-06-18 | Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models | Xuanchi Ren et.al. | 2506.09042 | link |
| 2025-06-17 | Causally Steered Diffusion for Automated Video Counterfactual Generation | Nikos Spyrou et.al. | 2506.14404 | link |
| 2025-06-17 | CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation | Jia-Chen Zhang et.al. | 2506.14206 | null |
| 2025-06-16 | EchoShot: Multi-Shot Portrait Video Generation | Jiahao Wang et.al. | 2506.15838 | null |
| 2025-06-16 | UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions | Zhucun Xue et.al. | 2506.13691 | null |
| 2025-06-15 | iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer | Zhelun Shen et.al. | 2506.12847 | null |
| 2025-06-13 | SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation | Xu Wang et.al. | 2506.11621 | null |
| 2025-06-13 | Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models | Sridhar S et.al. | 2506.10005 | null |
| 2025-06-12 | GenWorld: Towards Detecting AI-generated Real-world Simulation Videos | Weiliang Chen et.al. | 2506.10975 | null |
| 2025-06-12 | M4V: Multi-Modal Mamba for Text-to-Video Generation | Jiancheng Huang et.al. | 2506.10915 | null |
| 2025-06-12 | GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning | Xiaoyi Bao et.al. | 2506.10639 | null |
| 2025-06-12 | DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers | Lizhen Wang et.al. | 2506.10568 | null |
| 2025-06-12 | AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation | Haoyuan Shi et.al. | 2506.10540 | null |
| 2025-06-11 | AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation | Chao Liang et.al. | 2506.11144 | null |
| 2025-06-11 | PlayerOne: Egocentric World Simulator | Yuanpeng Tu et.al. | 2506.09995 | null |
| 2025-06-11 | InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions | Zhenzhi Wang et.al. | 2506.09984 | null |
| 2025-06-11 | ReSim: Reliable World Simulation for Autonomous Driving | Jiazhi Yang et.al. | 2506.09981 | null |
| 2025-06-11 | DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning | Dongxu Liu et.al. | 2506.09644 | null |
| 2025-06-11 | Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation | Shanchuan Lin et.al. | 2506.09350 | null |
| 2025-06-10 | Seedance 1.0: Exploring the Boundaries of Video Generation Models | Yu Gao et.al. | 2506.09113 | null |
| 2025-06-10 | FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation | Zheqi He et.al. | 2506.09081 | link |
| 2025-06-10 | VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks | Xinlong Chen et.al. | 2506.09079 | null |
| 2025-06-10 | MagCache: Fast Video Generation with Magnitude-Aware Cache | Zehong Ma et.al. | 2506.09045 | link |
| 2025-06-10 | Product of Experts for Visual Generation | Yunzhi Zhang et.al. | 2506.08894 | null |
| 2025-06-10 | HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation | Ziyao Huang et.al. | 2506.08797 | null |
| 2025-06-10 | RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping | Yang Bai et.al. | 2506.08632 | null |
| 2025-06-10 | How Much To Guide: Revisiting Adaptive Guidance in Classifier-Free Guidance Text-to-Vision Diffusion Models | Huixuan Zhang et.al. | 2506.08351 | null |
| 2025-06-10 | From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models | Pablo Acuaviva et.al. | 2506.07280 | null |
| 2025-06-09 | Seeing Voices: Generating A-Roll Video from Audio with Mirage | Aditi Sundararaman et.al. | 2506.08279 | null |
| 2025-06-09 | Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion | Xun Huang et.al. | 2506.08009 | null |
| 2025-06-09 | Dreamland: Controllable World Creation with Simulator and Generative Models | Sicheng Mo et.al. | 2506.08006 | null |
| 2025-06-09 | Audio-Sync Video Generation with Multi-Stream Temporal Control | Shuchen Weng et.al. | 2506.08003 | null |
| 2025-06-09 | Generative Modeling of Weights: Generalization or Memorization? | Boya Zeng et.al. | 2506.07998 | link |
| 2025-06-09 | Video Unlearning via Low-Rank Refusal Vector | Simone Facchiano et.al. | 2506.07891 | null |
| 2025-06-09 | EgoM2P: Egocentric Multimodal Multitask Pretraining | Gen Li et.al. | 2506.07886 | null |
| 2025-06-09 | PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement | Teng Hu et.al. | 2506.07848 | null |
| 2025-06-09 | Consistent Video Editing as Flow-Driven Image-to-Video Generation | Ge Wang et.al. | 2506.07713 | null |
| 2025-06-09 | Evaluating Robustness in Latent Diffusion Models via Embedding Level Augmentation | Boris Martirosyan et.al. | 2506.07706 | null |
| 2025-06-09 | Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers | Haosong Liu et.al. | 2506.05096 | null |
| 2025-06-08 | TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation | Min-Jung Kim et.al. | 2506.07205 | null |
| 2025-06-08 | Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models | Sangwon Jang et.al. | 2506.07177 | null |
| 2025-06-08 | Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion | Huaize Liu et.al. | 2506.07136 | null |
| 2025-06-07 | Self-Adapting Improvement Loops for Robotic Learning | Calvin Luo et.al. | 2506.06658 | null |
| 2025-06-06 | Restereo: Diffusion stereo video generation and restoration | Xingchang Huang et.al. | 2506.06023 | null |
| 2025-06-06 | LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models | Haojie Yu et.al. | 2506.05806 | null |
| 2025-06-06 | FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion | Akide Liu et.al. | 2506.04648 | null |
| 2025-06-05 | EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh | Tao Hu et.al. | 2506.05554 | null |
| 2025-06-05 | ContentV: Efficient Training of Video Generation Models with Limited Compute | Wenfeng Lin et.al. | 2506.05343 | null |
| 2025-06-05 | FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation | Huihan Wang et.al. | 2506.04956 | link |
| 2025-06-05 | DualX-VSR: Dual Axial Spatial |
Shuo Cao et.al. | 2506.04830 | null |
| 2025-06-05 | Follow-Your-Creation: Empowering 4D Creation through Video Inpainting | Yue Ma et.al. | 2506.04590 | null |
| 2025-06-05 | FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers | Xuanhua He et.al. | 2506.04213 | null |
| 2025-06-05 | SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios | Lingwei Dang et.al. | 2506.02444 | link |
| 2025-06-04 | LayerFlow: A Unified Model for Layer-aware Video Generation | Sihui Ji et.al. | 2506.04228 | null |
| 2025-06-04 | UNIC: Unified In-Context Video Editing | Zixuan Ye et.al. | 2506.04216 | null |
| 2025-06-04 | DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models | Ziyi Wu et.al. | 2506.03517 | null |
| 2025-06-03 | Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas | Austin Silveria et.al. | 2506.03275 | null |
| 2025-06-03 | IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation | Yuanze Lin et.al. | 2506.03150 | null |
| 2025-06-03 | Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval | Jiwen Yu et.al. | 2506.03141 | null |
| 2025-06-03 | CamCloneMaster: Enabling Reference-based Camera Control for Video Generation | Yawen Luo et.al. | 2506.03140 | null |
| 2025-06-03 | AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation | Lu Qiu et.al. | 2506.03126 | null |
| 2025-06-03 | DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation | Zhengyao Lv et.al. | 2506.03123 | null |
| 2025-06-03 | TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models | Chetwin Low et.al. | 2506.03099 | null |
| 2025-06-03 | SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis | Ssharvien Kumar Sivakumar et.al. | 2506.03082 | null |
| 2025-06-03 | ORV: 4D Occupancy-centric Robot Video Generation | Xiuyu Yang et.al. | 2506.03079 | link |
| 2025-06-03 | Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers | Pengtao Chen et.al. | 2506.03065 | null |
| 2025-06-03 | LinkTo-Anime: A 2D Animation Optical Flow Dataset from 3D Model Rendering | Xiaoyi Feng et.al. | 2506.02733 | null |
| 2025-06-03 | LumosFlow: Motion-Guided Long Video Generation | Jiahao Chen et.al. | 2506.02497 | null |
| 2025-06-02 | Motion aware video generative model | Bowen Xue et.al. | 2506.02244 | null |
| 2025-06-02 | Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control | Xiao Fu et.al. | 2506.01943 | null |
| 2025-06-02 | OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation | Sen Liang et.al. | 2506.01801 | null |
| 2025-06-02 | Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks | Tao Yang et.al. | 2506.01758 | null |
| 2025-06-02 | Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents | Shuting Wang et.al. | 2506.01689 | null |
| 2025-06-02 | LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model | Xiaodong Wang et.al. | 2506.01546 | null |
| 2025-06-02 | Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark | Shuyu Yang et.al. | 2506.01466 | null |
| 2025-06-02 | DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion | Geunmin Hwang et.al. | 2506.01454 | null |
| 2025-05-30 | MiniMax-Remover: Taming Bad Noise Helps Video Object Removal | Bojia Zi et.al. | 2505.24873 | null |
| 2025-05-30 | DreamDance: Animating Character Art via Inpainting Stable Gaussian Worlds | Jiaxu Zhang et.al. | 2505.24733 | null |
| 2025-05-30 | UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation | Yang-Tian Sun et.al. | 2505.24521 | null |
| 2025-05-30 | Interactive Video Generation via Domain Adaptation | Ishaan Rawal et.al. | 2505.24253 | null |
| 2025-05-30 | STORK: Improving the Fidelity of Mid-NFE Sampling for Diffusion and Flow Matching Models | Zheng Tan et.al. | 2505.24210 | link |
| 2025-05-29 | MAGREF: Masked Guidance for Any-Reference Video Generation | Yufan Deng et.al. | 2505.23742 | link |
| 2025-05-29 | VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos | Tingyu Song et.al. | 2505.23693 | link |
| 2025-05-29 | VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models | Xiangdong Zhang et.al. | 2505.23656 | link |
| 2025-05-29 | VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation | Shi-Xue Zhang et.al. | 2505.23484 | link |
| 2025-05-29 | Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis | Hengyuan Cao et.al. | 2505.23325 | null |
| 2025-05-29 | RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer | Liu Liu et.al. | 2505.23171 | null |
| 2025-05-29 | Zero-to-Hero: Zero-Shot Initialization Empowering Reference-Based Video Appearance Editing | Tongtong Su et.al. | 2505.23134 | link |
| 2025-05-29 | MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation | Siyuan Wang et.al. | 2505.23120 | link |
| 2025-05-29 | GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion | Gwanghyun Kim et.al. | 2505.23085 | null |
| 2025-05-29 | MOVi: Training-free Text-conditioned Multi-Object Video Generation | Aimon Rahman et.al. | 2505.22980 | null |
| 2025-05-29 | HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions | Shuolin Xu et.al. | 2505.22977 | link |
| 2025-05-29 | Minute-Long Videos with Dual Parallelisms | Zeqing Wang et.al. | 2505.21070 | link |
| 2025-05-28 | ATI: Any Trajectory Instruction for Controllable Video Generation | Angtian Wang et.al. | 2505.22944 | null |
| 2025-05-28 | Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation | Zhe Kong et.al. | 2505.22647 | link |
| 2025-05-28 | Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers | Weilun Feng et.al. | 2505.22167 | null |
| 2025-05-28 | FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing | Guanwen Feng et.al. | 2505.22141 | null |
| 2025-05-28 | LatentMove: Towards Complex Human Movement Video Generation | Ashkan Taghipour et.al. | 2505.22046 | null |
| 2025-05-28 | PanoWan: Lifting Diffusion Video Generation Models to 360Β° with Latitude/Longitude-aware Mechanisms | Yifei Xia et.al. | 2505.22016 | null |
| 2025-05-28 | Learning World Models for Interactive Video Generation | Taiye Chen et.al. | 2505.21996 | null |
| 2025-05-28 | SageAttention2++: A More Efficient Implementation of SageAttention2 | Jintao Zhang et.al. | 2505.21136 | link |
| 2025-05-28 | OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation | Shenghai Yuan et.al. | 2505.20292 | link |
| 2025-05-27 | HDRSDR-VQA: A Subjective Video Quality Dataset for HDR and SDR Comparative Evaluation | Bowen Chen et.al. | 2505.21831 | null |
| 2025-05-27 | Think Before You Diffuse: LLMs-Guided Physics-Aware Video Generation | Ke Zhang et.al. | 2505.21653 | null |
| 2025-05-27 | VideoMarkBench: Benchmarking Robustness of Video Watermarking | Zhengyuan Jiang et.al. | 2505.21620 | link |
| 2025-05-27 | Frame In-N-Out: Unbounded Controllable Image-to-Video Generation | Boyang Wang et.al. | 2505.21491 | null |
| 2025-05-27 | Dynamic Vision from EEG Brain Recordings: How much does EEG know? | Prajwal Singh et.al. | 2505.21385 | null |
| 2025-05-27 | RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy | Aiyue Chen et.al. | 2505.21036 | null |
| 2025-05-27 | Frame-Level Captions for Long Video Generation with Complex Multi Scenes | Guangcong Zheng et.al. | 2505.20827 | null |
| 2025-05-27 | Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt | Xiang Zhu et.al. | 2505.20795 | null |
| 2025-05-27 | Photography Perspective Composition: Towards Aesthetic Perspective Recommendation | Lujian Yao et.al. | 2505.20655 | null |
| 2025-05-27 | Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training | Bolin Lai et.al. | 2505.20629 | null |
| 2025-05-27 | Dynamic-I2V: Exploring Image-to-Video Generation Models via Multimodal LLM | Peng Liu et.al. | 2505.19901 | null |
| 2025-05-26 | MotionPro: A Precise Motion Controller for Image-to-Video Generation | Zhongwei Zhang et.al. | 2505.20287 | null |
| 2025-05-26 | DriveCamSim: Generalizable Camera Simulation via Explicit Camera Modeling for Autonomous Driving | Wenchao Sun et.al. | 2505.19692 | link |
| 2025-05-26 | TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs | Juntong Wang et.al. | 2505.19535 | null |
| 2025-05-26 | The Role of Video Generation in Enhancing Data-Limited Action Understanding | Wei Li et.al. | 2505.19495 | null |
| 2025-05-26 | Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals | Nate Gillman et.al. | 2505.19386 | null |
| 2025-05-26 | DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation | Junhao Chen et.al. | 2505.18078 | null |
| 2025-05-25 | From Single Images to Motion Policies via Video-Generation Environment Representations | Weiming Zhi et.al. | 2505.19306 | null |
| 2025-05-25 | SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation | Shenggan Cheng et.al. | 2505.19151 | null |
| 2025-05-25 | WorldEval: World Model as Real-World Robot Policies Evaluator | Yaxuan Li et.al. | 2505.19017 | null |
| 2025-05-25 | Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency | Hyunho Ha et.al. | 2505.18932 | null |
| 2025-05-25 | Interspatial Attention for Efficient 4D Human Video Generation | Ruizhi Shao et.al. | 2505.15800 | null |
| 2025-05-24 | Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation | Shuo Yang et.al. | 2505.18875 | null |
| 2025-05-24 | VORTA: Efficient Video Diffusion via Routing Sparse Attention | Wenhao Sun et.al. | 2505.18809 | link |
| 2025-05-24 | DVD-Quant: Data-free Video Diffusion Transformers Quantization | Zhiteng Li et.al. | 2505.18663 | link |
| 2025-05-24 | ProphetDWM: A Driving World Model for Rolling Out Future Actions and Videos | Xiaodong Wang et.al. | 2505.18650 | null |
| 2025-05-23 | WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions | Zizhang Li et.al. | 2505.18151 | null |
| 2025-05-23 | SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain | Jiawei Zhou et.al. | 2505.17727 | null |
| 2025-05-23 | Scaling Image and Video Generation via Test-Time Evolutionary Search | Haoran He et.al. | 2505.17618 | null |
| 2025-05-23 | InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO | Xueji Fang et.al. | 2505.17574 | link |
| 2025-05-23 | Challenger: Affordable Adversarial Driving Video Generation | Zhiyuan Xu et.al. | 2505.15880 | null |
| 2025-05-22 | Temporal Differential Fields for 4D Motion Modeling via Image-to-Video Synthesis | Xin You et.al. | 2505.17333 | null |
| 2025-05-22 | Training-Free Efficient Video Generation via Dynamic Token Carving | Yuechen Zhang et.al. | 2505.16864 | link |
| 2025-05-22 | Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts | Taewon Kang et.al. | 2505.16819 | null |
| 2025-05-22 | MAGIC: Motion-Aware Generative Inference via Confidence-Guided LLM | Siwei Meng et.al. | 2505.16456 | null |
| 2025-05-21 | Generative AI for Autonomous Driving: A Review | Katharina Winter et.al. | 2505.15863 | null |
| 2025-05-21 | AvatarShield: Visual Reinforcement Learning for Human-Centric Video Forgery Detection | Zhipei Xu et.al. | 2505.15173 | null |
| 2025-05-21 | CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation | Xinran Wang et.al. | 2505.15145 | link |
| 2025-05-21 | BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation | Haiquan Wen et.al. | 2505.12620 | link |
| 2025-05-21 | Video-GPT via Next Clip Diffusion | Shaobin Zhuang et.al. | 2505.12489 | null |
| 2025-05-20 | Programmatic Video Prediction Using Large Language Models | Hao Tang et.al. | 2505.14948 | link |
| 2025-05-20 | Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers | Sucheng Ren et.al. | 2505.14687 | link |
| 2025-05-20 | LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer | Changgu Chen et.al. | 2505.14167 | null |
| 2025-05-20 | Hunyuan-Game: Industrial-grade Intelligent Game Creation Model | Ruihuang Li et.al. | 2505.14135 | null |
| 2025-05-20 | MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation | Yanbo Ding et.al. | 2505.10238 | link |
| 2025-05-19 | FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance | Dian Shao et.al. | 2505.13437 | null |
| 2025-05-19 | MAGI-1: Autoregressive Video Generation at Scale | Sand. ai et.al. | 2505.13211 | link |
| 2025-05-19 | DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories | Joel Jang et.al. | 2505.12705 | link |
| 2025-05-19 | Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking | Zihan Su et.al. | 2505.12667 | null |
| 2025-05-18 | EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models | Hu Yue et.al. | 2505.09694 | link |
| 2025-05-17 | FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge | Xuan Shen et.al. | 2505.14709 | link |
| 2025-05-17 | DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance | Xuan Shen et.al. | 2505.14708 | link |
| 2025-05-17 | LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text Interpretation | Jiarui Wang et.al. | 2505.12098 | link |
| 2025-05-17 | VFRTok: Variable Frame Rates Video Tokenizer with Duration-Proportional Information Assumption | Tianxiong Zhong et.al. | 2505.12053 | null |
| 2025-05-17 | STORYANCHORS: Generating Consistent Multi-Scene Story Frames for Long-Form Narratives | Bo Wang et.al. | 2505.08350 | null |
| 2025-05-16 | QVGen: Pushing the Limit of Quantized Video Generative Models | Yushi Huang et.al. | 2505.11497 | null |
| 2025-05-16 | Face Consistency Benchmark for GenAI Video | Michal Podstawski et.al. | 2505.11425 | null |
| 2025-05-16 | Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model | Wei Li et.al. | 2505.07449 | link |
| 2025-05-15 | ToonifyGB: StyleGAN-based Gaussian Blendshapes for 3D Stylized Head Avatars | Rui-Yang Ju et.al. | 2505.10072 | null |
| 2025-05-15 | Generating time-consistent dynamics with discriminator-guided image diffusion models | Philipp Hess et.al. | 2505.09089 | null |
| 2025-05-15 | Generative Pre-trained Autoregressive Diffusion Transformer | Yuan Zhang et.al. | 2505.07344 | null |
| 2025-05-14 | Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios | Huafeng Shi et.al. | 2505.10584 | null |
| 2025-05-13 | Generative AI for Autonomous Driving: Frontiers and Opportunities | Yuping Wang et.al. | 2505.08854 | link |
| 2025-05-13 | Symbolically-Guided Visual Plan Inference from Uncurated Video Data | Wenyan Yang et.al. | 2505.08444 | null |
| 2025-05-12 | DanceGRPO: Unleashing GRPO on Visual Generation | Zeyue Xue et.al. | 2505.07818 | null |
| 2025-05-12 | ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models | Ozgur Kara et.al. | 2505.07652 | null |
| 2025-05-11 | DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models | Junhao Xia et.al. | 2505.07057 | null |
| 2025-05-11 | BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation | Panwen Hu et.al. | 2505.06985 | null |
| 2025-05-10 | Jailbreaking the Text-to-Video Generative Models | Jiayang Liu et.al. | 2505.06679 | null |
| 2025-05-10 | ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images | Xianghao Kong et.al. | 2505.06537 | null |
| 2025-05-08 | 3D Scene Generation: A Survey | Beichen Wen et.al. | 2505.05474 | link |
| 2025-05-08 | T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models | Xuyang Guo et.al. | 2505.04946 | null |
| 2025-05-08 | HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation | Teng Hu et.al. | 2505.04512 | null |
| 2025-05-06 | Real-Time Person Image Synthesis Using a Flow Matching Model | Jiwoo Jeong et.al. | 2505.03562 | link |
| 2025-05-06 | Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights | Zhaiming Shen et.al. | 2505.03205 | null |
| 2025-05-04 | DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization | Wenchuan Wang et.al. | 2505.02192 | null |
| 2025-05-03 | GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting | Anushka Agarwal et.al. | 2505.01928 | null |
| 2025-05-03 | PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth | Bu Jin et.al. | 2505.01729 | null |
| 2025-05-02 | VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos | Zongxia Li et.al. | 2505.01481 | link |
| 2025-05-02 | FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis | Jiangtong Tan et.al. | 2505.01172 | link |
| 2025-05-01 | Controllable Weather Synthesis and Removal with Video Diffusion Models | Chih-Hao Lin et.al. | 2505.00704 | null |
| 2025-05-01 | T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation | Xuyang Guo et.al. | 2505.00337 | null |
| 2025-04-30 | Direct Motion Models for Assessing Generated Videos | Kelsey Allen et.al. | 2505.00209 | null |
| 2025-04-30 | Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis | Michal Geyer et.al. | 2505.00135 | null |
| 2025-04-30 | ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction | Qihao Liu et.al. | 2504.21855 | null |
| 2025-04-30 | HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation | Haiyang Zhou et.al. | 2504.21650 | link |
| 2025-04-30 | Simple Visual Artifact Detection in Sora-Generated Videos | Misora Sugiyama et.al. | 2504.21334 | null |
| 2025-04-30 | Capturing Conditional Dependence via Auto-regressive Diffusion Models | Xunpeng Huang et.al. | 2504.21314 | null |
| 2025-04-29 | TesserAct: Learning 4D Embodied World Models | Haoyu Zhen et.al. | 2504.20995 | null |
| 2025-04-29 | DDPS: Discrete Diffusion Posterior Sampling for Paths in Layered Graphs | Hao Luan et.al. | 2504.20754 | null |
| 2025-04-29 | Advance Fake Video Detection via Vision Transformers | Joy Battocchio et.al. | 2504.20669 | null |
| 2025-04-28 | CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition | Quynh Phung et.al. | 2504.19894 | null |
| 2025-04-28 | DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer | Junpeng Jiang et.al. | 2504.19614 | null |
| 2025-04-26 | Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning | Yifan Xie et.al. | 2504.18810 | null |
| 2025-04-26 | Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation | Jong Inn Park et.al. | 2504.18805 | null |
| 2025-04-25 | NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration | Haotian Dong et.al. | 2504.18448 | null |
| 2025-04-25 | We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback | Minkyu Choi et.al. | 2504.17180 | null |
| 2025-04-24 | Dynamic Camera Poses and Where to Find Them | Chris Rockwell et.al. | 2504.17788 | null |
| 2025-04-24 | MV-Crafter: An Intelligent System for Music-guided Video Generation | Chuer Chen et.al. | 2504.17267 | null |
| 2025-04-24 | DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks | Yinqi Li et.al. | 2504.17253 | link |
| 2025-04-23 | Subject-driven Video Generation via Disentangled Identity and Motion | Daneul Kim et.al. | 2504.17816 | null |
| 2025-04-23 | BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation | Ruotong Wang et.al. | 2504.16907 | null |
| 2025-04-23 | ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance | Ying Li et.al. | 2504.16464 | null |
| 2025-04-23 | VideoMark: A Distortion-Free Robust Watermarking Framework for Video Diffusion Models | Xuming Hu et.al. | 2504.16359 | null |
| 2025-04-22 | DriVerse: Navigation World Model for Driving Simulation via Multimodal Trajectory Prompting and Motion Alignment | Xiaofan Li et.al. | 2504.18576 | link |
| 2025-04-22 | Survey of Video Diffusion Models: Foundations, Implementations, and Applications | Yimu Wang et.al. | 2504.16081 | link |
| 2025-04-22 | Efficient Temporal Consistency in Diffusion-Based Video Editing with Adaptor Modules: A Theoretical Framework | Xinyuan Song et.al. | 2504.16016 | null |
| 2025-04-22 | Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning | Wang Lin et.al. | 2504.15932 | null |
| 2025-04-22 | Satellite to GroundScape -- Large-scale Consistent Ground View Generation from Satellite Views | Ningli Xu et.al. | 2504.15786 | null |
| 2025-04-22 | DiTPainter: Efficient Video Inpainting with Diffusion Transformers | Xian Wu et.al. | 2504.15661 | null |
| 2025-04-21 | Solving New Tasks by Adapting Internet Video Knowledge | Calvin Luo et.al. | 2504.15369 | null |
| 2025-04-21 | Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform | Xianpan Zhou et.al. | 2504.15182 | null |
| 2025-04-21 | DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation | Weijie He et.al. | 2504.15032 | null |
| 2025-04-21 | Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation | Chenjie Cao et.al. | 2504.14899 | link |
| 2025-04-21 | SkyReels-V2: Infinite-length Film Generative Model | Guibin Chen et.al. | 2504.13074 | link |
| 2025-04-21 | Packing Input Frame Context in Next-Frame Prediction Models for Video Generation | Lvmin Zhang et.al. | 2504.12626 | link |
| 2025-04-20 | Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis | Jingjing Ren et.al. | 2504.14470 | null |
| 2025-04-19 | SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation | Minho Park et.al. | 2504.14396 | link |
| 2025-04-18 | Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting | Jiaxin Huang et.al. | 2504.11092 | null |
| 2025-04-17 | Understanding Attention Mechanism in Video Diffusion Models | Bingyan Liu et.al. | 2504.12027 | null |
| 2025-04-17 | VideoPanda: Video Panoramic Diffusion with Multi-view Attention | Kevin Xie et.al. | 2504.11389 | null |
| 2025-04-17 | StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text | Roberto Henschel et.al. | 2403.14773 | null |
| 2025-04-16 | VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate | Zhihang Yuan et.al. | 2504.12259 | link |
| 2025-04-16 | Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM | Zirui Pan et.al. | 2504.12048 | null |
| 2025-04-16 | The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation | Bingjie Gao et.al. | 2504.11739 | null |
| 2025-04-16 | ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation | Zongyi Li et.al. | 2410.20502 | null |
| 2025-04-15 | InterAnimate: Taming Region-aware Diffusion Model for Realistic Human Interaction Animation | Yukang Lin et.al. | 2504.10905 | null |
| 2025-04-15 | OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding | Dianbing Xi et.al. | 2504.10825 | null |
| 2025-04-14 | H-MoRe: Learning Human-centric Motion Representation for Action Analysis | Zhanbo Huang et.al. | 2504.10676 | link |
| 2025-04-14 | H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models | Yushu Wu et.al. | 2504.10567 | null |
| 2025-04-14 | FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos | Rui Chen et.al. | 2504.10358 | null |
| 2025-04-14 | Aligning Anime Video Generation with Human Feedback | Bingwen Zhu et.al. | 2504.10044 | null |
| 2025-04-14 | EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise | Chao Liu et.al. | 2504.09789 | null |
| 2025-04-13 | CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models | Pooja Guhan et.al. | 2504.09472 | null |
| 2025-04-11 | Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model | Team Seawead et.al. | 2504.08685 | null |
| 2025-04-11 | Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization | Jialu Li et.al. | 2504.08641 | null |
| 2025-04-11 | Diffusion Models for Robotic Manipulation: A Survey | Rosa Wolf et.al. | 2504.08438 | null |
| 2025-04-11 | EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model | Renda Li et.al. | 2504.08344 | null |
| 2025-04-11 | RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements | Guangcong Zheng et.al. | 2504.08212 | link |
| 2025-04-11 | TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation | Ruineng Li et.al. | 2504.08181 | null |
| 2025-04-10 | Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction | Zeren Jiang et.al. | 2504.07961 | link |
| 2025-04-10 | Beyond the Frame: Generating 360Β° Panoramic Videos from Perspective Videos | Rundong Luo et.al. | 2504.07940 | null |
| 2025-04-10 | Diffusion Transformers for Tabular Data Time Series Generation | Fabrizio Garuti et.al. | 2504.07566 | link |
| 2025-04-09 | EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation | Diljeet Jagpal et.al. | 2504.06861 | null |
| 2025-04-09 | DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation | Wangbo Zhao et.al. | 2504.06803 | link |
| 2025-04-09 | RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism | Elia Peruzzo et.al. | 2504.06672 | null |
| 2025-04-09 | Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception | Ruotian Peng et.al. | 2504.06666 | null |
| 2025-04-08 | CamContextI2V: Context-aware Controllable Video Generation | Luis Denninger et.al. | 2504.06022 | link |
| 2025-04-08 | Physics-aware generative models for turbulent fluid flows through energy-consistent stochastic interpolants | Nikolaj T. MΓΌcke et.al. | 2504.05852 | link |
| 2025-04-07 | One-Minute Video Generation with Test-Time Training | Karan Dalal et.al. | 2504.05298 | null |
| 2025-04-07 | Video-Bench: Human-Aligned Video Generation Benchmark | Hui Han et.al. | 2504.04907 | null |
| 2025-04-07 | Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation | Fa-Ting Hong et.al. | 2504.02542 | link |
| 2025-04-05 | Video4DGen: Enhancing Video and 4D Generation through Mutual Optimization | Yikai Wang et.al. | 2504.04153 | link |
| 2025-04-05 | Multi-identity Human Image Animation with Structural Video Diffusion | Zhenzhi Wang et.al. | 2504.04126 | null |
| 2025-04-05 | Can You Count to Nine? A Human Evaluation Benchmark for Counting Limits in Modern Text-to-Video Models | Xuyang Guo et.al. | 2504.04051 | null |
| 2025-04-05 | DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion | Maksim Siniukov et.al. | 2504.04010 | null |
| 2025-04-04 | Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models | Xuran Ma et.al. | 2504.03140 | link |
| 2025-04-04 | MG-Gen: Single Image to Motion Graphics Generation with Layer Decomposition | Takahiro Shirakawa et.al. | 2504.02361 | null |
| 2025-04-03 | How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models | Pascal Chang et.al. | 2504.03072 | null |
| 2025-04-03 | Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments | Chenyu Zhang et.al. | 2504.02918 | null |
| 2025-04-03 | Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets | Chuning Zhu et.al. | 2504.02792 | null |
| 2025-04-03 | Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model | Shengjun Zhang et.al. | 2504.02764 | null |
| 2025-04-03 | ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer | Jiayi Gao et.al. | 2504.02451 | link |
| 2025-04-03 | SkyReels-A2: Compose Anything in Video Diffusion Transformers | Zhengcong Fei et.al. | 2504.02436 | link |
| 2025-04-03 | OmniCam: Unified Multimodal Video Generation via Camera Control | Xiaoda Yang et.al. | 2504.02312 | null |
| 2025-04-03 | VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step | Hanyang Wang et.al. | 2504.01956 | null |
| 2025-04-03 | Loong: Generating Minute-level Long Videos with Autoregressive Language Models | Yuqing Wang et.al. | 2410.02757 | null |
| 2025-04-02 | Proof of Humanity: A Multi-Layer Network Framework for Certifying Human-Originated Content in an AI-Dominated Internet | Sebastian Barros et.al. | 2504.03752 | null |
| 2025-04-02 | WorldPrompter: Traversable Text-to-Scene Generation | Zhaoyang Zhang et.al. | 2504.02045 | null |
| 2025-04-02 | Towards Physically Plausible Video Generation via VLM Planning | Xindi Yang et.al. | 2503.23368 | null |
| 2025-04-01 | AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction | Junhao Cheng et.al. | 2504.01014 | link |
| 2025-04-01 | WorldScore: A Unified Evaluation Benchmark for World Generation | Haoyi Duan et.al. | 2504.00983 | null |
| 2025-04-01 | DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding | Chong Li et.al. | 2504.00432 | null |
| 2025-04-01 | HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation | Boyuan Wang et.al. | 2503.24026 | null |
| 2025-04-01 | On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices | Bosung Kim et.al. | 2503.23796 | link |
| 2025-03-31 | GazeLLM: Multimodal LLMs incorporating Human Visual Attention | Jun Rekimoto et.al. | 2504.00221 | null |
| 2025-03-31 | Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation | Shengqiong Wu et.al. | 2503.24379 | null |
| 2025-03-31 | JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation | Fangda Chen et.al. | 2503.23951 | null |
| 2025-03-31 | HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation | Kun Liu et.al. | 2503.23715 | null |
| 2025-03-30 | VideoGen-Eval: Agent-based System for Video Generation Evaluation | Yuhang Yang et.al. | 2503.23452 | link |
| 2025-03-30 | JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization | Kai Liu et.al. | 2503.23377 | null |
| 2025-03-30 | MoCha: Towards Movie-Grade Talking Character Synthesis | Cong Wei et.al. | 2503.23307 | null |
| 2025-03-30 | SketchVideo: Sketch-based Video Generation and Editing | Feng-Lin Liu et.al. | 2503.23284 | null |
| 2025-03-29 | Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models | Prin Phunyaphibarn et.al. | 2503.20240 | null |
| 2025-03-28 | Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model | Jangho Park et.al. | 2503.22622 | null |
| 2025-03-28 | EchoFlow: A Foundation Model for Cardiac Ultrasound Image and Video Generation | Hadrien Reynaud et.al. | 2503.22357 | null |
| 2025-03-28 | CoGen: 3D Consistent Video Generation via Adaptive Conditioning for Autonomous Driving | Yishen Ji et.al. | 2503.22231 | null |
| 2025-03-27 | VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models | Chi-Pin Huang et.al. | 2503.21781 | null |
| 2025-03-27 | Exploring the Evolution of Physics Cognition in Video Generation: A Survey | Minghui Lin et.al. | 2503.21765 | link |
| 2025-03-27 | VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness | Dian Zheng et.al. | 2503.21755 | link |
| 2025-03-27 | Audio-driven Gesture Generation via Deviation Feature in the Latent Space | Jiahui Chen et.al. | 2503.21616 | null |
| 2025-03-27 | ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model | Jinwei Qi et.al. | 2503.21144 | null |
| 2025-03-26 | Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations | Haitong Liu et.al. | 2503.21824 | link |
| 2025-03-26 | Synthetic Video Enhances Physical Fidelity in Video Synthesis | Qi Zhao et.al. | 2503.20822 | null |
| 2025-03-26 | RecTable: Fast Modeling Tabular Data with Rectified Flow | Masane Fuchi et.al. | 2503.20731 | link |
| 2025-03-26 | AccidentSim: Generating Physically Realistic Vehicle Collision Videos from Real-World Accident Reports | Xiangwen Zhang et.al. | 2503.20654 | null |
| 2025-03-26 | GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving | Lloyd Russell et.al. | 2503.20523 | null |
| 2025-03-26 | VPO: Aligning Text-to-Video Generation Models with Prompt Optimization | Jiale Cheng et.al. | 2503.20491 | link |
| 2025-03-26 | Wan: Open and Advanced Large-Scale Video Generative Models | WanTeam et.al. | 2503.20314 | link |
| 2025-03-26 | Video Motion Graphs | Haiyang Liu et.al. | 2503.20218 | null |
| 2025-03-26 | Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing | Jaihoon Kim et.al. | 2503.19385 | null |
| 2025-03-26 | EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models | Yufei Cai et.al. | 2503.19369 | link |
| 2025-03-25 | Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors | Yuke Lou et.al. | 2503.20118 | null |
| 2025-03-25 | Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals | Stefan Stojanov et.al. | 2503.19953 | null |
| 2025-03-25 | FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling | Qiusheng Huang et.al. | 2503.19940 | null |
| 2025-03-25 | FullDiT: Multi-Task Video Generative Foundation Model with Full Attention | Xuan Ju et.al. | 2503.19907 | null |
| 2025-03-25 | Mask |
Tianhao Qi et.al. | 2503.19881 | null |
| 2025-03-25 | AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers | Jiazhi Guan et.al. | 2503.19824 | null |
| 2025-03-25 | AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset | Haiyu Zhang et.al. | 2503.19462 | null |
| 2025-03-25 | MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation | Yukang Lin et.al. | 2503.19383 | null |
| 2025-03-25 | Long-Context Autoregressive Video Modeling with Next-Frame Prediction | Yuchao Gu et.al. | 2503.19325 | link |
| 2025-03-25 | Aether: Geometric-Aware Unified World Modeling | Aether Team et.al. | 2503.18945 | null |
| 2025-03-25 | AMD-Hummingbird: Towards an Efficient Text-to-Video Model | Takashi Isobe et.al. | 2503.18559 | link |
| 2025-03-25 | Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model | Yingying Fan et.al. | 2503.16942 | null |
| 2025-03-24 | Video-T1: Test-Time Scaling for Video Generation | Fangfu Liu et.al. | 2503.18942 | null |
| 2025-03-24 | Training-free Diffusion Acceleration with Bottleneck Sampling | Ye Tian et.al. | 2503.18940 | null |
| 2025-03-24 | EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation | Qiang Qu et.al. | 2503.18552 | null |
| 2025-03-24 | Can Text-to-Video Generation help Video-Language Alignment? | Luca Zanella et.al. | 2503.18507 | null |
| 2025-03-24 | Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation | Dingcheng Zhen et.al. | 2503.18429 | null |
| 2025-03-24 | Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance | Sicong Feng et.al. | 2503.18386 | null |
| 2025-03-23 | LongDiff: Training-Free Long Video Generation in One Go | Zhuoling Li et.al. | 2503.18150 | null |
| 2025-03-23 | TransAnimate: Taming Layer Diffusion to Generate RGBA Video | Xuewei Chen et.al. | 2503.17934 | null |
| 2025-03-22 | RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation | Zhiqiang Yuan et.al. | 2503.17735 | null |
| 2025-03-21 | Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks | Bhishma Dedhia et.al. | 2503.17539 | null |
| 2025-03-21 | Position: Interactive Generative Video as Next-Generation Game Engine | Jiwen Yu et.al. | 2503.17359 | null |
| 2025-03-21 | AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process | Junjie Hu et.al. | 2503.17029 | null |
| 2025-03-21 | Enabling Versatile Controls for Video Diffusion Models | Xu Zhang et.al. | 2503.16983 | link |
| 2025-03-21 | SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation | Chun-Han Yao et.al. | 2503.16396 | null |
| 2025-03-20 | A Recipe for Generating 3D Worlds From a Single Image | Katja Schwarz et.al. | 2503.16611 | null |
| 2025-03-20 | XAttention: Block Sparse Attention with Antidiagonal Scoring | Ruyi Xu et.al. | 2503.16428 | link |
| 2025-03-20 | MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance | Quanhao Li et.al. | 2503.16421 | null |
| 2025-03-20 | ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos | Haolin Yang et.al. | 2503.16400 | null |
| 2025-03-20 | PoseTraj: Pose-Aware Trajectory Control in Video Diffusion | Longbin Ji et.al. | 2503.16068 | null |
| 2025-03-20 | Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models | Marc BenedΓ San MillΓ‘n et.al. | 2503.15996 | null |
| 2025-03-20 | MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving | Haiguang Wang et.al. | 2503.15875 | link |
| 2025-03-20 | VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling | Hyojun Go et.al. | 2503.15855 | null |
| 2025-03-20 | VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention | Mingzhe Zheng et.al. | 2503.15138 | null |
| 2025-03-19 | Temporal Regularization Makes Your Video Generator Stronger | Harold Haodong Chen et.al. | 2503.15417 | null |
| 2025-03-19 | Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models | Tingxiu Chen et.al. | 2503.14966 | link |
| 2025-03-18 | MusicInfuser: Making Video Diffusion Listen and Dance | Susung Hong et.al. | 2503.14505 | null |
| 2025-03-18 | MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation | Hongyu Zhang et.al. | 2503.14428 | null |
| 2025-03-18 | Impossible Videos | Zechen Bai et.al. | 2503.14378 | null |
| 2025-03-18 | LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models | Yu Cheng et.al. | 2503.14325 | link |
| 2025-03-18 | Concat-ID: Towards Universal Identity-Preserving Video Synthesis | Yong Zhong et.al. | 2503.14151 | null |
| 2025-03-18 | Fast Autoregressive Video Generation with Diagonal Decoding | Yang Ye et.al. | 2503.14070 | null |
| 2025-03-18 | AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark | Xinhao Xiang et.al. | 2503.14064 | link |
| 2025-03-17 | MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Portrait Few-Step Synthesis | Shitong Shao et.al. | 2503.13319 | null |
| 2025-03-17 | Language-guided Open-world Video Anomaly Detection | Zihao Liu et.al. | 2503.13160 | null |
| 2025-03-17 | Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction | Zheyuan Liu et.al. | 2503.12953 | null |
| 2025-03-17 | AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations | Quang Trung Truong et.al. | 2503.12828 | null |
| 2025-03-17 | Long-Video Audio Synthesis with Multi-Agent Collaboration | Yehang Zhang et.al. | 2503.10719 | null |
| 2025-03-16 | SPC-GS: Gaussian Splatting with Semantic-Prompt Consistency for Indoor Open-World Free-view Synthesis from Sparse Inputs | Guibiao Liao et.al. | 2503.12535 | null |
| 2025-03-16 | VMBench: A Benchmark for Perception-Aligned Video Motion Generation | Xinran Ling et.al. | 2503.10076 | link |
| 2025-03-15 | ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis | Yu Fang et.al. | 2503.14526 | null |
| 2025-03-15 | A Speech-to-Video Synthesis Approach Using Spatio-Temporal Diffusion for Vocal Tract MRI | Paula Andrea PΓ©rez-Toro et.al. | 2503.12102 | null |
| 2025-03-15 | SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering | Byeongjun Park et.al. | 2503.12024 | link |
| 2025-03-14 | ReCamMaster: Camera-Controlled Generative Rendering from A Single Video | Jianhong Bai et.al. | 2503.11647 | null |
| 2025-03-14 | HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models | Ziqin Zhou et.al. | 2503.11513 | null |
| 2025-03-14 | TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation | Hongxiang Zhao et.al. | 2503.11423 | null |
| 2025-03-14 | Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model | Haoyang Huang et.al. | 2503.11251 | link |
| 2025-03-14 | Cross-Modal Learning for Music-to-Music-Video Description Generation | Zhuoyuan Mao et.al. | 2503.11190 | null |
| 2025-03-14 | Long Context Tuning for Video Generation | Yuwei Guo et.al. | 2503.10589 | null |
| 2025-03-14 | On the Limitations of Vision-Language Models in Understanding Image Transforms | Ahmad Mustafa Anis et.al. | 2503.09837 | null |
| 2025-03-13 | CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models | Hao He et.al. | 2503.10592 | null |
| 2025-03-13 | CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance | Yufan Deng et.al. | 2503.10391 | null |
| 2025-03-13 | Semantic Latent Motion for Portrait Video Generation | Qiyuan Zhang et.al. | 2503.10096 | null |
| 2025-03-13 | UVE: Are MLLMs Unified Evaluators for AI-Generated Videos? | Yuanxin Liu et.al. | 2503.09949 | link |
| 2025-03-13 | Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers | Yasheng Sun et.al. | 2503.09942 | null |
| 2025-03-13 | VideoMerge: Towards Training-free Long Video Generation | Siyang Zhang et.al. | 2503.09926 | null |
| 2025-03-13 | WonderVerse: Extendable 3D Scene Generation with Video Generative Models | Hao Feng et.al. | 2503.09160 | null |
| 2025-03-12 | Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework | Jing Wang et.al. | 2503.10704 | null |
| 2025-03-12 | LuciBot: Automated Robot Policy Learning from Generated Videos | Xiaowen Qiu et.al. | 2503.09871 | null |
| 2025-03-12 | I2V3D: Controllable image-to-video generation with 3D guidance | Zhiyuan Zhang et.al. | 2503.09733 | null |
| 2025-03-12 | Accelerating Diffusion Sampling via Exploiting Local Transition Coherence | Shangwen Zhu et.al. | 2503.09675 | null |
| 2025-03-12 | Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k | Xiangyu Peng et.al. | 2503.09642 | link |
| 2025-03-12 | PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop | Chenyu Li et.al. | 2503.09595 | link |
| 2025-03-12 | Unified Dense Prediction of Video Diffusion | Lehan Yang et.al. | 2503.09344 | null |
| 2025-03-12 | Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latant Space | Jian Zhu et.al. | 2503.09215 | null |
| 2025-03-12 | SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video | Chengshu Zhao et.al. | 2503.09154 | link |
| 2025-03-12 | Reangle-A-Video: 4D Video Generation as Video-to-Video Translation | Hyeonho Jeong et.al. | 2503.09151 | null |
| 2025-03-12 | Alex Ergasti et.al. | 2503.08307 | link | |
| 2025-03-12 | Object-Centric World Model for Language-Guided Manipulation | Youngjoon Jeong et.al. | 2503.06170 | null |
| 2025-03-11 | V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video | Jianqi Chen et.al. | 2503.09631 | null |
| 2025-03-11 | REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder | Yitian Zhang et.al. | 2503.08665 | null |
| 2025-03-11 | Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling | Subin Kim et.al. | 2503.08605 | null |
| 2025-03-11 | WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation | Jing Wang et.al. | 2503.08153 | null |
| 2025-03-11 | ObjectMover: Generative Object Movement with Video Prior | Xin Yu et.al. | 2503.08037 | null |
| 2025-03-11 | How Can Video Generative AI Transform K-12 Education? Examining Teachers' Perspectives through TPACK and TAM | Unggi Lee et.al. | 2503.08003 | null |
| 2025-03-11 | VACE: All-in-One Video Creation and Editing | Zeyinzi Jiang et.al. | 2503.07598 | null |
| 2025-03-11 | LightMotion: A Light and Tuning-free Method for Simulating Camera Motion in Video Generation | Quanjian Song et.al. | 2503.06508 | link |
| 2025-03-10 | DreamRelation: Relation-Centric Video Customization | Yujie Wei et.al. | 2503.07602 | null |
| 2025-03-10 | AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion | Mingzhen Sun et.al. | 2503.07418 | null |
| 2025-03-10 | Automated Movie Generation via Multi-Agent CoT Planning | Weijia Wu et.al. | 2503.07314 | link |
| 2025-03-10 | From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers | Jiacheng Liu et.al. | 2503.06923 | link |
| 2025-03-09 | VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation | Hritik Bansal et.al. | 2503.06800 | null |
| 2025-03-09 | TR-DQ: Time-Rotation Diffusion Quantization | Yihua Shao et.al. | 2503.06564 | null |
| 2025-03-09 | QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation | Junyi Wu et.al. | 2503.06545 | link |
| 2025-03-09 | Generative Video Bi-flow | Chen Liu et.al. | 2503.06364 | null |
| 2025-03-08 | Text2Story: Advancing Video Storytelling with Text Guidance | Taewon Kang et.al. | 2503.06310 | null |
| 2025-03-08 | ROCM: RLHF on consistency models | Shivanshu Shekhar et.al. | 2503.06171 | null |
| 2025-03-08 | VACT: A Video Automatic Causal Testing System and a Benchmark | Haotong Yang et.al. | 2503.06163 | null |
| 2025-03-08 | GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation | Ye Tao et.al. | 2503.06136 | null |
| 2025-03-08 | DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation | Runze Zhang et.al. | 2503.06053 | null |
| 2025-03-08 | The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation | Aoxiong Yin et.al. | 2503.04606 | link |
| 2025-03-08 | Rethinking Video Tokenization: A Conditioned Diffusion-based Approach | Nianzu Yang et.al. | 2503.03708 | link |
| 2025-03-07 | MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice | Hongwei Yi et.al. | 2503.05978 | null |
| 2025-03-07 | MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio | Xuenan Xu et.al. | 2503.05242 | link |
| 2025-03-07 | Unified Reward Model for Multimodal Understanding and Generation | Yibin Wang et.al. | 2503.05236 | null |
| 2025-03-07 | Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos | Zhiyu Tan et.al. | 2502.21314 | null |
| 2025-03-06 | Toward Lightweight and Fast Decoders for Diffusion Models in Image and Video Generation | Alexey Buzovkin et.al. | 2503.04871 | link |
| 2025-03-06 | FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video | Yue Gao et.al. | 2503.04720 | null |
| 2025-03-06 | What Are You Doing? A Closer Look at Controllable Human Video Generation | Emanuele Bugliarello et.al. | 2503.04666 | null |
| 2025-03-05 | ProReflow: Progressive Reflow with Decomposed Velocity | Lei Ke et.al. | 2503.04824 | null |
| 2025-03-05 | GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control | Xuanchi Ren et.al. | 2503.03751 | link |
| 2025-03-05 | DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance | Zhao Yang et.al. | 2503.03689 | link |
| 2025-03-05 | High-Quality Virtual Single-Viewpoint Surgical Video: Geometric Autocalibration of Multiple Cameras in Surgical Lights | Yuna Kato et.al. | 2503.03558 | link |
| 2025-03-05 | Video Super-Resolution: All You Need is a Video Diffusion Model | Zhihao Zhan et.al. | 2503.03355 | null |
| 2025-03-04 | GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning | Zhun Mou et.al. | 2503.02341 | null |
| 2025-03-04 | Unified Video Action Model | Shuang Li et.al. | 2503.00200 | null |
| 2025-03-03 | VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation | Wenhao Wang et.al. | 2503.01739 | link |
| 2025-03-03 | VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors | Juil Koo et.al. | 2503.01107 | null |
| 2025-03-03 | TransVDM: Motion-Constrained Video Diffusion Model for Transparent Video Synthesis | Menghao Li et.al. | 2502.19454 | null |
| 2025-03-02 | Extrapolating and Decoupling Image-to-Video Generation Models: Motion Modeling is Easier Than You Think | Jie Tian et.al. | 2503.00948 | link |
| 2025-03-01 | Learning to Animate Images from A Few Videos to Portray Delicate Human Actions | Haoxin Li et.al. | 2503.00276 | null |
| 2025-02-28 | Training-free and Adaptive Sparse Attention for Efficient Long Video Generation | Yifei Xia et.al. | 2502.21079 | null |
| 2025-02-28 | HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models | Xiao Wang et.al. | 2502.20811 | null |
| 2025-02-28 | WorldModelBench: Judging Video Generation Models As World Models | Dacheng Li et.al. | 2502.20694 | null |
| 2025-02-28 | RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers | Ke Cao et.al. | 2502.14377 | null |
| 2025-02-27 | Mobius: Text to Seamless Looping Video Generation via Latent Shift | Xiuli Bi et.al. | 2502.20307 | link |
| 2025-02-27 | FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute | Sotiris Anagnostidis et.al. | 2502.20126 | null |
| 2025-02-27 | C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation | Yuhao Li et.al. | 2502.19868 | link |
| 2025-02-26 | Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis | Long Cheng et.al. | 2503.01873 | null |
| 2025-02-26 | Glad: A Streaming Scene Generator for Autonomous Driving | Bin Xie et.al. | 2503.00045 | null |
| 2025-02-26 | FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion mode | Lingzhou Mu et.al. | 2502.19455 | null |
| 2025-02-25 | SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference | Jintao Zhang et.al. | 2502.18137 | link |
| 2025-02-25 | ASurvey: Spatiotemporal Consistency in Video Generation | Zhiyu Yin et.al. | 2502.17863 | null |
| 2025-02-24 | X-Dancer: Expressive Music to Human Dance Video Generation | Zeyuan Chen et.al. | 2502.17414 | null |
| 2025-02-24 | VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing | Xiangpeng Yang et.al. | 2502.17258 | null |
| 2025-02-24 | Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions | Zhong Li et.al. | 2502.17119 | link |
| 2025-02-21 | RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers | Min Zhao et.al. | 2502.15894 | null |
| 2025-02-21 | VaViM and VaVAM: Autonomous Driving through Video Generative Modeling | Florent Bartoccioni et.al. | 2502.15672 | link |
| 2025-02-21 | LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities | Florian Sestak et.al. | 2502.12128 | link |
| 2025-02-20 | Hardware-Friendly Static Quantization Method for Video Diffusion Transformers | Sanghyun Yi et.al. | 2502.15077 | null |
| 2025-02-20 | LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection | Qingyuan Liu et.al. | 2502.14994 | null |
| 2025-02-20 | Improving the Diffusability of Autoencoders | Ivan Skorokhodov et.al. | 2502.14831 | null |
| 2025-02-20 | Designing Parameter and Compute Efficient Diffusion Transformers using Distillation | Vignesh Sundaresha et.al. | 2502.14226 | null |
| 2025-02-19 | FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation | Yunpeng Zhang et.al. | 2502.13995 | link |
| 2025-02-19 | LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation | Junchen Fu et.al. | 2502.12945 | null |
| 2025-02-18 | VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation | Xinlong Chen et.al. | 2502.12782 | link |
| 2025-02-18 | MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation | Sihyun Yu et.al. | 2502.12632 | null |
| 2025-02-17 | DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation | Zhihang Yuan et.al. | 2502.11897 | link |
| 2025-02-17 | Object-Centric Image to Video Generation with Language Guidance | Angel Villar-Corrales et.al. | 2502.11655 | null |
| 2025-02-17 | Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model | Guoqing Ma et.al. | 2502.10248 | link |
| 2025-02-17 | Magic 1-For-1: Generating One Minute Video Clips within One Minute | Hongwei Yi et.al. | 2502.07701 | link |
| 2025-02-16 | MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation | Michael Fuest et.al. | 2502.11234 | null |
| 2025-02-16 | Phantom: Subject-consistent video generation via cross-modal alignment | Lijie Liu et.al. | 2502.11079 | null |
| 2025-02-15 | SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers | Di Qiu et.al. | 2502.10841 | link |
| 2025-02-14 | RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control | Teng Li et.al. | 2502.10059 | null |
| 2025-02-14 | GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation | Hongyin Zhang et.al. | 2502.09268 | null |
| 2025-02-13 | Enhance-A-Video: Better Generated Video for Free | Yang Luo et.al. | 2502.07508 | link |
| 2025-02-12 | CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation | Qinghe Wang et.al. | 2502.08639 | null |
| 2025-02-12 | FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis | Wonjoon Jin et.al. | 2502.08244 | null |
| 2025-02-12 | Learning Human Skill Generators at Key-Step Levels | Yilu Wu et.al. | 2502.08234 | null |
| 2025-02-12 | AnyCharV: Bootstrap Controllable Character Video Generation with Fine-to-Coarse Guidance | Zhao Wang et.al. | 2502.08189 | null |
| 2025-02-12 | Next Block Prediction: Video Generation via Semi-Autoregressive Modeling | Shuhuai Ren et.al. | 2502.07737 | null |
| 2025-02-12 | VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation | Sixiao Zheng et.al. | 2502.07531 | null |
| 2024-05-07 | LLM-grounded Video Diffusion Models | Long Lian et.al. | 2309.17444 | null |
| 2023-10-12 | Echocardiography video synthesis from end diastolic semantic map via diffusion model | Phi Nguyen Van et.al. | 2310.07131 | null |
| 2023-05-30 | Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising | Fu-Yun Wang et.al. | 2305.18264 | null |
| 2023-03-21 | Latent Video Diffusion Models for High-Fidelity Long Video Generation | Yingqing He et.al. | 2211.13221 | null |
| 2022-07-12 | Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis | Long Zhuo et.al. | 2207.05049 | null |
| 2021-12-02 | Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image | Andrew Liu et.al. | 2012.09855 | null |
| 2020-11-10 | Audeo: Audio Generation for a Silent Performance Video | Kun Su et.al. | 2006.14348 | null |
| 2019-10-15 | TruNet: Short Videos Generation from Long Videos via Story-Preserving Truncation | Fan Yang et.al. | 1910.05899 | null |
TryOn
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-08 | Comparing quantum channels using Hermitian-preserving trace-preserving linear maps: A physically meaningful approach | Arindam Mitra et.al. | 2512.07822 | null |
| 2025-12-08 | Training-free Clothing Region of Interest Self-correction for Virtual Try-On | Shengjie Lu et.al. | 2512.07126 | null |
| 2025-12-08 | VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack | Shiji Zhao et.al. | 2512.05853 | null |
| 2025-12-05 | Where to Fly, What to Send: Communication-Aware Aerial Support for Ground Robots | Harshil Suthar et.al. | 2512.06207 | null |
| 2025-12-05 | Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer | Rong Wang et.al. | 2512.05593 | null |
| 2025-12-04 | Not All Birds Look The Same: Identity-Preserving Generation For Birds | Aaron Sun et.al. | 2512.04485 | null |
| 2025-12-03 | Tuning for TraceTarnish: Techniques, Trends, and Testing Tangible Traits | Robert Dilworth et.al. | 2512.03465 | null |
| 2025-12-02 | Methods in complete intersections in corank one | Satya Mandal et.al. | 2512.02373 | null |
| 2025-11-30 | Asymptotic and nonlinear geometries of Banach spaces and their interactions | Florent P. Baudier et.al. | 2512.00817 | null |
| 2025-11-29 | Password-Activated Shutdown Protocols for Misaligned Frontier Agents | Kai Williams et.al. | 2512.03089 | null |
| 2025-11-29 | Kicking for Goal or Touch? An Expected Points Framework for Penalty Decisions in Rugby Union | Kenny Watts et.al. | 2512.00312 | null |
| 2025-11-26 | On a form of intrinsic optimism in Set Theory | M. MuΓ±oz PΓ©rez et.al. | 2512.02045 | null |
| 2025-11-24 | Systematic assessment of the Hubble tension via Bayesian jackknife testing | Thomas Hughes et.al. | 2511.19341 | null |
| 2025-11-24 | Solar-GECO: Perovskite Solar Cell Property Prediction with Geometric-Aware Co-Attention | Lucas Li et.al. | 2511.19263 | null |
| 2025-11-24 | Can we detect treatment effect waning from time-to-event data? | Eni Musta et.al. | 2511.19096 | null |
| 2025-11-24 | Eevee: Towards Close-up High-resolution Video-based Virtual Try-on | Jianhao Zeng et.al. | 2511.18957 | null |
| 2025-11-24 | Rethinking Garment Conditioning in Diffusion-based Virtual Try-On | Kihyun Na et.al. | 2511.18775 | null |
| 2025-11-23 | Projective deduction of the non-trivial first integral to the Euler problem: an explicit computation | Gabriella Pinzari et.al. | 2511.18569 | null |
| 2025-11-22 | Towards a General Framework for HTN Modeling with LLMs | Israel Puerta-Merino et.al. | 2511.18165 | null |
| 2025-11-22 | Active Learning with Selective Time-Step Acquisition for PDEs | Yegon Kim et.al. | 2511.18107 | null |
| 2025-11-21 | Pre-cache: A Microarchitectural Solution to prevent Meltdown and Spectre | Subhash Sethumurugan et.al. | 2511.17726 | null |
| 2025-11-20 | Data-Driven Stellar Spectral Modelling with GSPICE | Douglas P. Finkbeiner et.al. | 2511.16754 | null |
| 2025-11-19 | UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment | Wei Zhang et.al. | 2511.15831 | null |
| 2025-11-19 | Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution | N Dinesh Reddy et.al. | 2511.14210 | null |
| 2025-11-18 | A System Dynamics Approach to Evaluating Sludge Management Strategies in Vinasse Treatment: Cost-Benefit Analysis and Scenario Assessment | Agustin Olivares et.al. | 2511.14607 | null |
| 2025-11-18 | PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos | Dianbing Xi et.al. | 2511.12935 | null |
| 2025-11-17 | Multi-Objective Statistical Model Checking using Lightweight Strategy Sampling (extended version) | Pedro R. D'Argenio et.al. | 2511.13460 | null |
| 2025-11-16 | Nonlocal action in Everettian Quantum Mechanics | Mordecai Waegell et.al. | 2511.12403 | null |
| 2025-11-16 | RefVTON: person-to-person Try on with Additional Unpaired Visual Reference | Liuzhuozheng Li et.al. | 2511.00956 | null |
| 2025-11-14 | Learning Fair Representations with Kolmogorov-Arnold Networks | Amisha Priyadarshini et.al. | 2511.11767 | null |
| 2025-11-14 | Discovering Meaningful Units with Visually Grounded Semantics from Image Captions | Melika Behjati et.al. | 2511.11262 | null |
| 2025-11-14 | Power Ensemble Aggregation for Improved Extreme Event AI Prediction | Julien Collard et.al. | 2511.11170 | null |
| 2025-11-13 | Optimal Welfare in Noncooperative Network Formation under Attack | Natan Doubez et.al. | 2511.10845 | null |
| 2025-11-13 | Generating optimal Gravitational-Wave template banks with metric-preserving autoencoders | Giovanni Cabass et.al. | 2511.10466 | null |
| 2025-11-12 | Efficiently Transforming Neural Networks into Decision Trees: A Path to Ground Truth Explanations with RENTT | Helena Monke et.al. | 2511.09299 | null |
| 2025-11-12 | Food as Soft Power: Taiwanese Gastrodiplomacy on Social Media and Algorithmic Suppression | Andrew Yen Chang et.al. | 2511.05729 | null |
| 2025-11-10 | Detecting Suicidal Ideation in Text with Interpretable Deep Learning: A CNN-BiGRU with Attention Mechanism | Mohaiminul Islam Bhuiyan et.al. | 2511.08636 | null |
| 2025-11-10 | On maximizing private neighbors in graphs | Stephen T. Hedetniemi et.al. | 2511.07248 | null |
| 2025-11-06 | Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts | Ellis Brown et.al. | 2511.04655 | null |
| 2025-11-06 | IntelliProof: An Argumentation Network-based Conversational Helper for Organized Reflection | Kaveh Eskandari Miandoab et.al. | 2511.04528 | null |
| 2025-11-06 | The truth is no diaper: Human and AI-generated associations to emotional words | Ε pela Vintar et.al. | 2511.04077 | null |
| 2025-11-04 | Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement | Sanghyun Lee et.al. | 2511.05562 | null |
| 2025-11-04 | FLAME: Flexible and Lightweight Biometric Authentication Scheme in Malicious Environments | Fuyi Wang et.al. | 2511.02176 | null |
| 2025-11-03 | Confounding Factors in Relating Model Performance to Morphology | Wessel Poelman et.al. | 2511.01380 | null |
| 2025-11-02 | AGRAG: Advanced Graph-based Retrieval-Augmented Generation for LLMs | Yubo Wang et.al. | 2511.05549 | null |
| 2025-11-01 | Sparse and nonparametric estimation of equations governing dynamical systems with applications to biology | G. Pillonetto et.al. | 2511.00579 | null |
| 2025-10-31 | Quantum-dot single photon source performance with off-resonant pulse preparation schemes | Gavin Crowder et.al. | 2511.00243 | null |
| 2025-10-31 | EL-MIA: Quantifying Membership Inference Risks of Sensitive Entities in LLMs | Ali Satvaty et.al. | 2511.00192 | null |
| 2025-10-31 | Consistency Training Helps Stop Sycophancy and Jailbreaks | Alex Irpan et.al. | 2510.27062 | null |
| 2025-10-30 | Ring-polymer instanton theory for tunneling between asymmetric wells | Marit R. Fiechter et.al. | 2510.26592 | null |
| 2025-10-29 | Heuristic Quantum Advantage with Peaked Circuits | Hrant Gharibyan et.al. | 2510.25838 | null |
| 2025-10-29 | Tackling the Algorithmic Control Crisis -- the Technical, Legal, and Ethical Challenges of Research into Algorithmic Agents | B. Bodo et.al. | 2510.25337 | null |
| 2025-10-16 | ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On | Junseo Park et.al. | 2509.25749 | null |
| 2025-10-09 | Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection | Yanjie Pan et.al. | 2510.07654 | null |
| 2025-10-06 | AvatarVTON: 4D Virtual Try-On for Animatable Avatars | Zicheng Jiang et.al. | 2510.04822 | null |
| 2025-10-03 | DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing | Qi Li et.al. | 2510.04797 | null |
| 2025-10-01 | Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset | Yannick Hauri et.al. | 2510.00633 | null |
| 2025-09-29 | UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections | Zeyu Cai et.al. | 2509.24817 | null |
| 2025-09-29 | ControlHair: Physically-based Video Diffusion for Controllable Dynamic Hair Rendering | Weikai Lin et.al. | 2509.21541 | null |
| 2025-09-24 | InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On | Julien Han et.al. | 2509.20524 | null |
| 2025-09-24 | Efficient Encoder-Free Pose Conditioning and Pose Control for Virtual Try-On | Qi Li et.al. | 2509.20343 | null |
| 2025-09-23 | Clothing agnostic Pre-inpainting Virtual Try-ON | Sehyun Kim et.al. | 2509.17654 | null |
| 2025-09-21 | SemanticGarment: Semantic-Controlled Generation and Editing of 3D Gaussian Garments | Ruiyan Wang et.al. | 2509.16960 | null |
| 2025-09-16 | DEFT-VTON: Efficient Virtual Try-On with Consistent Generalised H-Transform | Xingzi Xu et.al. | 2509.13506 | null |
| 2025-09-05 | LUIVITON: Learned Universal Interoperable VIrtual Try-ON | Cong Cao et.al. | 2509.05030 | null |
| 2025-09-04 | Virtual Fitting Room: Generating Arbitrarily Long Videos of Virtual Try-On from a Single Image -- Technical Preview | Jun-Kun Chen et.al. | 2509.04450 | null |
| 2025-09-04 | Towards High-Fidelity, Identity-Preserving Real-Time Makeup Transfer: Decoupling Style Generation | Lydia Kin Ching Chau et.al. | 2509.02445 | null |
| 2025-08-30 | IC-Custom: Diverse Image Customization via In-Context Learning | Yaowei Li et.al. | 2507.01926 | null |
| 2025-08-28 | Dress&Dance: Dress up and Dance as You Like It - Technical Preview | Jun-Kun Chen et.al. | 2508.21070 | null |
| 2025-08-28 | FastFit: Accelerating Multi-Reference Virtual Try-On via Cacheable Diffusion Models | Zheng Chong et.al. | 2508.20586 | null |
| 2025-08-25 | JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on | Aowen Wang et.al. | 2508.17614 | null |
| 2025-08-19 | OmniTry: Virtual Try-On Anything without Masks | Yutong Feng et.al. | 2508.13632 | null |
| 2025-08-16 | DualFit: A Two-Stage Virtual Try-On via Warping and Synthesis | Minh Tran et.al. | 2508.12131 | null |
| 2025-08-12 | StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback | Hongbo Ma et.al. | 2508.06555 | null |
| 2025-08-11 | MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization | Ankan Deria et.al. | 2508.08488 | null |
| 2025-08-11 | Undress to Redress: A Training-Free Framework for Virtual Try-On | Zhiying Li et.al. | 2508.07680 | null |
| 2025-08-07 | One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose | Jinxi Liu et.al. | 2508.04559 | null |
| 2025-08-06 | Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off | Seungyong Lee et.al. | 2508.04825 | null |
| 2025-08-06 | Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis | Angang Zhang et.al. | 2508.04551 | null |
| 2025-08-06 | FFHQ-Makeup: Paired Synthetic Makeup Dataset with Facial Consistency Across Multiple Styles | Xingchao Yang et.al. | 2508.03241 | null |
| 2025-08-04 | DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework | Tongchun Zuo et.al. | 2508.02807 | null |
| 2025-07-29 | From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos | Chenjian Gao et.al. | 2507.20331 | null |
| 2025-07-29 | Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism | Jun Zheng et.al. | 2412.09822 | null |
| 2025-07-21 | FW-VTON: Flattening-and-Warping for Person-to-Person Virtual Try-on | Zheng Wang et.al. | 2507.16010 | null |
| 2025-07-20 | OmniVTON: Training-Free Universal Virtual Try-On | Zhaotong Yang et.al. | 2507.15037 | null |
| 2025-07-11 | Scalable and Realistic Virtual Try-on Application for Foundation Makeup with Kubelka-Munk Theory | Hui Pang et.al. | 2507.07333 | null |
| 2025-07-08 | TalkFashion: Intelligent Virtual Try-On Assistant Based on Multimodal Large Language Model | Yujie Hu et.al. | 2507.05790 | null |
| 2025-07-02 | FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization | Peng Zheng et.al. | 2507.01792 | null |
| 2025-06-30 | KiseKloset: Comprehensive System For Outfit Retrieval, Recommendation, And Try-On | Thanh-Tung Phan-Nguyen et.al. | 2506.23471 | null |
| 2025-06-29 | DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On | Xiang Xu et.al. | 2506.23295 | null |
| 2025-06-26 | Video Virtual Try-on with Conditional Diffusion Transformer Inpainter | Cheng Zou et.al. | 2506.21270 | null |
| 2025-06-23 | InstructAttribute: Fine-grained Object Attributes editing with Instruction | Xingxi Yin et.al. | 2505.00751 | null |
| 2025-06-14 | Real-Time Per-Garment Virtual Try-On with Temporal Consistency for Loose-Fitting Garments | Zaiqiang Wu et.al. | 2506.12348 | null |
| 2025-06-13 | HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment | Ming Meng et.al. | 2505.19638 | null |
| 2025-06-12 | Low-Barrier Dataset Collection with Real Human Body for Interactive Per-Garment Virtual Try-On | Zaiqiang Wu et.al. | 2506.10468 | null |
| 2025-06-06 | ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On | Jinjuan Wang et.al. | 2506.05858 | null |
| 2025-06-02 | OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation | Sen Liang et.al. | 2506.01801 | null |
| 2025-06-01 | DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation | Xianbing Sun et.al. | 2506.00908 | null |
| 2025-05-29 | VITON-DRR: Details Retention Virtual Try-on via Non-rigid Registration | Ben Li et.al. | 2505.23439 | null |
| 2025-05-28 | MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on | Guangyuan Li et.al. | 2505.21325 | null |
| 2025-05-27 | Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals | Davide Lobba et.al. | 2505.21062 | null |
| 2025-05-26 | VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models | Hu Xiaobin et.al. | 2505.19571 | null |
| 2025-05-22 | Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction | Dong Li et.al. | 2505.16980 | null |
| 2025-05-22 | Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On | Siqi Wan et.al. | 2505.16977 | link |
| 2025-05-15 | Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates | Ren Li et.al. | 2504.08353 | link |
| 2025-04-29 | Creating Your Editable 3D Photorealistic Avatar with Tetrahedron-constrained Gaussian Splatting | Hanxi Liu et.al. | 2504.20403 | null |
| 2025-04-24 | FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model | Kaicheng Pang et.al. | 2504.17826 | null |
| 2025-04-24 | 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models | Min Wei et.al. | 2504.17414 | null |
| 2025-04-21 | Shape-Guided Clothing Warping for Virtual Try-On | Xiaoyu Han et.al. | 2504.15232 | link |
| 2025-04-21 | Insert Anything: Image Insertion via In-Context Editing in DiT | Wensong Song et.al. | 2504.15009 | null |
| 2025-04-19 | Flux Already Knows -- Activating Subject-Driven Image Generation without Training | Hao Kang et.al. | 2504.11478 | link |
| 2025-04-19 | Concat-ID: Towards Universal Identity-Preserving Video Synthesis | Yong Zhong et.al. | 2503.14151 | null |
| 2025-04-18 | Fashion-RAG: Multimodal Fashion Image Editing via Retrieval-Augmented Generation | Fulvio Sanguigni et.al. | 2504.14011 | null |
| 2025-04-17 | Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off | Riza Velioglu et.al. | 2504.13078 | link |
| 2025-04-15 | ReZero: Enhancing LLM search ability by trying one-more-time | Alan Dao et.al. | 2504.11001 | null |
| 2025-04-11 | VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction | Zijian He et.al. | 2503.12165 | null |
| 2025-04-04 | From Keypoints to Realism: A Realistic and Accurate Virtual Try-on Network from 2D Images | Maliheh Toozandehjani et.al. | 2504.03807 | null |
| 2025-04-03 | MAD: Makeup All-in-One with Cross-Domain Diffusion Model | Bo-Kai Ruan et.al. | 2504.02545 | null |
| 2025-04-01 | Diffusion Model-Based Size Variable Virtual Try-On Technology and Evaluation Method | Shufang Zhang et.al. | 2504.00562 | null |
| 2025-03-26 | ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On | Ji Woo Hong et.al. | 2503.20418 | null |
| 2025-03-26 | Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks | Hailong Guo et.al. | 2501.15891 | null |
| 2025-03-25 | Exploring Disentangled and Controllable Human Image Synthesis: From End-to-End to Stage-by-Stage | Zhengwentai Sun et.al. | 2503.19486 | null |
| 2025-03-20 | Shining Yourself: High-Fidelity Ornaments Virtual Try-on with Diffusion Model | Yingmao Miao et.al. | 2503.16065 | null |
| 2025-03-18 | Limb-Aware Virtual Try-On Network with Progressive Clothing Warping | Shengping Zhang et.al. | 2503.14074 | link |
| 2025-03-16 | Progressive Limb-Aware Virtual Try-On | Xiaoyu Han et.al. | 2503.12588 | link |
| 2025-03-15 | ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text | Haifeng Ni et.al. | 2501.16757 | null |
| 2025-03-11 | MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input | Zhenchen Wan et.al. | 2503.08650 | null |
| 2025-03-11 | RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency | Siqi Li et.al. | 2501.08682 | null |
| 2025-02-20 | CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors | Donghao Luo et.al. | 2502.14373 | null |
| 2025-02-05 | Dress-1-to-3: Single Image to Simulation-Ready 3D Outfit with Diffusion Prior and Differentiable Physics | Xuan Li et.al. | 2502.03449 | null |
| 2025-02-03 | MFP-VTON: Enhancing Mask-Free Person-to-Person Virtual Try-On via Diffusion Transformer | Le Shen et.al. | 2502.01626 | null |
| 2025-01-26 | IPVTON: Image-based 3D Virtual Try-on with Image Prompt Adapter | Xiaojing Zhong et.al. | 2501.15616 | null |
| 2025-01-26 | Cross-Cultural Fashion Design via Interactive Large Language Models and Diffusion Models | Spencer Ramsey et.al. | 2501.15571 | null |
| 2025-01-20 | EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process | Mostafa Atef et.al. | 2501.11776 | null |
| 2025-01-20 | CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation | Zheng Chong et.al. | 2501.11325 | link |
| 2025-01-17 | Disharmony: Forensics using Reverse Lighting Harmonization | Philip Wootaek Shin et.al. | 2501.10212 | null |
| 2025-01-12 | ODPG: Outfitting Diffusion with Pose Guided Condition | Seohyun Lee et.al. | 2501.06769 | null |
| 2025-01-10 | MC-VTON: Minimal Control Virtual Try-On Diffusion Transformer | Junsheng Luan et.al. | 2501.03630 | null |
| 2025-01-09 | 1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On | Shuliang Ning et.al. | 2501.05369 | null |
| 2025-01-08 | Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling | Nannan Li et.al. | 2501.04666 | null |
| 2025-01-07 | HYB-VITON: A Hybrid Approach to Virtual Try-On Combining Explicit and Implicit Warping | Kosuke Takemoto et.al. | 2501.03910 | link |
| 2025-01-07 | VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control | Yuanpeng Tu et.al. | 2501.01427 | null |
| 2024-12-25 | DRDM: A Disentangled Representations Diffusion Model for Synthesizing Realistic Person Images | Enbo Huang et.al. | 2412.18797 | null |
| 2024-12-22 | PromptDresser: Improving the Quality and Controllability of Virtual Try-On via Generative Textual Prompt and Prompt-aware Mask | Jeongho Kim et.al. | 2412.16978 | link |
| 2024-12-19 | DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On | Wengyi Zhan et.al. | 2412.14465 | null |
| 2024-12-19 | FashionComposer: Compositional Fashion Image Generation | Sihui Ji et.al. | 2412.14168 | null |
| 2024-11-18 | Try-On-Adapter: A Simple and Flexible Try-On Paradigm | Hanzhong Guo et.al. | 2411.10187 | null |
| 2024-07-18 | Time-Efficient and Identity-Consistent Virtual Try-On Using A Variant of Altered Diffusion Models | Phuong Dam et.al. | 2403.07371 | null |
| 2024-07-18 | Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images | Aiyu Cui et.al. | 2311.16094 | null |
| 2024-06-05 | GraVITON: Graph based garment warping with attention guided inversion for Virtual-tryon | Sanhita Pathak et.al. | 2406.02184 | null |
| 2024-05-28 | Single Stage Warped Cloth Learning and Semantic-Contextual Attention Feature Fusion for Virtual TryOn | Sanhita Pathak et.al. | 2310.05024 | null |
| 2024-05-08 | VTON-IT: Virtual Try-On using Image Translation | Santosh Adhikari et.al. | 2310.04558 | null |
| 2024-04-29 | Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos | Zhengze Xu et.al. | 2404.17571 | null |
| 2024-04-02 | TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On | Jiazheng Xing et.al. | 2404.00878 | null |
| 2023-04-03 | Learning Garment DensePose for Robust Warping in Virtual Try-On | Aiyu Cui et.al. | 2303.17688 | null |
| 2021-09-13 | Per Garment Capture and Synthesis for Real-time Virtual Try-on | Toby Chong et.al. | 2109.04654 | null |
| 2021-08-25 | ARShoe: Real-Time Augmented Reality Shoe Try-on System on Smartphones | Shan An et.al. | 2108.10515 | null |
| 2021-06-01 | An Efficient Style Virtual Try on Network for Clothing Business Industry | Shanchen Pang et.al. | 2105.13183 | null |
| 2021-01-14 | ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on | Gaurav Kuppa et.al. | 2012.10495 | null |
| 2016-02-22 | Issues in the Multiple Try Metropolis mixing | L. Martino et.al. | 1508.04253 | null |
| 2015-02-27 | Trying to understand dark matter | B. Hoeneisen et.al. | 1502.07375 | null |
| 2014-05-20 | On the flexibility of the design of Multiple Try Metropolis schemes | Luca Martino et.al. | 1201.0646 | null |
Visual Edit
Visual Edit
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-08 | OpenVE-3M: A Large-Scale High-Quality Dataset for Instruction-Guided Video Editing | Haoyang He et.al. | 2512.07826 | null |
| 2025-12-08 | LongCat-Image Technical Report | Meituan LongCat Team et.al. | 2512.07584 | null |
| 2025-12-08 | Unified Video Editing with Temporal Reasoner | Xiangpeng Yang et.al. | 2512.07469 | null |
| 2025-12-08 | MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition | Xinyu Wei et.al. | 2512.07348 | null |
| 2025-12-08 | AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing | Ziming Hong et.al. | 2512.07247 | null |
| 2025-12-08 | Coherent Audio-Visual Editing via Conditional Audio Generation Following Video Edits | Masato Ishii et.al. | 2512.07209 | null |
| 2025-12-05 | EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing | Runjia Li et.al. | 2512.06065 | null |
| 2025-12-05 | EditThinker: Unlocking Iterative Reasoning for Any Image Editor | Hongyu Li et.al. | 2512.05965 | null |
| 2025-12-05 | World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty | Zhiting Mei et.al. | 2512.05927 | null |
| 2025-12-05 | Edit-aware RAW Reconstruction | Abhijith Punnappurath et.al. | 2512.05859 | null |
| 2025-12-05 | InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem | Yeobin Hong et.al. | 2512.05672 | null |
| 2025-12-05 | 2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency | Xingxi Yin et.al. | 2512.05557 | null |
| 2025-12-05 | SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling | Elisabetta Fedele et.al. | 2512.05343 | null |
| 2025-12-05 | EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture | Xin He et.al. | 2512.04810 | null |
| 2025-12-04 | DEAR: Dataset for Evaluating the Aesthetics of RenderingDEAR: Dataset for Evaluating the Aesthetics of Rendering | Vsevolod Plohotnuk et.al. | 2512.05209 | null |
| 2025-12-04 | Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models | Rowan Bradbury et.al. | 2512.05198 | null |
| 2025-12-04 | Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints | Minghan Zhu et.al. | 2512.05079 | null |
| 2025-12-04 | I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models | Juntong Wang et.al. | 2512.04660 | null |
| 2025-12-04 | X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale | Pei Yang et.al. | 2512.04537 | null |
| 2025-12-04 | RefaΓ§ade: Editing Object with Given Reference Texture | Youze Huang et.al. | 2512.04534 | null |
| 2025-12-04 | SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation | Xin Liang et.al. | 2512.04529 | null |
| 2025-12-04 | Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation | Sidan Zhu et.al. | 2512.04426 | null |
| 2025-12-04 | ViDiC: Video Difference Captioning | Jiangtao Wu et.al. | 2512.03405 | null |
| 2025-12-03 | PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design | Jiazhe Wei et.al. | 2512.04082 | null |
| 2025-12-03 | DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment | Sheng-Hao Liao et.al. | 2512.03981 | null |
| 2025-12-03 | Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence | Shuai Yang et.al. | 2512.03905 | null |
| 2025-12-03 | GaussianBlender: Instant Stylization of 3D Gaussians with Disentangled Latent Spaces | Melis Ocal et.al. | 2512.03683 | null |
| 2025-12-03 | Global-Local Aware Scene Text Editing | Fuxiang Yang et.al. | 2512.03574 | null |
| 2025-12-03 | Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models | Shojiro Yamabe et.al. | 2512.03463 | null |
| 2025-12-03 | SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation | Yu Yuan et.al. | 2512.03350 | null |
| 2025-12-03 | LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization | Zhihan Xiao et.al. | 2512.02933 | null |
| 2025-12-03 | Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code | Pritam Deka et.al. | 2512.02170 | null |
| 2025-12-02 | PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement | Haitian Zheng et.al. | 2512.03247 | null |
| 2025-12-02 | MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues | Zichen Liu et.al. | 2512.03046 | null |
| 2025-12-02 | PPTArena: A Benchmark for Agentic PowerPoint Editing | Michael Ofengenden et.al. | 2512.03042 | null |
| 2025-12-02 | In-Context Sync-LoRA for Portrait Video Editing | Sagi Polaczek et.al. | 2512.03013 | null |
| 2025-12-02 | Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study | Vishal Dubey et.al. | 2512.02850 | null |
| 2025-12-02 | Hear What Matters! Text-conditioned Selective Video-to-Audio Generation | Junwon Lee et.al. | 2512.02650 | null |
| 2025-12-02 | PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding | Zheng Huang et.al. | 2512.02624 | null |
| 2025-12-02 | PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing | Junyi Hou et.al. | 2512.02589 | null |
| 2025-12-01 | UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits | Keming Ye et.al. | 2512.02790 | null |
| 2025-12-01 | DepthScape: Authoring 2.5D Designs via Depth Estimation, Semantic Understanding, and Geometry Extraction | Xia Su et.al. | 2512.02263 | null |
| 2025-12-01 | CoatFusion: Controllable Material Coating in Images | Sagie Levy et.al. | 2512.02143 | null |
| 2025-12-01 | Generative Video Motion Editing with 3D Point Tracks | Yao-Chih Lee et.al. | 2512.02015 | null |
| 2025-12-01 | TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models | Zhiheng Liu et.al. | 2512.02014 | null |
| 2025-12-01 | FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing | Yucheng Liao et.al. | 2512.01755 | null |
| 2025-12-01 | Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval | Xin Wang et.al. | 2512.01636 | null |
| 2025-12-01 | Reversible Inversion for Training-Free Exemplar-guided Image Editing | Yuke Li et.al. | 2512.01382 | null |
| 2025-12-01 | Fast Multi-view Consistent 3D Editing with Video Priors | Liyi Chen et.al. | 2511.23172 | null |
| 2025-11-30 | Graph Queries from Natural Language using Constrained Language Models and Visual Editing | Benedikt Kantz et.al. | 2512.00948 | null |
| 2025-11-30 | Hybrid-DMKG: A Hybrid Reasoning Framework over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing | Li Yuan et.al. | 2512.00881 | null |
| 2025-11-30 | PanFlow: Decoupled Motion Control for Panoramic Video Generation | Cheng Zhang et.al. | 2512.00832 | null |
| 2025-11-30 | Seeing the Wind from a Falling Leaf | Zhiyuan Gao et.al. | 2512.00762 | null |
| 2025-11-30 | Charts Are Not Images: On the Challenges of Scientific Chart Editing | Shawn Li et.al. | 2512.00752 | null |
| 2025-11-30 | Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer | Dong In Lee et.al. | 2512.00677 | null |
| 2025-11-29 | NeuroVolve: Evolving Visual Stimuli toward Programmable Neural Objectives | Haomiao Chen et.al. | 2512.00557 | null |
| 2025-11-29 | Recognizing Pneumonia in Real-World Chest X-rays with a Classifier Trained with Images Synthetically Generated by Nano Banana | Jiachuan Peng et.al. | 2512.00428 | null |
| 2025-11-29 | WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing | Kaihang Pan et.al. | 2512.00387 | null |
| 2025-11-29 | POLARIS: Projection-Orthogonal Least Squares for Robust and Adaptive Inversion in Diffusion Models | Wenshuo Chen et.al. | 2512.00369 | null |
| 2025-11-29 | USB: Unified Synthetic Brain Framework for Bidirectional Pathology-Healthy Generation and Editing | Jun Wang et.al. | 2512.00269 | null |
| 2025-11-28 | DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline | Rui Zhang et.al. | 2511.23377 | null |
| 2025-11-28 | Vision Bridge Transformer at Scale | Zhenxiong Tan et.al. | 2511.23199 | null |
| 2025-11-28 | NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing | Zhenyu Xu et.al. | 2511.23105 | null |
| 2025-11-28 | Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation | Felipe Akio Matsuoka et.al. | 2511.23066 | null |
| 2025-11-28 | JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization | Yunlong Lin et.al. | 2511.23002 | null |
| 2025-11-28 | MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation | Yuta Oshima et.al. | 2511.22989 | null |
| 2025-11-27 | Improving Robotic Manipulation Robustness via NICE Scene Surgery | Sajjad Pakdamansavoji et.al. | 2511.22777 | null |
| 2025-11-27 | Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer | Z-Image Team et.al. | 2511.22699 | null |
| 2025-11-27 | Test-time scaling of diffusions with flow maps | Amirmojtaba Sabour et.al. | 2511.22688 | null |
| 2025-11-27 | REASONEDIT: Towards Reasoning-Enhanced Image Editing Models | Fukun Yin et.al. | 2511.22625 | null |
| 2025-11-27 | Creating Blank Canvas Against AI-enabled Image Forgery | Qi Song et.al. | 2511.22237 | null |
| 2025-11-27 | 3D-Consistent Multi-View Editing by Diffusion Guidance | Josef Bengtson et.al. | 2511.22228 | null |
| 2025-11-27 | G |
Wenbo Hu et.al. | 2511.21688 | null |
| 2025-11-26 | PAT3D: Physics-Augmented Text-to-3D Scene Generation | Guying Lin et.al. | 2511.21978 | null |
| 2025-11-26 | Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation | Joonhyung Park et.al. | 2511.21185 | null |
| 2025-11-26 | AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control | Xinyue Guo et.al. | 2511.21146 | null |
| 2025-11-26 | CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion | Dianbing Xi et.al. | 2511.21129 | null |
| 2025-11-26 | FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain | YuAn Wang et.al. | 2511.21113 | null |
| 2025-11-26 | MIRA: Multimodal Iterative Reasoning Agent for Image Editing | Ziyun Zeng et.al. | 2511.21087 | null |
| 2025-11-26 | MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization | Yingjie Xia et.al. | 2511.21051 | null |
| 2025-11-26 | CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching | Qirui Yang et.al. | 2511.21024 | null |
| 2025-11-26 | From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition | Jingxi Chen et.al. | 2511.20996 | null |
| 2025-11-26 | Inversion-Free Style Transfer with Dual Rectified Flows | Yingying Deng et.al. | 2511.20986 | null |
| 2025-11-26 | Beyond Realism: Learning the Art of Expressive Composition with StickerNet | Haoming Lu et.al. | 2511.20957 | null |
| 2025-11-25 | GaINeR: Geometry-Aware Implicit Network Representation | Weronika Jakubowska et.al. | 2511.20924 | null |
| 2025-11-25 | DinoLizer: Learning from the Best for Generative Inpainting Localization | Minh Thong Doi et.al. | 2511.20722 | null |
| 2025-11-25 | MotionV2V: Editing Motion in a Video | Ryan Burgert et.al. | 2511.20640 | null |
| 2025-11-25 | iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation | Zhoujie Fu et.al. | 2511.20635 | null |
| 2025-11-25 | The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment | Ziheng Ouyang et.al. | 2511.20614 | null |
| 2025-11-25 | PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding | Haoze Zhang et.al. | 2511.20562 | null |
| 2025-11-25 | OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation | Hao Yu et.al. | 2511.20211 | null |
| 2025-11-25 | UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers | Min Zhao et.al. | 2511.20123 | null |
| 2025-11-25 | Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization | Xingyue Lin et.al. | 2511.20034 | null |
| 2025-11-25 | OmniRefiner: Reinforcement-Guided Local Diffusion Refinement | Yaoli Liu et.al. | 2511.19990 | null |
| 2025-11-25 | Low-Resolution Editing is All You Need for High-Resolution Editing | Junsung Lee et.al. | 2511.19945 | null |
| 2025-11-25 | Are Image-to-Video Models Good Zero-Shot Image Editors? | Zechuan Zhang et.al. | 2511.19435 | null |
| 2025-11-24 | Agint: Agentic Graph Compilation for Software Engineering Agents | Abhi Chivukula et.al. | 2511.19635 | null |
| 2025-11-24 | Vidi2: Large Multimodal Models for Video Understanding and Creation | Vidi Team et.al. | 2511.19529 | null |
| 2025-11-24 | Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction | Yun Zhou et.al. | 2511.19426 | null |
| 2025-11-24 | AvatarBrush: Monocular Reconstruction of Gaussian Avatars with Intuitive Local Editing | Mengtian Li et.al. | 2511.19189 | null |
| 2025-11-24 | DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection | Hai Ci et.al. | 2511.19111 | null |
| 2025-11-24 | Pre-Filtering Code Suggestions using Developer Behavioral Telemetry to Optimize LLM-Assisted Programming | Mohammad Nour Al Awad et.al. | 2511.18849 | null |
| 2025-11-24 | NI-Tex: Non-isometric Image-based Garment Texture Generation | Hui Shan et.al. | 2511.18765 | null |
| 2025-11-24 | DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving | Hongbin Lin et.al. | 2511.18713 | null |
| 2025-11-24 | ObjectAlign: Neuro-Symbolic Object Consistency Verification and Correction | Mustafa Munir et.al. | 2511.18701 | null |
| 2025-11-24 | Now You See It, Now You Don't - Instant Concept Erasure for Safe Text-to-Image and Video Generation | Shristi Das Biswas et.al. | 2511.18684 | null |
| 2025-11-24 | Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers | Yiqing Shi et.al. | 2511.18673 | null |
| 2025-11-23 | FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement | Wenshuo Gao et.al. | 2511.18346 | null |
| 2025-11-23 | Point-to-Point: Sparse Motion Guidance for Controllable Video Editing | Yeji Song et.al. | 2511.18277 | null |
| 2025-11-23 | MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation | Tao Shen et.al. | 2511.18262 | null |
| 2025-11-22 | Video4Edit: Viewing Image Editing as a Degenerate Temporal Process | Xiaofan Li et.al. | 2511.18131 | null |
| 2025-11-22 | IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment | Bowen Qu et.al. | 2511.18055 | null |
| 2025-11-22 | Paper2SysArch: Structure-Constrained System Architecture Generation from Scientific Papers | Ziyi Guo et.al. | 2511.18036 | null |
| 2025-11-21 | Show Me: Unifying Instructional Image and Video Generation with Diffusion Models | Yujiang Pu et.al. | 2511.17839 | null |
| 2025-11-21 | Native 3D Editing with Full Attention | Weiwei Cai et.al. | 2511.17501 | null |
| 2025-11-21 | Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition | Nissim Maruani et.al. | 2511.17454 | null |
| 2025-11-21 | Range-Edit: Semantic Mask Guided Outdoor LiDAR Scene Editing | Suchetan G. Uppur et.al. | 2511.17269 | null |
| 2025-11-21 | PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention | Yipeng Chen et.al. | 2511.17185 | null |
| 2025-11-21 | Spanning Tree Autoregressive Visual Generation | Sangkyu Lee et.al. | 2511.17089 | null |
| 2025-11-21 | RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation | Wenzhuo Sun et.al. | 2511.17048 | null |
| 2025-11-21 | DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution | Chaoran Xu et.al. | 2511.16920 | null |
| 2025-11-21 | FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation | Yueru He et.al. | 2511.14998 | null |
| 2025-11-20 | WorldGen: From Text to Traversable and Interactive 3D Worlds | Dilin Wang et.al. | 2511.16825 | null |
| 2025-11-20 | SVG360: Multi-View SVG Generation with Geometric and Color Consistency from a Single SVG | Mengnan Jiang et.al. | 2511.16766 | null |
| 2025-11-20 | Motion Transfer-Enhanced StyleGAN for Generating Diverse Macaque Facial Expressions | Takuya Igaue et.al. | 2511.16711 | null |
| 2025-11-20 | Controllable Layer Decomposition for Reversible Multi-Layer Image Generation | Zihao Liu et.al. | 2511.16249 | null |
| 2025-11-19 | One algebra for all : Geometric Algebra methods for neurosymbolic XR scene authoring, animation and neural rendering | Manos Kamarianakis et.al. | 2511.15398 | null |
| 2025-11-19 | ChartEditor: A Reinforcement Learning Framework for Robust Chart Editing | Liangyu Chen et.al. | 2511.15266 | null |
| 2025-11-18 | InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization | Daniel Gilo et.al. | 2511.14899 | null |
| 2025-11-18 | UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning | Rui Tian et.al. | 2511.14760 | null |
| 2025-11-18 | Task Addition and Weight Disentanglement in Closed-Vocabulary Models | Adam Hazimeh et.al. | 2511.14569 | null |
| 2025-11-18 | ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation | Zitong Xu et.al. | 2511.14259 | null |
| 2025-11-18 | InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior | Weimin Bai et.al. | 2511.14208 | null |
| 2025-11-18 | UniSER: A Foundation Model for Unified Soft Effects Removal | Jingdong Zhang et.al. | 2511.14183 | null |
| 2025-11-18 | Text-Driven Reasoning Video Editing via Reinforcement Learning on Digital Twin Representations | Yiqing Shen et.al. | 2511.14100 | null |
| 2025-11-18 | Error-Driven Scene Editing for 3D Grounding in Large Language Models | Yue Zhang et.al. | 2511.14086 | null |
| 2025-11-18 | Semantic Context Matters: Improving Conditioning for Autoregressive Models | Dongyang Jin et.al. | 2511.14063 | null |
| 2025-11-18 | Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline | Rui Zuo et.al. | 2511.13442 | null |
| 2025-11-18 | MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation | Junjie Yang et.al. | 2511.13135 | null |
| 2025-11-17 | Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine | Xincheng Shuai et.al. | 2511.13713 | null |
| 2025-11-17 | Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting | Jiangnan Ye et.al. | 2511.13684 | null |
| 2025-11-17 | Language-Guided Invariance Probing of Vision-Language Models | Jae Joong Lee et.al. | 2511.13494 | null |
| 2025-11-17 | Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling | Adam Hazimeh et.al. | 2511.13478 | null |
| 2025-11-17 | TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing | Yuchen Bao et.al. | 2511.13399 | null |
| 2025-11-17 | SkyReels-Text: Fine-grained Font-Controllable Text Editing for Poster Design | Yunjie Yu et.al. | 2511.13285 | null |
| 2025-11-17 | Uncovering and Mitigating Transient Blindness in Multimodal Model Editing | Xiaoqi Han et.al. | 2511.13243 | null |
| 2025-11-17 | InteractiveGNNExplainer: A Visual Analytics Framework for Multi-Faceted Understanding and Probing of Graph Neural Network Predictions | TC Singh et.al. | 2511.13160 | null |
| 2025-11-17 | Semantic Prioritization in Visual Counterfactual Explanations with Weighted Segmentation and Auto-Adaptive Region Selection | Lintong Zhang et.al. | 2511.12992 | null |
| 2025-11-17 | Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes | Feng Lv et.al. | 2511.12932 | null |
| 2025-11-17 | Generative Photographic Control for Scene-Consistent Video Cinematic Editing | Huiqiang Sun et.al. | 2511.12921 | null |
| 2025-11-16 | Catastrophic Forgetting in Kolmogorov-Arnold Networks | Mohammad Marufur Rahman et.al. | 2511.12828 | null |
| 2025-11-16 | Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis | Zeqin Yu et.al. | 2511.12658 | null |
| 2025-11-16 | Designed to Spread: Generative Approaches to Enhance Information Diffusion | Ziqing Qian et.al. | 2511.12516 | null |
| 2025-11-15 | ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks | Ruixun Liu et.al. | 2511.12267 | null |
| 2025-11-15 | Mixture of States: Routing Token-Level Dynamics for Multimodal Generation | Haozhe Liu et.al. | 2511.12207 | null |
| 2025-11-15 | FIA-Edit: Frequency-Interactive Attention for Efficient and High-Fidelity Inversion-Free Text-Guided Image Editing | Kaixiang Yang et.al. | 2511.12151 | null |
| 2025-11-15 | Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing | Hossein Mohebbi et.al. | 2511.11780 | null |
| 2025-11-14 | PEtab-GUI: A graphical user interface to create, edit and inspect PEtab parameter estimation problems | Paul Jonas Jost et.al. | 2511.11515 | null |
| 2025-11-14 | ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation | Kaishen Wang et.al. | 2511.11483 | null |
| 2025-11-14 | WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation | Wei Chow et.al. | 2511.11434 | null |
| 2025-11-14 | SimuFreeMark: A Noise-Simulation-Free Robust Watermarking Against Image Editing | Yichao Tang et.al. | 2511.11295 | null |
| 2025-11-14 | Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing | Cong Cao et.al. | 2511.11236 | null |
| 2025-11-14 | On the Information-Theoretic Fragility of Robust Watermarking under Diffusion Editing | Yunyi Ni et.al. | 2511.10933 | null |
| 2025-11-14 | STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data | Yongdeuk Seo et.al. | 2511.09977 | null |
| 2025-11-14 | UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation | Zhen Yang et.al. | 2511.08195 | null |
| 2025-11-13 | IPCD: Intrinsic Point-Cloud Decomposition | Shogo Sato et.al. | 2511.09866 | null |
| 2025-11-13 | AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting | Aymen Mir et.al. | 2511.09827 | null |
| 2025-11-12 | SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control | Arman Zarei et.al. | 2511.09715 | null |
| 2025-11-11 | RePose-NeRF: Robust Radiance Fields for Mesh Reconstruction under Noisy Camera Poses | Sriram Srinivasan et.al. | 2511.08545 | null |
| 2025-11-11 | 3D4D: An Interactive, Editable, 4D World Model via 3D Video Generation | Yunhong He et.al. | 2511.08536 | null |
| 2025-11-11 | UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist | Zhengyang Liang et.al. | 2511.08521 | null |
| 2025-11-11 | HardFlow: Hard-Constrained Sampling for Flow-Matching Models via Trajectory Optimization | Zeyang Li et.al. | 2511.08425 | null |
| 2025-11-11 | LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning | Fengyi Fu et.al. | 2511.08251 | null |
| 2025-11-11 | VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics | Daniel Cher et.al. | 2511.07744 | null |
| 2025-11-09 | Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising | Assaf Singer et.al. | 2511.08633 | null |
| 2025-11-09 | AesTest: Measuring Aesthetic Intelligence from Perception to Production | Guolong Wang et.al. | 2511.06360 | null |
| 2025-11-09 | RelightMaster: Precise Video Relighting with Multi-plane Light Images | Weikang Bian et.al. | 2511.06271 | null |
| 2025-11-07 | On the Brittleness of CLIP Text Encoders | Allie Tran et.al. | 2511.04247 | null |
| 2025-11-07 | Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing | Zhihui Chen et.al. | 2511.00801 | null |
| 2025-11-06 | Personalized Image Editing in Text-to-Image Diffusion Models via Collaborative Direct Preference Optimization | Connor Dunlop et.al. | 2511.05616 | null |
| 2025-11-06 | MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers | Ali Boudaghi et.al. | 2511.04376 | null |
| 2025-11-06 | Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization | Zhejia Cai et.al. | 2511.03950 | null |
| 2025-11-05 | Diffusion-Based Image Editing: An Unforeseen Adversary to Robust Invisible Watermarks | Wenkai Fu et.al. | 2511.05598 | null |
| 2025-11-05 | Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition | Jongseo Lee et.al. | 2511.03725 | null |
| 2025-11-05 | Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising | Shuangquan Lyu et.al. | 2511.03272 | null |
| 2025-11-05 | ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing | Yaosen Chen et.al. | 2511.02505 | null |
| 2025-11-03 | UniREditBench: A Unified Reasoning-based Image Editing Benchmark | Feng Han et.al. | 2511.01295 | null |
| 2025-10-31 | BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Editing | Jinsu Kim et.al. | 2511.00143 | null |
| 2025-10-31 | Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing | Yijia Wang et.al. | 2510.27335 | null |
| 2025-10-30 | Security Risk of Misalignment between Text and Image in Multi-modal Model | Xiaosen Wang et.al. | 2510.26105 | null |
| 2025-10-29 | LGCC: Enhancing Flow Matching Based Text-Guided Image Editing with Local Gaussian Coupling and Context Consistency | Fangbing Liu et.al. | 2511.01894 | null |
| 2025-10-29 | SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing | Sung-Hoon Yoon et.al. | 2510.25970 | null |
| 2025-10-29 | RegionE: Adaptive Region-Aware Generation for Efficient Image Editing | Pengtao Chen et.al. | 2510.25590 | null |
| 2025-10-29 | LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation | Zeyu Wang et.al. | 2510.22946 | null |
| 2025-10-28 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Inclusion AI et.al. | 2510.24821 | null |
| 2025-10-28 | Group Relative Attention Guidance for Image Editing | Xuanpu Zhang et.al. | 2510.24657 | null |
| 2025-10-28 | Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models | Byeonghu Na et.al. | 2510.23974 | null |
| 2025-10-27 | Autoregressive Styled Text Image Generation, but Make it Reliable | Carmine Zaccagnino et.al. | 2510.23240 | null |
| 2025-10-27 | UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization | Huixuan Zhang et.al. | 2510.23023 | null |
| 2025-10-27 | VALA: Learning Latent Anchors for Training-Free and Temporally Consistent | Zhangkai Wu et.al. | 2510.22970 | null |
| 2025-10-27 | FAME: Fairness-aware Attention-modulated Video Editing | Zhangkai Wu et.al. | 2510.22960 | null |
| 2025-10-27 | LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas | Guocheng Gordon Qian et.al. | 2510.20820 | null |
| 2025-10-25 | GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation | Phillip Mueller et.al. | 2510.22337 | null |
| 2025-10-24 | FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing | Or Ronai et.al. | 2510.22010 | null |
| 2025-10-24 | SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation | Alec Helbling et.al. | 2510.21120 | null |
| 2025-10-24 | EditInfinity: Image Editing with Binary-Quantized Generative Models | Jiahuan Wang et.al. | 2510.20217 | null |
| 2025-10-24 | Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks | Kai Zeng et.al. | 2510.19195 | null |
| 2025-10-23 | Positional Encoding Field | Yunpeng Bai et.al. | 2510.20385 | null |
| 2025-10-23 | FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing | Yanghao Wang et.al. | 2510.20212 | null |
| 2025-10-22 | Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing | Yusu Qian et.al. | 2510.19808 | null |
| 2025-10-21 | PICABench: How Far Are We from Physically Realistic Image Editing? | Yuandong Pu et.al. | 2510.17681 | null |
| 2025-10-21 | Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback | Zongjian Li et.al. | 2510.16888 | null |
| 2025-10-20 | ConsistEdit: Highly Consistent and Precise Training-free Visual Editing | Zixin Yin et.al. | 2510.17803 | null |
| 2025-10-19 | Region in Context: Text-condition Image editing with Human-like semantic reasoning | Thuy Phuong Vu et.al. | 2510.16772 | null |
| 2025-10-17 | BLIP3o-NEXT: Next Frontier of Native Image Generation | Jiuhai Chen et.al. | 2510.15857 | null |
| 2025-10-17 | Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset | Qingyan Bai et.al. | 2510.15742 | null |
| 2025-10-16 | Coupled Diffusion Sampling for Training-Free Multi-View Image Editing | Hadi Alzayer et.al. | 2510.14981 | null |
| 2025-10-16 | Learning an Image Editing Model without Image Editing Pairs | Nupur Kumari et.al. | 2510.14978 | null |
| 2025-10-16 | In-Context Learning with Unpaired Clips for Instruction-based Video Editing | Xinyao Liao et.al. | 2510.14648 | null |
| 2025-10-15 | Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation | Yi Zuo et.al. | 2510.13084 | null |
| 2025-10-14 | UniFusion: Vision-Language Model as Unified Encoder in Image Generation | Kevin Li et.al. | 2510.12789 | null |
| 2025-10-14 | Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding | Ye Chen et.al. | 2510.12256 | null |
| 2025-10-14 | VIDMP3: Video Editing by Representing Motion with Pose and Position Priors | Sandeep Mishra et.al. | 2510.12069 | null |
| 2025-10-13 | IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment | Yinan Chen et.al. | 2510.11647 | null |
| 2025-10-13 | Zero-shot Face Editing via ID-Attribute Decoupled Inversion | Yang Hou et.al. | 2510.11050 | null |
| 2025-10-13 | GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation | Shasha Guo et.al. | 2510.11020 | null |
| 2025-10-13 | DreamMakeup: Face Makeup Customization using Latent Diffusion Models | Geon Yeong Park et.al. | 2510.10918 | null |
| 2025-10-11 | EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection | Huaizhi Qu et.al. | 2510.13652 | null |
| 2025-10-11 | ReMix: Towards a Unified View of Consistent Character Generation and Editing | Benjia Zhou et.al. | 2510.10156 | null |
| 2025-10-11 | MultiCOIN: Multi-Modal COntrollable Video INbetweening | Maham Tanveer et.al. | 2510.08561 | null |
| 2025-10-10 | Mono4DEditor: Text-Driven 4D Scene Editing from Monocular Video via Point-Level Localization of Language-Embedded Gaussians | Jin-Chuan Shi et.al. | 2510.09438 | null |
| 2025-10-10 | TBStar-Edit: From Image Editing Pattern Shifting to Consistency Enhancement | Hao Fang et.al. | 2510.04483 | null |
| 2025-10-09 | FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching | Jiacheng Liu et.al. | 2510.08669 | null |
| 2025-10-09 | Kontinuous Kontext: Continuous Strength Control for Instruction-based Image Editing | Rishubh Parihar et.al. | 2510.08532 | null |
| 2025-10-09 | InstructX: Towards Unified Visual Editing with MLLM Guidance | Chong Mou et.al. | 2510.08485 | null |
| 2025-10-09 | UniVideo: Unified Understanding, Generation, and Editing for Videos | Cong Wei et.al. | 2510.08377 | null |
| 2025-10-09 | InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing | Haoran Yu et.al. | 2510.08181 | null |
| 2025-10-09 | Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing | Zhentao Zou et.al. | 2510.08157 | null |
| 2025-10-08 | DreamOmni2: Multimodal Instruction-based Editing and Generation | Bin Xia et.al. | 2510.06679 | null |
| 2025-10-07 | Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding | Yi Xin et.al. | 2510.06308 | null |
| 2025-10-07 | Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling | Young D. Kwon et.al. | 2510.06295 | null |
| 2025-10-07 | Diffusion-Based Image Editing for Breaking Robust Watermarks | Yunyi Ni et.al. | 2510.05978 | null |
| 2025-10-07 | When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach | Daniel GonzΓ‘lbez-Biosca et.al. | 2510.05661 | null |
| 2025-10-06 | SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder | Ronen Kamenetsky et.al. | 2510.05081 | null |
| 2025-10-05 | ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation | Jay Zhangjie Wu et.al. | 2510.04290 | null |
| 2025-10-05 | Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers | Shikang Zheng et.al. | 2510.04188 | null |
| 2025-10-05 | Prompt-to-Prompt: Text-Based Image Editing Via Cross-Attention Mechanisms -- The Research of Hyperparameters and Novel Mechanisms to Enhance Existing Frameworks | Linn Bieske et.al. | 2510.04034 | null |
| 2025-10-04 | From Filters to VLMs: Benchmarking Defogging Methods through Object Detection and Segmentation Performance | Ardalan Aryashad et.al. | 2510.03906 | null |
| 2025-10-04 | Rare Text Semantics Were Always There in Your Diffusion Transformer | Seil Kang et.al. | 2510.03886 | null |
| 2025-10-03 | DiT-VTON: Diffusion Transformer Framework for Unified Multi-Category Virtual Try-On and Virtual Try-All with Integrated Image Editing | Qi Li et.al. | 2510.04797 | null |
| 2025-10-03 | OTR: Synthesizing Overlay Text Dataset for Text Removal | Jan Zdenek et.al. | 2510.02787 | null |
| 2025-10-02 | DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing | Zihan Zhou et.al. | 2510.02253 | null |
| 2025-10-02 | Towards Better Optimization For Listwise Preference in Diffusion Models | Jiamu Bai et.al. | 2510.01540 | null |
| 2025-10-02 | VRWKV-Editor: Reducing quadratic complexity in transformer-based video editing | Abdelilah Aitrouga et.al. | 2509.25998 | null |
| 2025-10-01 | IMAGEdit: Let Any Subject Transform | Fei Shen et.al. | 2510.01186 | null |
| 2025-10-01 | EditTrack: Detecting and Attributing AI-assisted Image Editing | Zhengyuan Jiang et.al. | 2510.01173 | null |
| 2025-10-01 | DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models | Seunghoo Hong et.al. | 2510.00778 | null |
| 2025-10-01 | CAMILA: Context-Aware Masking for Image Editing with Language Alignment | Hyunseung Kim et.al. | 2509.19731 | null |
| 2025-09-30 | EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing | Keming Wu et.al. | 2509.26346 | null |
| 2025-09-30 | Training-Free Reward-Guided Image Editing via Trajectory Optimal Control | Jinho Chang et.al. | 2509.25845 | null |
| 2025-09-30 | Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation | Mingyu Kang et.al. | 2509.25776 | null |
| 2025-09-30 | Dragging with Geometry: From Pixels to Geometry-Guided Image Editing | Xinyu Pu et.al. | 2509.25740 | null |
| 2025-09-30 | EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling | Xin Luo et.al. | 2509.23909 | null |
| 2025-09-30 | FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing | Junyi Wu et.al. | 2509.22244 | null |
| 2025-09-29 | Training-Free Multimodal Guidance for Video to Audio Generation | Eleonora Grassucci et.al. | 2509.24550 | null |
| 2025-09-29 | Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency | Jiaqi Tan et.al. | 2509.24514 | null |
| 2025-09-29 | Latent Visual Reasoning | Bangzheng Li et.al. | 2509.24251 | null |
| 2025-09-28 | Visual CoT Makes VLMs Smarter but More Fragile | Chunxue Xu et.al. | 2509.23789 | null |
| 2025-09-28 | Seedream 4.0: Toward Next-generation Multimodal Image Generation | Team Seedream et.al. | 2509.20427 | null |
| 2025-09-27 | Object-AVEdit: An Object-level Audio-Visual Editing Model | Youquan Fu et.al. | 2510.00050 | null |
| 2025-09-26 | EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer | Zhehao Dong et.al. | 2509.22407 | null |
| 2025-09-26 | SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks | Jialiang Li et.al. | 2509.21928 | null |
| 2025-09-26 | Taming Flow-based I2V Models for Creative Video Editing | Xianghao Kong et.al. | 2509.21917 | null |
| 2025-09-26 | TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation | Qihang Wang et.al. | 2509.21905 | null |
| 2025-09-25 | FreeInsert: Personalized Object Insertion with Geometric and Style Control | Yuhong Zhang et.al. | 2509.20756 | null |
| 2025-09-25 | ArtUV: Artist-style UV Unwrapping | Yuguang Chen et.al. | 2509.20710 | null |
| 2025-09-25 | EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning | Xuan Ju et.al. | 2509.20360 | null |
| 2025-09-25 | Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation | Yuanhuiyi Lyu et.al. | 2509.18639 | null |
| 2025-09-24 | Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation | Shufan Li et.al. | 2509.19244 | null |
| 2025-09-23 | Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation | Yanzuo Lu et.al. | 2509.18824 | null |
| 2025-09-23 | GeoRemover: Removing Objects and Their Causal Visual Artifacts | Zixin Zhu et.al. | 2509.18538 | null |
| 2025-09-22 | Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance | Hongxing Fan et.al. | 2509.17757 | null |
| 2025-09-20 | Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media | Zihan Ding et.al. | 2509.16811 | null |
| 2025-09-20 | V-CECE: Visual Counterfactual Explanations via Conceptual Edits | Nikolaos Spanos et.al. | 2509.16567 | null |
| 2025-09-19 | Neural Atlas Graphs for Dynamic Scene Decomposition and Editing | Jan Philipp Schneider et.al. | 2509.16336 | null |
| 2025-09-19 | Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution | Chang Soo Lim et.al. | 2509.15781 | null |
| 2025-09-18 | AutoEdit: Automatic Hyperparameter Tuning for Image Editing | Chau Pham et.al. | 2509.15031 | null |
| 2025-09-18 | MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks | Mingsong Li et.al. | 2509.14638 | null |
| 2025-09-18 | End4: End-to-end Denoising Diffusion for Diffusion-Based Inpainting Detection | Fei Wang et.al. | 2509.13214 | null |
| 2025-09-17 | Controllable-Continuous Color Editing in Diffusion Model via Color Mapping | Yuqi Yang et.al. | 2509.13756 | null |
| 2025-09-17 | LLM-I: LLMs are Naturally Interleaved Multimodal Creators | Zirun Guo et.al. | 2509.13642 | null |
| 2025-09-16 | EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing | Tianyu Chen et.al. | 2509.13399 | null |
| 2025-09-16 | Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder | Qifei Jia et.al. | 2509.12883 | null |
| 2025-09-16 | Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations | Jinjie Shen et.al. | 2509.12653 | null |
| 2025-09-15 | LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence | Zixin Yin et.al. | 2509.12203 | null |
| 2025-09-13 | EditDuet: A Multi-Agent System for Video Non-Linear Editing | Marcelo Sandoval-Castaneda et.al. | 2509.10761 | null |
| 2025-09-12 | Immunizing Images from Text to Image Editing via Adversarial Cross-Attention | Matteo Trippodo et.al. | 2509.10359 | null |
| 2025-09-10 | RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts | Lauren H. Cooke et.al. | 2509.08640 | null |
| 2025-09-09 | Delta Velocity Rectified Flow for Text-to-Image Editing | Gaspard Beaudouin et.al. | 2509.05342 | null |
| 2025-09-04 | Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control | Haruo Fujiwara et.al. | 2509.05285 | null |
| 2025-09-04 | Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping | Jingyi Lu et.al. | 2509.04582 | null |
| 2025-09-04 | From Editor to Dense Geometry Estimator | JiYuan Wang et.al. | 2509.04338 | null |
| 2025-09-03 | Discrete Noise Inversion for Next-scale Autoregressive Text-based Image Editing | Quan Dao et.al. | 2509.01984 | null |
| 2025-09-02 | Fidelity-preserving enhancement of ptychography with foundational text-to-image models | Ming Du et.al. | 2509.04513 | null |
| 2025-09-02 | Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination | Ziyun Zeng et.al. | 2509.01986 | null |
| 2025-09-01 | O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing | Yuqing Chen et.al. | 2509.01596 | null |
| 2025-09-01 | Neural Scene Designer: Self-Styled Semantic Image Manipulation | Jianman Lin et.al. | 2509.01405 | null |
| 2025-08-30 | LatentEdit: Adaptive Latent Control for Consistent Semantic Editing | Siyi Liu et.al. | 2509.00541 | null |
| 2025-08-28 | Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation | Chenfan Qu et.al. | 2508.20987 | null |
| 2025-08-28 | Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent | En Ci et.al. | 2508.20505 | null |
| 2025-08-28 | Audio-Guided Visual Editing with Complex Multi-Modal Prompts | Hyeonyu Kim et.al. | 2508.20379 | null |
| 2025-08-27 | Not Every Gift Comes in Gold Paper or with a Red Ribbon: Exploring Color Perception in Text-to-Image Models | Shay Shomer Chai et.al. | 2508.19791 | null |
| 2025-08-25 | ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models | Haitang Feng et.al. | 2508.18271 | null |
| 2025-08-25 | SpotEdit: Evaluating Visually-Guided Image Editing Methods | Sara Ghazanfari et.al. | 2508.18159 | null |
| 2025-08-24 | An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing | Zihan Liang et.al. | 2508.17435 | null |
| 2025-08-24 | Defending Deepfake via Texture Feature Perturbation | Xiao Zhang et.al. | 2508.17315 | null |
| 2025-08-24 | PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing | Peilin Xiong et.al. | 2508.17302 | null |
| 2025-08-21 | Visual Autoregressive Modeling for Instruction-Guided Image Editing | Qingyang Mao et.al. | 2508.15772 | null |
| 2025-08-20 | AnchorSync: Global Consistency Optimization for Long Video Editing | Zichi Liu et.al. | 2508.14609 | null |
| 2025-08-20 | DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing | Weitao Wang et.al. | 2508.14465 | null |
| 2025-08-19 | Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing | Feng-Lin Liu et.al. | 2508.13797 | null |
| 2025-08-18 | Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score | Syed Muhmmad Israr et.al. | 2508.12718 | null |
| 2025-08-18 | TimeMachine: Fine-Grained Facial Age Editing with Identity Preservation | Yilin Mi et.al. | 2508.11284 | null |
| 2025-08-18 | NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale | NextStep Team et.al. | 2508.10711 | null |
| 2025-08-16 | PEdger++: Practical Edge Detection via Assembling Cross Information | Yuanbin Fu et.al. | 2508.11961 | null |
| 2025-08-14 | LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters | Haomin Zhang et.al. | 2508.11074 | null |
| 2025-08-14 | A Segmentation-driven Editing Method for Bolt Defect Augmentation and Detection | Yangjie Xiao et.al. | 2508.10509 | null |
| 2025-08-14 | TweezeEdit: Consistent and Efficient Image Editing with Path Regularization | Jianda Mao et.al. | 2508.10498 | null |
| 2025-08-13 | LIA-X: Interpretable Latent Portrait Animator | Yaohui Wang et.al. | 2508.09959 | null |
| 2025-08-12 | Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control | Zeqian Long et.al. | 2508.08134 | null |
| 2025-08-12 | Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation | Fangyuan Mao et.al. | 2508.07981 | null |
| 2025-08-11 | X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning | Jian Ma et.al. | 2508.07607 | null |
| 2025-08-11 | Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing | Joonghyuk Shin et.al. | 2508.07519 | null |
| 2025-08-10 | CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization | Youqi Wang et.al. | 2508.07413 | null |
| 2025-08-10 | Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers | Xin Ma et.al. | 2508.07246 | null |
| 2025-08-09 | CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing | Weiyan Xie et.al. | 2508.06937 | null |
| 2025-08-09 | Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing | Shichao Ma et.al. | 2508.06916 | null |
| 2025-08-08 | UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization | Yachun Mi et.al. | 2508.06101 | null |
| 2025-08-08 | DreamVE: Unified Instruction-based Image and Video Editing | Bin Xia et.al. | 2508.06080 | null |
| 2025-08-08 | NEP: Autoregressive Image Editing via Next Editing Token Prediction | Huimin Wu et.al. | 2508.06044 | null |
| 2025-08-08 | InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow | Yiming Gong et.al. | 2508.06033 | null |
| 2025-08-05 | Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation | Peiyu Wang et.al. | 2508.03320 | null |
| 2025-08-05 | Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation | Jun Luo et.al. | 2508.03300 | null |
| 2025-08-05 | LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing | Liangyang Ouyang et.al. | 2508.03144 | null |
| 2025-08-05 | UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding, Editing and Verifying | Chengyu Bai et.al. | 2508.03142 | null |
| 2025-08-05 | The Promise of RL for Autoregressive Image Editing | Saba Ahmadi et.al. | 2508.01119 | null |
| 2025-08-04 | Transport-Guided Rectified Flow Inversion: Improved Image Editing Using Optimal Transport Theory | Marian Lupascu et.al. | 2508.02363 | null |
| 2025-08-04 | Qwen-Image Technical Report | Chenfei Wu et.al. | 2508.02324 | null |
| 2025-08-01 | Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence | Danzhen Fu et.al. | 2508.00299 | null |
| 2025-08-01 | Towards Robust Semantic Correspondence: A Benchmark and Insights | Wenyue Chong et.al. | 2508.00272 | null |
| 2025-08-01 | Training-free Geometric Image Editing on Diffusion Models | Hanshen Zhu et.al. | 2507.23300 | null |
| 2025-07-31 | UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing | Hao Tang et.al. | 2507.23278 | null |
| 2025-07-29 | Low-Cost Test-Time Adaptation for Robust Video Editing | Jianhui Wang et.al. | 2507.21858 | null |
| 2025-07-29 | From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos | Chenjian Gao et.al. | 2507.20331 | null |
| 2025-07-28 | GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset | Yuhan Wang et.al. | 2507.21033 | null |
| 2025-07-28 | ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation | Sherry X. Chen et.al. | 2507.07317 | null |
| 2025-07-25 | HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback | Elham Soltani Kazemi et.al. | 2507.18921 | null |
| 2025-07-23 | Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling | Yi Xin et.al. | 2507.17801 | null |
| 2025-07-22 | ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement | Kahim Wong et.al. | 2507.16397 | null |
| 2025-07-22 | Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling | Chao Zhou et.al. | 2507.16240 | null |
| 2025-07-22 | LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs | Zitong Xu et.al. | 2507.16193 | null |
| 2025-07-20 | Light Future: Multimodal Action Frame Prediction via InstructPix2Pix | Zesen Zhong et.al. | 2507.14809 | null |
| 2025-07-18 | NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining | Maksim Kuprashevich et.al. | 2507.14119 | null |
| 2025-07-18 | Moodifier: MLLM-Enhanced Emotion-Driven Image Editing | Jiarong Ye et.al. | 2507.14024 | null |
| 2025-07-16 | MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing | Shreya Kadambi et.al. | 2507.13401 | null |
| 2025-07-15 | EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing | Vassilis Sioros et.al. | 2507.11096 | null |
| 2025-07-14 | Sparse Fine-Tuning of Transformers for Generative Tasks | Wei Chen et.al. | 2507.10855 | null |
| 2025-07-14 | LayLens: Improving Deepfake Understanding through Simplified Explanations | Abhijeet Narang et.al. | 2507.10066 | null |
| 2025-07-11 | FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields | Gwanhyeong Koo et.al. | 2507.08285 | null |
| 2025-07-08 | 2D Instance Editing in 3D Space | Yuhuan Xie et.al. | 2507.05819 | null |
| 2025-07-07 | Neural-Driven Image Editing | Pengfei Zhou et.al. | 2507.05397 | null |
| 2025-07-07 | Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing | Chun-Hsiao Yeh et.al. | 2507.05259 | null |
| 2025-07-07 | S |
Xudong Liu et.al. | 2507.04584 | null |
| 2025-07-04 | Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images | Yuran Dong et.al. | 2507.03402 | null |
| 2025-07-04 | LACONIC: A 3D Layout Adapter for Controllable Image Creation | LΓ©opold Maillard et.al. | 2507.03257 | null |
| 2025-07-03 | From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding | Xiangfeng Wang et.al. | 2507.02790 | null |
| 2025-07-02 | Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning | Qingdong He et.al. | 2507.01908 | null |
| 2025-07-02 | ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation | Jimyeong Kim et.al. | 2507.01496 | null |
| 2025-07-02 | QC-OT: Optimal Transport with Quasiconformal Mapping | Yuping Lv et.al. | 2507.01456 | null |
| 2025-07-01 | Ovis-U1 Technical Report | Guo-Hua Wang et.al. | 2506.23044 | null |
| 2025-06-30 | A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement | Gaozheng Pei et.al. | 2506.23676 | null |
| 2025-06-30 | TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity | Yuzhuo Chen et.al. | 2506.23484 | null |
| 2025-06-29 | OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions | Yuanhao Cai et.al. | 2506.23361 | null |
| 2025-06-29 | Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis | Lei-lei Li et.al. | 2506.23263 | null |
| 2025-06-28 | Towards Explainable Bilingual Multimodal Misinformation Detection and Localization | Yiwei He et.al. | 2506.22930 | null |
| 2025-06-28 | STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing | Junsung Lee et.al. | 2506.22868 | null |
| 2025-06-27 | Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy | Yuhao Liu et.al. | 2506.22432 | null |
| 2025-06-27 | GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles | Mengyi Shan et.al. | 2506.21839 | null |
| 2025-06-27 | DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing | Lingling Cai et.al. | 2506.20967 | null |
| 2025-06-26 | Controllable 3D Placement of Objects with Scene-Aware Diffusion Models | Mohamed Omran et.al. | 2506.21446 | null |
| 2025-06-26 | Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling | Hansam Cho et.al. | 2506.21045 | null |
| 2025-06-26 | M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization | Ju-Hyeon Nam et.al. | 2506.20922 | null |
| 2025-06-26 | FaSTA |
Advait Gupta et.al. | 2506.20911 | null |
| 2025-06-26 | BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing | Jiacheng Chen et.al. | 2506.17450 | null |
| 2025-06-25 | EditP23: 3D Editing via Propagation of Image Prompts to Multi-View | Roi Bar-On et.al. | 2506.20652 | null |
| 2025-06-25 | Towards Efficient Exemplar Based Image Editing with Multimodal VLMs | Avadhoot Jadhav et.al. | 2506.20155 | null |
| 2025-06-25 | OmniGen2: Exploration to Advanced Multimodal Generation | Chenyuan Wu et.al. | 2506.18871 | null |
| 2025-06-24 | SceneCrafter: Controllable Multi-View Driving Scene Editing | Zehao Zhu et.al. | 2506.19488 | null |
| 2025-06-24 | LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning | Chenjian Gao et.al. | 2506.10082 | null |
| 2025-06-23 | Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models | Ilia Beletskii et.al. | 2506.19103 | null |
| 2025-06-23 | Let Your Video Listen to Your Music! | Xinyu Zhang et.al. | 2506.18881 | null |
| 2025-06-23 | CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing | Dinh-Khoi Vo et.al. | 2506.18438 | null |
| 2025-06-23 | Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction | Han Zhang et.al. | 2506.18290 | null |
| 2025-06-20 | FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation | Fan Yang et.al. | 2506.16806 | null |
| 2025-06-19 | Arch-Router: Aligning LLM Routing with Human Preferences | Co Tran et.al. | 2506.16655 | null |
| 2025-06-18 | VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics | Josef KuchaΕ et.al. | 2506.15903 | null |
| 2025-06-17 | Causally Steered Diffusion for Automated Video Counterfactual Generation | Nikos Spyrou et.al. | 2506.14404 | link |
| 2025-06-16 | AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing | Biao Yang et.al. | 2506.13301 | null |
| 2025-06-15 | Balancing Preservation and Modification: A Region and Semantic Aware Metric for Instruction-Based Image Editing | Zhuoying Li et.al. | 2506.13827 | null |
| 2025-06-15 | ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies | Chenglin Wang et.al. | 2506.12830 | null |
| 2025-06-14 | Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text Prompts | Saemee Choi et.al. | 2506.12520 | null |
| 2025-06-13 | SphereDrag: Spherical Geometry-Aware Panoramic Image Editing | Zhiao Feng et.al. | 2506.11863 | null |
| 2025-06-13 | **Consistent Video Editing a |