Skip to content

mlnjsh/Computer_Vision_Projects

Repository files navigation

👁️ Computer Vision Research Portfolio

5 Research-Level Projects | Interactive Demos | Production-Ready Code

Python PyTorch Streamlit License Stars


Milan Amrut Joshi — Computer Vision Scientist

Covering 3D Vision • Video Understanding • Medical Imaging • Autonomous Driving • Embodied AI


┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   🏗️ 3D Reconstruction   🎬 Video Reasoning   🏥 Medical Imaging    │
│                                                                     │
│        🚗 Autonomous Driving        🤖 Robotic Manipulation         │
│                                                                     │
│   📓 50+ Image Processing Techniques  🎯 YOLO v8→v11 Projects      │
│                                                                     │
│   141 Files  •  31,000+ Lines  •  5 Streamlit Apps  •  8 Notebooks  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

🔬 Research Projects

🏗️ Project 1: Feed-Forward 3D Reconstruction

Subfield: 3D Vision / Geometric Deep Learning

Transformer that predicts 3D point clouds, depth maps, and camera poses from 2-10 images in a single forward pass — no iterative optimization.

Inspired by: CVPR 2025 Best Paper (VGGT)

Component Detail
Backbone DINOv2 ViT-B/14
Architecture Cross-view attention + epipolar encoding
Datasets CO3D v2, DTU MVS, ScanNet
Metrics Chamfer Distance, F-score, Depth Error
Input: N images → [ViT Encoder] → [Cross-View Attention]
  → Point Maps + Depth Maps + Camera Poses

📂 Code🚀 Streamlit Demo📓 Notebook

🎬 Project 2: Video Temporal Reasoning

Subfield: Video Understanding

Two-stage system: adaptive keyframe selection + temporal reasoning for answering temporal/causal questions about long-form video. Even GPT-4V only scores 60-65% on this.

Inspired by: Stanford's T* (CVPR 2025)

Component Detail
Frame Selector CLIP ViT-L/14 + learned MLP
Reasoner Temporal attention + VLM
Datasets Video-MME, Ego4D, NExT-QA, STAR
Metrics QA Accuracy by type (temporal, causal)
Long Video + Question → [Frame Selector] → K keyframes
  → [Temporal Reasoner] → Answer with reasoning

📂 Code🚀 Streamlit Demo📓 Notebook

🏥 Project 3: Medical Segmentation via Distillation

Subfield: Medical Imaging + Foundation Models

Distill SAM 2.1 (312M params) into a compact student (8M params) that achieves 98.9% of teacher accuracy at 8.4x speed — fast enough for real-time surgical guidance.

Inspired by: MedSAM2 / MM-DINOv2 (MICCAI 2025)

Component Detail
Teacher SAM 2.1 + DINOv2
Student EfficientNet-B0 + UNet decoder
Datasets Kvasir-SEG, ISIC 2019, TotalSegmentator
Metrics Dice, IoU, Hausdorff-95, FPS
Teacher (312M, 5 FPS) → Knowledge Distillation
  → Student (8M, 42 FPS) — same accuracy!

📂 Code🚀 Streamlit Demo📓 Notebook

🚗 Project 4: 3D Occupancy for Autonomous Driving

Subfield: Autonomous Driving Perception

Camera-only 3D voxel occupancy prediction (200×200×16, 17 classes) with temporal fusion via scene flow — no expensive LiDAR needed.

Inspired by: FSF-Net (2026) / TPVFormer / BEVFormer

Component Detail
Backbone ResNet-50 + FPN
View Transform Lift-Splat-Shoot (LSS)
Temporal Scene flow + deformable attention
Datasets nuScenes, Occ3D, KITTI, Waymo
6 Cameras → [ResNet] → [LSS Lifting] → [BEV]
  → [Temporal Fusion] → 3D Occupancy Grid (17 classes)

📂 Code🚀 Streamlit Demo📓 Notebook

🤖 Project 5: Compact Vision-Language-Action Model for Robotic Manipulation

Subfield: Embodied AI / Robotics

A compact VLA model (~400M params) that takes camera images + language instructions and outputs smooth robotic actions via diffusion-based decoding. Runs at 8+ Hz on consumer hardware.

Inspired by: OpenVLA, SmolVLA (HuggingFace), pi-zero

Component Detail
Vision SigLIP ViT-B/16 with spatial compression
Language Lightweight transformer encoder
Action DDPM diffusion decoder (16-step chunks × 7-DoF)
Datasets Open X-Embodiment, CALVIN, RLBench
RGB Image + "pick up the red cup" → [VLM Fusion] → [Diffusion Decoder]
  → 16 smooth actions (x, y, z, rx, ry, rz, gripper)

📂 Code🚀 Streamlit Demo📓 Notebook


📓 Bonus: Comprehensive Notebooks

🖼️ Image Processing

50+ Techniques on 10 Images

From basic pixel operations to Fourier transforms, superpixels, and panorama stitching.

Color spaces • Filtering • Edges • Thresholding • Morphology • Contours • SIFT/ORB • FFT • Denoising • Inpainting

Open Notebook →

🎥 Video Processing

30+ Techniques with Synthetic Videos

Motion analysis, object tracking, optical flow, video stabilization, and feature extraction.

Frame differencing • MOG2/KNN • Lucas-Kanade • Farneback • MeanShift • CSRT • Heatmaps • Scene detection

Open Notebook →

🎯 YOLO v8 → v11

All Tasks, All Versions

Detection, segmentation, pose estimation, classification, OBB — with cross-version benchmarks.

YOLOv8 • YOLOv9 (PGI) • YOLOv10 (NMS-free) • YOLO11 • Custom training • Tracking • Export

Open Notebook →


🚀 Quick Start

# Clone the repository
git clone https://github.com/mlnjsh/Computer_Vision_Projects.git
cd Computer_Vision_Projects

# Install base dependencies
pip install -r requirements_streamlit.txt

# Run any project's interactive Streamlit demo
cd Project_1_3D_Reconstruction && streamlit run app.py   # 3D point cloud viewer
cd Project_2_Video_Temporal_Reasoning && streamlit run app.py  # Frame selection timeline
cd Project_3_Medical_Segmentation && streamlit run app.py      # Medical segmentation overlay
cd Project_4_3D_Occupancy_Prediction && streamlit run app.py   # 3D voxel driving scene
cd Project_5_VLA_Robotic_Manipulation && streamlit run app.py  # Robot arm trajectory

# Or run the landing page
streamlit run app.py

Note: All demos use synthetic data — no dataset downloads or GPU required!


📁 Repository Structure

Computer_Vision_Projects/
│
├── 📄 app.py                          # Main Streamlit landing page
├── 📄 requirements_streamlit.txt       # Shared dependencies
│
├── 🏗️ Project_1_3D_Reconstruction/     # 22 files — DINOv2 + cross-view attention
│   ├── app.py                          # Streamlit: interactive 3D point clouds (Plotly)
│   ├── src/models/                     # Encoder, cross-attention, decoders
│   ├── src/data/                       # CO3D, DTU loaders + synthetic data
│   ├── src/losses/                     # Chamfer, depth, pose, reprojection
│   ├── src/training/                   # Full training loop + config
│   ├── src/evaluation/                 # Metrics + 3D visualization
│   └── notebooks/demo.ipynb            # Interactive demo
│
├── 🎬 Project_2_Video_Temporal_Reasoning/ # 20 files — CLIP + temporal attention
│   ├── app.py                          # Streamlit: frame selection + attention maps
│   ├── src/frame_selection/            # CLIP scorer, samplers, diversity selector
│   ├── src/temporal_reasoning/         # Temporal encoding + before/after attention
│   └── notebooks/demo.ipynb
│
├── 🏥 Project_3_Medical_Segmentation/  # 24 files — SAM2 distillation
│   ├── app.py                          # Streamlit: polyp/skin/organ segmentation
│   ├── src/models/teacher/             # SAM 2.1 + DINOv2 wrappers
│   ├── src/models/student/             # EfficientNet + UNet decoder
│   ├── src/models/distillation/        # Feature + logit distillation
│   └── notebooks/demo.ipynb
│
├── 🚗 Project_4_3D_Occupancy_Prediction/ # 23 files — LSS + BEV + temporal
│   ├── app.py                          # Streamlit: 3D voxels + BEV view
│   ├── src/models/backbone/            # ResNet-50 + FPN
│   ├── src/models/view_transform/      # Lift-Splat-Shoot, BEV encoder
│   ├── src/models/temporal/            # Scene flow + deformable attention
│   └── notebooks/demo.ipynb
│
├── 🤖 Project_5_VLA_Robotic_Manipulation/ # 23 files — VLM + diffusion actions
│   ├── app.py                          # Streamlit: robot trajectory + diffusion viz
│   ├── src/models/vision/              # SigLIP encoder
│   ├── src/models/language/            # Lightweight LM
│   ├── src/models/action/              # DDPM diffusion decoder
│   ├── src/simulation/                 # 2D tabletop environment
│   └── notebooks/demo.ipynb
│
├── 🖼️ Image_Processing_Fundamentals/   # 50+ techniques on 10 images
│   └── Image_Processing_Complete_Guide.ipynb
│
├── 🎥 Video_Processing_Fundamentals/   # 30+ techniques with synthetic video
│   └── Video_Processing_Complete_Guide.ipynb
│
└── 🎯 YOLO_Projects/                   # YOLOv8 → v9 → v10 → v11
    └── YOLO_Complete_Projects.ipynb

🛠️ Tech Stack

Category Technologies
Deep Learning PyTorch 2.x, torchvision, timm, HuggingFace Transformers
CV Libraries OpenCV, scikit-image, Ultralytics YOLO, albumentations
Foundation Models DINOv2, CLIP, SAM 2.1, SigLIP, EfficientNet
Visualization Streamlit, Plotly 3D, Matplotlib, Open3D
Training AMP, DDP, TensorBoard, cosine-warmup scheduling
Datasets CO3D, nuScenes, COCO, Kvasir-SEG, Open X-Embodiment

📊 Project Comparison

Project Subfield Params Key Innovation Venue Inspiration
3D Reconstruction 3D Vision ~85M Feed-forward, no iterative SfM CVPR 2025 Best Paper
Video Reasoning Video Understanding ~350M Adaptive frame selection CVPR 2025 (Stanford)
Medical Segmentation Medical AI 8M 39x compression, 98.9% accuracy MICCAI 2025
3D Occupancy Autonomous Driving ~45M Camera-only, no LiDAR Pattern Recognition 2026
VLA Robotics Embodied AI ~400M Diffusion action decoding CVPR 2026 Workshop

📚 References & Inspiration

Click to expand full reference list

Project 1 — 3D Reconstruction:

  • Wang et al., "VGGT: Visual Geometry Grounded Transformer," CVPR 2025 (Best Paper)
  • Wang et al., "DUSt3R: Geometric 3D Vision Made Easy," CVPR 2024
  • Oquab et al., "DINOv2: Learning Robust Visual Features," 2023

Project 2 — Video Temporal Reasoning:

  • Liu et al., "T*: Long Video Understanding via Temporal Search," CVPR 2025
  • Fu et al., "Video-MME: Multi-modal LLMs in Video Analysis," 2024
  • Xiao et al., "NExT-QA: Temporal and Causal Reasoning," CVPR 2021

Project 3 — Medical Segmentation:

  • Ravi et al., "SAM 2: Segment Anything in Images and Videos," ICLR 2025
  • Shin et al., "MM-DINOv2 for Medical Imaging," MICCAI 2025
  • Hinton et al., "Distilling the Knowledge in a Neural Network," 2015

Project 4 — 3D Occupancy:

  • Li et al., "FSF-Net: Scene Flow Guided 3D Occupancy," Pattern Recognition 2026
  • Huang et al., "TPVFormer: Tri-Perspective View," CVPR 2023
  • Philion & Fidler, "Lift, Splat, Shoot," ECCV 2020

Project 5 — VLA Robotics:

  • Kim et al., "OpenVLA: Vision-Language-Action Model," 2024
  • Roux et al., "SmolVLA: Efficient Robot Manipulation," 2025
  • Black et al., "π0: Vision-Language-Action Flow Model," 2024

⭐ Star this repo if you find it useful!

Built with passion for computer vision research

Milan Amrut Joshi • 2026

GitHub


Contributors & Domain Experts

Milan Amrut Joshi
Milan Amrut Joshi

Project Author
Meta FAIR
Meta FAIR

PyTorch, Detectron2, SAM
Phil Wang
Phil Wang

Prolific CV/Transformer implementations

About

5 Research-Level CV Projects: 3D Reconstruction (CVPR 2025) | Video Reasoning | Medical Segmentation | Autonomous Driving | Robotic VLA — with Streamlit demos, training pipelines & notebooks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors