👁️ Computer Vision Research Portfolio

5 Research-Level Projects | Interactive Demos | Production-Ready Code

Milan Amrut Joshi — Computer Vision Scientist

Covering 3D Vision • Video Understanding • Medical Imaging • Autonomous Driving • Embodied AI

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   🏗️ 3D Reconstruction   🎬 Video Reasoning   🏥 Medical Imaging    │
│                                                                     │
│        🚗 Autonomous Driving        🤖 Robotic Manipulation         │
│                                                                     │
│   📓 50+ Image Processing Techniques  🎯 YOLO v8→v11 Projects      │
│                                                                     │
│   141 Files  •  31,000+ Lines  •  5 Streamlit Apps  •  8 Notebooks  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

🔬 Research Projects

🏗️ Project 1: Feed-Forward 3D Reconstruction

Subfield: 3D Vision / Geometric Deep Learning

Transformer that predicts 3D point clouds, depth maps, and camera poses from 2-10 images in a single forward pass — no iterative optimization.

Inspired by: CVPR 2025 Best Paper (VGGT)

Component	Detail
Backbone	DINOv2 ViT-B/14
Architecture	Cross-view attention + epipolar encoding
Datasets	CO3D v2, DTU MVS, ScanNet
Metrics	Chamfer Distance, F-score, Depth Error

Input: N images → [ViT Encoder] → [Cross-View Attention]
  → Point Maps + Depth Maps + Camera Poses

📂 Code • 🚀 Streamlit Demo • 📓 Notebook

🎬 Project 2: Video Temporal Reasoning

Subfield: Video Understanding

Two-stage system: adaptive keyframe selection + temporal reasoning for answering temporal/causal questions about long-form video. Even GPT-4V only scores 60-65% on this.

Inspired by: Stanford's T* (CVPR 2025)

Component	Detail
Frame Selector	CLIP ViT-L/14 + learned MLP
Reasoner	Temporal attention + VLM
Datasets	Video-MME, Ego4D, NExT-QA, STAR
Metrics	QA Accuracy by type (temporal, causal)

Long Video + Question → [Frame Selector] → K keyframes
  → [Temporal Reasoner] → Answer with reasoning

📂 Code • 🚀 Streamlit Demo • 📓 Notebook

🏥 Project 3: Medical Segmentation via Distillation

Subfield: Medical Imaging + Foundation Models

Distill SAM 2.1 (312M params) into a compact student (8M params) that achieves 98.9% of teacher accuracy at 8.4x speed — fast enough for real-time surgical guidance.

Inspired by: MedSAM2 / MM-DINOv2 (MICCAI 2025)

Component	Detail
Teacher	SAM 2.1 + DINOv2
Student	EfficientNet-B0 + UNet decoder
Datasets	Kvasir-SEG, ISIC 2019, TotalSegmentator
Metrics	Dice, IoU, Hausdorff-95, FPS

Teacher (312M, 5 FPS) → Knowledge Distillation
  → Student (8M, 42 FPS) — same accuracy!

📂 Code • 🚀 Streamlit Demo • 📓 Notebook

🚗 Project 4: 3D Occupancy for Autonomous Driving

Subfield: Autonomous Driving Perception

Camera-only 3D voxel occupancy prediction (200×200×16, 17 classes) with temporal fusion via scene flow — no expensive LiDAR needed.

Inspired by: FSF-Net (2026) / TPVFormer / BEVFormer

Component	Detail
Backbone	ResNet-50 + FPN
View Transform	Lift-Splat-Shoot (LSS)
Temporal	Scene flow + deformable attention
Datasets	nuScenes, Occ3D, KITTI, Waymo

6 Cameras → [ResNet] → [LSS Lifting] → [BEV]
  → [Temporal Fusion] → 3D Occupancy Grid (17 classes)

📂 Code • 🚀 Streamlit Demo • 📓 Notebook

🤖 Project 5: Compact Vision-Language-Action Model for Robotic Manipulation

Subfield: Embodied AI / Robotics

A compact VLA model (~400M params) that takes camera images + language instructions and outputs smooth robotic actions via diffusion-based decoding. Runs at 8+ Hz on consumer hardware.

Inspired by: OpenVLA, SmolVLA (HuggingFace), pi-zero

Component	Detail
Vision	SigLIP ViT-B/16 with spatial compression
Language	Lightweight transformer encoder
Action	DDPM diffusion decoder (16-step chunks × 7-DoF)
Datasets	Open X-Embodiment, CALVIN, RLBench

RGB Image + "pick up the red cup" → [VLM Fusion] → [Diffusion Decoder]
  → 16 smooth actions (x, y, z, rx, ry, rz, gripper)

📂 Code • 🚀 Streamlit Demo • 📓 Notebook

📓 Bonus: Comprehensive Notebooks

🖼️ Image Processing

50+ Techniques on 10 Images

From basic pixel operations to Fourier transforms, superpixels, and panorama stitching.

Color spaces • Filtering • Edges • Thresholding • Morphology • Contours • SIFT/ORB • FFT • Denoising • Inpainting

Open Notebook →

🎥 Video Processing

30+ Techniques with Synthetic Videos

Motion analysis, object tracking, optical flow, video stabilization, and feature extraction.

Frame differencing • MOG2/KNN • Lucas-Kanade • Farneback • MeanShift • CSRT • Heatmaps • Scene detection

Open Notebook →

🎯 YOLO v8 → v11

All Tasks, All Versions

Detection, segmentation, pose estimation, classification, OBB — with cross-version benchmarks.

YOLOv8 • YOLOv9 (PGI) • YOLOv10 (NMS-free) • YOLO11 • Custom training • Tracking • Export

Open Notebook →

🚀 Quick Start

# Clone the repository
git clone https://github.com/mlnjsh/Computer_Vision_Projects.git
cd Computer_Vision_Projects

# Install base dependencies
pip install -r requirements_streamlit.txt

# Run any project's interactive Streamlit demo
cd Project_1_3D_Reconstruction && streamlit run app.py   # 3D point cloud viewer
cd Project_2_Video_Temporal_Reasoning && streamlit run app.py  # Frame selection timeline
cd Project_3_Medical_Segmentation && streamlit run app.py      # Medical segmentation overlay
cd Project_4_3D_Occupancy_Prediction && streamlit run app.py   # 3D voxel driving scene
cd Project_5_VLA_Robotic_Manipulation && streamlit run app.py  # Robot arm trajectory

# Or run the landing page
streamlit run app.py

Note: All demos use synthetic data — no dataset downloads or GPU required!

📁 Repository Structure

Computer_Vision_Projects/
│
├── 📄 app.py                          # Main Streamlit landing page
├── 📄 requirements_streamlit.txt       # Shared dependencies
│
├── 🏗️ Project_1_3D_Reconstruction/     # 22 files — DINOv2 + cross-view attention
│   ├── app.py                          # Streamlit: interactive 3D point clouds (Plotly)
│   ├── src/models/                     # Encoder, cross-attention, decoders
│   ├── src/data/                       # CO3D, DTU loaders + synthetic data
│   ├── src/losses/                     # Chamfer, depth, pose, reprojection
│   ├── src/training/                   # Full training loop + config
│   ├── src/evaluation/                 # Metrics + 3D visualization
│   └── notebooks/demo.ipynb            # Interactive demo
│
├── 🎬 Project_2_Video_Temporal_Reasoning/ # 20 files — CLIP + temporal attention
│   ├── app.py                          # Streamlit: frame selection + attention maps
│   ├── src/frame_selection/            # CLIP scorer, samplers, diversity selector
│   ├── src/temporal_reasoning/         # Temporal encoding + before/after attention
│   └── notebooks/demo.ipynb
│
├── 🏥 Project_3_Medical_Segmentation/  # 24 files — SAM2 distillation
│   ├── app.py                          # Streamlit: polyp/skin/organ segmentation
│   ├── src/models/teacher/             # SAM 2.1 + DINOv2 wrappers
│   ├── src/models/student/             # EfficientNet + UNet decoder
│   ├── src/models/distillation/        # Feature + logit distillation
│   └── notebooks/demo.ipynb
│
├── 🚗 Project_4_3D_Occupancy_Prediction/ # 23 files — LSS + BEV + temporal
│   ├── app.py                          # Streamlit: 3D voxels + BEV view
│   ├── src/models/backbone/            # ResNet-50 + FPN
│   ├── src/models/view_transform/      # Lift-Splat-Shoot, BEV encoder
│   ├── src/models/temporal/            # Scene flow + deformable attention
│   └── notebooks/demo.ipynb
│
├── 🤖 Project_5_VLA_Robotic_Manipulation/ # 23 files — VLM + diffusion actions
│   ├── app.py                          # Streamlit: robot trajectory + diffusion viz
│   ├── src/models/vision/              # SigLIP encoder
│   ├── src/models/language/            # Lightweight LM
│   ├── src/models/action/              # DDPM diffusion decoder
│   ├── src/simulation/                 # 2D tabletop environment
│   └── notebooks/demo.ipynb
│
├── 🖼️ Image_Processing_Fundamentals/   # 50+ techniques on 10 images
│   └── Image_Processing_Complete_Guide.ipynb
│
├── 🎥 Video_Processing_Fundamentals/   # 30+ techniques with synthetic video
│   └── Video_Processing_Complete_Guide.ipynb
│
└── 🎯 YOLO_Projects/                   # YOLOv8 → v9 → v10 → v11
    └── YOLO_Complete_Projects.ipynb

🛠️ Tech Stack

Category	Technologies
Deep Learning	PyTorch 2.x, torchvision, timm, HuggingFace Transformers
CV Libraries	OpenCV, scikit-image, Ultralytics YOLO, albumentations
Foundation Models	DINOv2, CLIP, SAM 2.1, SigLIP, EfficientNet
Visualization	Streamlit, Plotly 3D, Matplotlib, Open3D
Training	AMP, DDP, TensorBoard, cosine-warmup scheduling
Datasets	CO3D, nuScenes, COCO, Kvasir-SEG, Open X-Embodiment

📊 Project Comparison

Project	Subfield	Params	Key Innovation	Venue Inspiration
3D Reconstruction	3D Vision	~85M	Feed-forward, no iterative SfM	CVPR 2025 Best Paper
Video Reasoning	Video Understanding	~350M	Adaptive frame selection	CVPR 2025 (Stanford)
Medical Segmentation	Medical AI	8M	39x compression, 98.9% accuracy	MICCAI 2025
3D Occupancy	Autonomous Driving	~45M	Camera-only, no LiDAR	Pattern Recognition 2026
VLA Robotics	Embodied AI	~400M	Diffusion action decoding	CVPR 2026 Workshop

📚 References & Inspiration

Click to expand full reference list

Project 1 — 3D Reconstruction:

Wang et al., "VGGT: Visual Geometry Grounded Transformer," CVPR 2025 (Best Paper)
Wang et al., "DUSt3R: Geometric 3D Vision Made Easy," CVPR 2024
Oquab et al., "DINOv2: Learning Robust Visual Features," 2023

Project 2 — Video Temporal Reasoning:

Liu et al., "T*: Long Video Understanding via Temporal Search," CVPR 2025
Fu et al., "Video-MME: Multi-modal LLMs in Video Analysis," 2024
Xiao et al., "NExT-QA: Temporal and Causal Reasoning," CVPR 2021

Project 3 — Medical Segmentation:

Ravi et al., "SAM 2: Segment Anything in Images and Videos," ICLR 2025
Shin et al., "MM-DINOv2 for Medical Imaging," MICCAI 2025
Hinton et al., "Distilling the Knowledge in a Neural Network," 2015

Project 4 — 3D Occupancy:

Li et al., "FSF-Net: Scene Flow Guided 3D Occupancy," Pattern Recognition 2026
Huang et al., "TPVFormer: Tri-Perspective View," CVPR 2023
Philion & Fidler, "Lift, Splat, Shoot," ECCV 2020

Project 5 — VLA Robotics:

Kim et al., "OpenVLA: Vision-Language-Action Model," 2024
Roux et al., "SmolVLA: Efficient Robot Manipulation," 2025
Black et al., "π0: Vision-Language-Action Flow Model," 2024

⭐ Star this repo if you find it useful!

Built with passion for computer vision research

Milan Amrut Joshi • 2026

Contributors & Domain Experts

_{Milan Amrut Joshi}
_{Project Author}

_{Meta FAIR}
_{PyTorch, Detectron2, SAM}

_{Phil Wang}
_{Prolific CV/Transformer implementations}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👁️ Computer Vision Research Portfolio

5 Research-Level Projects | Interactive Demos | Production-Ready Code

🔬 Research Projects

🏗️ Project 1: Feed-Forward 3D Reconstruction

🎬 Project 2: Video Temporal Reasoning

🏥 Project 3: Medical Segmentation via Distillation

🚗 Project 4: 3D Occupancy for Autonomous Driving

🤖 Project 5: Compact Vision-Language-Action Model for Robotic Manipulation

📓 Bonus: Comprehensive Notebooks

🖼️ Image Processing

🎥 Video Processing

🎯 YOLO v8 → v11

🚀 Quick Start

📁 Repository Structure

🛠️ Tech Stack

📊 Project Comparison

📚 References & Inspiration

⭐ Star this repo if you find it useful!

Contributors & Domain Experts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Image_Processing_Fundamentals		Image_Processing_Fundamentals
Project_1_3D_Reconstruction		Project_1_3D_Reconstruction
Project_2_Video_Temporal_Reasoning		Project_2_Video_Temporal_Reasoning
Project_3_Medical_Segmentation		Project_3_Medical_Segmentation
Project_4_3D_Occupancy_Prediction		Project_4_3D_Occupancy_Prediction
Project_5_VLA_Robotic_Manipulation		Project_5_VLA_Robotic_Manipulation
Video_Processing_Fundamentals		Video_Processing_Fundamentals
YOLO_Projects		YOLO_Projects
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements_streamlit.txt		requirements_streamlit.txt

Folders and files

Latest commit

History

Repository files navigation

👁️ Computer Vision Research Portfolio

5 Research-Level Projects | Interactive Demos | Production-Ready Code

🔬 Research Projects

🏗️ Project 1: Feed-Forward 3D Reconstruction

🎬 Project 2: Video Temporal Reasoning

🏥 Project 3: Medical Segmentation via Distillation

🚗 Project 4: 3D Occupancy for Autonomous Driving

🤖 Project 5: Compact Vision-Language-Action Model for Robotic Manipulation

📓 Bonus: Comprehensive Notebooks

🖼️ Image Processing

🎥 Video Processing

🎯 YOLO v8 → v11

🚀 Quick Start

📁 Repository Structure

🛠️ Tech Stack

📊 Project Comparison

📚 References & Inspiration

⭐ Star this repo if you find it useful!

Contributors & Domain Experts

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages