Milan Amrut Joshi — Computer Vision Scientist
Covering 3D Vision • Video Understanding • Medical Imaging • Autonomous Driving • Embodied AI
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ 🏗️ 3D Reconstruction 🎬 Video Reasoning 🏥 Medical Imaging │
│ │
│ 🚗 Autonomous Driving 🤖 Robotic Manipulation │
│ │
│ 📓 50+ Image Processing Techniques 🎯 YOLO v8→v11 Projects │
│ │
│ 141 Files • 31,000+ Lines • 5 Streamlit Apps • 8 Notebooks │
│ │
└─────────────────────────────────────────────────────────────────────┘
|
Subfield: 3D Vision / Geometric Deep Learning
Inspired by: CVPR 2025 Best Paper (VGGT)
|
Subfield: Video Understanding
Inspired by: Stanford's T* (CVPR 2025)
|
||||||||||||||||||||
|
Subfield: Medical Imaging + Foundation Models
Inspired by: MedSAM2 / MM-DINOv2 (MICCAI 2025)
|
Subfield: Autonomous Driving Perception
Inspired by: FSF-Net (2026) / TPVFormer / BEVFormer
|
||||||||||||||||||||
|
Subfield: Embodied AI / Robotics
Inspired by: OpenVLA, SmolVLA (HuggingFace), pi-zero
|
|||||||||||||||||||||
|
50+ Techniques on 10 Images From basic pixel operations to Fourier transforms, superpixels, and panorama stitching. Color spaces • Filtering • Edges • Thresholding • Morphology • Contours • SIFT/ORB • FFT • Denoising • Inpainting |
30+ Techniques with Synthetic Videos Motion analysis, object tracking, optical flow, video stabilization, and feature extraction. Frame differencing • MOG2/KNN • Lucas-Kanade • Farneback • MeanShift • CSRT • Heatmaps • Scene detection |
All Tasks, All Versions Detection, segmentation, pose estimation, classification, OBB — with cross-version benchmarks. YOLOv8 • YOLOv9 (PGI) • YOLOv10 (NMS-free) • YOLO11 • Custom training • Tracking • Export |
# Clone the repository
git clone https://github.com/mlnjsh/Computer_Vision_Projects.git
cd Computer_Vision_Projects
# Install base dependencies
pip install -r requirements_streamlit.txt
# Run any project's interactive Streamlit demo
cd Project_1_3D_Reconstruction && streamlit run app.py # 3D point cloud viewer
cd Project_2_Video_Temporal_Reasoning && streamlit run app.py # Frame selection timeline
cd Project_3_Medical_Segmentation && streamlit run app.py # Medical segmentation overlay
cd Project_4_3D_Occupancy_Prediction && streamlit run app.py # 3D voxel driving scene
cd Project_5_VLA_Robotic_Manipulation && streamlit run app.py # Robot arm trajectory
# Or run the landing page
streamlit run app.pyNote: All demos use synthetic data — no dataset downloads or GPU required!
Computer_Vision_Projects/
│
├── 📄 app.py # Main Streamlit landing page
├── 📄 requirements_streamlit.txt # Shared dependencies
│
├── 🏗️ Project_1_3D_Reconstruction/ # 22 files — DINOv2 + cross-view attention
│ ├── app.py # Streamlit: interactive 3D point clouds (Plotly)
│ ├── src/models/ # Encoder, cross-attention, decoders
│ ├── src/data/ # CO3D, DTU loaders + synthetic data
│ ├── src/losses/ # Chamfer, depth, pose, reprojection
│ ├── src/training/ # Full training loop + config
│ ├── src/evaluation/ # Metrics + 3D visualization
│ └── notebooks/demo.ipynb # Interactive demo
│
├── 🎬 Project_2_Video_Temporal_Reasoning/ # 20 files — CLIP + temporal attention
│ ├── app.py # Streamlit: frame selection + attention maps
│ ├── src/frame_selection/ # CLIP scorer, samplers, diversity selector
│ ├── src/temporal_reasoning/ # Temporal encoding + before/after attention
│ └── notebooks/demo.ipynb
│
├── 🏥 Project_3_Medical_Segmentation/ # 24 files — SAM2 distillation
│ ├── app.py # Streamlit: polyp/skin/organ segmentation
│ ├── src/models/teacher/ # SAM 2.1 + DINOv2 wrappers
│ ├── src/models/student/ # EfficientNet + UNet decoder
│ ├── src/models/distillation/ # Feature + logit distillation
│ └── notebooks/demo.ipynb
│
├── 🚗 Project_4_3D_Occupancy_Prediction/ # 23 files — LSS + BEV + temporal
│ ├── app.py # Streamlit: 3D voxels + BEV view
│ ├── src/models/backbone/ # ResNet-50 + FPN
│ ├── src/models/view_transform/ # Lift-Splat-Shoot, BEV encoder
│ ├── src/models/temporal/ # Scene flow + deformable attention
│ └── notebooks/demo.ipynb
│
├── 🤖 Project_5_VLA_Robotic_Manipulation/ # 23 files — VLM + diffusion actions
│ ├── app.py # Streamlit: robot trajectory + diffusion viz
│ ├── src/models/vision/ # SigLIP encoder
│ ├── src/models/language/ # Lightweight LM
│ ├── src/models/action/ # DDPM diffusion decoder
│ ├── src/simulation/ # 2D tabletop environment
│ └── notebooks/demo.ipynb
│
├── 🖼️ Image_Processing_Fundamentals/ # 50+ techniques on 10 images
│ └── Image_Processing_Complete_Guide.ipynb
│
├── 🎥 Video_Processing_Fundamentals/ # 30+ techniques with synthetic video
│ └── Video_Processing_Complete_Guide.ipynb
│
└── 🎯 YOLO_Projects/ # YOLOv8 → v9 → v10 → v11
└── YOLO_Complete_Projects.ipynb
| Category | Technologies |
|---|---|
| Deep Learning | PyTorch 2.x, torchvision, timm, HuggingFace Transformers |
| CV Libraries | OpenCV, scikit-image, Ultralytics YOLO, albumentations |
| Foundation Models | DINOv2, CLIP, SAM 2.1, SigLIP, EfficientNet |
| Visualization | Streamlit, Plotly 3D, Matplotlib, Open3D |
| Training | AMP, DDP, TensorBoard, cosine-warmup scheduling |
| Datasets | CO3D, nuScenes, COCO, Kvasir-SEG, Open X-Embodiment |
| Project | Subfield | Params | Key Innovation | Venue Inspiration |
|---|---|---|---|---|
| 3D Reconstruction | 3D Vision | ~85M | Feed-forward, no iterative SfM | CVPR 2025 Best Paper |
| Video Reasoning | Video Understanding | ~350M | Adaptive frame selection | CVPR 2025 (Stanford) |
| Medical Segmentation | Medical AI | 8M | 39x compression, 98.9% accuracy | MICCAI 2025 |
| 3D Occupancy | Autonomous Driving | ~45M | Camera-only, no LiDAR | Pattern Recognition 2026 |
| VLA Robotics | Embodied AI | ~400M | Diffusion action decoding | CVPR 2026 Workshop |
Click to expand full reference list
Project 1 — 3D Reconstruction:
- Wang et al., "VGGT: Visual Geometry Grounded Transformer," CVPR 2025 (Best Paper)
- Wang et al., "DUSt3R: Geometric 3D Vision Made Easy," CVPR 2024
- Oquab et al., "DINOv2: Learning Robust Visual Features," 2023
Project 2 — Video Temporal Reasoning:
- Liu et al., "T*: Long Video Understanding via Temporal Search," CVPR 2025
- Fu et al., "Video-MME: Multi-modal LLMs in Video Analysis," 2024
- Xiao et al., "NExT-QA: Temporal and Causal Reasoning," CVPR 2021
Project 3 — Medical Segmentation:
- Ravi et al., "SAM 2: Segment Anything in Images and Videos," ICLR 2025
- Shin et al., "MM-DINOv2 for Medical Imaging," MICCAI 2025
- Hinton et al., "Distilling the Knowledge in a Neural Network," 2015
Project 4 — 3D Occupancy:
- Li et al., "FSF-Net: Scene Flow Guided 3D Occupancy," Pattern Recognition 2026
- Huang et al., "TPVFormer: Tri-Perspective View," CVPR 2023
- Philion & Fidler, "Lift, Splat, Shoot," ECCV 2020
Project 5 — VLA Robotics:
- Kim et al., "OpenVLA: Vision-Language-Action Model," 2024
- Roux et al., "SmolVLA: Efficient Robot Manipulation," 2025
- Black et al., "π0: Vision-Language-Action Flow Model," 2024
Built with passion for computer vision research
Milan Amrut Joshi • 2026
![]() Milan Amrut Joshi Project Author |
![]() Meta FAIR PyTorch, Detectron2, SAM |
![]() Phil Wang Prolific CV/Transformer implementations |


