Skip to content

ghostiee-11/Multimodal-RAG-System-for-Visual-Shopping-Assistance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal RAG System for Visual Shopping Assistance

Deployed locally on https://multimodal-deployment.vercel.app/

This project implements a progressively developed Multimodal Retrieval-Augmented Generation (RAG) system to answer e-commerce product queries using both textual and visual data. Built for the headphones category on Amazon.in, it leverages product specifications, user reviews, images, BLIP captions, and OCR-derived text to provide grounded, context-aware responses via Google Gemini.


Project Overview

The system evolves across three major iterations:

Iteration Description Entry Point
1 Basic RAG using text-only chunks and a single BLIP caption per image main_assistant.py
2 Adds multiple BLIP captions and ViLT-based reranking main_assistant_new.py
3 Integrates filtered BLIP captions and OCR text into a unified retrieval pipeline main_assistant_new_one.py (Recommended)
4 Integrates filtered BLIP captions and OCR text into a hybrid searching unified retrieval pipeline main_assistant_new_one_hyb.py (Recommended)

Recommended File Structure

For smooth execution across embedding, retrieval, and assistant pipelines, flatten all scripts and folders into a single directory structure:

├── scrapped_dataset/              # Final CSVs: metadata, reviews, specs, image info
├── scraper/                       # Web scraping scripts
├── embedding/                     # Embedding generation, BLIP, OCR scripts

├── retriever.py                   # Iteration 1 retriever
├── retriever_new.py               # Iteration 2 retriever
├── retriever_new_one.py           # Iteration 3 retriever
├── retriever_new_one_hyb.py       # Iteration 3 retriever+hyb searching 

├── llm_handler.py                 # Iteration 1 LLM handler
├── llm_handler_new.py             # Iteration 2 LLM handler
├── llm_handler_new_one.py         # Iteration 3 LLM handler

├── main_assistant.py              # Iteration 1 chatbot
├── main_assistant_new.py          # Iteration 2 chatbot
├── main_assistant_new_one.py      # Iteration 3 chatbot (recommended)
├── main_assistant_new_one_hyb.py  # Iteration 3 chatbot+hyb searching enabled (recommended)
├── requirements.txt               # Dependencies
└── README.md                      # Project documentation

Installation

  1. Clone the repository:

    git clone <your-repo-url>
    cd Multimodal-RAG-System-for-Visual-Shopping-Assistance

2.Install dependencies: pip install -r requirements.txt 3.Set environment variables: export PINECONE_API_KEY=your_pinecone_api_key export PINECONE_ENVIRONMENT=your_pinecone_environment export GEMINI_API_KEY=your_gemini_api_key Required API Keys Google Gemini Flash API Key

Pinecone API Key and Environment

These are used for language generation and vector database storage/retrieval, respectively.

Models Used The following pre-trained models are used throughout the pipeline:

Purpose Model Name Text Embedding all-mpnet-base-v2 Cross Encoder cross-encoder/stsb-roberta-base Image Embedding openai/clip-vit-base-patch32 Image Captioning Salesforce/blip-image-captioning-large Image Reranking dandelin/vilt-b32-finetuned-vqa

How to Run the Chatbot Run one of the following assistant scripts:

Iteration 1 (Basic RAG)

python main_assistant.py

Iteration 2 (BLIP + ViLT reranking)

python main_assistant_new.py

Iteration 3 (BLIP + OCR + Semantic Filtering) - Recommended

python main_assistant_new_one.py

Iteration 3 (with hybrid searching) (BLIP + OCR + Semantic Filtering) - Recommended

python main_assistant_new_one_hyb.py

Dataset Details All cleaned and processed dataset files are located in the scrapped_dataset/ directory:

products_final.csv: Product metadata (title, price, category, image paths)

customer_reviews_scraped_v3.csv: User reviews and aspect summaries

all_documents.csv: Descriptions and specifications

all_product_images_info_scraped.csv: Image metadata

valid_product_images.csv: Filtered list of usable images

image_captions_multiple.csv: Multiple BLIP-generated captions per image

image_ocr_texts_cleaned.csv: Cleaned OCR outputs

image_combined_blip_ocr_filtered_final.csv: Final filtered captions + OCR metadata

Notes Flatten all subfolders into a single directory before running scripts. Update all relative paths in the scripts if you restructure the project.

Use the latest iteration (main_assistant_new_one.py) for the most accurate and visually grounded responses.

Embeddings can be regenerated using scripts in the embedding/ directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages