AI-Scraper is a web application that generates web scraping scripts automatically using AI. Users input a URL and the data fields they want, and the system outputs a working Python script or runs it in isolated cloud containers for safe execution.
- AI-Powered Script Generation: Automatically generates Python web scraping scripts using AI
- Multiple Output Formats: Download scripts or run them in the cloud for CSV/JSON output
- Containerized Execution: Safe, isolated script execution in Docker containers using run_scraper.py
- Scraper Runner Service: Dedicated service for executing generated scraping scripts
- Security-First: Scripts run in isolated containers with timeout and resource limits
- Real-time Monitoring: Execution tracking with status updates and error handling
- User Management: Authentication and user dashboard
- Modern Stack: FastAPI + React with PostgreSQL database
- Backend: Python + FastAPI
- Frontend: React with TypeScript
- Database: PostgreSQL (SQLite for local development)
- AI Integration: OpenAI GPT / Local LLM support
- Scraper Runner: Dedicated service (
run_scraper.py) for isolated script execution - Deployment: Docker containers for easy deployment
- Containerization: Multi-service orchestration with Docker Compose
- Docker and Docker Compose
- Python 3.9+
- Node.js 16+
- PostgreSQL (or use SQLite for development)
- Clone the repository
- Set up environment variables
- Start the development servers
# Backend
cd backend
pip install -r requirements.txt
uvicorn app.main:app --reload
# Frontend
cd frontend
npm install
npm run dev
# Database
docker-compose up dbdocker-compose up --buildThe scraper runner system provides secure, isolated execution of generated scraping scripts. The run_scraper.py service ensures that scripts run in controlled environments without affecting the main application.
- Container Isolation: Scripts execute in separate Docker containers
- Timeout Protection: 5-minute execution timeout with graceful failure handling
- Resource Management: Memory and CPU limits for script execution
- Output Handling: Supports JSON, CSV, and XML output formats
- Error Handling: Comprehensive error reporting and logging
- Security: Scripts run with restricted permissions and no system access
- User generates or uploads a scraping script via the web interface
- Backend API queues the execution request
- run_scraper.py receives execution instructions
- Script is executed in an isolated container with controlled environment
- Output is captured and stored for user download
- Execution status and logs are returned to the user interface
- Real-time execution status updates
- Comprehensive logging for troubleshooting
- Execution time tracking and performance metrics
- Error reporting with detailed stack traces
POST /register- User registrationPOST /login- User authenticationPOST /refresh- Token refreshGET /me- Get current user profile
GET /- List user's scrapersPOST /- Create new scraperGET /{id}- Get scraper detailsPUT /{id}- Update scraperDELETE /{id}- Delete scraperPOST /{id}/generate- Generate AI scriptGET /{id}/script- Download generated scriptPOST /{id}/execute- Execute scraperGET /{id}/executions- Get execution history
GET /profile- Get user profile with statisticsPUT /profile- Update user profileGET /credits- Get user creditsGET /scrapers- Get user's scrapersGET /executions- Get user's execution history
GET /stats- System statisticsGET /users- List all users with filteringGET /executions/recent- Recent execution logsGET /ai-logs/recent- AI generation historyGET /system/health- System health check
- Generate Script: User provides URL and desired fields
- AI Processing: Backend generates Python scraping script
- Security Validation: Script is validated for safety
- Execution Queue: Script is queued for containerized execution
- Container Execution: run_scraper.py executes script in isolated environment
- Result Handling: Output is processed and stored
- User Notification: Results are available for download or view
AI-Scraper/
├── backend/ # FastAPI Backend Application
│ ├── app/
│ │ ├── main.py # Application entry point
│ │ ├── api/ # API routes (auth, scrapers, users, admin)
│ │ ├── models.py # Database models
│ │ ├── ai_agent.py # AI script generation
│ │ └── database.py # Database configuration
│ ├── requirements.txt # Python dependencies
│ └── Dockerfile # Backend container
│
├── frontend/ # React Frontend Application
│ ├── src/
│ │ ├── App.tsx # Main application component
│ │ ├── components/ # Reusable UI components
│ │ ├── pages/ # Application pages
│ │ └── stores/ # State management (Zustand)
│ ├── package.json # Node.js dependencies
│ └── Dockerfile # Frontend container
│
├── scraper-runner/ # Script Execution Service
│ ├── run_scraper.py # Main execution service with container isolation
│ ├── scrape-requirements.txt # Python dependencies for scraping
│ └── Dockerfile # Runner container
│
├── docker-compose.yml # Full stack orchestration
├── README.md # Project documentation
└── LICENSE # MIT License
Configure the following in backend/.env:
# AI Configuration
OPENAI_API_KEY=your_openai_api_key
OPENAI_MODEL=gpt-3.5-turbo
# Database
DATABASE_URL=postgresql://user:password@localhost:5432/ai_scraper
# Security
SECRET_KEY=your_secret_key_here
# Script Execution
SCRIPT_TIMEOUT=300
MAX_SCRIPT_SIZE=10000The scraper runner includes specialized dependencies for web scraping:
requests==2.31.0
beautifulsoup4==4.12.2
selenium==4.15.2
lxml==4.9.3
pandas==2.1.4
urllib3==2.0.7
MIT License