SQL Query Generator with Self-Improving AI

Author: Markus van Kempen (markus.van.kempen@gmail.com) Last Updated: January 24, 2026 Version: 2.0

A Streamlit application that converts natural language queries into SQL using IBM WatsonX LLMs with self-improving capabilities powered by embeddings.

Features

🧠 Self-Improving System

Embedding-based learning: Uses IBM Slate embeddings to find semantically similar past queries
Few-shot prompting: Automatically includes successful examples in prompts
Error avoidance: Learns from common mistakes to improve future queries
User feedback: 👍/👎 buttons help the system learn what works

🤖 Four Query Generation Modes

Mode	Description	Best For
Direct SQL	Prompt engineering + SQLGlot validation	Speed-critical queries
LangChain Agent	LangChain SQL chain with schema injection	Complex queries
BeeAI Agent	IBM BeeAI Framework with native WatsonX	Highest reliability
Self-Improving	Full learning system with pattern storage	Continuous improvement
SQL Validation	SQLGlot (AST Parsing)	Robustness & Safety
Agent Frameworks	LangChain, BeeAI Framework	Flexibility & Extensibility

💬 Hybrid Query Classification

Not all questions need SQL! The system uses a 4-layer hybrid approach:

Layer	Status	Description
1. Semantic	✅	IBM Slate embeddings (79 examples) - understands "per" = "by"
2. Keywords	✅	Database/general/help keyword matching
3. Schema-Aware	✅	Validates against actual DB content
4. Empty Analysis	✅	Explains why queries return no results

Query Types Handled:

Database queries → Generate SQL (e.g., "show all products")
General questions → Direct response (e.g., "what time is it")
Help questions → System guidance (e.g., "what can you do")
Unknown entities → Helpful feedback (e.g., "show BMW sales" → "BMW not in database")

The semantic classifier understands that "show total revenue per country" = "revenue by country" = "sales grouped by country" without keyword rules!

See COMPARISON_RESULTS.md for full implementation details and what's missing.

🔍 Product Disambiguation

When a search matches multiple products (e.g., "Wireless Mouse" matches both "Wireless Mouse" and "Wireless Mouse Pro"), the system shows a helpful disambiguation message.

⚡ Semantic Caching (NEW)

Memory + disk caching of embedding vectors
First query: API call (~450ms) → Cached for reuse (<1ms)
Reduces WatsonX API calls by 80%+ for repeated patterns

💬 Context Awareness (NEW)

Understands follow-up queries in conversation:

"show more" → expands previous results
"same for USA" → applies new filter to previous query
"top 5" → limits previous results
"those products" → resolves references to previous data

🛡️ Robust Validation (NEW)

SQLGlot-powered: AST-based parsing ensures only valid SQL is executed
Safety checks: Prevents SELECT * without tables, implicit cross-joins (ON TRUE), and missing join conditions
Fallback mechanism: Gracefully degrades if strict validation fails

📊 Schema Enrichment (NEW)

Semantic Context: Columns have descriptions (e.g., "status: 1=New")
Dynamic Sampling: Injects real valid values into prompt (e.g., country: 'USA', 'Germany')
Reduced Hallucinations: LLM knows exactly which values exist in the DB

Quick Start

1. Setup Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Configure Credentials

Create a .env file:

WATSONX_API_KEY=your_api_key
WATSONX_URL=https://us-south.ml.cloud.ibm.com
WATSONX_PROJECT_ID=your_project_id
DATABASE_PATH=./data/database.db

3. Initialize Database

python scripts/init_database.py

4. Run the App

streamlit run app.py

Project Structure

├── app.py                    # Main Streamlit UI (entry point)
├── watsonx_client.py         # Direct SQL mode with learning
├── langchain_agent.py        # LangChain mode with learning
├── beeai_agent.py            # BeeAI mode with learning
├── self_improving_agent.py   # Full self-improving agent
├── learning_store.py         # Embedding-based pattern storage
├── query_classifier.py       # Hybrid query classification
├── semantic_classifier.py    # Embedding-based intent detection + caching
├── context_manager.py        # Conversation context awareness
├── product_mapper.py         # Product disambiguation
├── schema_loader.py          # Database schema utilities
├── logging_config.py         # Centralized logging
│
├── scripts/                  # Utility scripts
│   └── init_database.py      # Database initialization
│
├── tests/                    # Test suite
│   ├── test_comparison.py    # SQLGlot comparison tests
│   ├── test_self_improving.py    # 100-query test suite
│   ├── test_model_comparison.py  # LLM model benchmark
│   ├── test_mode_comparison.py   # Mode comparison tests
│   └── test_model_mode_comparison.py  # Combined comparison
│
├── docs/                     # Documentation
│   ├── COMPARISON_RESULTS.md # Detailed mode & model comparison
│   └── test_results_summary.md   # Test findings
│
├── data/                     # Data storage
│   ├── database.db           # SQLite database
│   ├── learning.db           # Learning store (patterns, errors)
│   ├── cache/                # Embedding cache
│   └── results/              # Test comparison results (JSON)
│
└── .cursor/rules/            # Cursor AI rules

Database Schema

Tables

customers: customer_id, customer_name, email, city, country, region
products: product_id, product_code, product_name, category, price
orders: order_id, customer_id, product_id, order_date, quantity, total_amount

Views

sales: Pre-joined view combining all tables for simplified queries

Self-Improving Architecture

User Query
    ↓
🔍 Query Classifier ──────────────────┐
    │                                  │ (general/help)
    │ (database)                       ↓
    ↓                              📝 Direct Response
🧠 Embedding Search
    │
    ↓
📚 Find Similar Patterns (top 3)
    │
    ↓
⚠️ Find Common Errors to Avoid
    │
    ↓
📝 Build Enhanced Prompt
    │ (includes few-shot examples + error avoidance)
    ↓
🤖 Generate SQL → Execute → Display Results
    │                           │
    ↓                           ↓
💾 Store Success Pattern    ⚠️ Store Error Pattern
    │
    ↓
👍/👎 User Feedback → Updates Pattern Rating

Test Results (100 Queries)

Results using Direct SQL Mode with Self-Improving Learning enabled.

Metric	Value
Success Rate	81%
Patterns Learned	81
Errors Logged	20
Embedding Utilization	99%

By Category

Category	Success Rate
Product Filtering	100%
Basic Sales	100%
Basic Products	95%
Aggregation	90%
Complex Queries	80%

Model + Mode Comparison

Best Combinations (After Prompt Optimization)

Rank	Combination	Accuracy	Speed	Notes
🥇	Granite 4 Small + LangChain	100%	604ms	Best overall
🥇	Granite 4 Small + BeeAI	100%	~650ms	After prompt fix
🥇	Mistral Small + BeeAI	100%	512ms	Fastest 100% combo
🥈	Llama 70B + Any Mode	100%	1200ms+	Most accurate model

Accuracy by Mode (7 test queries)

Mode	Accuracy	Notes
LangChain	100%	Retry logic compensates for errors
BeeAI	100%	After prompt optimization
Direct SQL	86%	Fast but struggles with complex

LLM Model Benchmark (6 queries across modes)

Model	Direct SQL	LangChain	BeeAI
Llama 3.3 70B	100%	100%	100%
Mistral Small 24B	67%	100%	100%
Granite 4 Small	83%	100%	100%
Granite 3.3 8B	50%	100%	83%

Key Finding

LangChain achieves 100% accuracy with ALL models because it handles retries and schema injection automatically.

Success by Query Type

Model	Simple	Filter	Aggregation	Complex
Granite 4 Small	100%	100%	100%	75%
Mistral Small 24B	100%	100%	75%	75%
Llama 3.3 70B	100%	100%	75%	50%

All Supported Models

ibm/granite-4-h-small - Recommended - Best accuracy
ibm/granite-3-3-8b-instruct - Fast but struggles with complex
meta-llama/llama-3-3-70b-instruct - Powerful but slower
meta-llama/llama-3-405b-instruct - Largest available
mistralai/mistral-small-3-1-24b-instruct-2503 - Fastest
mistralai/mistral-medium-2505 - Balanced

API Keys & Services

IBM WatsonX: LLM inference (Granite, Llama, Mistral models)
IBM Slate Embeddings: ibm/slate-125m-english-rtrvr-v2 for semantic search

Example Queries

Product Queries:

"show all laptop products"
"list products under $100"
"what wireless products do we have"

Sales Queries:

"show sales by country"
"total revenue from laptops in USA"
"top 5 products by sales"

Customer Queries:

"how many customers are in Europe"
"list customers who ordered smartphones"

General Questions (no SQL):

"what time is it"
"hello"
"what can you do"

Running Tests

# Run all tests with pytest
pytest

# Run specific test file
pytest tests/test_model_comparison.py -v

LLM Model Comparison

# Quick test (5 queries, 3 models)
python tests/test_model_comparison.py --quick

# Full benchmark (12 queries, 4 models)
python tests/test_model_comparison.py

# Specific models
python tests/test_model_comparison.py --models ibm/granite-4-h-small meta-llama/llama-3-3-70b-instruct

Self-Improving Test

# Run 100-query test
python tests/test_self_improving.py --mode direct_sql --queries 100

# Quick test (10 queries)
python tests/test_self_improving.py --quick

Documentation

Document	Description
ARCHITECTURE.md	System architecture, data flow, components
API.md	API reference for all modules
USAGE.md	Usage guide with examples
COMPARISON_RESULTS.md	Detailed mode & model comparison
test_results_summary.md	Test findings
CONTRIBUTING.md	Contribution guidelines
SHOWCASE.md	Self-Improving Agent Showcase

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
contributions		contributions
data		data
docs		docs
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.mise.toml		.mise.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
app.py		app.py
beeai_agent.py		beeai_agent.py
context_manager.py		context_manager.py
langchain_agent.py		langchain_agent.py
learning_store.py		learning_store.py
logging_config.py		logging_config.py
product_mapper.py		product_mapper.py
pyproject.toml		pyproject.toml
query_classifier.py		query_classifier.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
schema_loader.py		schema_loader.py
self_improving_agent.py		self_improving_agent.py
semantic_classifier.py		semantic_classifier.py
watsonx_client.py		watsonx_client.py

Folders and files

Latest commit

History

Repository files navigation

SQL Query Generator with Self-Improving AI

Features

🧠 Self-Improving System

🤖 Four Query Generation Modes

💬 Hybrid Query Classification

🔍 Product Disambiguation

⚡ Semantic Caching (NEW)

💬 Context Awareness (NEW)

🛡️ Robust Validation (NEW)

📊 Schema Enrichment (NEW)

Quick Start

1. Setup Environment

2. Configure Credentials

3. Initialize Database

4. Run the App

Project Structure

Database Schema

Tables

Views

Self-Improving Architecture

Test Results (100 Queries)

By Category

Model + Mode Comparison

Best Combinations (After Prompt Optimization)

Accuracy by Mode (7 test queries)

LLM Model Benchmark (6 queries across modes)

Key Finding

Success by Query Type

All Supported Models

API Keys & Services

Example Queries

Running Tests

LLM Model Comparison

Self-Improving Test

Documentation

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages