Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 246 additions & 0 deletions doc/multi_agent_database_discovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# Multi-Agent Database Discovery System

## Overview

This document describes a multi-agent database discovery system implemented using Claude Code's autonomous agent capabilities. The system uses 4 specialized subagents that collaborate via the MCP (Model Context Protocol) catalog to perform comprehensive database analysis.

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│ Main Agent (Orchestrator) │
│ - Launches 4 specialized subagents in parallel │
│ - Coordinates via MCP catalog │
│ - Synthesizes final report │
└────────────────┬────────────────────────────────────────────────────┘
┌────────────┼────────────┬────────────┬────────────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│Struct. │ │Statist.│ │Semantic│ │Query │ │ MCP │
│ Agent │ │ Agent │ │ Agent │ │ Agent │ │Catalog │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
│ │ │ │ │
└────────────┴────────────┴────────────┴────────────┘
▼ ▼
┌─────────┐ ┌─────────────┐
│ Database│ │ Catalog │
│ (testdb)│ │ (Shared Mem)│
└─────────┘ └─────────────┘
```
Comment on lines +9 to +32
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The architecture diagram is a bit confusing regarding the catalog. It shows an 'MCP Catalog' box at the same level as the agents, and also a 'Catalog (Shared Mem)' box at the bottom that the agents interact with. This suggests there might be two different catalogs. Could you clarify if these are the same entity and perhaps simplify the diagram to show a single, shared catalog that all agents communicate with via MCP?


## The Four Discovery Agents

### 1. Structural Agent
**Mission**: Map tables, relationships, indexes, and constraints

**Responsibilities**:
- Complete ERD documentation
- Table schema analysis (columns, types, constraints)
- Foreign key relationship mapping
- Index inventory and assessment
- Architectural pattern identification

**Catalog Entries**: `structural_discovery`

**Key Deliverables**:
- Entity Relationship Diagram
- Complete table definitions
- Index inventory with recommendations
- Relationship cardinality mapping

### 2. Statistical Agent
**Mission**: Profile data distributions, patterns, and anomalies

**Responsibilities**:
- Table row counts and cardinality analysis
- Data distribution profiling
- Anomaly detection (duplicates, outliers)
- Statistical summaries (min/max/avg/stddev)
- Business metrics calculation

**Catalog Entries**: `statistical_discovery`

**Key Deliverables**:
- Data quality score
- Duplicate detection reports
- Statistical distributions
- True vs inflated metrics

### 3. Semantic Agent
**Mission**: Infer business domain and entity types

**Responsibilities**:
- Business domain identification
- Entity type classification (master vs transactional)
- Business rule discovery
- Entity lifecycle analysis
- State machine identification

**Catalog Entries**: `semantic_discovery`

**Key Deliverables**:
- Complete domain model
- Business rules documentation
- Entity lifecycle definitions
- Missing capabilities identification

### 4. Query Agent
**Mission**: Analyze access patterns and optimization opportunities

**Responsibilities**:
- Query pattern identification
- Index usage analysis
- Performance bottleneck detection
- N+1 query risk assessment
- Optimization recommendations

**Catalog Entries**: `query_discovery`

**Key Deliverables**:
- Access pattern analysis
- Index recommendations (prioritized)
- Query optimization strategies
- EXPLAIN analysis results

## Discovery Process

### Round Structure

Each agent runs 4 rounds of analysis:

#### Round 1: Blind Exploration
- Initial schema/data analysis
- First observations cataloged
- Initial hypotheses formed

#### Round 2: Pattern Recognition
- Read other agents' findings from catalog
- Identify patterns and anomalies
- Form and test hypotheses

#### Round 3: Hypothesis Testing
- Validate business rules against actual data
- Cross-reference findings with other agents
- Confirm or reject hypotheses

#### Round 4: Final Synthesis
- Compile comprehensive findings
- Generate actionable recommendations
- Create final mission summary

### Catalog-Based Collaboration

```python
# Agent writes findings
catalog_upsert(
kind="structural_discovery",
key="table_customers",
document="...",
tags="structural,table,schema"
)

# Agent reads other agents' findings
findings = catalog_list(kind="statistical_discovery")
```

## Example Discovery Output

### Database: testdb (E-commerce Order Management)

#### True Statistics (After Deduplication)
| Metric | Current | Actual |
|--------|---------|--------|
| Customers | 15 | 5 |
| Products | 15 | 5 |
| Orders | 15 | 5 |
| Order Items | 27 | 9 |
| Revenue | $10,886.67 | $3,628.85 |

#### Critical Findings
1. **Data Quality**: 5/100 (Catastrophic) - 67% data triplication
2. **Missing Index**: orders.order_date (P0 critical)
3. **Missing Constraints**: No UNIQUE or FK constraints
4. **Business Domain**: E-commerce order management system

## Launching the Discovery System

```python
# In Claude Code, launch 4 agents in parallel:
Task(
description="Structural Discovery",
prompt=STRUCTURAL_AGENT_PROMPT,
subagent_type="general-purpose"
)

Task(
description="Statistical Discovery",
prompt=STATISTICAL_AGENT_PROMPT,
subagent_type="general-purpose"
)

Task(
description="Semantic Discovery",
prompt=SEMANTIC_AGENT_PROMPT,
subagent_type="general-purpose"
)

Task(
description="Query Discovery",
prompt=QUERY_AGENT_PROMPT,
subagent_type="general-purpose"
)
```

## MCP Tools Used

The agents use these MCP tools for database analysis:

- `list_schemas` - List all databases
- `list_tables` - List tables in a schema
- `describe_table` - Get table schema
- `sample_rows` - Get sample data from table
- `column_profile` - Get column statistics
- `run_sql_readonly` - Execute read-only queries
- `catalog_upsert` - Store findings in catalog
- `catalog_list` / `catalog_get` - Retrieve findings from catalog

## Benefits of Multi-Agent Approach

1. **Parallel Execution**: All 4 agents run simultaneously
2. **Specialized Expertise**: Each agent focuses on its domain
3. **Cross-Validation**: Agents validate each other's findings
4. **Comprehensive Coverage**: All aspects of database analyzed
5. **Knowledge Synthesis**: Final report combines all perspectives

## Output Format

The system produces:

1. **40+ Catalog Entries** - Detailed findings organized by agent
2. **Comprehensive Report** - Executive summary with:
- Structure & Schema (ERD, table definitions)
- Business Domain (entity model, business rules)
- Key Insights (data quality, performance)
- Data Quality Assessment (score, recommendations)

## Future Enhancements

- [ ] Additional specialized agents (Security, Performance, Compliance)
- [ ] Automated remediation scripts
- [ ] Continuous monitoring mode
- [ ] Integration with CI/CD pipelines
- [ ] Web-based dashboard for findings

## Related Files

- `simple_discovery.py` - Simplified demo of multi-agent pattern
- `mcp_catalog.db` - Catalog database for storing findings

## References

- Claude Code Task Tool Documentation
- MCP (Model Context Protocol) Specification
- ProxySQL MCP Server Implementation
85 changes: 53 additions & 32 deletions lib/MySQL_Catalog.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
#include "proxysql.h"
#include <sstream>
#include <algorithm>
#include "../deps/json/json.hpp"

MySQL_Catalog::MySQL_Catalog(const std::string& path)
: db(NULL), db_path(path)
Expand Down Expand Up @@ -220,31 +221,40 @@ std::string MySQL_Catalog::search(
return "[]";
}

// Build JSON result
std::ostringstream json;
json << "[";
bool first = true;
// Build JSON result using nlohmann::json
nlohmann::json results = nlohmann::json::array();

if (resultset) {
for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
it != resultset->rows.end(); ++it) {
Comment on lines 228 to 229
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For improved readability and to use a more modern C++ idiom, you could replace the traditional iterator-based for loop with a range-based for loop.

for (SQLite3_row* row : resultset->rows) {

SQLite3_row* row = *it;
if (!first) json << ",";
first = false;

json << "{"
<< "\"kind\":\"" << (row->fields[0] ? row->fields[0] : "") << "\","
<< "\"key\":\"" << (row->fields[1] ? row->fields[1] : "") << "\","
<< "\"document\":" << (row->fields[2] ? row->fields[2] : "null") << ","
<< "\"tags\":\"" << (row->fields[3] ? row->fields[3] : "") << "\","
<< "\"links\":\"" << (row->fields[4] ? row->fields[4] : "") << "\""
<< "}";

nlohmann::json entry;
entry["kind"] = std::string(row->fields[0] ? row->fields[0] : "");
entry["key"] = std::string(row->fields[1] ? row->fields[1] : "");

// Parse the stored JSON document - nlohmann::json handles escaping
const char* doc_str = row->fields[2];
if (doc_str) {
try {
entry["document"] = nlohmann::json::parse(doc_str);
} catch (const nlohmann::json::parse_error& e) {
// If document is not valid JSON, store as string
entry["document"] = std::string(doc_str);
}
} else {
entry["document"] = nullptr;
}

entry["tags"] = std::string(row->fields[3] ? row->fields[3] : "");
entry["links"] = std::string(row->fields[4] ? row->fields[4] : "");

results.push_back(entry);
}
delete resultset;
}

json << "]";
return json.str();
return results.dump();
}

std::string MySQL_Catalog::list(
Expand Down Expand Up @@ -282,31 +292,42 @@ std::string MySQL_Catalog::list(
resultset = NULL;
db->execute_statement(sql.str().c_str(), &error, &cols, &affected, &resultset);

// Build JSON result with total count
std::ostringstream json;
json << "{\"total\":" << total << ",\"results\":[";
// Build JSON result using nlohmann::json
nlohmann::json result;
result["total"] = total;
nlohmann::json results = nlohmann::json::array();

bool first = true;
if (resultset) {
for (std::vector<SQLite3_row*>::iterator it = resultset->rows.begin();
it != resultset->rows.end(); ++it) {
Comment on lines 301 to 302
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the search method, this traditional for loop can be simplified by using a more modern C++ range-based for loop. This enhances code readability and maintainability.

for (SQLite3_row* row : resultset->rows) {

SQLite3_row* row = *it;
if (!first) json << ",";
first = false;

json << "{"
<< "\"kind\":\"" << (row->fields[0] ? row->fields[0] : "") << "\","
<< "\"key\":\"" << (row->fields[1] ? row->fields[1] : "") << "\","
<< "\"document\":" << (row->fields[2] ? row->fields[2] : "null") << ","
<< "\"tags\":\"" << (row->fields[3] ? row->fields[3] : "") << "\","
<< "\"links\":\"" << (row->fields[4] ? row->fields[4] : "") << "\""
<< "}";

nlohmann::json entry;
entry["kind"] = std::string(row->fields[0] ? row->fields[0] : "");
entry["key"] = std::string(row->fields[1] ? row->fields[1] : "");

// Parse the stored JSON document
const char* doc_str = row->fields[2];
if (doc_str) {
try {
entry["document"] = nlohmann::json::parse(doc_str);
} catch (const nlohmann::json::parse_error& e) {
entry["document"] = std::string(doc_str);
}
} else {
entry["document"] = nullptr;
}

entry["tags"] = std::string(row->fields[3] ? row->fields[3] : "");
entry["links"] = std::string(row->fields[4] ? row->fields[4] : "");

results.push_back(entry);
}
delete resultset;
}

json << "]}";
return json.str();
result["results"] = results;
return result.dump();
}

int MySQL_Catalog::merge(
Expand Down
8 changes: 7 additions & 1 deletion lib/MySQL_Tool_Handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -910,7 +910,13 @@ std::string MySQL_Tool_Handler::catalog_get(const std::string& kind, const std::
if (rc == 0) {
result["kind"] = kind;
result["key"] = key;
result["document"] = json::parse(document);
// Parse as raw JSON value to preserve nested structure
try {
result["document"] = json::parse(document);
} catch (const json::parse_error& e) {
// If not valid JSON, store as string
result["document"] = document;
}
} else {
result["error"] = "Entry not found";
}
Expand Down
Loading