Skip to content

Commit 0ecbb84

Browse files
committed
Add design spec for executor engine (sub-project 7)
1 parent 894c437 commit 0ecbb84

File tree

1 file changed

+363
-0
lines changed

1 file changed

+363
-0
lines changed
Lines changed: 363 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,363 @@
1+
# SQL Engine Executor — Design Specification
2+
3+
## Overview
4+
5+
The executor runs logical plans against data sources and produces result sets. It uses the Volcano/iterator model: each plan node becomes an operator with open/next/close methods. Rows are pulled on demand from root to leaves.
6+
7+
Sub-project 7 of the query engine. Depends on: type system, expression evaluator, catalog, row format, logical plan.
8+
9+
### Goals
10+
11+
- **Volcano iterator model** — open/next/close operator interface
12+
- **9 operator types** — Scan, Filter, Project, Join (nested loop), Aggregate, Sort, Limit, Distinct, SetOp
13+
- **DataSource abstraction** — composable data input (in-memory, cached results, remote)
14+
- **PlanExecutor** — builds operator tree from logical plan, executes, returns ResultSet
15+
- **End-to-end milestone** — SQL string → parse → plan → execute → result rows
16+
17+
### Constraints
18+
19+
- C++17, arena-compatible where possible
20+
- Operators own their state, arena used for row allocation
21+
- Materialized ResultSet (all rows in memory) — streaming deferred
22+
- Uses expression evaluator for WHERE/HAVING/Project expressions
23+
- Uses catalog for column resolution
24+
25+
### Non-Goals
26+
27+
- Optimizer (separate sub-project, added after executor works)
28+
- INSERT/UPDATE/DELETE execution (needs writable storage)
29+
- Subquery execution
30+
- Index-based scans
31+
- Streaming/cursor output
32+
- Vectorized execution
33+
34+
---
35+
36+
## Operator Interface
37+
38+
```cpp
39+
class Operator {
40+
public:
41+
virtual ~Operator() = default;
42+
virtual void open() = 0;
43+
virtual bool next(Row& out) = 0; // returns false when exhausted
44+
virtual void close() = 0;
45+
};
46+
```
47+
48+
Execution flow:
49+
```
50+
operator->open();
51+
Row row;
52+
while (operator->next(row)) {
53+
// process row
54+
}
55+
operator->close();
56+
```
57+
58+
---
59+
60+
## DataSource Interface
61+
62+
```cpp
63+
class DataSource {
64+
public:
65+
virtual ~DataSource() = default;
66+
virtual const TableInfo* table_info() const = 0;
67+
virtual void open() = 0;
68+
virtual bool next(Row& out) = 0;
69+
virtual void close() = 0;
70+
};
71+
```
72+
73+
### InMemoryDataSource
74+
75+
Reference implementation: yields rows from a `std::vector<Row>`.
76+
77+
```cpp
78+
class InMemoryDataSource : public DataSource {
79+
public:
80+
InMemoryDataSource(const TableInfo* table, std::vector<Row> rows);
81+
const TableInfo* table_info() const override;
82+
void open() override;
83+
bool next(Row& out) override;
84+
void close() override;
85+
private:
86+
const TableInfo* table_;
87+
std::vector<Row> rows_;
88+
size_t cursor_ = 0;
89+
};
90+
```
91+
92+
---
93+
94+
## Operator Types
95+
96+
### ScanOperator
97+
98+
Wraps a DataSource. Yields all rows from the source.
99+
100+
```cpp
101+
class ScanOperator : public Operator {
102+
DataSource* source_;
103+
};
104+
```
105+
106+
### FilterOperator
107+
108+
Evaluates a WHERE/HAVING expression for each input row. Skips rows that don't match.
109+
110+
```cpp
111+
class FilterOperator : public Operator {
112+
Operator* child_;
113+
const AstNode* expr_; // WHERE expression AST
114+
// + evaluator context (functions, resolver, arena)
115+
};
116+
```
117+
118+
`next()` calls `child_->next()` in a loop, evaluates `expr_` for each row. Returns the first row where the expression evaluates to a truthy value (not NULL, not FALSE).
119+
120+
### ProjectOperator
121+
122+
Evaluates a list of expressions, produces a new row with computed columns.
123+
124+
```cpp
125+
class ProjectOperator : public Operator {
126+
Operator* child_; // null if no FROM (e.g., SELECT 1+2)
127+
const AstNode** exprs_; // expression list
128+
uint16_t expr_count_;
129+
};
130+
```
131+
132+
If `child_` is null, produces one row (evaluates expressions with no input row).
133+
134+
### NestedLoopJoinOperator
135+
136+
For each row from the left child, scans all rows from the right child. Emits combined rows where the join condition matches.
137+
138+
```cpp
139+
class NestedLoopJoinOperator : public Operator {
140+
Operator* left_;
141+
Operator* right_;
142+
uint8_t join_type_; // INNER, LEFT, RIGHT, FULL, CROSS
143+
const AstNode* condition_; // ON expression (null for CROSS)
144+
};
145+
```
146+
147+
For LEFT JOIN: if no right row matches, emit left row + NULLs for right columns.
148+
For CROSS JOIN: no condition check, emit all combinations.
149+
150+
Right side is materialized on first `open()` (stored in a vector) since it's scanned multiple times.
151+
152+
### AggregateOperator
153+
154+
Buffers all input rows, groups by key, computes aggregate functions.
155+
156+
```cpp
157+
class AggregateOperator : public Operator {
158+
Operator* child_;
159+
const AstNode** group_by_exprs_;
160+
uint16_t group_count_;
161+
const AstNode** agg_exprs_;
162+
uint16_t agg_count_;
163+
};
164+
```
165+
166+
On `open()`: consume all child rows, build groups (hash map keyed by group-by values).
167+
On `next()`: yield one row per group with computed aggregates.
168+
169+
**Aggregate state per group:**
170+
- COUNT: increment counter
171+
- SUM: accumulate value
172+
- AVG: accumulate sum + count
173+
- MIN/MAX: track extreme value
174+
175+
Detects aggregate function calls in the expression AST by checking for `NODE_FUNCTION_CALL` with names COUNT/SUM/AVG/MIN/MAX.
176+
177+
### SortOperator
178+
179+
Buffers all input, sorts by key(s), yields in order.
180+
181+
```cpp
182+
class SortOperator : public Operator {
183+
Operator* child_;
184+
const AstNode** keys_;
185+
uint8_t* directions_; // 0=ASC, 1=DESC
186+
uint16_t key_count_;
187+
};
188+
```
189+
190+
Uses `std::sort` with a custom comparator that evaluates key expressions.
191+
192+
### LimitOperator
193+
194+
Counts rows, skips offset, stops at count.
195+
196+
```cpp
197+
class LimitOperator : public Operator {
198+
Operator* child_;
199+
int64_t count_;
200+
int64_t offset_;
201+
int64_t emitted_ = 0;
202+
int64_t skipped_ = 0;
203+
};
204+
```
205+
206+
### DistinctOperator
207+
208+
Tracks seen row values. Skips duplicates.
209+
210+
```cpp
211+
class DistinctOperator : public Operator {
212+
Operator* child_;
213+
// hash set of seen row value combinations
214+
};
215+
```
216+
217+
Uses a hash set keyed by a hash of all column values in the row.
218+
219+
### SetOpOperator
220+
221+
UNION: yield all rows from left, then all from right.
222+
UNION ALL: same but skip deduplication.
223+
INTERSECT: yield rows that appear in both (hash-based).
224+
EXCEPT: yield rows from left that don't appear in right (hash-based).
225+
226+
```cpp
227+
class SetOpOperator : public Operator {
228+
Operator* left_;
229+
Operator* right_;
230+
uint8_t op_; // UNION=0, INTERSECT=1, EXCEPT=2
231+
bool all_;
232+
};
233+
```
234+
235+
---
236+
237+
## PlanExecutor
238+
239+
Converts a logical plan tree into an operator tree and executes it.
240+
241+
```cpp
242+
template <Dialect D>
243+
class PlanExecutor {
244+
public:
245+
PlanExecutor(FunctionRegistry<D>& functions,
246+
const Catalog& catalog,
247+
Arena& arena);
248+
249+
void add_data_source(const char* table_name, DataSource* source);
250+
251+
ResultSet execute(PlanNode* plan);
252+
253+
private:
254+
Operator* build_operator(PlanNode* node);
255+
256+
FunctionRegistry<D>& functions_;
257+
const Catalog& catalog_;
258+
Arena& arena_;
259+
std::unordered_map<std::string, DataSource*> sources_;
260+
};
261+
```
262+
263+
`build_operator` recursively walks the plan tree:
264+
- SCAN → ScanOperator(find data source by table name)
265+
- FILTER → FilterOperator(build child, expr)
266+
- PROJECT → ProjectOperator(build child, exprs)
267+
- JOIN → NestedLoopJoinOperator(build left, build right, condition)
268+
- AGGREGATE → AggregateOperator(build child, group_by, agg_exprs)
269+
- SORT → SortOperator(build child, keys, directions)
270+
- LIMIT → LimitOperator(build child, count, offset)
271+
- DISTINCT → DistinctOperator(build child)
272+
- SET_OP → SetOpOperator(build left, build right, op, all)
273+
274+
---
275+
276+
## ResultSet
277+
278+
```cpp
279+
struct ResultSet {
280+
std::vector<Row> rows;
281+
std::vector<std::string> column_names;
282+
uint16_t column_count = 0;
283+
284+
size_t row_count() const { return rows.size(); }
285+
bool empty() const { return rows.empty(); }
286+
};
287+
```
288+
289+
All result rows are materialized in memory. The arena used for row Value storage must outlive the ResultSet.
290+
291+
---
292+
293+
## File Organization
294+
295+
```
296+
include/sql_engine/
297+
operator.h — Operator base class
298+
data_source.h — DataSource interface + InMemoryDataSource
299+
result_set.h — ResultSet struct
300+
operators/
301+
scan_op.h
302+
filter_op.h
303+
project_op.h
304+
join_op.h
305+
aggregate_op.h
306+
sort_op.h
307+
limit_op.h
308+
distinct_op.h
309+
set_op_op.h
310+
plan_executor.h — PlanExecutor<D>
311+
312+
tests/
313+
test_operators.cpp — Unit tests per operator
314+
test_plan_executor.cpp — End-to-end SQL → results
315+
```
316+
317+
---
318+
319+
## Testing Strategy
320+
321+
### Operator unit tests
322+
323+
Each operator tested in isolation with hand-built inputs:
324+
325+
- **ScanOperator:** yields all rows, empty source, reopen
326+
- **FilterOperator:** keeps matching, filters all, NULL in condition
327+
- **ProjectOperator:** column subset, computed expression, no-FROM single row
328+
- **JoinOperator:** inner match, inner no-match, left join NULLs, cross join
329+
- **AggregateOperator:** COUNT(*), SUM, AVG, GROUP BY with multiple groups, no-group aggregate
330+
- **SortOperator:** ASC, DESC, multi-key, stable, single row
331+
- **LimitOperator:** count only, offset+count, offset beyond data, zero limit
332+
- **DistinctOperator:** removes dupes, all unique, all same
333+
- **SetOpOperator:** UNION ALL, UNION (dedup), INTERSECT, EXCEPT
334+
335+
### End-to-end integration tests
336+
337+
Full pipeline: SQL → parse → plan → execute → verify result rows:
338+
339+
| SQL | Expected |
340+
|---|---|
341+
| `SELECT * FROM users` | All rows |
342+
| `SELECT name FROM users WHERE age > 18` | Filtered + projected |
343+
| `SELECT * FROM users ORDER BY age DESC LIMIT 2` | Sorted + limited |
344+
| `SELECT dept, COUNT(*) FROM users GROUP BY dept` | Aggregated |
345+
| `SELECT * FROM users u JOIN orders o ON u.id = o.user_id` | Joined |
346+
| `SELECT name FROM users WHERE name LIKE 'A%'` | LIKE filter |
347+
| `SELECT 1 + 2` | Single row: [3] |
348+
| `SELECT DISTINCT status FROM users` | Deduplicated |
349+
| `SELECT * FROM t1 UNION ALL SELECT * FROM t2` | Combined rows |
350+
351+
---
352+
353+
## Performance Targets
354+
355+
| Operation | Target |
356+
|---|---|
357+
| ScanOperator::next() | <20ns per row (pointer increment) |
358+
| FilterOperator::next() | <100ns per row (expression evaluation) |
359+
| ProjectOperator::next() | <200ns per row (N expression evaluations) |
360+
| LimitOperator::next() | <10ns per row (counter check) |
361+
| SortOperator (1000 rows, 1 key) | <100us total |
362+
| AggregateOperator (1000 rows, 10 groups) | <200us total |
363+
| Full pipeline: simple SELECT WHERE (100 rows) | <50us total |

0 commit comments

Comments
 (0)