Statistical SQL functions for fleet-wide and cohort analysis

## Summary

Fill gaps in QuestDB's statistical SQL surface to enable fleet-wide observability, cohort comparison, and distribution analysis. The core aggregate functions (stddev, variance, corr, covar) already exist — the main gaps are making them usable as window functions, adding distribution and regression functions, and adding bucketing utilities.

## Motivation

Fleet observability in robotics, IoT, and capital markets follows a common pattern: compare an individual (robot, device, symbol) against a cohort (fleet, site, market). This requires statistical functions that work both as aggregates (per-group) and as window functions (individual vs. population in the same query).

Example — the most natural fleet comparison query:

```sql
SELECT robot_id,
  avg(motor_temp) AS robot_avg,
  avg(motor_temp) OVER () AS fleet_avg,
  stddev(motor_temp) OVER () AS fleet_stddev,
  (avg(motor_temp) - avg(motor_temp) OVER ()) 
    / stddev(motor_temp) OVER () AS z_score
FROM telemetry
WHERE sensor = 'motor_temp' AND ts > now() - 1d
GROUP BY robot_id
ORDER BY z_score DESC
```

Today this doesn't work because `stddev()` cannot be used with `OVER()`.

## Gap 1: Statistical aggregates as window functions

These aggregates exist but **cannot be used with OVER()**:

| Function | As aggregate | As window function |
|----------|:---:|:---:|
| `stddev()` / `stddev_samp()` / `stddev_pop()` | Yes | **No** |
| `variance()` / `var_samp()` / `var_pop()` | Yes | **No** |
| `corr()` | Yes | **No** |
| `covar_samp()` / `covar_pop()` | Yes | **No** |

This is the highest-impact gap. Enabling these as window functions unlocks individual-vs-cohort queries — z-scores, fleet-relative ranking, anomaly detection — without subquery workarounds.

### What this enables

```sql
-- Z-score: how many standard deviations from fleet mean?
SELECT robot_id,
  avg(metric) AS robot_avg,
  (avg(metric) - avg(metric) OVER ()) / stddev(metric) OVER () AS z_score
FROM telemetry
GROUP BY robot_id

-- Rolling correlation between two metrics per robot
SELECT ts, robot_id,
  corr(motor_temp, joint_velocity) OVER (
    PARTITION BY robot_id 
    ORDER BY ts 
    RANGE '1' HOUR PRECEDING
  ) AS rolling_corr
FROM telemetry_wide

-- Per-robot variance relative to fleet variance
SELECT robot_id,
  variance(metric) AS robot_var,
  variance(metric) OVER () AS fleet_var
FROM telemetry
GROUP BY robot_id
```

## Gap 2: Missing window functions

| Function | What it does | Use case |
|----------|-------------|----------|
| `ntile(n)` | Divide rows into n equal buckets | Fleet tiering: "top quartile", "bottom decile" of robots by health score |
| `cume_dist()` | Cumulative distribution (0–1) | "What fraction of the fleet is below this threshold?" |
| `nth_value(expr, n)` | Value at position n in window frame | "What's the 3rd worst robot?" |

### What this enables

```sql
-- Tier robots into quartiles by average motor temperature
SELECT robot_id, avg_temp,
  ntile(4) OVER (ORDER BY avg_temp DESC) AS quartile
FROM (
  SELECT robot_id, avg(motor_temp) AS avg_temp
  FROM telemetry WHERE ts > now() - 1d
  GROUP BY robot_id
)

-- What percentile is each robot at?
SELECT robot_id, avg_temp,
  cume_dist() OVER (ORDER BY avg_temp) AS fleet_percentile
FROM (
  SELECT robot_id, avg(motor_temp) AS avg_temp
  FROM telemetry WHERE ts > now() - 1d
  GROUP BY robot_id
)
```

## Gap 3: Regression completeness

`regr_slope()` and `regr_intercept()` exist. The rest of the standard SQL regression family is missing:

| Function | What it does | Use case |
|----------|-------------|----------|
| `regr_r2()` | Coefficient of determination | "Is motor temp actually trending up, or is it noise?" |
| `regr_count()` | Number of non-null pairs | Regression diagnostics |
| `regr_avgx()` / `regr_avgy()` | Average of independent/dependent variable | Regression diagnostics |
| `regr_sxx()` / `regr_sxy()` / `regr_syy()` | Sum of squares | Advanced regression analysis |

### What this enables

```sql
-- Detect robots with a statistically significant temperature trend
SELECT robot_id,
  regr_slope(motor_temp, elapsed_seconds) AS temp_trend,
  regr_r2(motor_temp, elapsed_seconds) AS r_squared
FROM telemetry
WHERE ts > now() - 7d
GROUP BY robot_id
HAVING regr_r2(motor_temp, elapsed_seconds) > 0.7  -- strong linear trend
   AND regr_slope(motor_temp, elapsed_seconds) > 0  -- upward
ORDER BY temp_trend DESC
```

## Gap 4: Distribution shape

| Function | What it does | Use case |
|----------|-------------|----------|
| `kurtosis()` | Tail heaviness of distribution | "Are fleet-wide temperature distributions heavy-tailed?" (frequent extreme values) |
| `skewness()` | Asymmetry of distribution | "Is the fleet skewed toward degradation?" |

### What this enables

```sql
-- Fleet health distribution shape per site
SELECT site,
  avg(motor_temp) AS mean,
  stddev(motor_temp) AS stddev,
  skewness(motor_temp) AS skew,
  kurtosis(motor_temp) AS kurt
FROM telemetry
WHERE ts > now() - 1d
GROUP BY site
```

Positive skewness + high kurtosis at one site vs. others = localized environmental issue.

## Gap 5: Histogram bucketing

| Function | What it does | Use case |
|----------|-------------|----------|
| `width_bucket(value, min, max, n)` | Assign value to one of n equal-width buckets | Fleet distribution dashboards, histogram construction |

### What this enables

```sql
-- Temperature distribution histogram across fleet
SELECT 
  width_bucket(avg_temp, 40, 100, 12) AS bucket,
  count(*) AS robot_count
FROM (
  SELECT robot_id, avg(motor_temp) AS avg_temp
  FROM telemetry WHERE ts > now() - 1d
  GROUP BY robot_id
)
GROUP BY bucket
ORDER BY bucket
```

## Priority

1. **Statistical aggregates as window functions** — highest impact, unlocks the core individual-vs-cohort pattern
2. **ntile(), cume_dist(), nth_value()** — fleet tiering and ranking
3. **regr_r2()** — trend detection quality
4. **width_bucket()** — histogram construction
5. **kurtosis(), skewness()** — distribution shape (nice-to-have)
6. **regr_count(), regr_avgx/avgy, regr_sxx/sxy/syy** — regression completeness (nice-to-have)

## Related

- #119 — Compound sort keys (provides efficient per-series access that these functions operate over)
- #103 — HORIZON JOIN (event impact analysis across fleet, complements statistical comparison)
- #114 — Materialized views (pre-compute fleet statistics for dashboards)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical SQL functions for fleet-wide and cohort analysis #120

Summary

Motivation

Gap 1: Statistical aggregates as window functions

What this enables

Gap 2: Missing window functions

What this enables

Gap 3: Regression completeness

What this enables

Gap 4: Distribution shape

What this enables

Gap 5: Histogram bucketing

What this enables

Priority

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Function	As aggregate	As window function
`stddev()` / `stddev_samp()` / `stddev_pop()`	Yes	No
`variance()` / `var_samp()` / `var_pop()`	Yes	No
`corr()`	Yes	No
`covar_samp()` / `covar_pop()`	Yes	No

Function	What it does	Use case
`ntile(n)`	Divide rows into n equal buckets	Fleet tiering: "top quartile", "bottom decile" of robots by health score
`cume_dist()`	Cumulative distribution (0–1)	"What fraction of the fleet is below this threshold?"
`nth_value(expr, n)`	Value at position n in window frame	"What's the 3rd worst robot?"

Function	What it does	Use case
`regr_r2()`	Coefficient of determination	"Is motor temp actually trending up, or is it noise?"
`regr_count()`	Number of non-null pairs	Regression diagnostics
`regr_avgx()` / `regr_avgy()`	Average of independent/dependent variable	Regression diagnostics
`regr_sxx()` / `regr_sxy()` / `regr_syy()`	Sum of squares	Advanced regression analysis

Function	What it does	Use case
`kurtosis()`	Tail heaviness of distribution	"Are fleet-wide temperature distributions heavy-tailed?" (frequent extreme values)
`skewness()`	Asymmetry of distribution	"Is the fleet skewed toward degradation?"

Statistical SQL functions for fleet-wide and cohort analysis #120

Description

Summary

Motivation

Gap 1: Statistical aggregates as window functions

What this enables

Gap 2: Missing window functions

What this enables

Gap 3: Regression completeness

What this enables

Gap 4: Distribution shape

What this enables

Gap 5: Histogram bucketing

What this enables

Priority

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions