Summary
Fill gaps in QuestDB's statistical SQL surface to enable fleet-wide observability, cohort comparison, and distribution analysis. The core aggregate functions (stddev, variance, corr, covar) already exist — the main gaps are making them usable as window functions, adding distribution and regression functions, and adding bucketing utilities.
Motivation
Fleet observability in robotics, IoT, and capital markets follows a common pattern: compare an individual (robot, device, symbol) against a cohort (fleet, site, market). This requires statistical functions that work both as aggregates (per-group) and as window functions (individual vs. population in the same query).
Example — the most natural fleet comparison query:
SELECT robot_id,
avg(motor_temp) AS robot_avg,
avg(motor_temp) OVER () AS fleet_avg,
stddev(motor_temp) OVER () AS fleet_stddev,
(avg(motor_temp) - avg(motor_temp) OVER ())
/ stddev(motor_temp) OVER () AS z_score
FROM telemetry
WHERE sensor = 'motor_temp' AND ts > now() - 1d
GROUP BY robot_id
ORDER BY z_score DESC
Today this doesn't work because stddev() cannot be used with OVER().
Gap 1: Statistical aggregates as window functions
These aggregates exist but cannot be used with OVER():
| Function |
As aggregate |
As window function |
stddev() / stddev_samp() / stddev_pop() |
Yes |
No |
variance() / var_samp() / var_pop() |
Yes |
No |
corr() |
Yes |
No |
covar_samp() / covar_pop() |
Yes |
No |
This is the highest-impact gap. Enabling these as window functions unlocks individual-vs-cohort queries — z-scores, fleet-relative ranking, anomaly detection — without subquery workarounds.
What this enables
-- Z-score: how many standard deviations from fleet mean?
SELECT robot_id,
avg(metric) AS robot_avg,
(avg(metric) - avg(metric) OVER ()) / stddev(metric) OVER () AS z_score
FROM telemetry
GROUP BY robot_id
-- Rolling correlation between two metrics per robot
SELECT ts, robot_id,
corr(motor_temp, joint_velocity) OVER (
PARTITION BY robot_id
ORDER BY ts
RANGE '1' HOUR PRECEDING
) AS rolling_corr
FROM telemetry_wide
-- Per-robot variance relative to fleet variance
SELECT robot_id,
variance(metric) AS robot_var,
variance(metric) OVER () AS fleet_var
FROM telemetry
GROUP BY robot_id
Gap 2: Missing window functions
| Function |
What it does |
Use case |
ntile(n) |
Divide rows into n equal buckets |
Fleet tiering: "top quartile", "bottom decile" of robots by health score |
cume_dist() |
Cumulative distribution (0–1) |
"What fraction of the fleet is below this threshold?" |
nth_value(expr, n) |
Value at position n in window frame |
"What's the 3rd worst robot?" |
What this enables
-- Tier robots into quartiles by average motor temperature
SELECT robot_id, avg_temp,
ntile(4) OVER (ORDER BY avg_temp DESC) AS quartile
FROM (
SELECT robot_id, avg(motor_temp) AS avg_temp
FROM telemetry WHERE ts > now() - 1d
GROUP BY robot_id
)
-- What percentile is each robot at?
SELECT robot_id, avg_temp,
cume_dist() OVER (ORDER BY avg_temp) AS fleet_percentile
FROM (
SELECT robot_id, avg(motor_temp) AS avg_temp
FROM telemetry WHERE ts > now() - 1d
GROUP BY robot_id
)
Gap 3: Regression completeness
regr_slope() and regr_intercept() exist. The rest of the standard SQL regression family is missing:
| Function |
What it does |
Use case |
regr_r2() |
Coefficient of determination |
"Is motor temp actually trending up, or is it noise?" |
regr_count() |
Number of non-null pairs |
Regression diagnostics |
regr_avgx() / regr_avgy() |
Average of independent/dependent variable |
Regression diagnostics |
regr_sxx() / regr_sxy() / regr_syy() |
Sum of squares |
Advanced regression analysis |
What this enables
-- Detect robots with a statistically significant temperature trend
SELECT robot_id,
regr_slope(motor_temp, elapsed_seconds) AS temp_trend,
regr_r2(motor_temp, elapsed_seconds) AS r_squared
FROM telemetry
WHERE ts > now() - 7d
GROUP BY robot_id
HAVING regr_r2(motor_temp, elapsed_seconds) > 0.7 -- strong linear trend
AND regr_slope(motor_temp, elapsed_seconds) > 0 -- upward
ORDER BY temp_trend DESC
Gap 4: Distribution shape
| Function |
What it does |
Use case |
kurtosis() |
Tail heaviness of distribution |
"Are fleet-wide temperature distributions heavy-tailed?" (frequent extreme values) |
skewness() |
Asymmetry of distribution |
"Is the fleet skewed toward degradation?" |
What this enables
-- Fleet health distribution shape per site
SELECT site,
avg(motor_temp) AS mean,
stddev(motor_temp) AS stddev,
skewness(motor_temp) AS skew,
kurtosis(motor_temp) AS kurt
FROM telemetry
WHERE ts > now() - 1d
GROUP BY site
Positive skewness + high kurtosis at one site vs. others = localized environmental issue.
Gap 5: Histogram bucketing
| Function |
What it does |
Use case |
width_bucket(value, min, max, n) |
Assign value to one of n equal-width buckets |
Fleet distribution dashboards, histogram construction |
What this enables
-- Temperature distribution histogram across fleet
SELECT
width_bucket(avg_temp, 40, 100, 12) AS bucket,
count(*) AS robot_count
FROM (
SELECT robot_id, avg(motor_temp) AS avg_temp
FROM telemetry WHERE ts > now() - 1d
GROUP BY robot_id
)
GROUP BY bucket
ORDER BY bucket
Priority
- Statistical aggregates as window functions — highest impact, unlocks the core individual-vs-cohort pattern
- ntile(), cume_dist(), nth_value() — fleet tiering and ranking
- regr_r2() — trend detection quality
- width_bucket() — histogram construction
- kurtosis(), skewness() — distribution shape (nice-to-have)
- regr_count(), regr_avgx/avgy, regr_sxx/sxy/syy — regression completeness (nice-to-have)
Related
Summary
Fill gaps in QuestDB's statistical SQL surface to enable fleet-wide observability, cohort comparison, and distribution analysis. The core aggregate functions (stddev, variance, corr, covar) already exist — the main gaps are making them usable as window functions, adding distribution and regression functions, and adding bucketing utilities.
Motivation
Fleet observability in robotics, IoT, and capital markets follows a common pattern: compare an individual (robot, device, symbol) against a cohort (fleet, site, market). This requires statistical functions that work both as aggregates (per-group) and as window functions (individual vs. population in the same query).
Example — the most natural fleet comparison query:
Today this doesn't work because
stddev()cannot be used withOVER().Gap 1: Statistical aggregates as window functions
These aggregates exist but cannot be used with OVER():
stddev()/stddev_samp()/stddev_pop()variance()/var_samp()/var_pop()corr()covar_samp()/covar_pop()This is the highest-impact gap. Enabling these as window functions unlocks individual-vs-cohort queries — z-scores, fleet-relative ranking, anomaly detection — without subquery workarounds.
What this enables
Gap 2: Missing window functions
ntile(n)cume_dist()nth_value(expr, n)What this enables
Gap 3: Regression completeness
regr_slope()andregr_intercept()exist. The rest of the standard SQL regression family is missing:regr_r2()regr_count()regr_avgx()/regr_avgy()regr_sxx()/regr_sxy()/regr_syy()What this enables
Gap 4: Distribution shape
kurtosis()skewness()What this enables
Positive skewness + high kurtosis at one site vs. others = localized environmental issue.
Gap 5: Histogram bucketing
width_bucket(value, min, max, n)What this enables
Priority
Related