Skip to content

Statistical SQL functions for fleet-wide and cohort analysis #120

@bluestreak01

Description

@bluestreak01

Summary

Fill gaps in QuestDB's statistical SQL surface to enable fleet-wide observability, cohort comparison, and distribution analysis. The core aggregate functions (stddev, variance, corr, covar) already exist — the main gaps are making them usable as window functions, adding distribution and regression functions, and adding bucketing utilities.

Motivation

Fleet observability in robotics, IoT, and capital markets follows a common pattern: compare an individual (robot, device, symbol) against a cohort (fleet, site, market). This requires statistical functions that work both as aggregates (per-group) and as window functions (individual vs. population in the same query).

Example — the most natural fleet comparison query:

SELECT robot_id,
  avg(motor_temp) AS robot_avg,
  avg(motor_temp) OVER () AS fleet_avg,
  stddev(motor_temp) OVER () AS fleet_stddev,
  (avg(motor_temp) - avg(motor_temp) OVER ()) 
    / stddev(motor_temp) OVER () AS z_score
FROM telemetry
WHERE sensor = 'motor_temp' AND ts > now() - 1d
GROUP BY robot_id
ORDER BY z_score DESC

Today this doesn't work because stddev() cannot be used with OVER().

Gap 1: Statistical aggregates as window functions

These aggregates exist but cannot be used with OVER():

Function As aggregate As window function
stddev() / stddev_samp() / stddev_pop() Yes No
variance() / var_samp() / var_pop() Yes No
corr() Yes No
covar_samp() / covar_pop() Yes No

This is the highest-impact gap. Enabling these as window functions unlocks individual-vs-cohort queries — z-scores, fleet-relative ranking, anomaly detection — without subquery workarounds.

What this enables

-- Z-score: how many standard deviations from fleet mean?
SELECT robot_id,
  avg(metric) AS robot_avg,
  (avg(metric) - avg(metric) OVER ()) / stddev(metric) OVER () AS z_score
FROM telemetry
GROUP BY robot_id

-- Rolling correlation between two metrics per robot
SELECT ts, robot_id,
  corr(motor_temp, joint_velocity) OVER (
    PARTITION BY robot_id 
    ORDER BY ts 
    RANGE '1' HOUR PRECEDING
  ) AS rolling_corr
FROM telemetry_wide

-- Per-robot variance relative to fleet variance
SELECT robot_id,
  variance(metric) AS robot_var,
  variance(metric) OVER () AS fleet_var
FROM telemetry
GROUP BY robot_id

Gap 2: Missing window functions

Function What it does Use case
ntile(n) Divide rows into n equal buckets Fleet tiering: "top quartile", "bottom decile" of robots by health score
cume_dist() Cumulative distribution (0–1) "What fraction of the fleet is below this threshold?"
nth_value(expr, n) Value at position n in window frame "What's the 3rd worst robot?"

What this enables

-- Tier robots into quartiles by average motor temperature
SELECT robot_id, avg_temp,
  ntile(4) OVER (ORDER BY avg_temp DESC) AS quartile
FROM (
  SELECT robot_id, avg(motor_temp) AS avg_temp
  FROM telemetry WHERE ts > now() - 1d
  GROUP BY robot_id
)

-- What percentile is each robot at?
SELECT robot_id, avg_temp,
  cume_dist() OVER (ORDER BY avg_temp) AS fleet_percentile
FROM (
  SELECT robot_id, avg(motor_temp) AS avg_temp
  FROM telemetry WHERE ts > now() - 1d
  GROUP BY robot_id
)

Gap 3: Regression completeness

regr_slope() and regr_intercept() exist. The rest of the standard SQL regression family is missing:

Function What it does Use case
regr_r2() Coefficient of determination "Is motor temp actually trending up, or is it noise?"
regr_count() Number of non-null pairs Regression diagnostics
regr_avgx() / regr_avgy() Average of independent/dependent variable Regression diagnostics
regr_sxx() / regr_sxy() / regr_syy() Sum of squares Advanced regression analysis

What this enables

-- Detect robots with a statistically significant temperature trend
SELECT robot_id,
  regr_slope(motor_temp, elapsed_seconds) AS temp_trend,
  regr_r2(motor_temp, elapsed_seconds) AS r_squared
FROM telemetry
WHERE ts > now() - 7d
GROUP BY robot_id
HAVING regr_r2(motor_temp, elapsed_seconds) > 0.7  -- strong linear trend
   AND regr_slope(motor_temp, elapsed_seconds) > 0  -- upward
ORDER BY temp_trend DESC

Gap 4: Distribution shape

Function What it does Use case
kurtosis() Tail heaviness of distribution "Are fleet-wide temperature distributions heavy-tailed?" (frequent extreme values)
skewness() Asymmetry of distribution "Is the fleet skewed toward degradation?"

What this enables

-- Fleet health distribution shape per site
SELECT site,
  avg(motor_temp) AS mean,
  stddev(motor_temp) AS stddev,
  skewness(motor_temp) AS skew,
  kurtosis(motor_temp) AS kurt
FROM telemetry
WHERE ts > now() - 1d
GROUP BY site

Positive skewness + high kurtosis at one site vs. others = localized environmental issue.

Gap 5: Histogram bucketing

Function What it does Use case
width_bucket(value, min, max, n) Assign value to one of n equal-width buckets Fleet distribution dashboards, histogram construction

What this enables

-- Temperature distribution histogram across fleet
SELECT 
  width_bucket(avg_temp, 40, 100, 12) AS bucket,
  count(*) AS robot_count
FROM (
  SELECT robot_id, avg(motor_temp) AS avg_temp
  FROM telemetry WHERE ts > now() - 1d
  GROUP BY robot_id
)
GROUP BY bucket
ORDER BY bucket

Priority

  1. Statistical aggregates as window functions — highest impact, unlocks the core individual-vs-cohort pattern
  2. ntile(), cume_dist(), nth_value() — fleet tiering and ranking
  3. regr_r2() — trend detection quality
  4. width_bucket() — histogram construction
  5. kurtosis(), skewness() — distribution shape (nice-to-have)
  6. regr_count(), regr_avgx/avgy, regr_sxx/sxy/syy — regression completeness (nice-to-have)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions