diff --git a/.gitignore b/.gitignore index be1bbfddf..1432765b6 100644 --- a/.gitignore +++ b/.gitignore @@ -96,6 +96,7 @@ criterion/ # Claude Code specific .claude/ +memory # R specific *.Rproj.user diff --git a/CHANGELOG.md b/CHANGELOG.md index 591173955..1a52e9651 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,10 @@ ### Added +- New `aggregate` SETTING on Identity-stat layers (point, line, area, bar, ribbon, + range, segment, arrow, rule, text). By default it collapses each group to a + single row by replacing every numeric mapping in place with its aggregated + value. See the `DRAW` documentation for details. - Added panel decorations (grid lines, axes, background) for polar coordinates (#156). - Added `radar` setting to polar coordinates for making radar plots (#418). @@ -11,7 +15,7 @@ - Side effects like `CREATE TEMP TABLE` before the `VISUALISE` statement are now separated from directly feeding into the visualisation data (#415) -- Fixed bug where panel axes were unintentionally anchored to zero when using +- Fixed bug where panel axes were unintentionally anchored to zero when using `FACET ... SETTING free => 'x'/'y'` (#410). - Fixed bug where faceted data were matched to the incorrect panels (#409) diff --git a/doc/syntax/clause/draw.qmd b/doc/syntax/clause/draw.qmd index 31bf8f3a9..7baad32e3 100644 --- a/doc/syntax/clause/draw.qmd +++ b/doc/syntax/clause/draw.qmd @@ -76,6 +76,48 @@ The `SETTING` clause can be used for two different things: #### Position A special setting is `position` which controls how overlapping objects are repositioned to avoid overlapping etc. Position adjustments have special mapping requirements so all position adjustments will not be relevant for all layer types. Different layers have different defaults as detailed in their documentation. You can read about each different position adjustment at [their own documentation sites](../index.qmd#position-adjustments). +#### Aggregate +Some layers support aggregation of their data through the `aggregate` setting. Their documentation will state this. `aggregate` collapses each group to a single row, replacing every numeric mapping in place with its aggregated value. Groups are defined by `PARTITION BY` together with all discrete mappings. + +The `aggregate` setting takes a single string or an array of strings. Each string is one of: + +* **Untargeted** — `''` (no prefix). With one untargeted aggregation, the function applies to every numeric mapping that doesn't have a targeted aggregation. With two untargeted aggregations, the first is used for the lower side of range layers (e.g. `x`/`xmin`) plus all non-range layers, and the second is used for the upper side of range layers (e.g. `xend`/`xmax`). More than two untargeted aggregations is not allowed. +* **Targeted** — `':'`. Applies `func` to the named aesthetic only (`` is a name like `x`, `y`, `xmin`, `xmax`, `xend`, `yend`, `color`, `size`, …). A target overrides any untargeted aggregation for that aesthetic. + +A numeric mapping is dropped from the layer with a warning, when it has neither a target nor an applicable default. + +##### Aggregate functions +Aggregation can either be a simple function or a band function. The simple functions are: + +* `'count'`: Non-null tally of the bound column. +* `'sum'` and `'prod'`: The sum or product +* `'min'`, `'max'`: Extremes +* `'range'` (max - min), `'mid'` (min + max) / 2 +* `'mean'`, and `'median'`: Central tendency +* `'geomean'`, `'harmean'`, and `'rms'`: Geometric, harmonic, and root-mean-square +* `'sdev'`, `'var'`, `'iqr'`, and `'se'`: Standard deviation, variance, interquartile range, and standard error +* `'p05'`, `'p10'`, `'p25'`, `'p50'`, `'p75'`, `'p90'`, and `'p95'`: Percentiles +* `'first'` and `'last'`: The first or last value in the group, in row order. Note that the row order within a group is engine-defined unless the source query has an `ORDER BY` — these are most useful when the upstream SQL provides an explicit ordering. +* `'diff'`: `last - first`. The change between the first and last value in row order — same ordering caveat applies. + +For band functions you combine an offset with an expansion, potentially multiplied. An example could be `'mean-1.96sdev'` which does exactly what you'd expect it to be. The general form is `±` with `` being optional (defaults to `1`). + +Allowed offsets are: `'mean'`, `'median'`, `'geomean'`, `'harmean'`, `'rms'`, `'sum'`, `'prod'`, `'min'`, `'max'`, `'mid'`, and `'p05'`–`'p95'` + +Allowed expansions are: `'sdev'`, `'se'`, `'var'`, `'iqr'`, and `'range'` + +##### Exploded aggregation +You can also target the same aesthetic more than once to produce *multiple rows per group* — one for each function. We call that *exploded aggregation*. For example `aggregate => ('y:min', 'y:max')` emits a min row and a max row per group, so a single `DRAW line` produces two summary lines that connect within each group rather than across them. When multiple rows are created, a synthetic `aggregate` column is made that tags each row with the name of the aggregation function. You can use this with a `REMAPPING` to drive another aesthetic — e.g. `REMAPPING aggregate AS stroke` to colour the two lines differently. The column's value is built from the per-row function names of the *exploded* targets, deduplicated, and joined with `/`: + +* `aggregate => ('y:min', 'y:max')` → rows tagged `'min'`, `'max'`. +* `aggregate => ('y:min', 'y:max', 'color:median')` → rows tagged `'min'`, `'max'` (the single-function `color` target is recycled across rows and is not part of the label). +* `aggregate => ('y:min', 'y:max', 'color:sum', 'color:prod')` → rows tagged `'min/sum'`, `'max/prod'`. +* `aggregate => ('y:mean', 'y:max', 'color:mean', 'color:prod')` → rows tagged `'mean'`, `'max/prod'` (the duplicate `'mean'` collapses). + +When several aesthetics are targeted with the same number of functions, they explode in lockstep: row 1 uses each aesthetic's first function, row 2 the second, and so on. Aesthetics with a single function — and the unprefixed defaults — are reused unchanged across every row. Mixing different numbers of aggregation metrics above 1 across aesthetics is not allowed. + +In the single-row (reduction) case aggregation applies in place — no `REMAPPING` is needed and no synthetic column is added. Only the multi-row (explosion) case described above introduces the synthetic `aggregate` column. + ### `FILTER` ```ggsql FILTER diff --git a/doc/syntax/layer/type/area.qmd b/doc/syntax/layer/type/area.qmd index a72b059ff..623b00a29 100644 --- a/doc/syntax/layer/type/area.qmd +++ b/doc/syntax/layer/type/area.qmd @@ -25,9 +25,14 @@ The following aesthetics are recognised by the area layer. * `orientation`: The orientation of the layer, see the [Orientation section](#orientation). One of the following: * `'aligned'` to align the layer's primary axis with the coordinate system's first axis. * `'transposed'` to align the layer's primary axis with the coordinate system's second axis. +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation -The area layer sorts the data along its primary axis +This layer supports aggregation through the `aggregate` setting. Aggregation groups are defined by `PARTITION BY`, all discrete mappings, but also the primary axis. Within each group, every numeric mapping is replaced in place by its aggregated value. Use a default like `'mean'` or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. + +Further, the area layer sorts the data along its primary axis before returning it. ## Orientation Area plots are sorted and connected along their primary axis. Since the primary axis cannot be deduced from the mapping it must be specified using the `orientation` setting. E.g. if you wish to create a vertical area plot you need to set `orientation => 'transposed'` to indicate that the primary layer axis follows the second axis of the coordinate system. diff --git a/doc/syntax/layer/type/bar.qmd b/doc/syntax/layer/type/bar.qmd index d34a4953c..d32b1f88a 100644 --- a/doc/syntax/layer/type/bar.qmd +++ b/doc/syntax/layer/type/bar.qmd @@ -25,10 +25,15 @@ The bar layer has no required aesthetics ## Settings * `position`: Position adjustment. One of `'identity'`, `'stack'` (default), `'dodge'`, or `'jitter'` * `width`: The width of the bars as a proportion of the available width (0 to 1) +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation If the secondary axis has not been mapped the layer will calculate counts for you and display these as the secondary axis. +This layer supports aggregation through the `aggregate` setting. Aggregation groups are defined by `PARTITION BY` and all discrete mappings. Within each group, every numeric mapping is replaced in place by its aggregated value. Use a default like `'mean'` or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. + ### Properties * `weight`: If mapped, the sum of the weights within each group is calculated instead of the count in each group @@ -116,3 +121,15 @@ DRAW bar MAPPING species AS fill PROJECT TO polar ``` + +Use a different type of aggregation for the bars through the `aggregate` setting. The `range` layer needs both `ymin` and `ymax` mapped; with two defaults, the first is applied to the lower bound and the second to the upper bound. + +```{ggsql} +VISUALISE species AS x FROM ggsql:penguins +DRAW bar + MAPPING body_mass AS y + SETTING aggregate => 'mean', fill => 'steelblue' +DRAW range + MAPPING body_mass AS ymin, body_mass AS ymax + SETTING aggregate => ('mean-1.96sdev', 'mean+1.96sdev') +``` diff --git a/doc/syntax/layer/type/line.qmd b/doc/syntax/layer/type/line.qmd index 3ec9ec212..898647151 100644 --- a/doc/syntax/layer/type/line.qmd +++ b/doc/syntax/layer/type/line.qmd @@ -24,9 +24,15 @@ The following aesthetics are recognised by the line layer. * `orientation`: The orientation of the layer, see the [Orientation section](#orientation). One of the following: * `'aligned'` to align the layer's primary axis with the coordinate system's first axis. * `'transposed'` to align the layer's primary axis with the coordinate system's second axis. +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation -The line layer sorts the data along its primary axis. +This layer supports aggregation through the `aggregate` setting. Aggregation groups are defined by `PARTITION BY`, all discrete mappings, but also the primary axis. Within each group, every numeric mapping is replaced in place by its aggregated value. Use a default like `'mean'` or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. + +Further, the line layer sorts the data along its primary axis before returning it. + If the line has a variable `stroke` or `opacity` aesthetic within groups, the line is broken into segments. Each segment gets the property of the preceding datapoint, so the last datapoint in a group does not transfer these properties. @@ -89,4 +95,14 @@ VISUALISE x, y FROM data DRAW line MAPPING z AS linewidth SCALE linewidth TO (0, 30) -``` \ No newline at end of file +``` + +Use aggregation to draw min and max lines from a set of observations on a single layer. Targeting `y` twice produces one summary row per function within the same group. A synthetic `aggregate` column tags each row with the different function names, that you can remap to colour the lines distinctly: + +```{ggsql} +VISUALISE Day AS x, Temp AS y FROM ggsql:airquality +DRAW line + REMAPPING aggregate AS stroke + SETTING aggregate => ('y:min', 'y:max') +DRAW point +``` diff --git a/doc/syntax/layer/type/point.qmd b/doc/syntax/layer/type/point.qmd index a64ca2580..b9fd50163 100644 --- a/doc/syntax/layer/type/point.qmd +++ b/doc/syntax/layer/type/point.qmd @@ -23,9 +23,12 @@ The following aesthetics are recognised by the point layer. ## Settings * `position`: Position adjustment. One of `'identity'` (default), `'stack'`, `'dodge'`, or `'jitter'` +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation -The point layer does not transform its data but passes it through unchanged +This layer supports aggregation through the `aggregate` setting. Aggregation groups are defined by `PARTITION BY` and all discrete mappings. Within each group, every numeric mapping is replaced in place by its aggregated value. Use a default like `'mean'` or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. ## Orientation The point layer has no orientation. The axes are treated symmetrically. @@ -72,3 +75,13 @@ VISUALISE species AS x, bill_dep AS y FROM ggsql:penguins DRAW point SETTING position => 'jitter', distribution => 'density' ``` + +Use aggregation to show a single point per group + +```{ggsql} +VISUALISE species AS x, island AS y, body_mass AS fill, body_mass AS size + FROM ggsql:penguins +DRAW point + SETTING aggregate => ('fill:mean', 'size:count') +SCALE size TO (5, 20) +``` diff --git a/doc/syntax/layer/type/range.qmd b/doc/syntax/layer/type/range.qmd index d3982bd66..35771ef9f 100644 --- a/doc/syntax/layer/type/range.qmd +++ b/doc/syntax/layer/type/range.qmd @@ -22,9 +22,12 @@ The following aesthetics are recognised by the range layer. ## Settings * `width`: The width of the hinges in points (must be >= 0). Defaults to 10. Can be set to `null` to not display hinges. +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation -The range layer does not transform its data but passes it through unchanged. +This layer supports aggregation through the `aggregate` setting. Within each group, defined by `PARTITION BY` and all discrete mappings, every numeric mapping is replaced in place by its aggregated value, producing one range per group. Range is a range layer with two defaults: the first applies to the start point (`xmin`/`ymin`) and the second applies to the end point (`xmax`/`ymax`). Use a single default like `'mean'` to apply the same function to all values, or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. ## Orientation The orientation of range layers is deduced directly from the mapping, because the interval is mapped to the secondary axis. To create a horizontal range layer, you map the independent variable to `y` instead of `x` and the interval to `xmin` and `xmax` (assuming a default Cartesian coordinate system). @@ -108,3 +111,25 @@ DRAW range MAPPING low AS ymin, high AS ymax SETTING width => null ``` + +Rather than precomputing the values and plotting them, you can use the aggregate functionality to calculate the relevant statistics dynamically: + +```{ggsql} +VISUALISE Date AS x, Temp AS ymin, Temp AS ymax, Temp AS color + FROM ggsql:airquality +DRAW range + REMAPPING aggregate AS linewidth + SETTING + aggregate => ( + 'x:first', + 'ymin:first', 'ymin:min', + 'ymax:last', 'ymax:max', + 'color:diff' + ), + width => null + PARTITION BY Week +SCALE linewidth TO (5, 1) +SCALE BINNED color TO ('steelblue', 'firebrick') + SETTING breaks => (-20, 0, 20) +``` + diff --git a/doc/syntax/layer/type/ribbon.qmd b/doc/syntax/layer/type/ribbon.qmd index 50a38d258..d46aa02a3 100644 --- a/doc/syntax/layer/type/ribbon.qmd +++ b/doc/syntax/layer/type/ribbon.qmd @@ -23,9 +23,12 @@ The following aesthetics are recognised by the ribbon layer. ## Settings * `position`: Position adjustment. One of `'identity'` (default), `'stack'`, `'dodge'`, or `'jitter'` +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation -The ribbon layer sorts the data along its primary axis +This layer supports aggregation through the `aggregate` setting. Within each group, defined by `PARTITION BY` and all discrete mappings, every numeric mapping is replaced in place by its aggregated value, producing one ribbon per group. Ribon is a range layer with two defaults: the first applies to the start point (`xmin`/`ymin`) and the second applies to the end point (`xmax`/`ymax`). Use a single default like `'mean'` to apply the same function to all values, or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. ## Orientation Ribbon layers are sorted and connected along their primary axis. The orientation is deduced directly from the mapping, because the interval is mapped to the secondary axis. To create a vertical ribbon layer you map the independent variable to `y` instead of `x` and the interval to `xmin` and `xmax` (assuming a default Cartesian coordinate system). @@ -59,3 +62,11 @@ DRAW ribbon DRAW line MAPPING MeanTemp AS y ``` + +Use aggregation to calculate bounds on the fly. The two untargeted aggregation functions target the `ymin` and `ymax` aesthetics automatically. + +```{ggsql} +VISUALISE Day AS x, Temp AS ymin, Temp AS ymax FROM ggsql:airquality +DRAW ribbon + SETTING aggregate => ('min', 'max') +``` diff --git a/doc/syntax/layer/type/rule.qmd b/doc/syntax/layer/type/rule.qmd index 71a2ceb46..032e5ea14 100644 --- a/doc/syntax/layer/type/rule.qmd +++ b/doc/syntax/layer/type/rule.qmd @@ -25,8 +25,12 @@ The following aesthetics are recognised by the rule layer. ## Settings * `position`: Position adjustment. One of `'identity'` (default), `'stack'`, `'dodge'`, or `'jitter'` +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation +This layer supports aggregation through the `aggregate` setting. Aggregation groups are defined by `PARTITION BY` and all discrete mappings. Within each group, every numeric mapping is replaced in place by its aggregated value. Use a default like `'mean'` or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. For diagonal lines, the position aesthetic determines the intercept: @@ -110,4 +114,14 @@ VISUALISE FROM ggsql:penguins intercept AS y, label AS colour FROM lines -``` \ No newline at end of file +``` + +Show a max rule for a timeseries + +```{ggsql} +VISUALISE Temp AS y FROM ggsql:airquality +DRAW line + MAPPING Date AS x +DRAW rule + SETTING aggregate => 'max' +``` diff --git a/doc/syntax/layer/type/segment.qmd b/doc/syntax/layer/type/segment.qmd index 7553aef96..f2aab57b4 100644 --- a/doc/syntax/layer/type/segment.qmd +++ b/doc/syntax/layer/type/segment.qmd @@ -25,9 +25,12 @@ For axis-aligned intervals where one coordinate is shared between the start and ## Settings * `position`: Position adjustment. One of `'identity'` (default), `'stack'`, `'dodge'`, or `'jitter'` +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation -The segment layer does not transform its data but passes it through unchanged. +This layer supports aggregation through the `aggregate` setting. Within each group, defined by `PARTITION BY` and all discrete mappings, every numeric mapping is replaced in place by its aggregated value, producing one segment per group. Segment is a range layer with two defaults: the first applies to the start point (`x`/`y`) and the second applies to the end point (`xend`/`yend`). Use a single default like `'mean'` to apply the same function to all four endpoints, or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. ## Orientation The segment layer has no orientations. The axes are treated symmetrically. diff --git a/doc/syntax/layer/type/text.qmd b/doc/syntax/layer/type/text.qmd index fa9d010ac..6a4310337 100644 --- a/doc/syntax/layer/type/text.qmd +++ b/doc/syntax/layer/type/text.qmd @@ -35,6 +35,9 @@ The following aesthetics are recognised by the text layer. * a 2-element numeric array `[h, v]` where the first number is the horizontal offset and the second number is the vertical offset. * `format` Formatting specifier, see explanation below. * `position`: Position adjustment. One of `'identity'` (default), `'stack'`, `'dodge'`, or `'jitter'` +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ### Format The `format` setting can take a string that will be used in formatting the `label` aesthetic. @@ -66,7 +69,7 @@ Known formatters are: * `x`/`X`: Unsigned hexadecimal ## Data transformation -The text layer does not transform its data but passed it through unchanged. +This layer supports aggregation through the `aggregate` setting. Aggregation groups are defined by `PARTITION BY` and all discrete mappings. Within each group, every numeric mapping is replaced in place by its aggregated value. Use a default like `'mean'` or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. ## Orientation The text layer has no orientation. The axes are treated symmetrically. @@ -146,3 +149,14 @@ PLACE text x => (40, 50, 50), y => (19, 19, 15) ``` + +Use aggregation to place labels at their centroid. + +```{ggsql} +VISUALISE bill_len AS x, bill_dep AS y FROM ggsql:penguins +DRAW point + MAPPING species AS fill +DRAW text + MAPPING species AS label + SETTING aggregate => 'mean', stroke => 'white', fontweight => 'bold', fontsize => 20 +``` diff --git a/doc/syntax/layer/type/tile.qmd b/doc/syntax/layer/type/tile.qmd index e700092c7..2107b2fb9 100644 --- a/doc/syntax/layer/type/tile.qmd +++ b/doc/syntax/layer/type/tile.qmd @@ -37,11 +37,16 @@ Alternatively, use only the center, which will set `height` to 1 by default. ## Settings * `position`: Position adjustment. One of `'identity'` (default), `'stack'`, `'dodge'`, or `'jitter'` +* `aggregate` Aggregation functions to apply per group: + * `null` apply no group aggregation (default). + * A single string or an array of strings. See an overview of aggregation function in [the `DRAW` documentation](../../clause/draw.qmd#aggregate) and more information in the *Data transformation* section below. ## Data transformation. When the primary aesthetics are continuous, primary data is reparameterised to {start, end}, e.g. `xmin` and `xmax`. When the secondary aesthetics are continuous, secondary data is reparameterised to {start, end}, e.g. `ymin` and `ymax`. +This layer also supports aggregation through the `aggregate` setting. Aggregation groups are defined by `PARTITION BY` and all discrete mappings. Within each group, every numeric mapping is replaced in place by its aggregated value. Use a default like `'mean'` or target individual aesthetics with `':'`. See [the `DRAW` documentation](../../clause/draw.qmd#aggregate) for the full setting shape. The position parameterisation runs after aggregation, so a heatmap from raw rows is just one `aggregate => ''` setting away. + ## Orientation The tile layer has no orientation. The axes are treated symmetrically. @@ -91,6 +96,15 @@ VISUALISE start AS xmin, end AS xmax, min AS ymin, max AS ymax DRAW tile ``` +Building a heatmap from raw rows by aggregating per cell. + +```{ggsql} +VISUALISE FROM ggsql:penguins +DRAW tile + MAPPING species AS x, sex AS y, body_mass AS fill + SETTING aggregate => 'mean', opacity => 1 +``` + Using a tile as an annotation. Note we're using the `PLACE` clause here instead of `DRAW` because we're not mapping from data. ```{ggsql} diff --git a/doc/vendor/SKILL.md b/doc/vendor/SKILL.md index 90c41dcf4..6039f00cd 100644 --- a/doc/vendor/SKILL.md +++ b/doc/vendor/SKILL.md @@ -129,6 +129,35 @@ SETTING position => 'dodge' -- side by side (default for boxplot, violin) SETTING position => 'jitter' -- random offset ``` +**Aggregate** collapses each group to a single row, replacing every numeric mapping in place with its aggregated value. Groups = `PARTITION BY` columns + all discrete mappings. Supported by `point`, `line`, `path`, `bar`, `area`, `ribbon`, `range`, `segment`, `rule`, `text`, `tile`. Not supported by `histogram`, `density`, `smooth`, `boxplot`, `violin` (they have their own stats). + +```ggsql +SETTING aggregate => '' -- single +SETTING aggregate => ('', '', …) -- list +``` + +Each `` is either: +- **Untargeted** — `''`. Applies to every numeric mapping without an explicit target. With two untargeted defaults, the first applies to lower-side aesthetics (`x`/`xmin`/etc.) plus all non-range layers, the second to upper-side (`xend`/`xmax`). More than two untargeted defaults is an error. +- **Targeted** — `':'`. Applies `func` to the named aesthetic only. Overrides any untargeted default for that aesthetic. + +Functions: +- Standard reductions: `count`, `sum`, `prod`, `min`, `max`, `range` (max−min), `mid` ((min+max)/2), `mean`, `median`, `geomean`, `harmean`, `rms`, `sdev`, `var`, `iqr`, `se`, `p05`–`p95`. +- Positional (rely on upstream `ORDER BY` for deterministic order): `first`, `last`, `diff` (last − first). +- Band: `±[]`, e.g. `'mean+1.96sdev'`, `'median-iqr'`. Offsets: `mean`, `median`, `geomean`, `harmean`, `rms`, `sum`, `prod`, `min`, `max`, `mid`, `p05`–`p95`. Expansions: `sdev`, `se`, `var`, `iqr`, `range`. + +**Explosion** — targeting the same aesthetic with multiple functions emits one row per function per group. A synthetic `aggregate` column tags each row with the function name. Use `REMAPPING aggregate AS ` to drive another aesthetic from it. When several aesthetics are exploded with the same length, they explode in lockstep (row 1 = each target's first function, row 2 = second, …); single-function targets are reused on every row. Mixing target lengths > 1 is an error. + +```ggsql +-- min/max envelope as two lines per group, coloured by function +DRAW line + MAPPING Date AS x, Temp AS y + REMAPPING aggregate AS color + SETTING aggregate => ('y:min', 'y:max') + PARTITION BY Year +``` + +**Scale interaction** — for an aesthetic that is *targeted* by aggregate, `SCALE BINNED ` runs **after** aggregation (otherwise the diff/mean/etc. would cancel within a bin). Untargeted `SCALE BINNED` still bins pre-aggregate so the bins can drive grouping. Continuous censoring (`SCALE FROM (lo, hi)`) and discrete OOB filtering defer to post-aggregate whenever the aesthetic is being aggregated (targeted or untargeted default). + ### FILTER SQL WHERE condition applied to layer data. Content is passed to the database: @@ -461,6 +490,21 @@ VISUALISE DRAW line MAPPING Date AS x, value AS y, 'Temperature' AS color FROM temps DRAW point MAPPING Date AS x, value AS y, 'Ozone' AS color FROM ozone SCALE x VIA date + +-- Per-week summary: open/close range, weekly temperature change (binned post-aggregate) +VISUALISE Date AS x, Temp AS ymin, Temp AS ymax, Temp AS color + FROM ggsql:airquality +DRAW range + SETTING aggregate => ('x:first', 'ymin:first', 'ymax:last', 'color:diff'), + width => null + PARTITION BY Week +SCALE BINNED color + +-- Mean ± 1.96·sdev band per group, drawn as a ribbon +VISUALISE Day AS x, Temp AS ymin, Temp AS ymax FROM ggsql:airquality +DRAW ribbon + SETTING aggregate => ('mean-1.96sdev', 'mean+1.96sdev') + PARTITION BY Month ``` --- diff --git a/src/execute/layer.rs b/src/execute/layer.rs index 6af5c6418..7a75a1d48 100644 --- a/src/execute/layer.rs +++ b/src/execute/layer.rs @@ -187,11 +187,16 @@ pub fn apply_remappings_post_query(df: DataFrame, layer: &Layer) -> Result = df .get_column_names() .into_iter() - .filter(|name| naming::is_stat_column(name)) + .filter(|name| { + naming::is_stat_column(name) && !layer.partition_by.contains(&name.to_string()) + }) .collect(); if !stat_cols.is_empty() { df = df.drop_many(&stat_cols)?; @@ -271,10 +276,24 @@ pub fn apply_pre_stat_transform( aesthetic_schema: &Schema, scales: &[Scale], dialect: &dyn SqlDialect, + aesthetic_ctx: &AestheticContext, ) -> String { let mut transform_exprs: Vec<(String, String)> = vec![]; let mut transformed_columns: HashSet = HashSet::new(); + // When a layer has `aggregate => …`, scale-driven rewrites are deferred to + // after the stat for aesthetics where running them up-front would defeat + // the aggregate. The post-stat machinery (`apply_post_stat_binning`, + // `apply_scale_oob`) picks the deferred ones up against the aggregated + // values in the materialised DataFrame. + let agg_buckets = crate::plot::layer::geom::stat_aggregate::aggregated_aesthetics( + &layer.parameters, + &layer.mappings, + aesthetic_schema, + aesthetic_ctx, + layer.geom.aggregate_domain_aesthetics().unwrap_or(&[]), + ); + // Check layer mappings for aesthetics with scales that need pre-stat transformation // Handles both column mappings and literal mappings (which are injected as synthetic columns) for (aesthetic, value) in &layer.mappings.aesthetics { @@ -303,6 +322,27 @@ pub fn apply_pre_stat_transform( // Find scale for this aesthetic if let Some(scale) = scales.iter().find(|s| s.aesthetic == *aesthetic) { if let Some(ref scale_type) = scale.scale_type { + // Defer this rewrite when the layer aggregates and the scale + // semantics call for it (see post-stat machinery for how the + // deferred rewrite actually runs). `Binned` only defers when + // the aesthetic is *explicitly* targeted (untargeted Binned + // still drives meaningful pre-stat grouping); OOB-flavoured + // rewrites defer whenever the aesthetic is being aggregated. + if let Some((ref targeted, ref aggregated)) = agg_buckets { + use crate::plot::scale::ScaleTypeKind; + let kind = scale_type.scale_type_kind(); + let defer = match kind { + ScaleTypeKind::Binned => targeted.contains(aesthetic), + ScaleTypeKind::Continuous + | ScaleTypeKind::Discrete + | ScaleTypeKind::Ordinal => aggregated.contains(aesthetic), + ScaleTypeKind::Identity => false, + }; + if defer { + continue; + } + } + // Get pre-stat SQL transformation from scale type (if applicable) // Each scale type's pre_stat_transform_sql() returns None if not applicable if let Some(sql) = @@ -436,6 +476,7 @@ pub fn apply_layer_transforms( scales: &[Scale], dialect: &dyn SqlDialect, execute_query: &F, + aesthetic_ctx: &AestheticContext, ) -> Result where F: Fn(&str) -> Result, @@ -482,6 +523,7 @@ where &aesthetic_schema, scales, dialect, + aesthetic_ctx, ); // Build group_by columns from partition_by @@ -511,6 +553,7 @@ where &layer.parameters, execute_query, dialect, + aesthetic_ctx, )?; // Flip user remappings BEFORE merging defaults for Transposed orientation. @@ -584,6 +627,37 @@ where layer.mappings.aesthetics.remove(aes); } + // Auto-remap stat columns whose names match aesthetics that were + // consumed by the stat (e.g. Aggregate's per-aesthetic outputs). The + // geom can't list these in `default_remappings` because the set of + // mapped aesthetics is dynamic per layer. + for stat in &stat_columns { + if final_remappings.contains_key(stat) { + continue; + } + if consumed_aesthetics.contains(stat) { + final_remappings.insert(stat.clone(), stat.clone()); + } + } + + // The synthetic `aggregate` stat column produced by an exploded + // Aggregate stat tags each row with its function name. For mark + // types that connect rows within a group (line, area, path, + // polygon) we add this column to `layer.partition_by` so e.g. + // `aggregate => ('y:min', 'y:max')` renders as two separate lines + // rather than one zigzag through both. Resolves to the post-rename + // data-column name: if the user remapped `aggregate AS `, the + // prefixed aesthetic column; otherwise the stat column. + if stat_columns.iter().any(|s| s == "aggregate") { + let partition_col = match final_remappings.get("aggregate") { + Some(aes) => naming::aesthetic_column(aes), + None => naming::stat_column("aggregate"), + }; + if !layer.partition_by.contains(&partition_col) { + layer.partition_by.push(partition_col); + } + } + // Apply stat_columns to layer aesthetics using the remappings for stat in &stat_columns { if let Some(aesthetic) = final_remappings.get(stat) { diff --git a/src/execute/mod.rs b/src/execute/mod.rs index 41f6f1ee9..04963f94b 100644 --- a/src/execute/mod.rs +++ b/src/execute/mod.rs @@ -124,24 +124,38 @@ fn validate( } } - // Validate remapping source columns are valid stat columns for this geom + // Validate remapping source columns are valid stat columns for this geom. + // Geoms that opt into the Aggregate stat (`supports_aggregate`) also accept + // `aggregate`, `count`, and any position aesthetic name as a stat source. let valid_stat_columns = layer.geom.valid_stat_columns(); + let supports_aggregate = layer.geom.supports_aggregate(); for stat_value in layer.remappings.aesthetics.values() { if let Some(stat_col) = stat_value.column_name() { - if !valid_stat_columns.contains(&stat_col) { - if valid_stat_columns.is_empty() { + let is_aggregate_stat_col = supports_aggregate + && (stat_col == "aggregate" + || stat_col == "count" + || crate::plot::aesthetic::is_position_aesthetic(stat_col)); + if !valid_stat_columns.contains(&stat_col) && !is_aggregate_stat_col { + if valid_stat_columns.is_empty() && !supports_aggregate { return Err(GgsqlError::ValidationError(format!( "Layer {}: REMAPPING not supported for geom '{}' (no stat transform)", idx + 1, layer.geom ))); } else { + let mut valid: Vec = + valid_stat_columns.iter().map(|s| s.to_string()).collect(); + if supports_aggregate { + valid.push("aggregate".to_string()); + valid.push("count".to_string()); + } + let valid_refs: Vec<&str> = valid.iter().map(|s| s.as_str()).collect(); return Err(GgsqlError::ValidationError(format!( "Layer {}: REMAPPING references unknown stat column '{}'. Valid stat columns for geom '{}' are: {}", idx + 1, stat_col, layer.geom, - crate::and_list(valid_stat_columns) + crate::and_list(&valid_refs) ))); } } @@ -842,9 +856,30 @@ fn add_discrete_columns_to_partition_by( // Build set of excluded aesthetics that should not trigger auto-grouping: // - Stat-consumed aesthetics (transformed, not grouped) // - 'label' aesthetic (text content to display, not grouping categories) + // — except when `aggregate` is set on the layer, in which case label + // becomes a legitimate grouping key (e.g. "mean per species, place + // species name at the centroid"). let consumed_aesthetics = layer.geom.stat_consumed_aesthetics(); let mut excluded_aesthetics: HashSet<&str> = consumed_aesthetics.iter().copied().collect(); - excluded_aesthetics.insert("label"); + if !crate::plot::layer::geom::has_aggregate_param(&layer.parameters) { + excluded_aesthetics.insert("label"); + } + + // When aggregate is active, an explicitly-targeted Binned aesthetic + // shouldn't auto-promote to a group key — the user is summarising the + // raw values and the binning runs post-stat against the aggregate + // output. Untargeted Binned still groups, so binning can drive + // meaningful aggregation buckets in the common case. + let agg_targeted: HashSet = + crate::plot::layer::geom::stat_aggregate::aggregated_aesthetics( + &layer.parameters, + &layer.mappings, + schema, + aesthetic_ctx, + layer.geom.aggregate_domain_aesthetics().unwrap_or(&[]), + ) + .map(|(t, _)| t) + .unwrap_or_default(); for (aesthetic, value) in &layer.mappings.aesthetics { // Skip position aesthetics - these should not trigger auto-grouping. @@ -877,9 +912,8 @@ fn add_discrete_columns_to_partition_by( let is_discrete = if let Some(scale) = scale_map.get(primary_aes) { if let Some(ref scale_type) = scale.scale_type { match scale_type.scale_type_kind() { - ScaleTypeKind::Discrete - | ScaleTypeKind::Binned - | ScaleTypeKind::Ordinal => true, + ScaleTypeKind::Discrete | ScaleTypeKind::Ordinal => true, + ScaleTypeKind::Binned => !agg_targeted.contains(aesthetic), ScaleTypeKind::Continuous => false, ScaleTypeKind::Identity => discrete_columns.contains(col), } @@ -1329,6 +1363,7 @@ pub fn prepare_data_with_reader(query: &str, reader: &dyn Reader) -> Result Result<()> { let aesthetic_ctx = spec.get_aesthetic_context(); + // Per-layer set of aesthetics that the aggregate stat *explicitly targets*. + // Targeted aesthetics had their pre-stat binning deferred (see + // `apply_pre_stat_transform`), so the materialised DataFrame still holds + // raw aggregate output for them — we need to bin those columns here. + // Untargeted aesthetics were binned pre-stat and the SQL `CASE WHEN` is + // already baked into the column, so the existing `__ggsql_aes_*` skip + // still applies for those. + let targeted_per_layer: Vec> = spec + .layers + .iter() + .map(|layer| { + crate::plot::layer::geom::stat_aggregate::targeted_aesthetics( + &layer.parameters, + &layer.mappings, + &aesthetic_ctx, + ) + }) + .collect(); + for scale in &spec.scales { // Only process Binned scales match &scale.scale_type { @@ -177,32 +196,49 @@ pub fn apply_post_stat_binning( _ => true, }; - // Find columns for this aesthetic across layers - let column_sources = find_columns_for_aesthetic_with_sources( - &spec.layers, - &scale.aesthetic, - data_map, - &aesthetic_ctx, - ); + // Walk layers directly so we can decide per-layer whether an + // aesthetic-named column was deferred (needs binning here) or + // already binned upstream by the pre-stat SQL. + let aesthetics_to_check = aesthetic_ctx + .internal_position_family(&scale.aesthetic) + .map(|f| f.to_vec()) + .unwrap_or_else(|| vec![scale.aesthetic.clone()]); - // Apply binning to each column - for (data_key, col_name) in column_sources { - if let Some(df) = data_map.get(&data_key) { - // Skip if column doesn't exist in this data source + for (idx, layer) in spec.layers.iter().enumerate() { + let data_key = naming::layer_key(idx); + if !data_map.contains_key(&data_key) { + continue; + } + + for aes_name in &aesthetics_to_check { + let col_name = match layer.mappings.get(aes_name) { + Some(crate::AestheticValue::Column { name, .. }) => name.clone(), + _ => continue, + }; + + let df = match data_map.get(&data_key) { + Some(d) => d, + None => continue, + }; if df.column(&col_name).is_err() { continue; } - // Skip post-stat binning for aesthetic columns (like __ggsql_aes_x__) - // because pre_stat_transform already binned them via SQL. - // Post-stat binning only applies to stat columns or remapped aesthetics. - if naming::is_aesthetic_column(&col_name) { + // Skip post-stat binning for aesthetic columns that were + // already binned via pre_stat_transform's CASE WHEN. The + // exception is when the layer's aggregate explicitly targets + // this aesthetic — in that case binning was deferred and the + // column holds the raw aggregate output that needs binning + // now. + if naming::is_aesthetic_column(&col_name) + && !targeted_per_layer[idx].contains(aes_name) + { continue; } let binned_df = apply_binning_to_dataframe(df, &col_name, &break_values, closed_left)?; - data_map.insert(data_key, binned_df); + data_map.insert(data_key.clone(), binned_df); } } } @@ -489,6 +525,22 @@ pub fn apply_pre_stat_resolve(spec: &mut Plot, layer_schemas: &[Schema]) -> Resu let aesthetic_ctx = spec.get_aesthetic_context(); + // Aesthetics that any layer's `aggregate` setting explicitly targets. Their + // BINNED scales must be resolved post-stat — the relevant column range is + // the aggregated output, not the raw input. Leaving them un-resolved here + // means `resolved == false` and `resolve_scales` will pick them up after + // the data is materialised. + let mut targeted_in_any_layer: HashSet = HashSet::new(); + for layer in &spec.layers { + for aes in crate::plot::layer::geom::stat_aggregate::targeted_aesthetics( + &layer.parameters, + &layer.mappings, + &aesthetic_ctx, + ) { + targeted_in_any_layer.insert(aes); + } + } + for scale in &mut spec.scales { // Only pre-resolve Binned scales let scale_type = match &scale.scale_type { @@ -496,6 +548,12 @@ pub fn apply_pre_stat_resolve(spec: &mut Plot, layer_schemas: &[Schema]) -> Resu _ => continue, }; + // Defer resolution for aesthetics targeted by aggregate so breaks + // come from the post-stat range. + if targeted_in_any_layer.contains(&scale.aesthetic) { + continue; + } + // Find all ColumnInfos for this aesthetic from schemas let column_infos = find_schema_columns_for_aesthetic( &spec.layers, diff --git a/src/naming.rs b/src/naming.rs index a25cbdc49..5aca72937 100644 --- a/src/naming.rs +++ b/src/naming.rs @@ -240,6 +240,22 @@ pub fn quote_ident(name: &str) -> String { format!("\"{}\"", name.replace('"', "\"\"")) } +/// Quote a SQL string literal: wraps in single quotes and escapes embedded +/// single quotes by doubling them, per the SQL standard. +/// +/// Use this when interpolating user-supplied (or otherwise variable) string +/// values into SQL literals — e.g. `WHERE name = 'O''Brien'`. +/// +/// # Example +/// ``` +/// use ggsql::naming; +/// assert_eq!(naming::quote_literal("foo"), "'foo'"); +/// assert_eq!(naming::quote_literal("O'Brien"), "'O''Brien'"); +/// ``` +pub fn quote_literal(s: &str) -> String { + format!("'{}'", s.replace('\'', "''")) +} + // ============================================================================ // Detection Functions // ============================================================================ diff --git a/src/plot/layer/geom/area.rs b/src/plot/layer/geom/area.rs index a9df6bffd..6fc357063 100644 --- a/src/plot/layer/geom/area.rs +++ b/src/plot/layer/geom/area.rs @@ -3,10 +3,11 @@ use crate::plot::layer::orientation::{ALIGNED, ORIENTATION_VALUES}; use crate::plot::types::DefaultAestheticValue; use crate::plot::{DefaultParamValue, ParamDefinition}; -use crate::{naming, Mappings}; +use crate::Mappings; -use super::types::{ParamConstraint, POSITION_VALUES}; -use super::{DefaultAesthetics, GeomTrait, GeomType, StatResult}; +use super::stat_aggregate; +use super::types::{wrap_with_order_by, ParamConstraint, POSITION_VALUES}; +use super::{has_aggregate_param, DefaultAesthetics, GeomTrait, GeomType, StatResult}; /// Area geom - filled area charts #[derive(Debug, Clone, Copy)] @@ -50,10 +51,15 @@ impl GeomTrait for Area { default: DefaultParamValue::String(ALIGNED), constraint: ParamConstraint::string_option(ORIENTATION_VALUES), }, + super::types::AGGREGATE_PARAM, ]; PARAMS } + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&["pos1"]) + } + fn needs_stat_transform(&self, _aesthetics: &Mappings) -> bool { true } @@ -61,21 +67,31 @@ impl GeomTrait for Area { fn apply_stat_transform( &self, query: &str, - _schema: &crate::plot::Schema, - _aesthetics: &Mappings, - _group_by: &[String], - _parameters: &std::collections::HashMap, + schema: &crate::plot::Schema, + aesthetics: &Mappings, + group_by: &[String], + parameters: &std::collections::HashMap, _execute_query: &dyn Fn(&str) -> crate::Result, - _dialect: &dyn crate::reader::SqlDialect, + dialect: &dyn crate::reader::SqlDialect, + aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> crate::Result { - // Area geom needs ordering by pos1 (domain axis) for proper rendering - let order_col = naming::aesthetic_column("pos1"); - Ok(StatResult::Transformed { - query: format!("{} ORDER BY {}", query, naming::quote_ident(&order_col)), - stat_columns: vec![], - dummy_columns: vec![], - consumed_aesthetics: vec![], - }) + let result = if has_aggregate_param(parameters) { + stat_aggregate::apply( + query, + schema, + aesthetics, + group_by, + parameters, + dialect, + aesthetic_ctx, + self.aggregate_domain_aesthetics().unwrap_or(&[]), + )? + } else { + StatResult::Identity + }; + // Area needs ordering by pos1 (domain axis) for proper rendering, in both + // the Identity and Aggregate paths. + Ok(wrap_with_order_by(query, result, "pos1")) } } diff --git a/src/plot/layer/geom/arrow.rs b/src/plot/layer/geom/arrow.rs index 375d97543..ccfb961b9 100644 --- a/src/plot/layer/geom/arrow.rs +++ b/src/plot/layer/geom/arrow.rs @@ -32,13 +32,20 @@ impl GeomTrait for Arrow { } fn default_params(&self) -> &'static [ParamDefinition] { - const PARAMS: &[ParamDefinition] = &[ParamDefinition { - name: "position", - default: DefaultParamValue::String("identity"), - constraint: ParamConstraint::string_option(POSITION_VALUES), - }]; + const PARAMS: &[ParamDefinition] = &[ + ParamDefinition { + name: "position", + default: DefaultParamValue::String("identity"), + constraint: ParamConstraint::string_option(POSITION_VALUES), + }, + super::types::AGGREGATE_PARAM, + ]; PARAMS } + + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[]) + } } impl std::fmt::Display for Arrow { diff --git a/src/plot/layer/geom/bar.rs b/src/plot/layer/geom/bar.rs index d64bce9f8..211e89a08 100644 --- a/src/plot/layer/geom/bar.rs +++ b/src/plot/layer/geom/bar.rs @@ -3,10 +3,11 @@ use std::collections::HashMap; use std::collections::HashSet; +use super::stat_aggregate; use super::types::{get_column_name, POSITION_VALUES}; use super::{ - DefaultAesthetics, DefaultParamValue, GeomTrait, GeomType, ParamConstraint, ParamDefinition, - StatResult, + has_aggregate_param, DefaultAesthetics, DefaultParamValue, GeomTrait, GeomType, + ParamConstraint, ParamDefinition, StatResult, }; use crate::naming; use crate::plot::types::{DefaultAestheticValue, ParameterValue}; @@ -71,6 +72,7 @@ impl GeomTrait for Bar { default: DefaultParamValue::String("stack"), constraint: ParamConstraint::string_option(POSITION_VALUES), }, + super::types::AGGREGATE_PARAM, ]; PARAMS } @@ -79,6 +81,10 @@ impl GeomTrait for Bar { &["pos1", "pos2", "weight"] } + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[]) + } + fn needs_stat_transform(&self, _aesthetics: &Mappings) -> bool { true // Bar stat decides COUNT vs identity based on y mapping } @@ -89,10 +95,23 @@ impl GeomTrait for Bar { schema: &Schema, aesthetics: &Mappings, group_by: &[String], - _parameters: &HashMap, + parameters: &HashMap, _execute_query: &dyn Fn(&str) -> Result, - _dialect: &dyn SqlDialect, + dialect: &dyn SqlDialect, + aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> Result { + if has_aggregate_param(parameters) { + return stat_aggregate::apply( + query, + schema, + aesthetics, + group_by, + parameters, + dialect, + aesthetic_ctx, + self.aggregate_domain_aesthetics().unwrap_or(&[]), + ); + } stat_bar_count(query, schema, aesthetics, group_by) } } diff --git a/src/plot/layer/geom/boxplot.rs b/src/plot/layer/geom/boxplot.rs index fdc7bae6b..5d99b358c 100644 --- a/src/plot/layer/geom/boxplot.rs +++ b/src/plot/layer/geom/boxplot.rs @@ -95,6 +95,7 @@ impl GeomTrait for Boxplot { parameters: &HashMap, _execute_query: &dyn Fn(&str) -> Result, dialect: &dyn SqlDialect, + _aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> Result { stat_boxplot(query, aesthetics, group_by, parameters, dialect) } diff --git a/src/plot/layer/geom/density.rs b/src/plot/layer/geom/density.rs index c33970b12..3fe62f9af 100644 --- a/src/plot/layer/geom/density.rs +++ b/src/plot/layer/geom/density.rs @@ -111,6 +111,7 @@ impl GeomTrait for Density { parameters: &std::collections::HashMap, _execute_query: &dyn Fn(&str) -> crate::Result, dialect: &dyn SqlDialect, + _aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> crate::Result { // Density geom: no tails limit (don't set tails parameter, defaults to None) stat_density( diff --git a/src/plot/layer/geom/histogram.rs b/src/plot/layer/geom/histogram.rs index 66400e562..bfb800502 100644 --- a/src/plot/layer/geom/histogram.rs +++ b/src/plot/layer/geom/histogram.rs @@ -97,6 +97,7 @@ impl GeomTrait for Histogram { parameters: &HashMap, execute_query: &dyn Fn(&str) -> Result, dialect: &dyn SqlDialect, + _aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> Result { stat_histogram( query, diff --git a/src/plot/layer/geom/line.rs b/src/plot/layer/geom/line.rs index a8ded3b18..624034586 100644 --- a/src/plot/layer/geom/line.rs +++ b/src/plot/layer/geom/line.rs @@ -1,12 +1,14 @@ //! Line geom implementation +use super::stat_aggregate; +use super::types::wrap_with_order_by; use super::{ - DefaultAesthetics, DefaultParamValue, GeomTrait, GeomType, ParamConstraint, ParamDefinition, - StatResult, + has_aggregate_param, DefaultAesthetics, DefaultParamValue, GeomTrait, GeomType, + ParamConstraint, ParamDefinition, StatResult, }; use crate::plot::layer::orientation::{ALIGNED, ORIENTATION_VALUES}; use crate::plot::types::DefaultAestheticValue; -use crate::{naming, Mappings}; +use crate::Mappings; /// Line geom - line charts with connected points #[derive(Debug, Clone, Copy)] @@ -31,14 +33,21 @@ impl GeomTrait for Line { } fn default_params(&self) -> &'static [ParamDefinition] { - const PARAMS: &[ParamDefinition] = &[ParamDefinition { - name: "orientation", - default: DefaultParamValue::String(ALIGNED), - constraint: ParamConstraint::string_option(ORIENTATION_VALUES), - }]; + const PARAMS: &[ParamDefinition] = &[ + ParamDefinition { + name: "orientation", + default: DefaultParamValue::String(ALIGNED), + constraint: ParamConstraint::string_option(ORIENTATION_VALUES), + }, + super::types::AGGREGATE_PARAM, + ]; PARAMS } + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&["pos1"]) + } + fn needs_stat_transform(&self, _aesthetics: &Mappings) -> bool { true } @@ -46,21 +55,31 @@ impl GeomTrait for Line { fn apply_stat_transform( &self, query: &str, - _schema: &crate::plot::Schema, - _aesthetics: &Mappings, - _group_by: &[String], - _parameters: &std::collections::HashMap, + schema: &crate::plot::Schema, + aesthetics: &Mappings, + group_by: &[String], + parameters: &std::collections::HashMap, _execute_query: &dyn Fn(&str) -> crate::Result, - _dialect: &dyn crate::reader::SqlDialect, + dialect: &dyn crate::reader::SqlDialect, + aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> crate::Result { - // Line geom needs ordering by pos1 (domain axis) for proper rendering - let order_col = naming::aesthetic_column("pos1"); - Ok(StatResult::Transformed { - query: format!("{} ORDER BY {}", query, naming::quote_ident(&order_col)), - stat_columns: vec![], - dummy_columns: vec![], - consumed_aesthetics: vec![], - }) + let result = if has_aggregate_param(parameters) { + stat_aggregate::apply( + query, + schema, + aesthetics, + group_by, + parameters, + dialect, + aesthetic_ctx, + self.aggregate_domain_aesthetics().unwrap_or(&[]), + )? + } else { + StatResult::Identity + }; + // Line needs ordering by pos1 (domain axis) for proper rendering, in both + // the Identity and Aggregate paths. + Ok(wrap_with_order_by(query, result, "pos1")) } } diff --git a/src/plot/layer/geom/mod.rs b/src/plot/layer/geom/mod.rs index 42069abb7..74004da5c 100644 --- a/src/plot/layer/geom/mod.rs +++ b/src/plot/layer/geom/mod.rs @@ -44,6 +44,7 @@ mod rule; mod segment; mod smooth; mod spatial; +pub(crate) mod stat_aggregate; mod text; mod tile; mod violin; @@ -74,6 +75,7 @@ pub use text::Text; pub use tile::Tile; pub use violin::Violin; +use crate::plot::aesthetic::AestheticContext; use crate::plot::types::{ParameterValue, Schema}; use crate::reader::SqlDialect; @@ -196,20 +198,62 @@ pub trait GeomTrait: std::fmt::Debug + std::fmt::Display + Send + Sync { false } + /// Whether the Aggregate stat applies to this geom, and which aesthetics + /// stay as group keys when it does. + /// + /// - `None` — geom doesn't accept the `aggregate` SETTING. Used by the + /// statistical geoms (`histogram`, `density`, `smooth`, `boxplot`, + /// `violin`) that have their own bespoke stats. + /// - `Some(&[])` — geom opts in; the stat groups by discrete mappings + + /// `PARTITION BY` only. Most non-statistical geoms. + /// - `Some(&[, …])` — geom opts in *and* pins the listed aesthetics + /// as group keys regardless of their column's continuity. Used by + /// `line`/`area`/`ribbon` (domain axis) and `tile` (every spatial slot). + /// + /// `supports_aggregate()` is derived from this; geoms only override one + /// method to opt in. + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + None + } + + /// Whether this geom accepts the `aggregate` SETTING parameter. + /// Derived from `aggregate_domain_aesthetics`; do not override. + fn supports_aggregate(&self) -> bool { + self.aggregate_domain_aesthetics().is_some() + } + /// Apply statistical transformation to the layer query. /// - /// The default implementation returns identity (no transformation). + /// The default implementation dispatches to the Aggregate stat when + /// `supports_aggregate()` is true and the `aggregate` parameter is set; + /// otherwise returns identity (no transformation). #[allow(clippy::too_many_arguments)] fn apply_stat_transform( &self, - _query: &str, - _schema: &Schema, - _aesthetics: &Mappings, - _group_by: &[String], - _parameters: &HashMap, + query: &str, + schema: &Schema, + aesthetics: &Mappings, + group_by: &[String], + parameters: &HashMap, _execute_query: &dyn Fn(&str) -> Result, - _dialect: &dyn SqlDialect, + dialect: &dyn SqlDialect, + aesthetic_ctx: &AestheticContext, ) -> Result { + if let (Some(domain), true) = ( + self.aggregate_domain_aesthetics(), + has_aggregate_param(parameters), + ) { + return stat_aggregate::apply( + query, + schema, + aesthetics, + group_by, + parameters, + dialect, + aesthetic_ctx, + domain, + ); + } Ok(StatResult::Identity) } @@ -256,6 +300,14 @@ pub trait GeomTrait: std::fmt::Debug + std::fmt::Display + Send + Sync { } } +/// True when `parameters["aggregate"]` is set to a non-null string or array. +pub(crate) fn has_aggregate_param(parameters: &HashMap) -> bool { + matches!( + parameters.get("aggregate"), + Some(ParameterValue::String(_)) | Some(ParameterValue::Array(_)) + ) +} + /// Wrapper struct for geom trait objects /// /// This provides a convenient interface for working with geoms while hiding @@ -430,6 +482,7 @@ impl Geom { parameters: &HashMap, execute_query: &dyn Fn(&str) -> Result, dialect: &dyn SqlDialect, + aesthetic_ctx: &AestheticContext, ) -> Result { self.0.apply_stat_transform( query, @@ -439,6 +492,7 @@ impl Geom { parameters, execute_query, dialect, + aesthetic_ctx, ) } @@ -465,6 +519,18 @@ impl Geom { self.0.valid_settings() } + /// Whether this geom accepts the `aggregate` SETTING parameter. + pub fn supports_aggregate(&self) -> bool { + self.0.supports_aggregate() + } + + /// Aesthetics the Aggregate stat must keep as group keys rather than + /// aggregating, even if their bound column is continuous. `None` when + /// the geom doesn't accept the `aggregate` setting. + pub fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + self.0.aggregate_domain_aesthetics() + } + /// Validate aesthetic mappings pub fn validate_aesthetics(&self, mappings: &Mappings) -> std::result::Result<(), String> { self.0.validate_aesthetics(mappings) diff --git a/src/plot/layer/geom/point.rs b/src/plot/layer/geom/point.rs index 3dafde2a7..f6b454c9e 100644 --- a/src/plot/layer/geom/point.rs +++ b/src/plot/layer/geom/point.rs @@ -31,13 +31,20 @@ impl GeomTrait for Point { } fn default_params(&self) -> &'static [ParamDefinition] { - const PARAMS: &[ParamDefinition] = &[ParamDefinition { - name: "position", - default: DefaultParamValue::String("identity"), - constraint: ParamConstraint::string_option(POSITION_VALUES), - }]; + const PARAMS: &[ParamDefinition] = &[ + ParamDefinition { + name: "position", + default: DefaultParamValue::String("identity"), + constraint: ParamConstraint::string_option(POSITION_VALUES), + }, + super::types::AGGREGATE_PARAM, + ]; PARAMS } + + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[]) + } } impl std::fmt::Display for Point { diff --git a/src/plot/layer/geom/range.rs b/src/plot/layer/geom/range.rs index 3fee68150..d547187b6 100644 --- a/src/plot/layer/geom/range.rs +++ b/src/plot/layer/geom/range.rs @@ -41,9 +41,14 @@ impl GeomTrait for Range { default: DefaultParamValue::Number(10.0), constraint: ParamConstraint::number_min(0.0), }, + super::types::AGGREGATE_PARAM, ]; PARAMS } + + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[]) + } } impl std::fmt::Display for Range { diff --git a/src/plot/layer/geom/ribbon.rs b/src/plot/layer/geom/ribbon.rs index 87d4636c1..5b3ca13a3 100644 --- a/src/plot/layer/geom/ribbon.rs +++ b/src/plot/layer/geom/ribbon.rs @@ -1,10 +1,11 @@ //! Ribbon geom implementation -use super::types::POSITION_VALUES; -use super::{DefaultAesthetics, GeomTrait, GeomType, StatResult}; +use super::stat_aggregate; +use super::types::{wrap_with_order_by, POSITION_VALUES}; +use super::{has_aggregate_param, DefaultAesthetics, GeomTrait, GeomType, StatResult}; use crate::plot::types::DefaultAestheticValue; use crate::plot::{DefaultParamValue, ParamConstraint, ParamDefinition}; -use crate::{naming, Mappings}; +use crate::Mappings; /// Ribbon geom - confidence bands and ranges #[derive(Debug, Clone, Copy)] @@ -31,14 +32,21 @@ impl GeomTrait for Ribbon { } fn default_params(&self) -> &'static [ParamDefinition] { - const PARAMS: &[ParamDefinition] = &[ParamDefinition { - name: "position", - default: DefaultParamValue::String("identity"), - constraint: ParamConstraint::string_option(POSITION_VALUES), - }]; + const PARAMS: &[ParamDefinition] = &[ + ParamDefinition { + name: "position", + default: DefaultParamValue::String("identity"), + constraint: ParamConstraint::string_option(POSITION_VALUES), + }, + super::types::AGGREGATE_PARAM, + ]; PARAMS } + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&["pos1"]) + } + fn needs_stat_transform(&self, _aesthetics: &Mappings) -> bool { true } @@ -46,21 +54,31 @@ impl GeomTrait for Ribbon { fn apply_stat_transform( &self, query: &str, - _schema: &crate::plot::Schema, - _aesthetics: &Mappings, - _group_by: &[String], - _parameters: &std::collections::HashMap, + schema: &crate::plot::Schema, + aesthetics: &Mappings, + group_by: &[String], + parameters: &std::collections::HashMap, _execute_query: &dyn Fn(&str) -> crate::Result, - _dialect: &dyn crate::reader::SqlDialect, + dialect: &dyn crate::reader::SqlDialect, + aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> crate::Result { - // Ribbon geom needs ordering by pos1 (domain axis) for proper rendering - let order_col = naming::aesthetic_column("pos1"); - Ok(StatResult::Transformed { - query: format!("{} ORDER BY {}", query, naming::quote_ident(&order_col)), - stat_columns: vec![], - dummy_columns: vec![], - consumed_aesthetics: vec![], - }) + let result = if has_aggregate_param(parameters) { + stat_aggregate::apply( + query, + schema, + aesthetics, + group_by, + parameters, + dialect, + aesthetic_ctx, + self.aggregate_domain_aesthetics().unwrap_or(&[]), + )? + } else { + StatResult::Identity + }; + // Ribbon needs ordering by pos1 (domain axis) for proper rendering, in both + // the Identity and Aggregate paths. + Ok(wrap_with_order_by(query, result, "pos1")) } } diff --git a/src/plot/layer/geom/rule.rs b/src/plot/layer/geom/rule.rs index be434f7a6..8549808ef 100644 --- a/src/plot/layer/geom/rule.rs +++ b/src/plot/layer/geom/rule.rs @@ -1,6 +1,6 @@ //! Rule geom implementation -use super::{DefaultAesthetics, GeomTrait, GeomType}; +use super::{DefaultAesthetics, GeomTrait, GeomType, ParamDefinition}; use crate::plot::types::DefaultAestheticValue; /// Rule geom - horizontal and vertical reference lines @@ -25,6 +25,15 @@ impl GeomTrait for Rule { } } + fn default_params(&self) -> &'static [ParamDefinition] { + const PARAMS: &[ParamDefinition] = &[super::types::AGGREGATE_PARAM]; + PARAMS + } + + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[]) + } + fn validate_aesthetics(&self, mappings: &crate::Mappings) -> std::result::Result<(), String> { // Rule requires exactly one of pos1 or pos2 (XOR logic) let has_pos1 = mappings.contains_key("pos1"); diff --git a/src/plot/layer/geom/segment.rs b/src/plot/layer/geom/segment.rs index 3066d8a56..3a76dc97b 100644 --- a/src/plot/layer/geom/segment.rs +++ b/src/plot/layer/geom/segment.rs @@ -31,13 +31,20 @@ impl GeomTrait for Segment { } fn default_params(&self) -> &'static [ParamDefinition] { - const PARAMS: &[ParamDefinition] = &[ParamDefinition { - name: "position", - default: DefaultParamValue::String("identity"), - constraint: ParamConstraint::string_option(POSITION_VALUES), - }]; + const PARAMS: &[ParamDefinition] = &[ + ParamDefinition { + name: "position", + default: DefaultParamValue::String("identity"), + constraint: ParamConstraint::string_option(POSITION_VALUES), + }, + super::types::AGGREGATE_PARAM, + ]; PARAMS } + + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[]) + } } impl std::fmt::Display for Segment { diff --git a/src/plot/layer/geom/smooth.rs b/src/plot/layer/geom/smooth.rs index 94cac7bed..c523201a4 100644 --- a/src/plot/layer/geom/smooth.rs +++ b/src/plot/layer/geom/smooth.rs @@ -100,6 +100,7 @@ impl GeomTrait for Smooth { parameters: &std::collections::HashMap, _execute_query: &dyn Fn(&str) -> crate::Result, dialect: &dyn SqlDialect, + _aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> crate::Result { // Get method from parameters (validated by ParamConstraint::string_option) let ParameterValue::String(method) = parameters.get("method").unwrap() else { diff --git a/src/plot/layer/geom/spatial.rs b/src/plot/layer/geom/spatial.rs index 74ae703c6..3ce1df9a4 100644 --- a/src/plot/layer/geom/spatial.rs +++ b/src/plot/layer/geom/spatial.rs @@ -36,6 +36,7 @@ impl GeomTrait for Spatial { _parameters: &std::collections::HashMap, execute_query: &dyn Fn(&str) -> crate::Result, dialect: &dyn crate::reader::SqlDialect, + _aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> crate::Result { for stmt in dialect.sql_spatial_setup() { execute_query(&stmt)?; diff --git a/src/plot/layer/geom/stat_aggregate.rs b/src/plot/layer/geom/stat_aggregate.rs new file mode 100644 index 000000000..312f68d49 --- /dev/null +++ b/src/plot/layer/geom/stat_aggregate.rs @@ -0,0 +1,2303 @@ +//! Aggregate stat — collapse each group to a single row by applying an +//! aggregate function per numeric mapping. +//! +//! When a layer's `aggregate` SETTING is set, this stat groups by discrete +//! mappings + PARTITION BY columns and emits one row per group. Each numeric +//! column-mapping (positional *and* material) is replaced in place by the +//! aggregated value of its bound column. Discrete mappings stay as group keys; +//! literal mappings pass through unchanged. +//! +//! # Setting shape +//! +//! `aggregate` accepts a single string or array of strings. Each string is +//! either: +//! +//! - **default** — `''` (no prefix). Up to two defaults may be supplied. +//! With one default it applies to every untargeted numeric mapping. With two +//! defaults the first applies to *lower-half* aesthetics (no suffix or `min` +//! suffix) plus all non-range geoms, and the second applies to *upper-half* +//! aesthetics (`max` or `end` suffix). More than two defaults is an error. +//! - **target** — `':'`. Applies `func` to the named aesthetic only. +//! `` is a user-facing name (`x`, `y`, `xmin`, `xmax`, `xend`, `yend`, +//! `color`, `size`, …); the stat resolves it to the internal name through +//! `AestheticContext`. +//! +//! Numeric mappings without a target *or* applicable default are dropped with +//! a warning to stderr. + +use std::collections::{HashMap, HashSet}; +use std::sync::OnceLock; + +use regex::Regex; + +use super::types::StatResult; +use crate::naming; +use crate::plot::aesthetic::AestheticContext; +use crate::plot::types::{ArrayElement, ParameterValue, Schema}; +use crate::reader::SqlDialect; +use crate::{GgsqlError, Mappings, Result}; + +/// All simple-aggregation function names accepted by the `aggregate` SETTING. +/// +/// Band names (e.g. `mean+sdev`, `median-0.5iqr`) are validated separately by +/// `parse_agg_name`, which checks the offset against `OFFSET_STATS` and the +/// expansion against `EXPANSION_STATS`. +pub const AGG_NAMES: &[&str] = &[ + // Tallies & sums + "count", "sum", "prod", // Extremes + "min", "max", "range", "mid", // Central tendency + "mean", "geomean", "harmean", "rms", "median", // Spread (standalone) + "sdev", "var", "iqr", // Percentiles + "p05", "p10", "p25", "p50", "p75", "p90", "p95", // Positional (row order in the group) + "first", "last", "diff", +]; + +/// Stats that can appear as the *offset* (left of `±`) in a band name like +/// `mean+sdev`. Single-value central or representative quantities only — +/// counts/spreads are excluded. +pub const OFFSET_STATS: &[&str] = &[ + "mean", "median", "geomean", "harmean", "rms", "sum", "prod", "min", "max", "mid", "p05", + "p10", "p25", "p50", "p75", "p90", "p95", +]; + +/// Stats that can appear as the *expansion* (right of `±[mod]`) in a band name. +/// Spread / dispersion measures only. +pub const EXPANSION_STATS: &[&str] = &["sdev", "se", "var", "iqr", "range"]; + +/// Parsed representation of any aggregate-function name. +/// +/// Simple aggregates (`mean`, `count`, `p25`) have `band == None`. Band names +/// (`mean+sdev`, `median-0.5iqr`) have `band == Some(...)` with the offset +/// stored in `offset` and the spread/multiplier in `band`. +#[derive(Debug, Clone, PartialEq)] +pub struct AggSpec { + pub offset: &'static str, + pub band: Option, +} + +#[derive(Debug, Clone, PartialEq)] +pub struct Band { + /// Signed multiplier on the expansion. `+1.0` corresponds to `+`; + /// `-1.96` corresponds to `-1.96`. The sign and magnitude are + /// folded together so there's a single source of truth. + pub mod_value: f64, + pub expansion: &'static str, +} + +fn resolve_static(name: &str, vocab: &'static [&'static str]) -> Option<&'static str> { + vocab.iter().copied().find(|v| *v == name) +} + +/// Single regex covering one `aggregate` entry: optional `:` prefix, +/// required offset name, optional `±[]` band suffix. +/// +/// Capture groups: +/// 1. aesthetic prefix (anything up to the first `:`; structural-only — full +/// aesthetic resolution happens in `apply()`) +/// 2. offset name +/// 3. sign — present iff the entry has a band +/// 4. magnitude — optional, defaults to `1.0` +/// 5. expansion name +fn entry_re() -> &'static Regex { + static RE: OnceLock = OnceLock::new(); + RE.get_or_init(|| { + Regex::new(r"^(?:([^:]+):)?([a-z]+\d*)(?:([+-])(\d+(?:\.\d+)?)?([a-z]+))?$").unwrap() + }) +} + +/// Parsed shape of a single `aggregate` array entry. +struct ParsedEntry { + /// `Some(name)` when the entry has an `:` prefix; `None` for an + /// unprefixed default. Resolution to internal aesthetic names happens in + /// `apply()` via `resolve_target_aesthetic`. + aesthetic: Option, + spec: AggSpec, +} + +fn parse_entry(entry: &str) -> std::result::Result { + let caps = entry_re() + .captures(entry) + .ok_or_else(|| format!("could not parse aggregate entry '{}'", entry))?; + + let aesthetic = caps.get(1).map(|m| m.as_str().to_string()); + let offset_str = caps.get(2).unwrap().as_str(); + let band_present = caps.get(3).is_some(); + + let band = if band_present { + let expansion_str = caps.get(5).unwrap().as_str(); + let expansion = resolve_static(expansion_str, EXPANSION_STATS).ok_or_else(|| { + format!( + "'{}': '{}' is not a valid expansion stat. Allowed expansions: {}", + entry, + expansion_str, + crate::or_list_quoted(EXPANSION_STATS, '\''), + ) + })?; + let magnitude: f64 = caps + .get(4) + .map_or(1.0, |m| m.as_str().parse().unwrap_or(1.0)); + let mod_value = if caps.get(3).unwrap().as_str() == "-" { + -magnitude + } else { + magnitude + }; + Some(Band { + mod_value, + expansion, + }) + } else { + None + }; + + let offset = if band.is_some() { + resolve_static(offset_str, OFFSET_STATS).ok_or_else(|| { + if AGG_NAMES.contains(&offset_str) { + format!( + "'{}': '{}' is not a valid offset stat. Allowed offsets: {}", + entry, + offset_str, + crate::or_list_quoted(OFFSET_STATS, '\''), + ) + } else { + format!( + "'{}': '{}' is not a known stat. Allowed offsets: {}", + entry, + offset_str, + crate::or_list_quoted(OFFSET_STATS, '\''), + ) + } + })? + } else { + resolve_static(offset_str, AGG_NAMES).ok_or_else(|| { + format!( + "unknown aggregate function '{}'. Allowed: {} (or use a band like `mean+sdev`)", + offset_str, + crate::or_list_quoted(AGG_NAMES, '\''), + ) + })? + }; + + Ok(ParsedEntry { + aesthetic, + spec: AggSpec { offset, band }, + }) +} + +// ============================================================================= +// AggregateSpec — parsed representation of the `aggregate` SETTING. +// ============================================================================= + +/// Parsed `aggregate` SETTING. +/// +/// Up to two unprefixed defaults plus per-aesthetic targets. A target may be +/// named more than once; the multiple functions cause that aesthetic to +/// *explode* into multiple rows per group +#[derive(Debug, Clone, PartialEq)] +pub struct AggregateSpec { + pub default_lower: Option, + pub default_upper: Option, + /// Targets in declaration order. Each entry is `(user-facing aesthetic, + /// non-empty list of functions)`. Multiple SETTING entries with the same + /// aesthetic are merged into one list during parsing — the cumulative + /// length determines that aesthetic's explosion factor. + pub targets: Vec<(String, Vec)>, +} + +impl AggregateSpec { + fn new() -> Self { + Self { + default_lower: None, + default_upper: None, + targets: Vec::new(), + } + } + + /// Maximum target list length, or `1` if every target has a single function. + /// This is the number of exploded rows the stat will emit per group. + pub fn explosion_factor(&self) -> usize { + self.targets + .iter() + .map(|(_, fns)| fns.len()) + .max() + .unwrap_or(1) + .max(1) + } + + /// Per-row labels for the synthetic `aggregate` column. `None` for the + /// single-row case (no explosion), since the column only makes sense as a + /// row-differentiator and there's nothing to differentiate. + /// + /// For each row in `0..explosion_factor`, walks every *exploded* target + /// (length == n; length-1 recycled targets are skipped because they take + /// the same value on every row), collects each target's function name at + /// that row, deduplicates them while preserving declaration order, and + /// joins with `/`. + /// + /// Examples (with `n = 2`): + /// - `('y:min', 'y:max')` → `['min', 'max']` + /// - `('y:min', 'y:max', 'color:sum', 'color:prod')` → `['min/sum', 'max/prod']` + /// - `('y:mean', 'y:max', 'color:mean', 'color:prod')` → `['mean', 'max/prod']` + /// - `('y:min', 'y:max', 'color:median')` → `['min', 'max']` (color is recycled) + pub fn explosion_labels(&self) -> Option> { + let n = self.explosion_factor(); + if n <= 1 { + return None; + } + let exploded: Vec<&Vec> = self + .targets + .iter() + .filter(|(_, fns)| fns.len() == n) + .map(|(_, fns)| fns) + .collect(); + let labels = (0..n) + .map(|row| { + let mut parts: Vec = Vec::new(); + for fns in &exploded { + let label = agg_label(&fns[row]); + if !parts.contains(&label) { + parts.push(label); + } + } + parts.join("/") + }) + .collect(); + Some(labels) + } +} + +/// Human-readable label for an `AggSpec`. Re-emits simple names verbatim and +/// reconstructs band names like `mean+sdev` / `mean-1.96sdev`. +fn agg_label(spec: &AggSpec) -> String { + match &spec.band { + None => spec.offset.to_string(), + Some(b) => { + let sign = if b.mod_value < 0.0 { '-' } else { '+' }; + let magnitude = b.mod_value.abs(); + if magnitude == 1.0 { + format!("{}{}{}", spec.offset, sign, b.expansion) + } else { + format!("{}{}{}{}", spec.offset, sign, magnitude, b.expansion) + } + } + } +} + +/// Parse the `aggregate` SETTING value into an `AggregateSpec`. Returns +/// `Ok(None)` when the parameter is unset, null, or empty. Returns `Err(...)` +/// for malformed input. +pub fn parse_aggregate_param( + value: &ParameterValue, +) -> std::result::Result, String> { + let entries: Vec<&str> = match value { + ParameterValue::Null => return Ok(None), + ParameterValue::String(s) => vec![s.as_str()], + ParameterValue::Array(arr) => { + let mut out = Vec::with_capacity(arr.len()); + for el in arr { + match el { + ArrayElement::String(s) => out.push(s.as_str()), + ArrayElement::Null => continue, + _ => { + return Err("'aggregate' array entries must be strings or null".to_string()); + } + } + } + if out.is_empty() { + return Ok(None); + } + out + } + _ => return Err("'aggregate' must be a string, array of strings, or null".to_string()), + }; + + let mut spec = AggregateSpec::new(); + for entry in entries { + let parsed = parse_entry(entry)?; + match parsed.aesthetic { + Some(aes) => { + if let Some((_, fns)) = spec.targets.iter_mut().find(|(a, _)| *a == aes) { + fns.push(parsed.spec); + } else { + spec.targets.push((aes, vec![parsed.spec])); + } + } + None => { + if spec.default_lower.is_none() { + spec.default_lower = Some(parsed.spec); + } else if spec.default_upper.is_none() { + spec.default_upper = Some(parsed.spec); + } else { + return Err(format!( + "'aggregate' accepts at most two unprefixed defaults; got a third: '{}'", + entry + )); + } + } + } + } + + if spec.default_lower.is_none() && spec.default_upper.is_none() && spec.targets.is_empty() { + return Ok(None); + } + + // Validate recycling: every target list must be length 1 or N (the max). + let n = spec.explosion_factor(); + if n > 1 { + for (aes, fns) in &spec.targets { + if fns.len() != 1 && fns.len() != n { + return Err(format!( + "aggregate target '{}' has {} functions; targets in an exploded layer must \ + have either 1 or {} functions (the longest target's count)", + aes, + fns.len(), + n + )); + } + } + } + + Ok(Some(spec)) +} + +// ============================================================================= +// SQL fragment helpers (per-column aggregate expressions). +// ============================================================================= + +/// Map a percentile function name (`p05`..`p95`, `median`) to its fraction. +fn percentile_fraction(func: &str) -> Option { + match func { + "median" | "p50" => Some(0.50), + "p05" => Some(0.05), + "p10" => Some(0.10), + "p25" => Some(0.25), + "p75" => Some(0.75), + "p90" => Some(0.90), + "p95" => Some(0.95), + _ => None, + } +} + +/// Build the inline SQL fragment for a *simple* stat (no band) applied to a +/// quoted column. Returns `None` when the dialect cannot express this +/// aggregate inline — for the percentile/iqr family that means the caller +/// switches to the correlated `sql_percentile` fallback; for other names it +/// means the dialect doesn't support that function and the stat layer raises +/// a clear error before SQL is built (see `validate_supported`). +fn simple_stat_sql_inline(name: &str, qcol: &str, dialect: &dyn SqlDialect) -> Option { + if let Some(frac) = percentile_fraction(name) { + let unquoted = unquote(qcol); + return dialect.sql_quantile_inline(&unquoted, frac); + } + if name == "iqr" { + let unquoted = unquote(qcol); + let p75 = dialect.sql_quantile_inline(&unquoted, 0.75)?; + let p25 = dialect.sql_quantile_inline(&unquoted, 0.25)?; + return Some(format!("({} - {})", p75, p25)); + } + dialect.sql_aggregate(name, qcol) +} + +/// Whether the dialect can produce SQL for this aggregate (inline or via the +/// percentile fallback). Used to surface a clear error before SQL is built. +fn dialect_supports(name: &str, dialect: &dyn SqlDialect) -> bool { + if percentile_fraction(name).is_some() || name == "iqr" { + // Always supported: percentile path falls back to a correlated subquery + // built from `sql_percentile`, which has a portable default. + return true; + } + dialect.sql_aggregate(name, "x").is_some() +} + +/// Walk every aggregate that will be emitted and confirm the dialect supports +/// it. Returns the list of unsupported function names, deduplicated. +fn unsupported_functions( + aggregated: &[(String, String, Vec)], + dialect: &dyn SqlDialect, +) -> Vec { + let mut missing: Vec = Vec::new(); + for (_, _, specs) in aggregated { + for spec in specs { + for name in [Some(spec.offset), spec.band.as_ref().map(|b| b.expansion)] + .into_iter() + .flatten() + { + if !dialect_supports(name, dialect) && !missing.iter().any(|m| m == name) { + missing.push(name.to_string()); + } + } + } + } + missing +} + +fn agg_sql_inline(spec: &AggSpec, qcol: &str, dialect: &dyn SqlDialect) -> Option { + let offset_sql = simple_stat_sql_inline(spec.offset, qcol, dialect)?; + match &spec.band { + None => Some(offset_sql), + Some(band) => { + let exp_sql = simple_stat_sql_inline(band.expansion, qcol, dialect)?; + Some(format_band(&offset_sql, band.mod_value, &exp_sql)) + } + } +} + +/// Format a band expression `(offset ± [magnitude *] expansion)`. The sign and +/// magnitude come folded together in `mod_value`; this splits them back out +/// only when emitting SQL so the output is readable (e.g. `(mean - 1.96 * sdev)` +/// rather than `(mean + -1.96 * sdev)`). +fn format_band(offset: &str, mod_value: f64, exp: &str) -> String { + let sign = if mod_value < 0.0 { '-' } else { '+' }; + let magnitude = mod_value.abs(); + if magnitude == 1.0 { + format!("({} {} {})", offset, sign, exp) + } else { + format!("({} {} {} * {})", offset, sign, magnitude, exp) + } +} + +/// Fallback SQL for a simple stat — used when a percentile component lacks +/// inline support. Emits a correlated `sql_percentile` subquery; falls +/// through to the inline form for everything else. +fn simple_stat_sql_fallback( + name: &str, + raw_col: &str, + dialect: &dyn SqlDialect, + src_alias: &str, + group_cols: &[String], +) -> String { + if let Some(frac) = percentile_fraction(name) { + return dialect.sql_percentile(raw_col, frac, src_alias, group_cols); + } + if name == "iqr" { + let p75 = dialect.sql_percentile(raw_col, 0.75, src_alias, group_cols); + let p25 = dialect.sql_percentile(raw_col, 0.25, src_alias, group_cols); + return format!("({} - {})", p75, p25); + } + let qcol = naming::quote_ident(raw_col); + simple_stat_sql_inline(name, &qcol, dialect).unwrap_or_else(|| "NULL".to_string()) +} + +fn agg_sql_fallback( + spec: &AggSpec, + raw_col: &str, + dialect: &dyn SqlDialect, + src_alias: &str, + group_cols: &[String], +) -> String { + let offset_sql = simple_stat_sql_fallback(spec.offset, raw_col, dialect, src_alias, group_cols); + match &spec.band { + None => offset_sql, + Some(band) => { + let exp_sql = + simple_stat_sql_fallback(band.expansion, raw_col, dialect, src_alias, group_cols); + format_band(&offset_sql, band.mod_value, &exp_sql) + } + } +} + +fn needs_quantile_fallback(spec: &AggSpec, probe_col: &str, dialect: &dyn SqlDialect) -> bool { + if simple_needs_fallback(spec.offset, probe_col, dialect) { + return true; + } + if let Some(band) = &spec.band { + if simple_needs_fallback(band.expansion, probe_col, dialect) { + return true; + } + } + false +} + +fn simple_needs_fallback(name: &str, probe_col: &str, dialect: &dyn SqlDialect) -> bool { + if let Some(frac) = percentile_fraction(name) { + return dialect.sql_quantile_inline(probe_col, frac).is_none(); + } + if name == "iqr" { + return dialect.sql_quantile_inline(probe_col, 0.5).is_none(); + } + false +} + +fn unquote(qcol: &str) -> String { + let trimmed = qcol.trim_start_matches('"').trim_end_matches('"'); + trimmed.replace("\"\"", "\"") +} + +// ============================================================================= +// apply — entry point. +// ============================================================================= + +/// Resolve a user-facing target aesthetic name to one or more internal names +/// that are actually mapped on the layer. Handles three cases: +/// 1. The name maps directly through `AestheticContext` (e.g. `y` → `pos2`). +/// 2. The name is an alias from `AESTHETIC_ALIASES` (e.g. `color` → `stroke`, +/// `fill`); each target whose internal counterpart is mapped is included. +/// 3. The name is a material aesthetic with the same internal name (e.g. `size`). +/// +/// Returns the empty vector if no resolution finds a mapped aesthetic. +fn resolve_target_aesthetic( + user_aes: &str, + aesthetics: &Mappings, + aesthetic_ctx: &AestheticContext, +) -> Vec { + use crate::plot::layer::geom::types::AESTHETIC_ALIASES; + let mut out = Vec::new(); + if let Some(internal) = aesthetic_ctx.map_user_to_internal(user_aes) { + if aesthetics.aesthetics.contains_key(internal) { + out.push(internal.to_string()); + return out; + } + } + for (alias, targets) in AESTHETIC_ALIASES { + if *alias == user_aes { + for t in *targets { + let internal = aesthetic_ctx + .map_user_to_internal(t) + .map(|s| s.to_string()) + .unwrap_or_else(|| (*t).to_string()); + if aesthetics.aesthetics.contains_key(&internal) && !out.contains(&internal) { + out.push(internal); + } + } + return out; + } + } + if aesthetics.aesthetics.contains_key(user_aes) { + out.push(user_aes.to_string()); + } + out +} + +/// Classify an internal aesthetic name as upper-half or lower-half for the +/// purpose of default-aggregate routing. +/// +/// `min` suffix → lower; `max`/`end` → upper; no suffix → lower. Material +/// aesthetics (no position prefix) are always lower. +fn is_upper_half(internal_aes: &str) -> bool { + internal_aes.ends_with("max") || internal_aes.ends_with("end") +} + +/// Resolve every user-facing target in `spec` to its internal aesthetic +/// name(s) on the layer. Returns a map keyed by internal aesthetic name with +/// the function list each target supplies, or an error when a target doesn't +/// match any mapped aesthetic or when two targets resolve to the same +/// aesthetic. +/// +/// Shared between [`apply`] (which uses the returned map) and +/// `Layer::validate_aggregate_setting` (which discards the map and just +/// surfaces the error). Callers pass an already-parsed [`AggregateSpec`] to +/// avoid re-parsing the raw setting. +pub(crate) fn resolve_aggregate_targets( + spec: &AggregateSpec, + aesthetics: &Mappings, + aesthetic_ctx: &AestheticContext, +) -> std::result::Result>, String> { + let mut targets_internal: HashMap> = HashMap::new(); + for (user_aes, fns) in &spec.targets { + let resolved = resolve_target_aesthetic(user_aes, aesthetics, aesthetic_ctx); + if resolved.is_empty() { + return Err(format!( + "aggregate target '{}' is not mapped on this layer", + user_aes + )); + } + for internal in resolved { + if targets_internal.contains_key(&internal) { + return Err(format!( + "aggregate target '{}' resolves to aesthetic '{}' which is already targeted", + user_aes, internal + )); + } + targets_internal.insert(internal, fns.clone()); + } + } + Ok(targets_internal) +} + +/// Compute the set of internal aesthetic names that the layer's `aggregate` +/// setting *explicitly targets*. Lighter than [`aggregated_aesthetics`] — +/// doesn't need a schema — so post-stat callers can use it without rebuilding +/// type information from a materialised DataFrame. +pub fn targeted_aesthetics( + parameters: &HashMap, + aesthetics: &Mappings, + aesthetic_ctx: &AestheticContext, +) -> HashSet { + let raw = match parameters.get("aggregate") { + Some(v) if !matches!(v, ParameterValue::Null) => v, + _ => return HashSet::new(), + }; + let spec = match parse_aggregate_param(raw).ok().flatten() { + Some(s) => s, + None => return HashSet::new(), + }; + let mut targeted: HashSet = HashSet::new(); + for (user_aes, _fns) in &spec.targets { + for internal in resolve_target_aesthetic(user_aes, aesthetics, aesthetic_ctx) { + targeted.insert(internal); + } + } + targeted +} + +/// Compute, for a layer's `aggregate` setting, which internal aesthetic names +/// will be (a) *explicitly targeted* by `aggregate => ':'` and +/// (b) *aggregated* by the stat (either targeted OR a numeric mapping that an +/// untargeted default applies to). +/// +/// The execute pipeline uses this to decide whether to defer scale-driven +/// pre-stat rewrites (`SCALE BINNED `, `SCALE FROM […]`, …) until +/// after the stat. The bucketing here mirrors the per-mapping branching in +/// [`apply`]; both must stay in sync. +/// +/// Returns `None` when `aggregate` is unset, null, or fails to parse — i.e. +/// when the stat will return `Identity` and no aesthetic is touched. Parse +/// errors are swallowed; the stat itself surfaces a clean diagnostic. +pub fn aggregated_aesthetics( + parameters: &HashMap, + aesthetics: &Mappings, + schema: &Schema, + aesthetic_ctx: &AestheticContext, + domain_aesthetics: &[&'static str], +) -> Option<(HashSet, HashSet)> { + let raw = parameters.get("aggregate")?; + if matches!(raw, ParameterValue::Null) { + return None; + } + let spec = parse_aggregate_param(raw).ok()??; + + let mut targeted: HashSet = HashSet::new(); + for (user_aes, _fns) in &spec.targets { + for internal in resolve_target_aesthetic(user_aes, aesthetics, aesthetic_ctx) { + targeted.insert(internal); + } + } + + let mut aggregated: HashSet = targeted.clone(); + let mut entries: Vec<(&String, &crate::AestheticValue)> = + aesthetics.aesthetics.iter().collect(); + entries.sort_by(|a, b| a.0.cmp(b.0)); + for (aes, value) in entries { + let col = match value.column_name() { + Some(c) => c, + None => continue, + }; + if domain_aesthetics.contains(&aes.as_str()) { + continue; + } + let is_discrete = schema + .iter() + .find(|c| c.name == col) + .map(|c| c.is_discrete) + .unwrap_or(false); + if is_discrete { + continue; + } + if targeted.contains(aes) { + continue; + } + let default_applies = if is_upper_half(aes) { + spec.default_upper.is_some() || spec.default_lower.is_some() + } else { + spec.default_lower.is_some() + }; + if default_applies { + aggregated.insert(aes.clone()); + } + } + + Some((targeted, aggregated)) +} + +/// Apply the Aggregate stat to a layer query. +/// +/// Returns `StatResult::Identity` when the `aggregate` parameter is unset, null, +/// or empty. Otherwise, builds a `GROUP BY` query producing one row per group +/// (the *reduce* path) — or, when at least one target lists multiple functions, +/// `N` rows per group with a synthetic `aggregate` column tagging each row +/// (the *explode* path). +#[allow(clippy::too_many_arguments)] +pub fn apply( + query: &str, + schema: &Schema, + aesthetics: &Mappings, + group_by: &[String], + parameters: &HashMap, + dialect: &dyn SqlDialect, + aesthetic_ctx: &AestheticContext, + domain_aesthetics: &[&'static str], +) -> Result { + let raw = match parameters.get("aggregate") { + None | Some(ParameterValue::Null) => return Ok(StatResult::Identity), + Some(v) => v, + }; + let spec = parse_aggregate_param(raw).map_err(GgsqlError::ValidationError)?; + let spec = match spec { + Some(s) => s, + None => return Ok(StatResult::Identity), + }; + let n = spec.explosion_factor(); + let labels = spec.explosion_labels(); + + // Resolve target keys (user-facing) → internal aesthetic names. An alias + // like `color` expands to whichever of its targets (stroke/fill) is mapped + // on the layer; the same function list applies to all of them. + let targets_internal = resolve_aggregate_targets(&spec, aesthetics, aesthetic_ctx) + .map_err(GgsqlError::ValidationError)?; + + // Walk mappings. Three buckets: + // - aggregated: (internal_aes, raw_col, fns of length n) — each emits one column per row + // - kept_cols: discrete column-mappings — keep as group key + // - dropped: numeric mapping with no applicable function (warn & skip) + let mut aggregated: Vec<(String, String, Vec)> = Vec::new(); + let mut kept_cols: Vec = Vec::new(); + let mut dropped: Vec = Vec::new(); + + let mut entries: Vec<(&String, &crate::AestheticValue)> = + aesthetics.aesthetics.iter().collect(); + entries.sort_by(|a, b| a.0.cmp(b.0)); + + for (aes, value) in entries { + let col = match value.column_name() { + Some(c) => c.to_string(), + None => continue, // literals & annotation columns pass through + }; + // Geom-declared domain aesthetics (e.g. `pos1` for line/area/ribbon) + // always become group keys — they identify each row, never get + // aggregated, never get dropped. + if domain_aesthetics.contains(&aes.as_str()) { + if !kept_cols.contains(&col) { + kept_cols.push(col); + } + continue; + } + let info = schema.iter().find(|c| c.name == col); + let is_discrete = info.map(|c| c.is_discrete).unwrap_or(false); + if is_discrete { + if !kept_cols.contains(&col) { + kept_cols.push(col); + } + continue; + } + + // Numeric mapping. Look up the function list (recycling to length n). + let fns: Option> = if let Some(list) = targets_internal.get(aes) { + if list.len() == n { + Some(list.clone()) + } else { + // Validated to be 1 or n during parsing; guard with a sanity check. + debug_assert_eq!(list.len(), 1); + Some(vec![list[0].clone(); n]) + } + } else { + let default = if is_upper_half(aes) { + spec.default_upper + .clone() + .or_else(|| spec.default_lower.clone()) + } else { + spec.default_lower.clone() + }; + default.map(|d| vec![d; n]) + }; + + match fns { + Some(list) => aggregated.push((aes.clone(), col, list)), + None => dropped.push(aes.clone()), + } + } + + for d in &dropped { + let user_aes = aesthetic_ctx.map_internal_to_user(d); + eprintln!( + "Warning: aggregate dropped numeric mapping for aesthetic '{}' \ + (no applicable default and no targeted function). \ + Suggestion: add an unprefixed default like `aggregate => 'mean'` \ + to apply one function to every numeric mapping, or target this \ + aesthetic with `'{0}:'`.", + user_aes, + ); + } + + // No aggregate functions to apply → the stat has nothing to do. Whether + // the layer has group keys or not is irrelevant: emitting a `SELECT keys + // FROM src GROUP BY keys` query would be a distinct-rows transform the + // user didn't ask for. + if aggregated.is_empty() { + return Ok(StatResult::Identity); + } + + // Group columns: PARTITION BY + discrete column-mappings, deduped. + let mut group_cols: Vec = Vec::new(); + for g in group_by { + if !group_cols.contains(g) { + group_cols.push(g.clone()); + } + } + for c in &kept_cols { + if !group_cols.contains(c) { + group_cols.push(c.clone()); + } + } + + let missing = unsupported_functions(&aggregated, dialect); + if !missing.is_empty() { + return Err(GgsqlError::ValidationError(format!( + "aggregate function(s) {} are not supported by this database backend", + crate::or_list_quoted(&missing, '\''), + ))); + } + + let transformed_query = match &labels { + Some(ls) => build_aggregate_query(query, &aggregated, &group_cols, ls, dialect), + None => build_group_by_query(query, &aggregated, &group_cols, dialect), + }; + + let mut stat_columns: Vec = aggregated.iter().map(|(a, _, _)| a.clone()).collect(); + let consumed_aesthetics: Vec = stat_columns.clone(); + // The synthetic `aggregate` column is only emitted for the multi-row + // (explosion) case, where it differentiates rows that share the same + // group key. + if labels.is_some() { + stat_columns.push("aggregate".to_string()); + } + + Ok(StatResult::Transformed { + query: transformed_query, + stat_columns, + dummy_columns: vec![], + consumed_aesthetics, + }) +} + +/// CTE preamble plus the alias the caller should `FROM`. When any emitted +/// aggregate references the `__ggsql_rn__` / `__ggsql_max_rn__` columns +/// (the dialect-portable form of `first` / `last`), wrap the source CTE in a +/// row-numbered layer. +fn source_cte_chain( + query: &str, + aggregated: &[(String, String, Vec)], + group_cols: &[String], + dialect: &dyn SqlDialect, +) -> (String, &'static str) { + let raw_src = "\"__ggsql_stat_src__\""; + if !needs_row_position(aggregated, dialect) { + return (format!("WITH {raw_src} AS ({query})"), raw_src); + } + let rn_src = "\"__ggsql_stat_src_rn__\""; + let group_select: Vec = group_cols.iter().map(|c| naming::quote_ident(c)).collect(); + // ORDER BY (SELECT 1) is the canonical "no real ordering" stand-in: it + // satisfies the standard's required ORDER BY for window functions while + // letting the engine pick the row order — same indeterminacy as DuckDB's + // native FIRST() without a user ORDER BY. + let partition = if group_select.is_empty() { + String::new() + } else { + format!("PARTITION BY {} ", group_select.join(", ")) + }; + let cte = format!( + "WITH {raw_src} AS ({query}), {rn_src} AS (\ + SELECT *, \ + ROW_NUMBER() OVER ({partition}ORDER BY (SELECT 1)) AS \"__ggsql_rn__\", \ + COUNT(*) OVER ({partition_no_order}) AS \"__ggsql_max_rn__\" \ + FROM {raw_src}\ + )", + partition_no_order = partition.trim_end(), + ); + (cte, rn_src) +} + +/// True iff at least one aggregate spec, after the dialect emits its SQL, +/// references the row-position columns. Backends with native `FIRST`/`LAST` +/// (DuckDB) emit a string that doesn't mention `__ggsql_rn__`, and so don't +/// pay for the extra window functions. +fn needs_row_position( + aggregated: &[(String, String, Vec)], + dialect: &dyn SqlDialect, +) -> bool { + for (_, _, specs) in aggregated { + for spec in specs { + for name in [Some(spec.offset), spec.band.as_ref().map(|b| b.expansion)] + .into_iter() + .flatten() + { + if let Some(sql) = dialect.sql_aggregate(name, "x") { + if sql.contains("__ggsql_rn__") { + return true; + } + } + } + } + } + false +} + +/// Build the single-row `WITH src AS () SELECT , +/// FROM src AS "__ggsql_qt__" GROUP BY ` query. Each aggregated +/// aesthetic's function list is length 1 here. +/// +/// Falls back to `dialect.sql_percentile()` per-column when an aggregate's +/// percentile component lacks inline support. +fn build_group_by_query( + query: &str, + aggregated: &[(String, String, Vec)], + group_cols: &[String], + dialect: &dyn SqlDialect, +) -> String { + let outer_alias = "\"__ggsql_qt__\""; + let (with_clause, src_alias) = source_cte_chain(query, aggregated, group_cols, dialect); + + let group_select: Vec = group_cols.iter().map(|c| naming::quote_ident(c)).collect(); + let group_by_clause = if group_cols.is_empty() { + String::new() + } else { + format!(" GROUP BY {}", group_select.join(", ")) + }; + + let mut select_parts: Vec = group_select.clone(); + + for (aes, raw_col, fns) in aggregated { + let agg = &fns[0]; + let stat_col = naming::stat_column(aes); + let qcol = naming::quote_ident(raw_col); + let expr = if needs_quantile_fallback(agg, raw_col, dialect) { + agg_sql_fallback(agg, raw_col, dialect, src_alias, group_cols) + } else { + agg_sql_inline(agg, &qcol, dialect) + .expect("agg_sql_inline must succeed when needs_quantile_fallback is false") + }; + select_parts.push(format!("{} AS {}", expr, naming::quote_ident(&stat_col))); + } + + format!( + "{with_clause} SELECT {sel} FROM {src} AS {outer}{gb}", + sel = select_parts.join(", "), + src = src_alias, + outer = outer_alias, + gb = group_by_clause, + ) +} + +/// Build the exploded `WITH src AS () UNION ALL +/// ...` query. One branch per row in `0..labels.len()`, each branch its own +/// `GROUP BY` with the row's aggregation functions and a literal label tagged +/// to `__ggsql_stat_aggregate__`. +fn build_aggregate_query( + query: &str, + aggregated: &[(String, String, Vec)], + group_cols: &[String], + labels: &[String], + dialect: &dyn SqlDialect, +) -> String { + let outer_alias = "\"__ggsql_qt__\""; + let (with_clause, src_alias) = source_cte_chain(query, aggregated, group_cols, dialect); + + let group_select: Vec = group_cols.iter().map(|c| naming::quote_ident(c)).collect(); + let group_by_clause = if group_cols.is_empty() { + String::new() + } else { + format!(" GROUP BY {}", group_select.join(", ")) + }; + + let stat_aggregate_col = naming::stat_column("aggregate"); + + let branches: Vec = labels + .iter() + .enumerate() + .map(|(row_idx, label)| { + let mut select_parts: Vec = group_select.clone(); + + for (aes, raw_col, fns) in aggregated { + let agg = &fns[row_idx]; + let stat_col = naming::stat_column(aes); + let qcol = naming::quote_ident(raw_col); + let expr = if needs_quantile_fallback(agg, raw_col, dialect) { + agg_sql_fallback(agg, raw_col, dialect, src_alias, group_cols) + } else { + agg_sql_inline(agg, &qcol, dialect) + .expect("agg_sql_inline must succeed when needs_quantile_fallback is false") + }; + select_parts.push(format!("{} AS {}", expr, naming::quote_ident(&stat_col))); + } + + select_parts.push(format!( + "{} AS {}", + naming::quote_literal(label), + naming::quote_ident(&stat_aggregate_col) + )); + + format!( + "SELECT {} FROM {} AS {}{}", + select_parts.join(", "), + src_alias, + outer_alias, + group_by_clause, + ) + }) + .collect(); + + format!("{with_clause} {body}", body = branches.join(" UNION ALL "),) +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::plot::aesthetic::AestheticContext; + use crate::plot::types::{AestheticValue, ColumnInfo}; + use arrow::datatypes::DataType; + + /// A test dialect that mimics DuckDB: native QUANTILE_CONT plus the + /// row-positional FIRST / LAST aggregates. + struct InlineQuantileDialect; + impl SqlDialect for InlineQuantileDialect { + fn sql_quantile_inline(&self, column: &str, fraction: f64) -> Option { + Some(format!( + "QUANTILE_CONT({}, {})", + naming::quote_ident(column), + fraction + )) + } + + fn sql_aggregate(&self, name: &str, qcol: &str) -> Option { + match name { + "first" => Some(format!("FIRST({})", qcol)), + "last" => Some(format!("LAST({})", qcol)), + "diff" => Some(format!("(LAST({c}) - FIRST({c}))", c = qcol)), + _ => crate::reader::default_sql_aggregate(name, qcol), + } + } + } + + /// A test dialect with no inline quantile support, exercising the + /// per-column `sql_percentile` fallback. + struct NoInlineQuantileDialect; + impl SqlDialect for NoInlineQuantileDialect {} + + fn col(name: &str) -> AestheticValue { + AestheticValue::Column { + name: name.to_string(), + original_name: None, + is_dummy: false, + } + } + + fn schema_for(cols: &[(&str, bool)]) -> Schema { + cols.iter() + .map(|(name, is_discrete)| ColumnInfo { + name: name.to_string(), + dtype: if *is_discrete { + DataType::Utf8 + } else { + DataType::Float64 + }, + is_discrete: *is_discrete, + min: None, + max: None, + }) + .collect() + } + + fn cartesian_ctx() -> AestheticContext { + AestheticContext::from_static(&["x", "y"], &[]) + } + + fn run( + params: ParameterValue, + aes: &Mappings, + schema: &Schema, + group_by: &[String], + dialect: &dyn SqlDialect, + ) -> Result { + run_with_domain(params, aes, schema, group_by, dialect, &[]) + } + + fn run_with_domain( + params: ParameterValue, + aes: &Mappings, + schema: &Schema, + group_by: &[String], + dialect: &dyn SqlDialect, + domain: &[&'static str], + ) -> Result { + let mut p = HashMap::new(); + p.insert("aggregate".to_string(), params); + let ctx = cartesian_ctx(); + apply( + "SELECT * FROM t", + schema, + aes, + group_by, + &p, + dialect, + &ctx, + domain, + ) + } + + fn arr(items: &[&str]) -> ParameterValue { + ParameterValue::Array( + items + .iter() + .map(|s| ArrayElement::String(s.to_string())) + .collect(), + ) + } + + // ---------- parser tests ---------- + + #[test] + fn parses_unset_and_null() { + assert_eq!(parse_aggregate_param(&ParameterValue::Null).unwrap(), None); + assert_eq!(parse_aggregate_param(&arr(&[])).unwrap(), None); + } + + #[test] + fn parses_single_default() { + let s = parse_aggregate_param(&ParameterValue::String("mean".to_string())) + .unwrap() + .unwrap(); + assert_eq!(s.default_lower.as_ref().map(|a| a.offset), Some("mean")); + assert!(s.default_upper.is_none()); + assert!(s.targets.is_empty()); + } + + #[test] + fn parses_two_defaults_in_order() { + let s = parse_aggregate_param(&arr(&["min", "max"])) + .unwrap() + .unwrap(); + assert_eq!(s.default_lower.as_ref().map(|a| a.offset), Some("min")); + assert_eq!(s.default_upper.as_ref().map(|a| a.offset), Some("max")); + } + + #[test] + fn three_unprefixed_defaults_is_error() { + let err = parse_aggregate_param(&arr(&["mean", "min", "max"])).unwrap_err(); + assert!(err.contains("at most two"), "got: {}", err); + } + + fn target_funcs<'a>(spec: &'a AggregateSpec, aes: &str) -> Option<&'a [AggSpec]> { + spec.targets + .iter() + .find(|(a, _)| a == aes) + .map(|(_, fns)| fns.as_slice()) + } + + #[test] + fn parses_targeted_entries() { + let s = parse_aggregate_param(&arr(&["mean", "y:max", "color:median"])) + .unwrap() + .unwrap(); + assert_eq!(s.default_lower.as_ref().map(|a| a.offset), Some("mean")); + assert_eq!(target_funcs(&s, "y").map(|fs| fs[0].offset), Some("max")); + assert_eq!( + target_funcs(&s, "color").map(|fs| fs[0].offset), + Some("median") + ); + } + + #[test] + fn duplicate_target_explodes_into_a_list() { + let s = parse_aggregate_param(&arr(&["y:min", "y:max"])) + .unwrap() + .unwrap(); + let fns = target_funcs(&s, "y").unwrap(); + assert_eq!(fns.len(), 2); + assert_eq!(fns[0].offset, "min"); + assert_eq!(fns[1].offset, "max"); + assert_eq!(s.explosion_factor(), 2); + assert_eq!( + s.explosion_labels(), + Some(vec!["min".to_string(), "max".to_string()]) + ); + } + + #[test] + fn multi_aesthetic_explosion_joins_unique_function_names() { + // Two exploded targets contribute distinct function names per row → 'min/sum', 'max/prod'. + let s = parse_aggregate_param(&arr(&["y:min", "y:max", "color:sum", "color:prod"])) + .unwrap() + .unwrap(); + assert_eq!( + s.explosion_labels(), + Some(vec!["min/sum".to_string(), "max/prod".to_string()]) + ); + } + + #[test] + fn multi_aesthetic_explosion_dedups_repeats() { + // y and color both use 'mean' at row 0 → label is just 'mean' (deduped). + let s = parse_aggregate_param(&arr(&["y:mean", "y:max", "color:mean", "color:prod"])) + .unwrap() + .unwrap(); + assert_eq!( + s.explosion_labels(), + Some(vec!["mean".to_string(), "max/prod".to_string()]) + ); + } + + #[test] + fn recycled_target_excluded_from_label() { + // color has length 1 → recycled, not exploded; label only reflects y's functions. + let s = parse_aggregate_param(&arr(&["y:min", "y:max", "color:median"])) + .unwrap() + .unwrap(); + assert_eq!( + s.explosion_labels(), + Some(vec!["min".to_string(), "max".to_string()]) + ); + } + + #[test] + fn single_row_returns_no_labels() { + // The aggregate column only makes sense as a row-differentiator, and a + // single-row aggregation has nothing to differentiate, so no labels. + let s = parse_aggregate_param(&ParameterValue::String("mean".to_string())) + .unwrap() + .unwrap(); + assert_eq!(s.explosion_labels(), None); + + let s = parse_aggregate_param(&arr(&["mean", "color:median"])) + .unwrap() + .unwrap(); + assert_eq!(s.explosion_labels(), None); + } + + #[test] + fn recycling_violation_is_error() { + // y has 2, color has 3 → mismatched, neither is 1 nor matches the longest. + let err = parse_aggregate_param(&arr(&[ + "y:min", + "y:max", + "color:p10", + "color:p50", + "color:p90", + ])) + .unwrap_err(); + assert!(err.contains("longest target"), "got: {}", err); + } + + #[test] + fn length_one_target_recycles_in_explosion() { + let s = parse_aggregate_param(&arr(&["y:min", "y:max", "color:median"])) + .unwrap() + .unwrap(); + assert_eq!(s.explosion_factor(), 2); + assert_eq!(target_funcs(&s, "color").map(|f| f.len()), Some(1)); + } + + #[test] + fn empty_prefix_is_error() { + let err = parse_aggregate_param(&ParameterValue::String(":mean".to_string())).unwrap_err(); + assert!(err.contains("could not parse"), "got: {}", err); + } + + #[test] + fn unknown_function_is_error() { + let err = parse_aggregate_param(&ParameterValue::String("nope".to_string())).unwrap_err(); + assert!(err.contains("unknown aggregate"), "got: {}", err); + } + + #[test] + fn band_functions_parse() { + let s = parse_aggregate_param(&arr(&["mean-sdev", "mean+sdev"])) + .unwrap() + .unwrap(); + assert_eq!(s.default_lower.as_ref().unwrap().offset, "mean"); + assert_eq!( + s.default_lower + .as_ref() + .unwrap() + .band + .as_ref() + .unwrap() + .expansion, + "sdev" + ); + assert_eq!( + s.default_lower + .as_ref() + .unwrap() + .band + .as_ref() + .unwrap() + .mod_value, + -1.0, + ); + assert_eq!(s.default_upper.as_ref().unwrap().offset, "mean"); + assert_eq!( + s.default_upper + .as_ref() + .unwrap() + .band + .as_ref() + .unwrap() + .mod_value, + 1.0, + ); + } + + // ---------- apply tests ---------- + + #[test] + fn returns_identity_when_param_unset() { + let aes = Mappings::new(); + let schema: Schema = vec![]; + let p: HashMap = HashMap::new(); + let ctx = cartesian_ctx(); + let result = apply( + "SELECT * FROM t", + &schema, + &aes, + &[], + &p, + &InlineQuantileDialect, + &ctx, + &[], + ) + .unwrap(); + assert_eq!(result, StatResult::Identity); + } + + #[test] + fn returns_identity_when_param_null() { + let aes = Mappings::new(); + let schema: Schema = vec![]; + let result = run( + ParameterValue::Null, + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + assert_eq!(result, StatResult::Identity); + } + + #[test] + fn single_default_applies_to_every_numeric_mapping() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let result = run( + ParameterValue::String("mean".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + consumed_aesthetics, + .. + } => { + assert!(query.contains("AVG(\"__ggsql_aes_pos1__\")"), "{}", query); + assert!(query.contains("AVG(\"__ggsql_aes_pos2__\")"), "{}", query); + // No GROUP BY when no discrete mappings or PARTITION BY — SQL + // collapses to a single row per query, which is correct. + assert!(!query.contains("CROSS JOIN")); + assert!(!query.contains("UNION ALL")); + assert_eq!(stat_columns.len(), 2); + assert!(stat_columns.contains(&"pos1".to_string())); + assert!(stat_columns.contains(&"pos2".to_string())); + assert_eq!(consumed_aesthetics.len(), 2); + } + _ => panic!("expected Transformed"), + } + } + + #[cfg(feature = "sqlite")] + #[test] + fn sqlite_dialect_emits_portable_stddev_and_first() { + use crate::reader::sqlite::SqliteDialect; + + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + + // sdev must not emit STDDEV_POP (SQLite has no such function). + let result = run( + ParameterValue::String("sdev".to_string()), + &aes, + &schema, + &[], + &SqliteDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + !query.contains("STDDEV_POP"), + "SQLite dialect must not emit STDDEV_POP, got: {query}" + ); + assert!(query.contains("SQRT") && query.contains("AVG"), "{query}"); + } + _ => panic!("expected Transformed"), + } + + // first now uses the portable ROW_NUMBER + MAX(CASE) form. It must run + // on SQLite without `FIRST` ever appearing as an aggregate call. + let result = run( + ParameterValue::String("first".to_string()), + &aes, + &schema, + &[], + &SqliteDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + query.contains("ROW_NUMBER()"), + "expected ROW_NUMBER prep, got: {query}" + ); + assert!( + query.contains("\"__ggsql_rn__\" = 1"), + "expected first via rn=1, got: {query}" + ); + assert!( + !query.contains("FIRST(\""), + "must not call FIRST as an aggregate, got: {query}" + ); + } + _ => panic!("expected Transformed"), + } + } + + /// End-to-end run against an actual SQLite reader. Pins that the rn-CTE + /// path (`MAX(CASE WHEN __ggsql_rn__ = …)`) returns the correct first / + /// last / diff values per discrete group — the SQL-string assertions in + /// `sqlite_dialect_emits_portable_stddev_and_first` only verify the + /// generated query's shape, not that it actually computes the right thing + /// when executed. + #[cfg(feature = "sqlite")] + #[test] + fn sqlite_first_last_diff_return_correct_values() { + use crate::naming; + use crate::reader::SqliteReader; + + let reader = SqliteReader::new().unwrap(); + + // Stable row order via an explicit ORDER BY so first/last are + // deterministic. Group A: 10, 30, 20 → first=10, last=20, diff=10. + // Group B: 100, 50 → first=100, last=50, diff=-50. + let body = "WITH t(g, ord, v) AS (\ + SELECT 'A', 1, 10 UNION ALL SELECT 'A', 2, 30 \ + UNION ALL SELECT 'A', 3, 20 \ + UNION ALL SELECT 'B', 1, 100 UNION ALL SELECT 'B', 2, 50) \ + SELECT g, v FROM t ORDER BY g, ord"; + + // Helper: run the query with a given aggregate and return per-group values. + let run_agg = |func: &str| -> Vec<(String, f64)> { + let query = format!( + "{body} VISUALISE \ + DRAW point MAPPING g AS x, v AS y \ + SETTING aggregate => '{func}'" + ); + let prepared = crate::execute::prepare_data_with_reader(&query, &reader).unwrap(); + let df = prepared + .data + .get(prepared.specs[0].layers[0].data_key.as_ref().unwrap()) + .unwrap(); + let xs = df.column("__ggsql_aes_pos1__").unwrap(); + let ys = df.column("__ggsql_aes_pos2__").unwrap(); + let mut out: Vec<(String, f64)> = (0..df.height()) + .map(|i| { + let x = crate::array_util::value_to_string(xs, i); + let y = crate::array_util::value_to_string(ys, i) + .parse::() + .unwrap(); + (x, y) + }) + .collect(); + out.sort_by(|a, b| a.0.cmp(&b.0)); + out + }; + + assert_eq!( + run_agg("first"), + vec![("A".to_string(), 10.0), ("B".to_string(), 100.0)], + "first should pick the group's first row in ORDER BY ord" + ); + assert_eq!( + run_agg("last"), + vec![("A".to_string(), 20.0), ("B".to_string(), 50.0)], + "last should pick the group's last row" + ); + assert_eq!( + run_agg("diff"), + vec![("A".to_string(), 10.0), ("B".to_string(), -50.0)], + "diff should be last - first per group" + ); + + // While we're here, sanity-check that the generated stat SQL goes + // through the rn-CTE path (no native FIRST/LAST aggregate call). + let _ = naming::layer_key(0); // exercise the import, keeps `naming` used + } + + #[test] + fn unsupported_aggregate_errors_with_dialect_that_lacks_function() { + // A dialect that explicitly opts out of `first` (returns None) must + // produce the validation error rather than emitting broken SQL. + struct OptOutDialect; + impl SqlDialect for OptOutDialect { + fn sql_aggregate(&self, name: &str, qcol: &str) -> Option { + if name == "first" { + return None; + } + crate::reader::default_sql_aggregate(name, qcol) + } + } + + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let err = run( + ParameterValue::String("first".to_string()), + &aes, + &schema, + &[], + &OptOutDialect, + ) + .unwrap_err(); + let msg = format!("{}", err); + assert!( + msg.contains("first") && msg.contains("not supported"), + "expected unsupported-function error mentioning 'first', got: {msg}" + ); + } + + #[test] + fn mid_emits_min_max_midpoint() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let result = run( + ParameterValue::String("mid".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + query.contains( + "(MIN(\"__ggsql_aes_pos1__\") + MAX(\"__ggsql_aes_pos1__\")) / 2.0" + ), + "{}", + query + ); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn diff_uses_row_position_and_subtracts_first_from_last() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + + // AnsiDialect path: portable rn-based form for last - first. + struct AnsiTestDialect; + impl SqlDialect for AnsiTestDialect {} + let result = run( + ParameterValue::String("diff".to_string()), + &aes, + &schema, + &[], + &AnsiTestDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!(query.contains("ROW_NUMBER()"), "{query}"); + assert!( + query.contains("\"__ggsql_rn__\" = \"__ggsql_max_rn__\""), + "{query}" + ); + assert!(query.contains("\"__ggsql_rn__\" = 1"), "{query}"); + assert!(query.contains(" - "), "expected subtraction, got: {query}"); + } + _ => panic!("expected Transformed"), + } + + // Native-FIRST/LAST path: no rn CTE. + let result = run( + ParameterValue::String("diff".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + query.contains("LAST(") && query.contains("FIRST("), + "expected native LAST/FIRST: {query}" + ); + assert!( + !query.contains("__ggsql_rn__"), + "native dialect must not add ROW_NUMBER prep: {query}" + ); + } + _ => panic!("expected Transformed"), + } + } + + #[cfg(feature = "duckdb")] + #[test] + fn duckdb_first_skips_row_number_cte() { + use crate::reader::duckdb::DuckDbDialect; + + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let result = run( + ParameterValue::String("first".to_string()), + &aes, + &schema, + &[], + &DuckDbDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + query.contains("FIRST(\""), + "expected native FIRST aggregate, got: {query}" + ); + assert!( + !query.contains("__ggsql_rn__"), + "DuckDB has native FIRST, must not add ROW_NUMBER prep: {query}" + ); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn last_with_discrete_group_partitions_row_number_over_group() { + // Pins build_group_by_query's behaviour for the rn-CTE + non-empty + // group_cols combo: every other test that exercises the rn CTE has + // empty group_cols (so windows emit `OVER ()`). A bug that + // forgot to thread `PARTITION BY ` through wouldn't + // surface in those tests. + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", true), // discrete group key + ("__ggsql_aes_pos2__", false), + ]); + + // Native-FIRST/LAST dialect: no rn CTE, GROUP BY uses the discrete key. + let result = run( + ParameterValue::String("last".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + !query.contains("__ggsql_rn__"), + "native LAST must not add ROW_NUMBER prep: {query}" + ); + assert!(query.contains("LAST(\"__ggsql_aes_pos2__\")"), "{query}"); + assert!(query.contains("GROUP BY \"__ggsql_aes_pos1__\""), "{query}"); + } + _ => panic!("expected Transformed"), + } + + // Default dialect: rn CTE must partition by the discrete group key. + struct AnsiTestDialect; + impl SqlDialect for AnsiTestDialect {} + let result = run( + ParameterValue::String("last".to_string()), + &aes, + &schema, + &[], + &AnsiTestDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + query.contains( + "ROW_NUMBER() OVER (PARTITION BY \"__ggsql_aes_pos1__\" ORDER BY (SELECT 1))" + ), + "{query}" + ); + assert!( + query.contains("COUNT(*) OVER (PARTITION BY \"__ggsql_aes_pos1__\")"), + "{query}" + ); + assert!(query.contains("GROUP BY \"__ggsql_aes_pos1__\""), "{query}"); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn first_and_last_emit_positional_aggregates() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2min", col("__ggsql_aes_pos2min__")); + aes.insert("pos2max", col("__ggsql_aes_pos2max__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2min__", false), + ("__ggsql_aes_pos2max__", false), + ]); + let result = run( + arr(&["first", "last"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!( + query.contains("FIRST(\"__ggsql_aes_pos2min__\")"), + "{}", + query + ); + assert!( + query.contains("LAST(\"__ggsql_aes_pos2max__\")"), + "{}", + query + ); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn two_defaults_split_lower_and_upper_for_segment() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + aes.insert("pos1end", col("__ggsql_aes_pos1end__")); + aes.insert("pos2end", col("__ggsql_aes_pos2end__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2__", false), + ("__ggsql_aes_pos1end__", false), + ("__ggsql_aes_pos2end__", false), + ]); + let result = run( + arr(&["min", "max"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + // pos1, pos2 use MIN; pos1end, pos2end use MAX. + assert!(query.contains("MIN(\"__ggsql_aes_pos1__\")"), "{}", query); + assert!(query.contains("MIN(\"__ggsql_aes_pos2__\")"), "{}", query); + assert!( + query.contains("MAX(\"__ggsql_aes_pos1end__\")"), + "{}", + query + ); + assert!( + query.contains("MAX(\"__ggsql_aes_pos2end__\")"), + "{}", + query + ); + assert!(!query.contains("MIN(\"__ggsql_aes_pos1end__\")")); + assert!(!query.contains("MAX(\"__ggsql_aes_pos1__\")")); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn two_defaults_split_for_ribbon() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2min", col("__ggsql_aes_pos2min__")); + aes.insert("pos2max", col("__ggsql_aes_pos2max__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2min__", false), + ("__ggsql_aes_pos2max__", false), + ]); + let result = run( + arr(&["mean-sdev", "mean+sdev"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!(query.contains("STDDEV_POP(\"__ggsql_aes_pos2max__\")")); + assert!(query.contains("AVG(\"__ggsql_aes_pos2min__\")")); + // upper default (mean+sdev) goes to pos2max → '+' between AVG and STDDEV + let pos2max_section = query.split("__ggsql_aes_pos2max__\")").next().unwrap_or(""); + assert!(pos2max_section.contains('+') || query.contains("+ STDDEV_POP")); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn targeted_prefix_overrides_default() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let result = run( + arr(&["mean", "y:max"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!(query.contains("AVG(\"__ggsql_aes_pos1__\")"), "{}", query); + assert!(query.contains("MAX(\"__ggsql_aes_pos2__\")"), "{}", query); + assert!(!query.contains("AVG(\"__ggsql_aes_pos2__\")")); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn material_aesthetic_targeted_by_user_facing_name() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + aes.insert("size", col("__ggsql_aes_size__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2__", false), + ("__ggsql_aes_size__", false), + ]); + let result = run( + arr(&["mean", "size:median"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + .. + } => { + assert!(query.contains("QUANTILE_CONT(\"__ggsql_aes_size__\", 0.5)")); + assert!(stat_columns.contains(&"size".to_string())); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn color_alias_targets_stroke_and_fill() { + // `color` is an alias that resolves to whichever of `stroke`/`fill` + // is actually mapped on the layer. + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + aes.insert("fill", col("__ggsql_aes_fill__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2__", false), + ("__ggsql_aes_fill__", false), + ]); + let result = run( + arr(&["mean", "color:max"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + .. + } => { + assert!(query.contains("MAX(\"__ggsql_aes_fill__\")"), "{}", query); + assert!(query.contains("AVG(\"__ggsql_aes_pos1__\")")); + assert!(stat_columns.contains(&"fill".to_string())); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn explosion_emits_union_all_with_aggregate_label_column() { + // ('y:min', 'y:max') on a line-style layer → 2 rows per group, each + // tagged with the function name in __ggsql_stat_aggregate__. + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let result = run( + arr(&["y:min", "y:max"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + consumed_aesthetics, + .. + } => { + assert!(query.contains("UNION ALL"), "{}", query); + assert!(query.contains("MIN(\"__ggsql_aes_pos2__\")"), "{}", query); + assert!(query.contains("MAX(\"__ggsql_aes_pos2__\")"), "{}", query); + assert!(query.contains("'min' AS \"__ggsql_stat_aggregate\"")); + assert!(query.contains("'max' AS \"__ggsql_stat_aggregate\"")); + // Aesthetics consumed: pos2. The synthetic `aggregate` is in + // stat_columns but NOT consumed (it's a new column). + assert!(consumed_aesthetics.contains(&"pos2".to_string())); + assert!(!consumed_aesthetics.contains(&"aggregate".to_string())); + assert!(stat_columns.contains(&"aggregate".to_string())); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn explosion_recycles_length_one_targets_and_defaults() { + // ('mean', 'y:min', 'y:max', 'color:median'): + // - default 'mean' applies to non-targeted aesthetics, recycled + // - y is exploded into [min, max] → N=2 + // - color is targeted with one function → recycled to [median, median] + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + aes.insert("fill", col("__ggsql_aes_fill__")); + aes.insert("size", col("__ggsql_aes_size__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2__", false), + ("__ggsql_aes_fill__", false), + ("__ggsql_aes_size__", false), + ]); + let result = run( + arr(&["mean", "y:min", "y:max", "color:median"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + // y is exploded → MIN and MAX appear in different branches + assert!(query.contains("MIN(\"__ggsql_aes_pos2__\")"), "{}", query); + assert!(query.contains("MAX(\"__ggsql_aes_pos2__\")")); + // color (alias → fill) is recycled → QUANTILE_CONT(.5) appears in BOTH branches + let median_count = query + .matches("QUANTILE_CONT(\"__ggsql_aes_fill__\", 0.5)") + .count(); + assert_eq!( + median_count, 2, + "color median should appear once per branch: {}", + query + ); + // size has no target → uses default 'mean' → AVG appears in both branches + let avg_size = query.matches("AVG(\"__ggsql_aes_size__\")").count(); + assert_eq!( + avg_size, 2, + "size mean should appear once per branch: {}", + query + ); + // pos1 (no target) → mean → AVG appears in both branches + let avg_pos1 = query.matches("AVG(\"__ggsql_aes_pos1__\")").count(); + assert_eq!(avg_pos1, 2); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn domain_aesthetic_kept_as_group_key_even_when_continuous() { + // Regression test for the line/area/ribbon case: the user writes + // DRAW line ... SETTING aggregate => ('y:min', 'y:max') + // and expects pos1 (the continuous time-axis column) to be a group + // key, not a dropped numeric mapping. The geom declares pos1 as a + // domain aesthetic; the stat keeps it as a group column. + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), // continuous, would be dropped without the domain hint + ("__ggsql_aes_pos2__", false), + ]); + let result = run_with_domain( + arr(&["y:min", "y:max"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + &["pos1"], + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + consumed_aesthetics, + .. + } => { + // pos1 is in the GROUP BY, not aggregated. + assert!( + query.contains("GROUP BY \"__ggsql_aes_pos1__\""), + "{}", + query + ); + assert!(!query.contains("MIN(\"__ggsql_aes_pos1__\")")); + assert!(!query.contains("MAX(\"__ggsql_aes_pos1__\")")); + // pos2 is exploded into MIN and MAX branches. + assert!(query.contains("MIN(\"__ggsql_aes_pos2__\")")); + assert!(query.contains("MAX(\"__ggsql_aes_pos2__\")")); + // pos1 is NOT consumed (kept), pos2 IS consumed. + assert!(!consumed_aesthetics.contains(&"pos1".to_string())); + assert!(consumed_aesthetics.contains(&"pos2".to_string())); + // synthetic aggregate column emitted in the explosion case. + assert!(stat_columns.contains(&"aggregate".to_string())); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn explosion_with_range_geom_two_defaults() { + // For ribbon: pos1 + pos2min (lower) + pos2max (upper). + // ('mean-sdev', 'mean+sdev', 'color:p25', 'color:p75'): + // - two defaults split lower/upper for range aesthetics + // - color is exploded → N=2 + // Result: two rows, with color taking p25 in row 0 and p75 in row 1; + // pos1/pos2min always use mean-sdev, pos2max always uses mean+sdev. + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2min", col("__ggsql_aes_pos2min__")); + aes.insert("pos2max", col("__ggsql_aes_pos2max__")); + aes.insert("fill", col("__ggsql_aes_fill__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2min__", false), + ("__ggsql_aes_pos2max__", false), + ("__ggsql_aes_fill__", false), + ]); + let result = run( + arr(&["mean-sdev", "mean+sdev", "color:p25", "color:p75"]), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + .. + } => { + assert!(query.contains("UNION ALL")); + // pos2max always uses mean+sdev (upper default) — a `+` between AVG and STDDEV + let upper_branch_marker = "AVG(\"__ggsql_aes_pos2max__\") + STDDEV_POP"; + assert!(query.contains(upper_branch_marker), "{}", query); + // color uses p25 in one branch, p75 in another + assert!(query.contains("QUANTILE_CONT(\"__ggsql_aes_fill__\", 0.25)")); + assert!(query.contains("QUANTILE_CONT(\"__ggsql_aes_fill__\", 0.75)")); + // Synthetic aggregate column is present + assert!(stat_columns.contains(&"aggregate".to_string())); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn discrete_mapping_becomes_group_key() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + aes.insert("color", col("__ggsql_aes_color__")); + let schema = schema_for(&[ + ("__ggsql_aes_pos1__", false), + ("__ggsql_aes_pos2__", false), + ("__ggsql_aes_color__", true), // discrete! + ]); + let result = run( + ParameterValue::String("mean".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + .. + } => { + assert!( + query.contains("GROUP BY \"__ggsql_aes_color__\""), + "{}", + query + ); + assert!(!stat_columns.contains(&"color".to_string())); + assert!(query.contains("AVG(\"__ggsql_aes_pos1__\")")); + assert!(query.contains("AVG(\"__ggsql_aes_pos2__\")")); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn literal_mapping_passes_through() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + aes.insert( + "fill", + AestheticValue::Literal(ParameterValue::String("steelblue".to_string())), + ); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let result = run( + ParameterValue::String("mean".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!(!query.contains("AVG(\"__ggsql_aes_fill__\")")); + assert!(query.contains("AVG(\"__ggsql_aes_pos1__\")")); + assert!(query.contains("AVG(\"__ggsql_aes_pos2__\")")); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn untargeted_numeric_mapping_dropped_when_no_default() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + // Only `y` targeted, no default → x is dropped. + let result = run( + ParameterValue::String("y:mean".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { + query, + stat_columns, + .. + } => { + assert!(query.contains("AVG(\"__ggsql_aes_pos2__\")")); + assert!(!query.contains("\"__ggsql_aes_pos1__\"")); + assert_eq!(stat_columns, vec!["pos2".to_string()]); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn quantile_uses_dialect_inline_when_available() { + let mut aes = Mappings::new(); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos2__", false)]); + let result = run( + ParameterValue::String("p25".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + assert!(query.contains("QUANTILE_CONT")); + assert!(query.contains("0.25")); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn quantile_falls_back_to_correlated_subquery_without_inline() { + let mut aes = Mappings::new(); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos2__", false)]); + let result = run( + ParameterValue::String("p25".to_string()), + &aes, + &schema, + &[], + &NoInlineQuantileDialect, + ) + .unwrap(); + match result { + StatResult::Transformed { query, .. } => { + // The fallback dialect's sql_percentile uses NTILE. + assert!(query.contains("NTILE(4)")); + // No explosion any more — single SELECT, no UNION ALL. + assert!(!query.contains("UNION ALL")); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn unknown_targeted_aesthetic_is_error() { + let mut aes = Mappings::new(); + aes.insert("pos1", col("__ggsql_aes_pos1__")); + aes.insert("pos2", col("__ggsql_aes_pos2__")); + let schema = schema_for(&[("__ggsql_aes_pos1__", false), ("__ggsql_aes_pos2__", false)]); + let err = run( + ParameterValue::String("size:mean".to_string()), + &aes, + &schema, + &[], + &InlineQuantileDialect, + ) + .unwrap_err(); + let msg = format!("{}", err); + assert!(msg.contains("not mapped"), "got: {}", msg); + } +} diff --git a/src/plot/layer/geom/text.rs b/src/plot/layer/geom/text.rs index 6ceb45f91..ae71ca55e 100644 --- a/src/plot/layer/geom/text.rs +++ b/src/plot/layer/geom/text.rs @@ -59,10 +59,15 @@ impl GeomTrait for Text { default: DefaultParamValue::Null, constraint: ParamConstraint::string(), }, + super::types::AGGREGATE_PARAM, ]; PARAMS } + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[]) + } + fn post_process( &self, df: DataFrame, diff --git a/src/plot/layer/geom/tile.rs b/src/plot/layer/geom/tile.rs index 3633f944e..b4a639022 100644 --- a/src/plot/layer/geom/tile.rs +++ b/src/plot/layer/geom/tile.rs @@ -2,12 +2,16 @@ use std::collections::HashMap; +use super::stat_aggregate; use super::types::POSITION_VALUES; use super::types::{get_column_name, get_quoted_column_name}; -use super::{DefaultAesthetics, GeomTrait, GeomType, ParamConstraint, StatResult}; +use super::{ + has_aggregate_param, DefaultAesthetics, GeomTrait, GeomType, ParamConstraint, StatResult, +}; use crate::naming; -use crate::plot::types::{DefaultAestheticValue, ParameterValue}; +use crate::plot::types::{ColumnInfo, DefaultAestheticValue, ParameterValue}; use crate::plot::{DefaultParamValue, ParamDefinition}; +use crate::reader::SqlDialect; use crate::{DataFrame, GgsqlError, Mappings, Result}; use super::types::Schema; @@ -70,11 +74,14 @@ impl GeomTrait for Tile { } fn default_params(&self) -> &'static [ParamDefinition] { - const PARAMS: &[ParamDefinition] = &[ParamDefinition { - name: "position", - default: DefaultParamValue::String("identity"), - constraint: ParamConstraint::string_option(POSITION_VALUES), - }]; + const PARAMS: &[ParamDefinition] = &[ + ParamDefinition { + name: "position", + default: DefaultParamValue::String("identity"), + constraint: ParamConstraint::string_option(POSITION_VALUES), + }, + super::types::AGGREGATE_PARAM, + ]; PARAMS } @@ -95,6 +102,16 @@ impl GeomTrait for Tile { true } + /// Every spatial slot is pinned as a group key — the rectangle's position + /// and size *define* the group, they are never the thing being summarised. + /// Material aesthetics (fill, stroke, opacity, …) pass through to the + /// aggregate as normal. + fn aggregate_domain_aesthetics(&self) -> Option<&'static [&'static str]> { + Some(&[ + "pos1", "pos1min", "pos1max", "width", "pos2", "pos2min", "pos2max", "height", + ]) + } + fn apply_stat_transform( &self, query: &str, @@ -103,12 +120,124 @@ impl GeomTrait for Tile { group_by: &[String], parameters: &HashMap, _execute_query: &dyn Fn(&str) -> Result, - _dialect: &dyn crate::reader::SqlDialect, + dialect: &dyn SqlDialect, + aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> Result { - stat_tile(query, schema, aesthetics, group_by, parameters) + // When `aggregate` is set, collapse rows first, then run the standard + // tile parameter consolidation over the aggregated result. The wrapper + // re-aliases stat-prefixed columns back to `__ggsql_aes_*` so stat_tile + // sees the same column shape as it does in the unaggregated path. When + // aggregate explodes (multi-function), stat_tile is given an extended + // schema so it passes the synthetic `__ggsql_stat_aggregate__` tag + // through to layer.rs (which uses it to drive `partition_by`). + let (working_query, exploded) = if has_aggregate_param(parameters) { + let agg = stat_aggregate::apply( + query, + schema, + aesthetics, + group_by, + parameters, + dialect, + aesthetic_ctx, + self.aggregate_domain_aesthetics().unwrap_or(&[]), + )?; + match agg { + StatResult::Transformed { + query: agg_query, + stat_columns: agg_stats, + consumed_aesthetics, + .. + } => { + let exploded = agg_stats.iter().any(|s| s == "aggregate"); + ( + rename_agg_stats_to_aes(agg_query, &consumed_aesthetics), + exploded, + ) + } + StatResult::Identity => (query.to_string(), false), + } + } else { + (query.to_string(), false) + }; + + // For exploded aggregate, splice the synthetic stat column into the + // schema so stat_tile's pass-through projection emits it. Avoids + // dropping the per-row function tag that `partition_by` needs. + let extended_schema: Schema; + let schema_for_tile = if exploded { + extended_schema = schema + .iter() + .cloned() + .chain(std::iter::once(ColumnInfo { + name: naming::stat_column("aggregate"), + dtype: arrow::datatypes::DataType::Utf8, + is_discrete: true, + min: None, + max: None, + })) + .collect(); + &extended_schema + } else { + schema + }; + + let tile_result = stat_tile( + &working_query, + schema_for_tile, + aesthetics, + group_by, + parameters, + )?; + + if exploded { + if let StatResult::Transformed { + query, + mut stat_columns, + dummy_columns, + consumed_aesthetics, + } = tile_result + { + if !stat_columns.iter().any(|s| s == "aggregate") { + stat_columns.push("aggregate".to_string()); + } + return Ok(StatResult::Transformed { + query, + stat_columns, + dummy_columns, + consumed_aesthetics, + }); + } + } + Ok(tile_result) } } +/// Wrap an aggregated query so each `__ggsql_stat___` column is also +/// exposed as `__ggsql_aes___`. Lets downstream stages treat the +/// aggregated values as if they were original aesthetic columns, which is +/// exactly the substitution the tile layer wants when only material +/// aesthetics get aggregated. +fn rename_agg_stats_to_aes(agg_query: String, consumed: &[String]) -> String { + if consumed.is_empty() { + return agg_query; + } + let aliases: Vec = consumed + .iter() + .map(|aes| { + format!( + "{} AS {}", + naming::quote_ident(&naming::stat_column(aes)), + naming::quote_ident(&naming::aesthetic_column(aes)), + ) + }) + .collect(); + format!( + "SELECT *, {} FROM ({}) AS \"__ggsql_post_agg__\"", + aliases.join(", "), + agg_query + ) +} + impl std::fmt::Display for Tile { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { write!(f, "tile") @@ -898,6 +1027,145 @@ mod tests { } } + #[test] + fn test_aggregate_dispatches_to_aggregate_then_tile() { + use crate::plot::aesthetic::AestheticContext; + use crate::reader::AnsiDialect; + + let mut aesthetics = Mappings::new(); + for aes in ["pos1", "pos2", "fill"] { + aesthetics.insert( + aes.to_string(), + AestheticValue::standard_column(naming::aesthetic_column(aes)), + ); + } + // Heatmap shape: discrete x and y, continuous fill. + let mut schema = create_schema(&["pos1", "pos2"]); + schema.push(ColumnInfo { + name: "__ggsql_aes_fill__".to_string(), + dtype: DataType::Float64, + is_discrete: false, + min: None, + max: None, + }); + let ctx = AestheticContext::from_static(&["x", "y"], &[]); + let mut parameters = HashMap::new(); + parameters.insert( + "aggregate".to_string(), + ParameterValue::String("mean".to_string()), + ); + + let result = Tile + .apply_stat_transform( + "SELECT * FROM data", + &schema, + &aesthetics, + &[], + ¶meters, + &|_| panic!("execute_query should not run during stat building"), + &AnsiDialect, + &ctx, + ) + .unwrap(); + + match result { + StatResult::Transformed { query, .. } => { + // Aggregate stage: GROUP BY pos1/pos2, AVG of fill into a stat column. + assert!( + query.contains("GROUP BY"), + "expected GROUP BY, got: {query}" + ); + assert!( + query.contains("AVG(\"__ggsql_aes_fill__\")"), + "expected AVG over fill, got: {query}" + ); + // Re-alias stage: stat fill column re-exposed as the aesthetic name. + let expected_alias = format!( + "{} AS {}", + naming::quote_ident(&naming::stat_column("fill")), + naming::quote_ident(&naming::aesthetic_column("fill")), + ); + assert!( + query.contains(&expected_alias), + "expected re-alias '{expected_alias}', got: {query}" + ); + // Tile stage: discrete-x position computation runs on top. + assert!( + query.contains("\"__ggsql_aes_pos1__\" AS \"__ggsql_stat_pos1"), + "expected tile pos1 stat, got: {query}" + ); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn test_aggregate_explosion_propagates_synthetic_column() { + use crate::plot::aesthetic::AestheticContext; + use crate::reader::AnsiDialect; + + let mut aesthetics = Mappings::new(); + for aes in ["pos1", "pos2", "fill"] { + aesthetics.insert( + aes.to_string(), + AestheticValue::standard_column(naming::aesthetic_column(aes)), + ); + } + let mut schema = create_schema(&["pos1", "pos2"]); + schema.push(ColumnInfo { + name: "__ggsql_aes_fill__".to_string(), + dtype: DataType::Float64, + is_discrete: false, + min: None, + max: None, + }); + let ctx = AestheticContext::from_static(&["x", "y"], &[]); + let mut parameters = HashMap::new(); + parameters.insert( + "aggregate".to_string(), + ParameterValue::Array(vec![ + crate::plot::types::ArrayElement::String("fill:min".to_string()), + crate::plot::types::ArrayElement::String("fill:max".to_string()), + ]), + ); + + let result = Tile + .apply_stat_transform( + "SELECT * FROM data", + &schema, + &aesthetics, + &[], + ¶meters, + &|_| panic!("execute_query should not run during stat building"), + &AnsiDialect, + &ctx, + ) + .unwrap(); + + match result { + StatResult::Transformed { + query, + stat_columns, + .. + } => { + assert!( + query.contains("UNION ALL"), + "expected UNION ALL, got: {query}" + ); + let synth = naming::stat_column("aggregate"); + assert!( + query.contains(&naming::quote_ident(&synth)), + "synthetic aggregate column dropped from query: {query}" + ); + assert!( + stat_columns.iter().any(|s| s == "aggregate"), + "stat_columns missing 'aggregate' tag: {stat_columns:?}" + ); + } + _ => panic!("expected Transformed"), + } + } + #[test] fn test_setting_width_as_fallback() { // Test that SETTING width/height are used when no MAPPING is provided diff --git a/src/plot/layer/geom/types.rs b/src/plot/layer/geom/types.rs index fb0ab5b88..8b390547a 100644 --- a/src/plot/layer/geom/types.rs +++ b/src/plot/layer/geom/types.rs @@ -28,6 +28,17 @@ pub const CLOSED_VALUES: &[&str] = &["left", "right"]; /// time by `normalise_aes_name()` in `parser/builder.rs`. pub const AESTHETIC_ALIASES: &[(&str, &[&str])] = &[("color", &["stroke", "fill"])]; +/// Shared `aggregate` parameter definition. Every geom that opts into the +/// Aggregate stat (i.e. `aggregate_domain_aesthetics()` returns `Some(_)`) +/// includes this entry in its `default_params()`. The constraint validates +/// only the structural shape (string / array-of-strings / null); the +/// per-entry vocabulary is checked by `stat_aggregate::parse_aggregate_param`. +pub const AGGREGATE_PARAM: ParamDefinition = ParamDefinition { + name: "aggregate", + default: DefaultParamValue::Null, + constraint: ParamConstraint::string_or_string_array_unconstrained(), +}; + /// Default aesthetic values for a geom type /// /// This struct describes which aesthetics a geom supports, requires, and their default values. @@ -175,6 +186,43 @@ pub use crate::plot::types::ColumnInfo; /// Schema of a data source - list of columns with type info pub use crate::plot::types::Schema; +/// Wrap a stat result with `ORDER BY `. +/// +/// Used by line/area/ribbon to ensure the rendered output is sorted along the +/// domain axis whether or not the layer also goes through the Aggregate stat. +/// +/// - `Identity` → becomes `Transformed` with ` ORDER BY `, +/// empty `stat_columns`/`dummy_columns`/`consumed_aesthetics`. +/// - `Transformed` → wraps the existing query in +/// `SELECT * FROM () AS "__ggsql_ord__" ORDER BY ` and preserves +/// the stat metadata. +pub fn wrap_with_order_by(input_query: &str, result: StatResult, aesthetic: &str) -> StatResult { + let order_col = naming::aesthetic_column(aesthetic); + let order_quoted = naming::quote_ident(&order_col); + match result { + StatResult::Identity => StatResult::Transformed { + query: format!("{} ORDER BY {}", input_query, order_quoted), + stat_columns: vec![], + dummy_columns: vec![], + consumed_aesthetics: vec![], + }, + StatResult::Transformed { + query, + stat_columns, + dummy_columns, + consumed_aesthetics, + } => StatResult::Transformed { + query: format!( + "SELECT * FROM ({}) AS \"__ggsql_ord__\" ORDER BY {}", + query, order_quoted + ), + stat_columns, + dummy_columns, + consumed_aesthetics, + }, + } +} + /// Helper to extract column name from aesthetic value pub fn get_column_name(aesthetics: &Mappings, aesthetic: &str) -> Option { use crate::AestheticValue; @@ -260,6 +308,56 @@ mod tests { assert!(!aes.is_required("yend")); } + #[test] + fn wrap_with_order_by_identity_appends_order() { + let result = wrap_with_order_by("SELECT * FROM t", StatResult::Identity, "pos1"); + match result { + StatResult::Transformed { + query, + stat_columns, + dummy_columns, + consumed_aesthetics, + } => { + assert_eq!(query, "SELECT * FROM t ORDER BY \"__ggsql_aes_pos1__\""); + assert!(stat_columns.is_empty()); + assert!(dummy_columns.is_empty()); + assert!(consumed_aesthetics.is_empty()); + } + _ => panic!("expected Transformed"), + } + } + + #[test] + fn wrap_with_order_by_transformed_wraps_query_and_preserves_metadata() { + let inner = StatResult::Transformed { + query: "SELECT * FROM grouped".to_string(), + stat_columns: vec!["pos2".to_string(), "aggregate".to_string()], + dummy_columns: vec!["pos1".to_string()], + consumed_aesthetics: vec!["pos2".to_string()], + }; + let result = wrap_with_order_by("SELECT * FROM raw", inner, "pos1"); + match result { + StatResult::Transformed { + query, + stat_columns, + dummy_columns, + consumed_aesthetics, + } => { + assert_eq!( + query, + "SELECT * FROM (SELECT * FROM grouped) AS \"__ggsql_ord__\" ORDER BY \"__ggsql_aes_pos1__\"" + ); + assert_eq!( + stat_columns, + vec!["pos2".to_string(), "aggregate".to_string()] + ); + assert_eq!(dummy_columns, vec!["pos1".to_string()]); + assert_eq!(consumed_aesthetics, vec!["pos2".to_string()]); + } + _ => panic!("expected Transformed"), + } + } + #[test] fn test_color_alias_requires_stroke_or_fill() { // Geom with neither stroke nor fill: color alias should NOT be supported diff --git a/src/plot/layer/geom/violin.rs b/src/plot/layer/geom/violin.rs index 7fef55eb3..6ee8d95b6 100644 --- a/src/plot/layer/geom/violin.rs +++ b/src/plot/layer/geom/violin.rs @@ -123,6 +123,7 @@ impl GeomTrait for Violin { parameters: &HashMap, _execute_query: &dyn Fn(&str) -> crate::Result, dialect: &dyn crate::reader::SqlDialect, + _aesthetic_ctx: &crate::plot::aesthetic::AestheticContext, ) -> Result { stat_violin(query, aesthetics, group_by, parameters, dialect) } diff --git a/src/plot/layer/mod.rs b/src/plot/layer/mod.rs index 33321a486..b2a130614 100644 --- a/src/plot/layer/mod.rs +++ b/src/plot/layer/mod.rs @@ -199,7 +199,10 @@ impl Layer { format!("`{}`", name) }; - // Check if all required aesthetics exist. + // Check if all required aesthetics exist. The Aggregate stat replaces + // mapped values in place — it never synthesises new aesthetics — so + // every required aesthetic must be mapped by the user regardless of + // the `aggregate` setting. let mut missing = Vec::new(); let mut position_reqs: Vec<(&str, u8, &str)> = Vec::new(); @@ -419,12 +422,68 @@ impl Layer { { validate_parameter(param_name, value, ¶m.constraint)?; } - // Otherwise it's a valid aesthetic setting (no constraint validation needed) + // Otherwise it's a valid aesthetic setting (no constraint validation needed). + // + // `aggregate` is registered in each supporting geom's `default_params` + // so its structural shape (string / array of strings / null) is + // checked through the standard `validate_parameter` path above. The + // per-entry vocabulary check (function names exist in `AGG_NAMES`, + // band syntax, recycling rules) lives in + // `stat_aggregate::parse_aggregate_param` and runs at execute time + // (`apply`) and at validate time via + // [`validate_aggregate_setting`] (called from `validate.rs::validate` + // so `ggsql validate` surfaces vocab errors without executing). } Ok(()) } + /// Validate the `aggregate` SETTING in isolation: per-entry vocabulary + /// (function names exist in `AGG_NAMES`, band syntax, recycling rules) + /// **and**, when `aesthetic_ctx` is supplied, target resolution (every + /// `:` target maps to a layer aesthetic; no two targets + /// resolve to the same aesthetic). The structural shape (string / array + /// of strings / null) is validated through the standard `default_params` + /// path in [`validate_settings`]; this function adds the layers the + /// static `ParamConstraint` can't express. + /// + /// Used by the standalone validate path (`ggsql validate`); the execute + /// path catches the same errors at execute time inside + /// `stat_aggregate::apply` (avoiding a redundant parse). + pub fn validate_aggregate_setting( + &self, + aesthetic_ctx: Option<&AestheticContext>, + ) -> std::result::Result<(), String> { + if !self.geom.supports_aggregate() { + return Ok(()); + } + let value = match self.parameters.get("aggregate") { + Some(v) => v, + None => return Ok(()), + }; + // Skip when the value is the wrong shape — `validate_settings` will + // already have surfaced that error via the `default_params` path; we + // shouldn't add a second, redundant message. + if !matches!( + value, + ParameterValue::String(_) | ParameterValue::Array(_) | ParameterValue::Null + ) { + return Ok(()); + } + let spec = match crate::plot::layer::geom::stat_aggregate::parse_aggregate_param(value)? { + Some(s) => s, + None => return Ok(()), + }; + if let Some(ctx) = aesthetic_ctx { + crate::plot::layer::geom::stat_aggregate::resolve_aggregate_targets( + &spec, + &self.mappings, + ctx, + )?; + } + Ok(()) + } + /// Update layer mappings to use prefixed aesthetic column names. /// /// After building a layer query that creates aesthetic columns with prefixed names, diff --git a/src/plot/scale/scale_type/discrete.rs b/src/plot/scale/scale_type/discrete.rs index 58a8e8e7e..1f7f6bdd3 100644 --- a/src/plot/scale/scale_type/discrete.rs +++ b/src/plot/scale/scale_type/discrete.rs @@ -261,7 +261,7 @@ impl ScaleTypeTrait for Discrete { let allowed_values: Vec = input_range .iter() .filter_map(|e| match e { - ArrayElement::String(s) => Some(format!("'{}'", s.replace('\'', "''"))), + ArrayElement::String(s) => Some(naming::quote_literal(s)), ArrayElement::Boolean(b) => Some(if *b { "true".into() } else { "false".into() }), _ => None, }) diff --git a/src/plot/scale/scale_type/ordinal.rs b/src/plot/scale/scale_type/ordinal.rs index 0b103ba70..9adf64b40 100644 --- a/src/plot/scale/scale_type/ordinal.rs +++ b/src/plot/scale/scale_type/ordinal.rs @@ -292,7 +292,7 @@ impl ScaleTypeTrait for Ordinal { let allowed_values: Vec = input_range .iter() .filter_map(|e| match e { - ArrayElement::String(s) => Some(format!("'{}'", s.replace('\'', "''"))), + ArrayElement::String(s) => Some(naming::quote_literal(s)), ArrayElement::Boolean(b) => Some(if *b { "true".into() } else { "false".into() }), ArrayElement::Number(n) => Some(n.to_string()), _ => None, diff --git a/src/plot/types.rs b/src/plot/types.rs index d7ee7379d..534dc1027 100644 --- a/src/plot/types.rs +++ b/src/plot/types.rs @@ -4,6 +4,7 @@ //! settings, and values. These are the building blocks used in AST types //! to capture what the user specified in their query. +use crate::naming; use crate::reader::SqlDialect; use arrow::datatypes::DataType; use chrono::{DateTime, Datelike, NaiveDate, NaiveDateTime, NaiveTime, Timelike}; @@ -673,7 +674,7 @@ impl ArrayElement { /// Used for generating SQL expressions from literal values. pub fn to_sql(&self, dialect: &dyn SqlDialect) -> String { match self { - Self::String(s) => format!("'{}'", s.replace('\'', "''")), + Self::String(s) => naming::quote_literal(s), Self::Number(n) => n.to_string(), Self::Boolean(b) => dialect.sql_boolean_literal(*b), Self::Date(d) => dialect.sql_date_literal(*d), @@ -902,7 +903,7 @@ impl ParameterValue { /// Arrays are handled separately in annotation layer VALUES clause generation. pub fn to_sql(&self, dialect: &dyn SqlDialect) -> String { match self { - ParameterValue::String(s) => format!("'{}'", s.replace('\'', "''")), + ParameterValue::String(s) => naming::quote_literal(s), ParameterValue::Number(n) => n.to_string(), ParameterValue::Boolean(b) => dialect.sql_boolean_literal(*b), ParameterValue::Array(_) => { @@ -1355,6 +1356,25 @@ impl ParamConstraint { } } + /// Any string or array of any strings (with null elements allowed) — for + /// the `aggregate` parameter. Per-entry vocabulary is checked downstream + /// by `stat_aggregate::parse_aggregate_param`; this constraint just pins + /// the structural shape (string / array-of-strings / null). + pub const fn string_or_string_array_unconstrained() -> Self { + Self { + number: TypeConstraint::Forbidden, + string: TypeConstraint::Any, + boolean: TypeConstraint::Forbidden, + array: TypeConstraint::Constrained(ArrayConstraint { + element: ArrayElementConstraint::String(StringConstraint::unconstrained()), + min_len: None, + max_len: None, + allow_null_elements: true, + }), + allow_null: true, + } + } + /// Builder method to disallow null values #[allow(dead_code)] pub const fn required(mut self) -> Self { diff --git a/src/reader/duckdb.rs b/src/reader/duckdb.rs index 00b8939dd..3aadcc265 100644 --- a/src/reader/duckdb.rs +++ b/src/reader/duckdb.rs @@ -106,6 +106,23 @@ impl super::SqlDialect for DuckDbDialect { ) } + fn sql_quantile_inline(&self, column: &str, fraction: f64) -> Option { + Some(format!( + "QUANTILE_CONT({}, {})", + naming::quote_ident(column), + fraction + )) + } + + fn sql_aggregate(&self, name: &str, qcol: &str) -> Option { + match name { + "first" => Some(format!("FIRST({})", qcol)), + "last" => Some(format!("LAST({})", qcol)), + "diff" => Some(format!("(LAST({c}) - FIRST({c}))", c = qcol)), + _ => super::default_sql_aggregate(name, qcol), + } + } + fn sql_percentile(&self, column: &str, fraction: f64, from: &str, groups: &[String]) -> String { let group_filter = groups .iter() diff --git a/src/reader/mod.rs b/src/reader/mod.rs index fc320a5dc..01ecc5481 100644 --- a/src/reader/mod.rs +++ b/src/reader/mod.rs @@ -188,6 +188,31 @@ pub trait SqlDialect { ) } + /// Inline-form quantile aggregate, usable directly in a `SELECT` list. + /// + /// Returns `Some(sql_fragment)` when the dialect supports a native quantile + /// aggregate that can be combined with other aggregates in the same `GROUP BY` + /// query (e.g. DuckDB's `QUANTILE_CONT`). Returns `None` when no native + /// inline form exists; callers should then fall back to [`sql_percentile`], + /// which produces a correlated scalar subquery. + fn sql_quantile_inline(&self, _column: &str, _fraction: f64) -> Option { + None + } + + /// SQL fragment for a simple aggregate function applied to an + /// already-quoted column expression. + /// + /// Returns `Some(expr)` when the dialect can express this aggregate inline + /// in a `GROUP BY` query. Returns `None` when the aggregate is not + /// supported by this backend; the stat layer surfaces a clear error. + /// + /// Names handled here are the entries of `stat_aggregate::AGG_NAMES` other + /// than the percentile/iqr family, which goes through [`sql_quantile_inline`] + /// / [`sql_percentile`] instead. + fn sql_aggregate(&self, name: &str, qcol: &str) -> Option { + default_sql_aggregate(name, qcol) + } + /// SQL literal for a date value (days since Unix epoch). fn sql_date_literal(&self, days_since_epoch: i32) -> String { format!( @@ -264,6 +289,44 @@ pub(crate) fn wrap_with_column_aliases(body_sql: &str, column_aliases: &[String] ) } +/// Default aggregate SQL emission, shared so dialects can opt into the standard +/// portable forms while overriding selected functions. +/// +/// `first` / `last` are expressed as `MAX(CASE WHEN __ggsql_rn__ = … THEN col END)`, +/// which depends on the row-number columns the stat layer injects when any +/// aggregate references them. Backends with a cheaper native equivalent +/// (e.g. DuckDB's `FIRST`/`LAST`) override [`SqlDialect::sql_aggregate`]. +pub fn default_sql_aggregate(name: &str, qcol: &str) -> Option { + let s = match name { + "count" => format!("COUNT({})", qcol), + "sum" => format!("SUM({})", qcol), + "prod" => format!("EXP(SUM(LN({})))", qcol), + "min" => format!("MIN({})", qcol), + "max" => format!("MAX({})", qcol), + "range" => format!("(MAX({c}) - MIN({c}))", c = qcol), + "mid" => format!("((MIN({c}) + MAX({c})) / 2.0)", c = qcol), + "mean" => format!("AVG({})", qcol), + "geomean" => format!("EXP(AVG(LN({})))", qcol), + "harmean" => format!("(COUNT({c}) * 1.0 / SUM(1.0 / {c}))", c = qcol), + "rms" => format!("SQRT(AVG({c} * {c}))", c = qcol), + "sdev" => format!("STDDEV_POP({})", qcol), + "se" => format!("(STDDEV_POP({c}) / SQRT(COUNT({c})))", c = qcol), + "var" => format!("VAR_POP({})", qcol), + "first" => format!("MAX(CASE WHEN \"__ggsql_rn__\" = 1 THEN {} END)", qcol), + "last" => format!( + "MAX(CASE WHEN \"__ggsql_rn__\" = \"__ggsql_max_rn__\" THEN {} END)", + qcol + ), + "diff" => format!( + "(MAX(CASE WHEN \"__ggsql_rn__\" = \"__ggsql_max_rn__\" THEN {c} END) \ + - MAX(CASE WHEN \"__ggsql_rn__\" = 1 THEN {c} END))", + c = qcol + ), + _ => return None, + }; + Some(s) +} + pub struct AnsiDialect; impl SqlDialect for AnsiDialect {} @@ -502,8 +565,8 @@ pub trait Reader { fn list_schemas(&self, catalog: &str) -> Result> { let df = self.execute_sql(&format!( "SELECT DISTINCT schema_name FROM information_schema.schemata \ - WHERE catalog_name = '{}' ORDER BY schema_name", - catalog.replace('\'', "''") + WHERE catalog_name = {} ORDER BY schema_name", + naming::quote_literal(catalog) ))?; let col = df.column("schema_name")?; let mut results = Vec::with_capacity(df.height()); @@ -518,9 +581,9 @@ pub trait Reader { fn list_tables(&self, catalog: &str, schema: &str) -> Result> { let df = self.execute_sql(&format!( "SELECT DISTINCT table_name, table_type FROM information_schema.tables \ - WHERE table_catalog = '{}' AND table_schema = '{}' ORDER BY table_name", - catalog.replace('\'', "''"), - schema.replace('\'', "''") + WHERE table_catalog = {} AND table_schema = {} ORDER BY table_name", + naming::quote_literal(catalog), + naming::quote_literal(schema) ))?; let name_col = df.column("table_name")?; let type_col = df.column("table_type")?; @@ -539,11 +602,11 @@ pub trait Reader { fn list_columns(&self, catalog: &str, schema: &str, table: &str) -> Result> { let df = self.execute_sql(&format!( "SELECT column_name, data_type FROM information_schema.columns \ - WHERE table_catalog = '{}' AND table_schema = '{}' AND table_name = '{}' \ + WHERE table_catalog = {} AND table_schema = {} AND table_name = {} \ ORDER BY ordinal_position", - catalog.replace('\'', "''"), - schema.replace('\'', "''"), - table.replace('\'', "''") + naming::quote_literal(catalog), + naming::quote_literal(schema), + naming::quote_literal(table) ))?; let name_col = df.column("column_name")?; let type_col = df.column("data_type")?; diff --git a/src/reader/sqlite.rs b/src/reader/sqlite.rs index 1b780629f..3332236aa 100644 --- a/src/reader/sqlite.rs +++ b/src/reader/sqlite.rs @@ -70,6 +70,23 @@ impl super::SqlDialect for SqliteDialect { } } + /// Stock SQLite has no `STDDEV_POP` / `VAR_POP`, so express variance, + /// standard deviation, and standard error in portable arithmetic. Every + /// other aggregate falls through to the shared default. + fn sql_aggregate(&self, name: &str, qcol: &str) -> Option { + // Population variance with a `MAX(0, …)` floor against tiny negative + // floats from catastrophic cancellation. Both `MAX(a, b)` and `SQRT` + // are scalar functions in modern bundled SQLite (math-functions build). + let var_pop = || format!("MAX(0.0, AVG({c} * {c}) - AVG({c}) * AVG({c}))", c = qcol); + let s = match name { + "var" => var_pop(), + "sdev" => format!("SQRT({})", var_pop()), + "se" => format!("(SQRT({}) / SQRT(COUNT({c})))", var_pop(), c = qcol), + _ => return super::default_sql_aggregate(name, qcol), + }; + Some(s) + } + /// SQLite does not support `CREATE OR REPLACE`, so emit a drop-then-create /// pair. Column aliases are preserved portably via the default CTE wrapper. fn create_or_replace_temp_table_sql( @@ -549,8 +566,8 @@ impl Reader for SqliteReader { table: &str, ) -> Result> { let df = self.execute_sql(&format!( - "SELECT name, type FROM pragma_table_info('{}') ORDER BY cid", - table.replace('\'', "''") + "SELECT name, type FROM pragma_table_info({}) ORDER BY cid", + naming::quote_literal(table) ))?; let name_col = df.column("name")?; let type_col = df.column("type")?; diff --git a/src/validate.rs b/src/validate.rs index 5cce990b9..8c7e715c6 100644 --- a/src/validate.rs +++ b/src/validate.rs @@ -250,6 +250,18 @@ pub fn validate(query: &str) -> Result { location: None, }); } + + // The aggregate setting is validated in isolation here so the + // standalone validate path (which doesn't run the stat) still + // catches malformed `aggregate` values and unmapped/duplicate + // targets. The execute path skips this; `stat_aggregate::apply` + // parses + reports there. + if let Err(e) = layer.validate_aggregate_setting(plot.aesthetic_context.as_ref()) { + errors.push(ValidationError { + message: format!("{}: {}", context, e), + location: None, + }); + } } }