Introducing StateSchemaV3 for the TransformWithState operator by ericm-db · Pull Request #8 · ericm-db/spark

ericm-db · 2024-05-22T20:48:55Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

This is a trivial change to replace the loop index from `int` to `long`. Surprisingly, microbenchmark shows more than double performance uplift. Analysis -------- The hot loop of `arrayEquals` method is simplifed as below. Loop index `i` is defined as `int`, it's compared with `length`, which is a `long`, to determine if the loop should end. ``` public static boolean arrayEquals( Object leftBase, long leftOffset, Object rightBase, long rightOffset, final long length) { ...... int i = 0; while (i <= length - 8) { if (Platform.getLong(leftBase, leftOffset + i) != Platform.getLong(rightBase, rightOffset + i)) { return false; } i += 8; } ...... } ``` Strictly speaking, there's a code bug here. If `length` is greater than 2^31 + 8, this loop will never end because `i` as a 32 bit integer is at most 2^31 - 1. But compiler must consider this behaviour as intentional and generate code strictly match the logic. It prevents compiler from generating optimal code. Defining loop index `i` as `long` corrects this issue. Besides more accurate code logic, JIT is able to optimize this code much more aggressively. From microbenchmark, this trivial change improves performance significantly on both Arm and x86 platforms. Benchmark --------- Source code: https://gist.github.com/cyb70289/258e261f388e22f47e4d961431786d1a Result on Arm Neoverse N2: ``` Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 674.313 ± 0.213 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 313.563 ± 2.338 ns/op ``` Result on Intel Cascake Lake: ``` Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 1130.695 ± 0.168 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 461.979 ± 0.097 ns/op ``` Deep dive --------- Dive deep to the machine code level, we can see why the big gap. Listed below are arm64 assembly generated by Openjdk-17 C2 compiler. For `int i`, the machine code is similar to source code, no deep optimization. Safepoint polling is expensive in this short loop. ``` // jit c2 machine code snippet 0x0000ffff81ba8904: mov w15, wzr // int i = 0 0x0000ffff81ba8908: nop 0x0000ffff81ba890c: nop loop: 0x0000ffff81ba8910: ldr x10, [x13, w15, sxtw] // Platform.getLong(leftBase, leftOffset + i) 0x0000ffff81ba8914: ldr x14, [x12, w15, sxtw] // Platform.getLong(rightBase, rightOffset + i) 0x0000ffff81ba8918: cmp x10, x14 0x0000ffff81ba891c: b.ne 0x0000ffff81ba899c // return false if not equal 0x0000ffff81ba8920: ldr x14, [x28, apache#848] // x14 -> safepoint 0x0000ffff81ba8924: add w15, w15, #0x8 // i += 8 0x0000ffff81ba8928: ldr wzr, [x14] // safepoint polling 0x0000ffff81ba892c: sxtw x10, w15 // extend i to long 0x0000ffff81ba8930: cmp x10, x11 0x0000ffff81ba8934: b.le 0x0000ffff81ba8910 // if (i <= length - 8) goto loop ``` For `long i`, JIT is able to do much more aggressive optimization. E.g, below code snippet unrolls the loop by four. ``` // jit c2 machine code snippet unrolled_loop: 0x0000ffff91de6fe0: sxtw x10, w7 0x0000ffff91de6fe4: add x23, x22, x10 0x0000ffff91de6fe8: add x24, x21, x10 0x0000ffff91de6fec: ldr x13, [x23] // unroll-1 0x0000ffff91de6ff0: ldr x14, [x24] 0x0000ffff91de6ff4: cmp x13, x14 0x0000ffff91de6ff8: b.ne 0x0000ffff91de70a8 0x0000ffff91de6ffc: ldr x13, [x23, #8] // unroll-2 0x0000ffff91de7000: ldr x14, [x24, #8] 0x0000ffff91de7004: cmp x13, x14 0x0000ffff91de7008: b.ne 0x0000ffff91de70b4 0x0000ffff91de700c: ldr x13, [x23, #16] // unroll-3 0x0000ffff91de7010: ldr x14, [x24, #16] 0x0000ffff91de7014: cmp x13, x14 0x0000ffff91de7018: b.ne 0x0000ffff91de70a4 0x0000ffff91de701c: ldr x13, [x23, #24] // unroll-4 0x0000ffff91de7020: ldr x14, [x24, #24] 0x0000ffff91de7024: cmp x13, x14 0x0000ffff91de7028: b.ne 0x0000ffff91de70b0 0x0000ffff91de702c: add w7, w7, #0x20 0x0000ffff91de7030: cmp w7, w11 0x0000ffff91de7034: b.lt 0x0000ffff91de6fe0 ``` ### What changes were proposed in this pull request? A trivial change to replace loop index `i` of method `arrayEquals` from `int` to `long`. ### Why are the changes needed? To improve performance and fix a possible bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49568 from cyb70289/arrayEquals. Authored-by: Yibo Cai <cyb70289@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <ben.hurdelhey@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Address the open follow-ups from [SPARK-56681](https://issues.apache.org/jira/browse/SPARK-56681) (umbrella for PATH / SPARK-56605 cleanup) in a single cleanup PR. Items #1 and #2 were already wired by SPARK-56639; this PR covers the remainder. | # | Item | Resolution | |---|---|---| | #1 | `FunctionResolution.resolveProcedure` was dead code | Already wired by SPARK-56639 (no action). | | #2 | Frozen view / SQL-function PATH wiring unfinished | Already done by SPARK-56639 (no action). | | #3 | `AnalysisContext.resolutionPathEntries` threadlocal | Audit only: confirmed `withNewAnalysisContext` / `reset()` correctly clear it. Full removal needs a coordinated refactor to plumb the path through `RelationResolution` / `FunctionResolution` method calls; flagged as a follow-up. | | #4 | `Analyzer.executeAndCheck` clobbers outer `SQLConf.withExistingConf` | Extracted `runWithSessionConf` helper, added `SQLConf.getExistingConfIfSet`. `executeAndCheck` and `executeSameContext` now share one path that yields to any outer scope. | | #5 | `VariableResolution.allowUnqualifiedSessionTempVariableLookup` force-loads default catalog | Replaced the hot-path catalog read with `CatalogManager.isSystemSessionOnPath`, which inspects stored session-path entries directly. No catalog load on column resolution. | | #6 | `DROP VARIABLE` PATH gate asymmetric with `DECLARE` / `CREATE` | Removed the gate. DDL on session variables (`DECLARE` / `CREATE` / `DROP`) always targets `system.session` directly; only DML (`SET VAR`, `SELECT x`) goes through PATH. | | #7 | `lookupFunctionType` exception swallow too broad | Narrowed from `NonFatal` to the explicit not-found list (`NoSuchFunctionException`, `NoSuchNamespaceException`, `CatalogNotFoundException`, `FORBIDDEN_OPERATION`). Other exceptions propagate. | | #8 | `lookupFunctionType` fan-out had wasteful `system.*` candidates | Filtered them out — `system.session`, `system.builtin`, `system.ai` are already resolved earlier in the same method. | | #9 | Three near-duplicate path-resolution helpers | Lifted into `CatalogManager.resolutionPathEntriesForAnalysis(pinnedEntries, viewCatalogAndNamespace)`. Relation, routine, and procedure resolution all route through it. | | #10 | Tests for the new error paths and gates | Added a DECLARE / SET VAR / DROP cycle test under non-default PATH and a struct-variable field-vs-qualified ambiguity test in `sql-session-variables.sql`. | | #11 | `ProtoToParsedPlanTestSuite.analyzerIsolationConf` was a bare `SQLConf` | Clone `spark.sessionState.conf` and only override `PATH_ENABLED=false`, so all `sparkConf` overrides (ANSI, alias config, ...) propagate automatically. | | Bonus | `ResolveSetVariable` hardcoded `SYSTEM.SESSION` regardless of actual PATH | `unresolvedVariableError` now takes `Seq[Seq[String]]` path entries with **required** `Origin` (no overloads). DML lookup failures (`SET VAR`, `FETCH ... INTO`) report the full SQL path as a bracketed list, byte-for-byte consistent with `UNRESOLVED_ROUTINE` and `TABLE_OR_VIEW_NOT_FOUND`. DDL name validation in `ResolveCatalogs` continues to report `[system.session]` since PATH does not apply there. Origin is plumbed through `VariableManager.set` so all error sites carry a `queryContext` pointing at the offending variable identifier (parser opt-ins via `withOrigin(identifierReference)` so the highlight is the variable name, not the whole statement). | ### Why are the changes needed? These are the cleanup items called out on SPARK-56681 from the post-merge source review of SPARK-56605. They eliminate dead code paths, plug user-visible bugs (force-loading a misconfigured default catalog on column resolution; clobbering pinned session configs; swallowing real catalog errors as `UNRESOLVED_ROUTINE`), remove the asymmetry between DDL and DML on session variables, and make `UNRESOLVED_VARIABLE` self-consistent with the other "not found" errors. ### Does this PR introduce _any_ user-facing change? Yes. - **`UNRESOLVED_VARIABLE.searchPath`** is now rendered as a bracketed list. For DML lookups (`SET VAR`, `FETCH ... INTO`), the list reflects the actual SQL PATH that was consulted instead of a hardcoded `SYSTEM.SESSION`. For DDL name validation (`DECLARE` / `DROP` with a non-session namespace), the list is `[`` `system`.`session` ``]` since PATH does not apply. - **`UNRESOLVED_VARIABLE`** now always carries a `queryContext` that highlights just the offending variable identifier (e.g. `"builtin.var1"`, `"ses.var1"`), not the whole `DECLARE` / `SET VAR` statement. - **`DROP TEMPORARY VARIABLE`** no longer raises `UNRESOLVED_VARIABLE` when the SQL PATH does not contain `system.session`. DDL on session variables ignores PATH, matching the existing behaviour of `DECLARE OR REPLACE VARIABLE`. - **`lookupFunctionType`** no longer swallows non–`NotFound` errors. A catalog reporting `PERMISSION_DENIED` (or similar) for a function lookup now propagates instead of silently producing `UNRESOLVED_ROUTINE`. ### How was this patch tested? - Added `sql-session-variables.sql` regression test for the struct-variable field-vs-qualified ambiguity (`DECLARE VARIABLE session STRUCT<a INT>` → `SELECT session.a` succeeds → `DROP` → `SELECT session.a` falls through to `UNRESOLVED_COLUMN`). - Updated `SetPathSuite`: DECLARE / SET VAR / DROP cycle under a non-default PATH; bonus test asserts the actual rendered search path and the variable-identifier `queryContext`. - Updated `SqlScriptingExecutionSuite` for the new bracketed `searchPath` and identifier-pinned `queryContext`. - Regenerated `sql-session-variables.sql.out` for the new error shape. - Added `resolutionPathEntriesForAnalysis` stubs to mocked `CatalogManager` instances in `PlanResolutionSuite`, `AlignAssignmentsSuiteBase`, and `TableLookupCacheSuite`. - Ran focused suites locally; all pass: - `build/sbt 'sql/testOnly *SetPathSuite *SqlScriptingExecutionSuite *ExecuteImmediateEndToEndSuite'` - `build/sbt 'sql/testOnly *SimpleSQLViewSuite *SQLFunctionSuite'` - `build/sbt 'sql/testOnly *PlanResolutionSuite *UpdateTableAlignAssignmentsSuite *MergeIntoTableAlignAssignmentsSuite'` - `build/sbt 'catalyst/testOnly *TableLookupCacheSuite *AnalysisSuite *AnalysisErrorSuite *LookupFunctionsSuite'` - `build/sbt 'sql/testOnly *FunctionQualificationSuite *RelationQualificationSuite *DataSourceV2FunctionSuite'` - `build/sbt 'sql/testOnly *SQLQuerySuite'` - `build/sbt 'connect/testOnly *ProtoToParsedPlanTestSuite'` - `build/sbt 'sql/testOnly *SQLQueryTestSuite -- -z sql-session-variables.sql'` - Full `org.apache.spark.sql.catalyst.analysis.*`, `org.apache.spark.sql.catalyst.parser.*`, and `org.apache.spark.sql.analysis.resolver.*` suites. - `scalastyle` and `scalafmt` clean across catalyst, sql, and connect modules. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor Claude Opus 4.7 Closes apache#55647 from srielau/SPARK-56681-patch-clean-up. Authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

github-actions Bot added SQL STRUCTURED STREAMING labels May 22, 2024

ericm-db added 9 commits June 3, 2024 14:38

mapstateimpl

575cfba

adding support for StateSchemaV3

b9fd42f

schemav3

a621ba8

able to serialize, deserialize schemas

8feb83a

adding colfamilymetadata trait

6a68d6b

using arraylist

8f75943

moving stateVariables.add

970cc13

renaming to ColumnFamilySchema

abf8c04

moving getListState/getMapState methods

37526de

ericm-db force-pushed the tws-state-schema-changes branch from d6cd20f to 37526de Compare June 3, 2024 21:41

removing the columnFamilyAccumulator

67acea8

ericm-db force-pushed the tws-state-schema-changes branch from ccee64e to 67acea8 Compare June 4, 2024 04:22

propagating information to executor through broadcast

411c5b8

ericm-db closed this Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing StateSchemaV3 for the TransformWithState operator#8

Introducing StateSchemaV3 for the TransformWithState operator#8
ericm-db wants to merge 11 commits into
tws-operator-state-metadatafrom
tws-state-schema-changes

ericm-db commented May 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericm-db commented May 22, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant