Encoder ttl poc by ericm-db · Pull Request #1 · ericm-db/spark

ericm-db · 2024-02-21T00:28:42Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…unit3-driver` instead of `net.sourceforge.htmlunit`" This reverts commit f5b1e37.

### What changes were proposed in this pull request? This PR uses SMALLINT (as TINYINT ranges [0, 255]) instead of BYTE to fix the ByteType mapping for MsSQLServer JDBC ```java [info] com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #1: Cannot find data type BYTE. [info] at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1662) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:898) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:793) [info] at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7417) [info] at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3488) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:262) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:237) [info] at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:733) [info] at org.apache.spark.sql.jdbc.JdbcDialect.createTable(JdbcDialects.scala:267) ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46164 from yaooqinn/SPARK-47938. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…rtition data results should return user-facing error ### What changes were proposed in this pull request? Create an example parquet table with partitions and insert data in Spark: ``` create table t(col1 string, col2 string, col3 string) using parquet location 'some/path/parquet-test' partitioned by (col1, col2); insert into t (col1, col2, col3) values ('a', 'b', 'c'); ``` Go into the `parquet-test` path in the filesystem and try to copy parquet data file from path `col1=a/col2=b` directory into `col1=a`. After that, try to create new table based on parquet data in Spark: ``` create table broken_table using parquet location 'some/path/parquet-test'; ``` This query errors with internal error. Stack trace excerpts: ``` org.apache.spark.SparkException: [INTERNAL_ERROR] Eagerly executed command failed. You hit a bug in Spark or the Spark plugins you use. Please, report this bug to the corresponding communities or vendors, and provide the full stack trace. SQLSTATE: XX000 ... Caused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected: Partition column name list #0: col1 Partition column name list #1: col1, col2For partitioned table directories, data files should only live in leaf directories. And directories at the same level should have the same partition column name. Please check the following directories for unexpected files or inconsistent partition column names: file:some/path/parquet-test/col1=a file:some/path/parquet-test/col1=a/col2=b at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:391) ... ``` Fix this by changing internal error to user-facing error. ### Why are the changes needed? Replace internal error with user-facing one for valid sequence of Spark SQL operations. ### Does this PR introduce _any_ user-facing change? Yes, it presents the user with regular error instead of internal error. ### How was this patch tested? Added checks to `ParquetPartitionDiscoverySuite` which simulate the described scenario by manually breaking parquet table in the filesystem. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47668 from nikolamand-db/SPARK-49163. Authored-by: Nikola Mandic <nikola.mandic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…in/load-spark-env.sh ### What changes were proposed in this pull request? The last action in [bin/load-spark-env.sh](https://github.com/apache/spark/blob/d5da49d56d7dec5f8a96c5252384d865f7efd4d9/bin/load-spark-env.sh#L68) performs a test to determine whether running in a terminal or not, and whether `stdin` is reading from a pipe. A more portable test is needed. ### Why are the changes needed? The current approach relies on `ps` with options that vary significantly between different Unix-like systems. Specifically, it prints an error message in both `cygwin` and `msys2` (and by extension, in all of the variations of `git-for-windows`). It doesn't print an error message, but fails to detect a terminal session in `Linux` and `Osx/Darwin homebrew` (always thinks STDIN is a pipe). Here's what the problem looks like in a `cygwin64` session (with `set -x` just ahead of the section of interest): If called directly: ```bash $ bin/load-spark-env.sh ++ ps -o stat= -p 1947 ps: unknown option -- o Try `ps --help' for more information. + [[ ! '' =~ \+ ]] + [[ -p /dev/stdin ]] + export 'SPARK_BEELINE_OPTS= -Djline.terminal=jline.UnsupportedTerminal' + SPARK_BEELINE_OPTS=' -Djline.terminal=jline.UnsupportedTerminal' ``` Interestingly, due to the 2-part test, it does the right thing w.r.t. the Terminal test, the main problem being the error message. If called downstream from a pipe: ```bash $ echo "yo" | bin/load-spark-env.sh ++ ps -o stat= -p 1955 ps: unknown option -- o Try `ps --help' for more information. + [[ ! '' =~ \+ ]] + [[ -p /dev/stdin ]] ``` Again, it correctly detects the pipe environment, but with an error message. In WSL2 Ubuntu, the test doesn't correctly detect a non-pipe terminal session: ```bash # /opt/spark$ bin/load-spark-env.sh ++ ps -o stat= -p 1423 + [[ ! S+ =~ \+ ]] # echo "yo!" | bin/load-spark-env.sh ++ ps -o stat= -p 1416 + [[ ! S+ =~ \+ ]] ``` In `apache#134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024`, the same failure occurs (it doesn't recognize terminal environments). ### Does this PR introduce _any_ user-facing change? This is a proposed bug fix, and, other than fixing the bug, should be invisible to users. ### How was this patch tested? The patch was verified to behave as intended in terminal sessions, both interactive and piped, in the following 5 environments. ``` - Linux quadd 5.15.0-124-generic apache#134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux - Linux d5 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29 23:14:13 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux - MINGW64_NT-10.0-22631 d5 3.5.4-0bc1222b.x86_64 2024-09-04 18:28 UTC x86_64 Msys - CYGWIN_NT-10.0-22631 d5 3.5.3-1.x86_64 2024-04-03 17:25 UTC x86_64 Cygwin - Darwin suemac.local 23.6.0 Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:21 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T8103 arm64 ``` The test was to manually run the following script, verifying the expected response to both pipe and terminal sessions. ```bash #!/bin/bash if [ -e /usr/bin/tty -a "`tty`" != "not a tty" -a ! -p /dev/stdin ]; then echo "not a pipe" else echo "is a pipe" fi ``` The output of the manual test in all 5 tested environments. ``` philwalkquadd:/opt/spark $ isPipe not a pipe # $ echo "yo" | isPipe is a pipe # ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48937 from philwalk/portability-fix-for-load-spark-env.sh. Authored-by: philwalk <philwalk9@gmail.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

### What changes were proposed in this pull request? This PR aims to disable `SparkBloomFilterSuite` due to the excessive running time. - SPARK-53077 is filed to re-enable this with the reasonable running time. ### Why are the changes needed? Previously, `common/sketch` module took less than 10s. ``` $ mvn package --pl common/sketch ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 7.177 s [INFO] Finished at: 2025-08-02T08:25:43-07:00 [INFO] ------------------------------------------------------------------------ ``` After `SparkBloomFilterSuite` was added newly, `SparkBloomFilterSuite` took over 12 minutes. It's too long as a unit test. - apache#50933 ``` [info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyRandomDistribution(long, double, int, org.junit.jupiter.api.TestInfo):#1 started [info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyEvenOdd(long, double, int, org.junit.jupiter.api.TestInfo):#1 started [info] Test run finished: 0 failed, 0 ignored, 2 total, 721.939s ``` ### Does this PR introduce _any_ user-facing change? No, this is a test change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51788 from dongjoon-hyun/SPARK-53076. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: yangjie01 <yangjie01@baidu.com>

… `SparkBloomFilterSuite` ### What changes were proposed in this pull request? This pr adds an environment variable named `SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED` to control whether the test case `SparkBloomFilterSuite` is executed. It also ensures that this test case is only run for validation in the daily tests specified in `build_non_ansi.yml`. ### Why are the changes needed? The `SparkBloomFilterSuite` requires periodic validation, but due to its excessively long execution time (over 10 minutes), it is not suitable for execution in the Change Pipeline. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual verification: - maven ``` build/mvn package --pl common/sketch [INFO] Running org.apache.spark.util.sketch.SparkBloomFilterSuite [WARNING] Tests run: 2, Failures: 0, Errors: 0, Skipped: 2, Time elapsed: 0.001 s -- in org.apache.spark.util.sketch.SparkBloomFilterSuite ``` ``` SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED=true build/mvn package --pl common/sketch [INFO] Running org.apache.spark.util.sketch.SparkBloomFilterSuite [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 401.9 s -- in org.apache.spark.util.sketch.SparkBloomFilterSuite ``` - sbt ``` build/sbt clean "sketch/test" [info] Test run started (JUnit Jupiter) [info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite ignored: Environment variable [SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED] does not exist [info] Test run finished: 0 failed, 0 ignored, 0 total, 0.016s ``` ``` SPARK_TEST_SPARK_BLOOM_FILTER_SUITE_ENABLED=true build/sbt clean "sketch/test" [info] Test run started (JUnit Jupiter) [info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyRandomDistribution(long, double, int, org.junit.jupiter.api.TestInfo):#1 started [info] Test org.apache.spark.util.sketch.SparkBloomFilterSuite#testAccuracyEvenOdd(long, double, int, org.junit.jupiter.api.TestInfo):#1 started [info] Test run finished: 0 failed, 0 ignored, 2 total, 359.776s ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51806 from LuciferYang/SPARK-53077. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

…e of the closed PRs ### What changes were proposed in this pull request? This PR aims to improve `merge_spark_pr.py` to stop early in case of the closed PRs. ### Why are the changes needed? To help committers by removing the useless interactions. **BEFORE** ``` $ dev/merge_spark_pr.py git rev-parse --abbrev-ref HEAD Which pull request would you like to merge? (e.g. 34): 1 I've re-written the title as follows to match the standard format: Original: Removed reference to incubation in README.md. Modified: Removed reference to incubation in README.md Would you like to use the modified title? (y/N): ... ``` **AFTER** ``` $ dev/merge_spark_pr.py git rev-parse --abbrev-ref HEAD Which pull request would you like to merge? (e.g. 34): 1 #1 is closed already. Restoring head pointer to SPARK-53277 git checkout SPARK-53277 Already on 'SPARK-53277' git branch ``` ### Does this PR introduce _any_ user-facing change? No, this is used by committers only. ### How was this patch tested? Manually check. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52019 from dongjoon-hyun/SPARK-53277. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Address the open follow-ups from [SPARK-56681](https://issues.apache.org/jira/browse/SPARK-56681) (umbrella for PATH / SPARK-56605 cleanup) in a single cleanup PR. Items #1 and #2 were already wired by SPARK-56639; this PR covers the remainder. | # | Item | Resolution | |---|---|---| | #1 | `FunctionResolution.resolveProcedure` was dead code | Already wired by SPARK-56639 (no action). | | #2 | Frozen view / SQL-function PATH wiring unfinished | Already done by SPARK-56639 (no action). | | #3 | `AnalysisContext.resolutionPathEntries` threadlocal | Audit only: confirmed `withNewAnalysisContext` / `reset()` correctly clear it. Full removal needs a coordinated refactor to plumb the path through `RelationResolution` / `FunctionResolution` method calls; flagged as a follow-up. | | #4 | `Analyzer.executeAndCheck` clobbers outer `SQLConf.withExistingConf` | Extracted `runWithSessionConf` helper, added `SQLConf.getExistingConfIfSet`. `executeAndCheck` and `executeSameContext` now share one path that yields to any outer scope. | | #5 | `VariableResolution.allowUnqualifiedSessionTempVariableLookup` force-loads default catalog | Replaced the hot-path catalog read with `CatalogManager.isSystemSessionOnPath`, which inspects stored session-path entries directly. No catalog load on column resolution. | | #6 | `DROP VARIABLE` PATH gate asymmetric with `DECLARE` / `CREATE` | Removed the gate. DDL on session variables (`DECLARE` / `CREATE` / `DROP`) always targets `system.session` directly; only DML (`SET VAR`, `SELECT x`) goes through PATH. | | #7 | `lookupFunctionType` exception swallow too broad | Narrowed from `NonFatal` to the explicit not-found list (`NoSuchFunctionException`, `NoSuchNamespaceException`, `CatalogNotFoundException`, `FORBIDDEN_OPERATION`). Other exceptions propagate. | | #8 | `lookupFunctionType` fan-out had wasteful `system.*` candidates | Filtered them out — `system.session`, `system.builtin`, `system.ai` are already resolved earlier in the same method. | | #9 | Three near-duplicate path-resolution helpers | Lifted into `CatalogManager.resolutionPathEntriesForAnalysis(pinnedEntries, viewCatalogAndNamespace)`. Relation, routine, and procedure resolution all route through it. | | #10 | Tests for the new error paths and gates | Added a DECLARE / SET VAR / DROP cycle test under non-default PATH and a struct-variable field-vs-qualified ambiguity test in `sql-session-variables.sql`. | | #11 | `ProtoToParsedPlanTestSuite.analyzerIsolationConf` was a bare `SQLConf` | Clone `spark.sessionState.conf` and only override `PATH_ENABLED=false`, so all `sparkConf` overrides (ANSI, alias config, ...) propagate automatically. | | Bonus | `ResolveSetVariable` hardcoded `SYSTEM.SESSION` regardless of actual PATH | `unresolvedVariableError` now takes `Seq[Seq[String]]` path entries with **required** `Origin` (no overloads). DML lookup failures (`SET VAR`, `FETCH ... INTO`) report the full SQL path as a bracketed list, byte-for-byte consistent with `UNRESOLVED_ROUTINE` and `TABLE_OR_VIEW_NOT_FOUND`. DDL name validation in `ResolveCatalogs` continues to report `[system.session]` since PATH does not apply there. Origin is plumbed through `VariableManager.set` so all error sites carry a `queryContext` pointing at the offending variable identifier (parser opt-ins via `withOrigin(identifierReference)` so the highlight is the variable name, not the whole statement). | ### Why are the changes needed? These are the cleanup items called out on SPARK-56681 from the post-merge source review of SPARK-56605. They eliminate dead code paths, plug user-visible bugs (force-loading a misconfigured default catalog on column resolution; clobbering pinned session configs; swallowing real catalog errors as `UNRESOLVED_ROUTINE`), remove the asymmetry between DDL and DML on session variables, and make `UNRESOLVED_VARIABLE` self-consistent with the other "not found" errors. ### Does this PR introduce _any_ user-facing change? Yes. - **`UNRESOLVED_VARIABLE.searchPath`** is now rendered as a bracketed list. For DML lookups (`SET VAR`, `FETCH ... INTO`), the list reflects the actual SQL PATH that was consulted instead of a hardcoded `SYSTEM.SESSION`. For DDL name validation (`DECLARE` / `DROP` with a non-session namespace), the list is `[`` `system`.`session` ``]` since PATH does not apply. - **`UNRESOLVED_VARIABLE`** now always carries a `queryContext` that highlights just the offending variable identifier (e.g. `"builtin.var1"`, `"ses.var1"`), not the whole `DECLARE` / `SET VAR` statement. - **`DROP TEMPORARY VARIABLE`** no longer raises `UNRESOLVED_VARIABLE` when the SQL PATH does not contain `system.session`. DDL on session variables ignores PATH, matching the existing behaviour of `DECLARE OR REPLACE VARIABLE`. - **`lookupFunctionType`** no longer swallows non–`NotFound` errors. A catalog reporting `PERMISSION_DENIED` (or similar) for a function lookup now propagates instead of silently producing `UNRESOLVED_ROUTINE`. ### How was this patch tested? - Added `sql-session-variables.sql` regression test for the struct-variable field-vs-qualified ambiguity (`DECLARE VARIABLE session STRUCT<a INT>` → `SELECT session.a` succeeds → `DROP` → `SELECT session.a` falls through to `UNRESOLVED_COLUMN`). - Updated `SetPathSuite`: DECLARE / SET VAR / DROP cycle under a non-default PATH; bonus test asserts the actual rendered search path and the variable-identifier `queryContext`. - Updated `SqlScriptingExecutionSuite` for the new bracketed `searchPath` and identifier-pinned `queryContext`. - Regenerated `sql-session-variables.sql.out` for the new error shape. - Added `resolutionPathEntriesForAnalysis` stubs to mocked `CatalogManager` instances in `PlanResolutionSuite`, `AlignAssignmentsSuiteBase`, and `TableLookupCacheSuite`. - Ran focused suites locally; all pass: - `build/sbt 'sql/testOnly *SetPathSuite *SqlScriptingExecutionSuite *ExecuteImmediateEndToEndSuite'` - `build/sbt 'sql/testOnly *SimpleSQLViewSuite *SQLFunctionSuite'` - `build/sbt 'sql/testOnly *PlanResolutionSuite *UpdateTableAlignAssignmentsSuite *MergeIntoTableAlignAssignmentsSuite'` - `build/sbt 'catalyst/testOnly *TableLookupCacheSuite *AnalysisSuite *AnalysisErrorSuite *LookupFunctionsSuite'` - `build/sbt 'sql/testOnly *FunctionQualificationSuite *RelationQualificationSuite *DataSourceV2FunctionSuite'` - `build/sbt 'sql/testOnly *SQLQuerySuite'` - `build/sbt 'connect/testOnly *ProtoToParsedPlanTestSuite'` - `build/sbt 'sql/testOnly *SQLQueryTestSuite -- -z sql-session-variables.sql'` - Full `org.apache.spark.sql.catalyst.analysis.*`, `org.apache.spark.sql.catalyst.parser.*`, and `org.apache.spark.sql.analysis.resolver.*` suites. - `scalastyle` and `scalafmt` clean across catalyst, sql, and connect modules. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor Claude Opus 4.7 Closes apache#55647 from srielau/SPARK-56681-patch-clean-up. Authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

ericm-db added 8 commits February 14, 2024 16:21

init

ded6a6e

Revert "[SPARK-44445][BUILD][TESTS] Use `org.seleniumhq.selenium.html…

c4404cd

…unit3-driver` instead of `net.sourceforge.htmlunit`" This reverts commit f5b1e37.

column family not created, removing key

3766755

removing log statements

1e91582

storing ttl as long

6ded3c4

actually checking ttl

a8233cf

deadlock, specifying RocksDB version

de568a9

ttl before commit

4c8a017

github-actions Bot added BUILD CORE SQL DSTREAM STRUCTURED STREAMING WEB UI labels Feb 21, 2024

ericm-db added 3 commits February 21, 2024 10:35

before commit, cannot commit error

4b5ccd0

maintenance works

98a5b1c

removing log lines

a1558b6

ericm-db closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoder ttl poc#1

Encoder ttl poc#1
ericm-db wants to merge 11 commits into
masterfrom
encoder-ttl-poc

ericm-db commented Feb 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ericm-db commented Feb 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant