Map State init, test cases pass by ericm-db · Pull Request #4 · ericm-db/spark

ericm-db · 2024-04-01T22:47:42Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…nsure it ### What changes were proposed in this pull request? This PR adds a unit test to ensure that the fields of the `LogKey` enumeration are sorted alphabetically, as specified by https://issues.apache.org/jira/browse/SPARK-47705. ### Why are the changes needed? This will make sure that the fields of the enumeration remain easy to read in the future as we add more cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR adds testing coverage only. ### Was this patch authored or co-authored using generative AI tooling? GitHub copilot offered some suggestions, but I rejected them Closes apache#45857 from dtenedor/logs. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

### What changes were proposed in this pull request? Fix formatting of error messages. Example: In several error messages, we are concatenating the plan without any separation: `in this locationFilter (dept_id#652`. We should add a colon and space or newline in between. Before: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this locationFilter (dept_id#22 = outer(dept_id#16)) +- SubqueryAlias dept +- View (`DEPT`, [dept_id#22, dept_name#23, state#24]) +- Project [cast(dept_id#25 as int) AS dept_id#22, cast(dept_name#26 as string) AS dept_name#23, cast(state#27 as string) AS state#24] +- Project [dept_id#25, dept_name#26, state#27] +- SubqueryAlias DEPT +- LocalRelation [dept_id#25, dept_name#26, state#27] . SQLSTATE: 0A000; line 3 pos 19; ``` After: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED] Unsupported subquery expression: Accessing outer query column is not allowed in this location: Filter (dept_id#71 = outer(dept_id#65)) +- SubqueryAlias dept +- View (`DEPT`, [dept_id#71, dept_name#72, state#73]) +- Project [cast(dept_id#74 as int) AS dept_id#71, cast(dept_name#75 as string) AS dept_name#72, cast(state#76 as string) AS state#73] +- Project [dept_id#74, dept_name#75, state#76] +- SubqueryAlias DEPT +- LocalRelation [dept_id#74, dept_name#75, state#76] . SQLSTATE: 0A000; line 3 pos 19; ``` ### Why are the changes needed? Improve error messages readability. ### Does this PR introduce _any_ user-facing change? Improve error messages readability. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45825 from jchen5/treenode-error-format. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…osing ### What changes were proposed in this pull request? When closing the rocksdb instance, we need to wait until all background work finish. If not, the following error could be observed: ``` 24/03/29 06:47:11 INFO RocksDB StateStoreId(opId=0,partId=0,name=default): [NativeRocksDB-2] [/error_handler.cc:396] Background IO error IO error: No such file or directory: While open a file for appending: /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=0,name=default)-732271d8-03b3-4046-a911-5797804df25c/workingDir-3a85e625-e4fb-4a78-b668-941ca16cc7a2/000008.sst: No such file or directory 24/03/29 06:47:11 ERROR RocksDB StateStoreId(opId=0,partId=0,name=default): [NativeRocksDB-3] [/db_impl/db_impl_compaction_flush.cc:3021] Waiting after background flush error: IO error: No such file or directory: While open a file for appending: /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=0,name=default)-732271d8-03b3-4046-a911-5797804df25c/workingDir-3a85e625-e4fb-4a78-b668-941ca16cc7a2/000008.sst: No such file or directoryAccumulated background error counts: 1 <TRUNCATED LOG> 24/03/29 11:54:09 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/spark-b5dac908-59cc-4276-80f7-34dab79716b7/StateStoreId(opId=0,partId=0,name=default)-702d3c8f-245e-4119-a763-b8e963d07e7b 24/03/29 06:47:12 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/spark-efd53c17-2b8a-4f80-aca0-b767dc06be3d/StateStoreId(opId=0,partId=4,name=default)-0eb30b1b-b92f-4744-aff6-85f9efd2bcf2 24/03/29 06:47:12 INFO ShutdownHookManager: Deleting directory /ephemeral/tmp/streaming.metadata-d281c16c-89c7-49b3-b65a-6eb2de6ddb6f pthread lock: Invalid argument ``` In the source code, after this error is thrown, there is a sleep for 1 second and then re lock the original mutex: https://github.com/facebook/rocksdb/blob/e46ab9d4f0a0e63bfc668421e2994efa918d6570/db/db_impl/db_impl_compaction_flush.cc#L2613 From the logs of RocksDB and ShutdownHookManager , we can see that exactly 1 second after rocks db throws, the pthread lock: Invalid argument is thrown. So it is likely that this mutex throws. ### Why are the changes needed? Bug fix for a transient issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test should be enough. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45863 from WweiL/SPARK-47722-rocksdb-cleanup. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? As suggested in https://github.com/apache/spark/pull/45834/files#r1549565157, I am creating initial guidelines for the structured logging framework. ### Why are the changes needed? We need guidelines to align the logging migration works in the community. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change. ### Was this patch authored or co-authored using generative AI tooling? Yes. Generated-by: GitHub Copilot 1.2.17.2887 Closes apache#45862 from gengliangwang/logREADME. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… support ### What changes were proposed in this pull request? This PR adds automatic casting and collations resolution as per `PGSQL` behaviour: 1. Collations set on the metadata level are implicit 2. Collations set using the `COLLATE` expression are explicit 3. When there is a combination of expressions of multiple collations the output will be: - if there are explicit collations and all of them are equal then that collation will be the output - if there are multiple different explicit collations `COLLATION_MISMATCH.EXPLICIT` will be thrown - if there are no explicit collations and only a single type of non default collation, that one will be used - if there are no explicit collations and multiple non-default implicit ones `COLLATION_MISMATCH.IMPLICIT` will be thrown INDETERMINATE_COLLATION should only be thrown on comparison operations and memory storing of data, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future. This is why we have to add another predefined collation id named INDETERMINATE_COLLATION_ID which means that the result is a combination of conflicting non-default implicit collations. Right now it would an id of -1 so it fail if it ever goes to the CollatorFactory. ### Why are the changes needed? We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use. ### Does this PR introduce _any_ user-facing change? Yes. We add 3 new errors and enable collation casting. ### How was this patch tested? Tests in `CollationSuite` were done to check code validity. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45383 from mihailom-db/SPARK-47210. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Currently if there's any gRPC exception, instead of just handling it, the PySpark's gRPC error handler is going to print it out to the stderr, not allowing the user to cleanly ignore the exception by using try/except control flow statement. In this PR we are removing the logger.exception call and we rely on the downstream exception mechanism to report this to the user. ### Why are the changes needed? Without this change, there's no way that the user ignores the gRPC error without piping the stderr to /dev/null or equivalent. ### Does this PR introduce _any_ user-facing change? Yes, the stderr will not have the exception trace written twice. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45840 from nemanja-boric-databricks/no-log. Authored-by: Nemanja Boric <nemanja.boric@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…park.speculation.quantile` to 0.9 ### What changes were proposed in this pull request? This PR aims to adjust the following in order to make Spark speculative execution behavior less aggressive from Apache Spark 4.0.0. - `spark.speculation.multiplier`: 1.5 -> 3 - `spark.speculation.quantile`: 0.75 -> 0.9 ### Why are the changes needed? Although `spark.speculation` is disabled by default, this has been used in many production use cases. ### Does this PR introduce _any_ user-facing change? This will make a speculative execution less agressive. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45858 from dongjoon-hyun/SPARK-47720. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…e package ### What changes were proposed in this pull request? This PR proposes to release a separate `pyspark-connect` package, see also [SPIP: Pure Python Package in PyPI (Spark Connect)](https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing). Today's PySpark package is roughly as follows: ``` pyspark ├── *.py # *Core / No Spark Connect support* ├── mllib # MLlib / No Spark Connect support ├── resource # Resource profile API / No Spark Connect support ├── streaming # DStream (deprecated) / No Spark Connect support ├── ml # ML │ └── connect # Spark Connect for ML ├── pandas # API on Spark with/without Spark Connect support └── sql # SQL └── connect # Spark Connect for SQL └── streaming # Spark Connect for Structured Streaming ``` There will be two packages available, `pyspark` and `pyspark-connect`. #### `pyspark` Same as today’s PySpark. But Core module is factored out to `pyspark.core.*`. User-facing interface stays the same at `pyspark.*`. ``` pyspark ├── core # *Core / No Spark Connect support* ├── mllib # MLlib / No Spark Connect support ├── resource # Resource profile API / No Spark Connect support ├── streaming # DStream (deprecated) / No Spark Connect support ├── ml # ML │ └── connect # Spark Connect for ML ├── pandas # API on Spark with/without Spark Connect support └── sql # SQL └── connect # Spark Connect for SQL └── streaming # Spark Connect for Structured Streaming ``` #### `pyspark-connect` Package after excluding modules that do not support Spark Connect, also excluding jars, that are, ml without jars: ``` pyspark ├── ml │ └── connect ├── pandas └── sql └── connect └── streaming ``` ### Why are the changes needed? To provide a pure Python library that does not depend on JVM. See also [SPIP: Pure Python Package in PyPI (Spark Connect)](https://docs.google.com/document/d/1Pund40wGRuB72LX6L7cliMDVoXTPR-xx4IkPmMLaZXk/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? Yes, users can install pure Python library via `pip install pyspark-connect`. ### How was this patch tested? Manually tested the basic set of tests. ```bash ./sbin/start-connect-server.sh --jars `ls connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` ``` ```bash cd python python packaging/connect/setup.py sdist cd dist conda create -y -n clean-py-3.11 python=3.11 conda activate clean-py-3.11 pip install pyspark-connect-4.0.0.dev0.tar.gz python ``` ```python >>> import pyspark >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.remote("sc://localhost").getOrCreate() >>> spark.range(10).show() ``` ``` +---+ | id| +---+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| +---+ ``` They will be separated added, and set as a scheduled job in CI. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45053 from HyukjinKwon/refactoring-core. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… test to ensure it ### What changes were proposed in this pull request? The pr aims to fix bug about apache#45857 ### Why are the changes needed? In fact, `LogKey.values.toSeq.sorted` did not sort alphabetically as expected. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45864 from panbingkun/fix_sort_logkey. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…remote pure Python library ### What changes were proposed in this pull request? This PR proposes to add an environment variable called `SPARK_CONNECT_TESTING_REMOTE` to set `remote` URL. ### Why are the changes needed? In order to test pure Python library with a remote server. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45868 from HyukjinKwon/SPARK-47724. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…f pandas is not available ### What changes were proposed in this pull request? This is a follow-up of the followings to skip `pandas`-related tests if pandas is not available. - apache#44852 - apache#45232 ### Why are the changes needed? `pandas` is an optional dependency. We had better skip it without causing failures. To recover the PyPy 3.8 CI, - https://github.com/apache/spark/actions/runs/8541011879/job/23421483071 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ``` $ python/run-tests --modules=pyspark-resource --parallelism=1 --python-executables=python3.10 Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.10'] Will test the following Python modules: ['pyspark-resource'] python3.10 python_implementation is CPython python3.10 version is: Python 3.10.13 Starting test(python3.10): pyspark.resource.profile (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/021bc7bb-242f-4cb4-8584-11ed6e711f78/python3.10__pyspark.resource.profile__jn89f1hh.log) Finished test(python3.10): pyspark.resource.profile (1s) Starting test(python3.10): pyspark.resource.tests.test_connect_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/244d6c6f-8799-4a2a-b7a7-20d7c50d643d/python3.10__pyspark.resource.tests.test_connect_resources__5ta1tf6e.log) Finished test(python3.10): pyspark.resource.tests.test_connect_resources (0s) ... 1 tests were skipped Starting test(python3.10): pyspark.resource.tests.test_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/671e7afa-e764-443f-bc40-7e940d7342ea/python3.10__pyspark.resource.tests.test_resources__lhbp6y5f.log) Finished test(python3.10): pyspark.resource.tests.test_resources (2s) ... 1 tests were skipped Tests passed in 4 seconds Skipped tests in pyspark.resource.tests.test_connect_resources with python3.10: test_profile_before_sc_for_connect (pyspark.resource.tests.test_connect_resources.ResourceProfileTests) ... skip (0.005s) Skipped tests in pyspark.resource.tests.test_resources with python3.10: test_profile_before_sc_for_sql (pyspark.resource.tests.test_resources.ResourceProfileTests) ... skip (0.001s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45869 from dongjoon-hyun/SPARK-46812. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…red logging framework ### What changes were proposed in this pull request? Migrate logError with variables of core module to structured logging framework. This is part1 which transforms the logError entries of the following API ``` def logError(msg: => String): Unit ``` to ``` def logError(entry: LogEntry): Unit ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? Yes, Spark core logs will contain additional MDC ### How was this patch tested? Compiler and scala style checks, as well as code review. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45834 from gengliangwang/coreError. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

### What changes were proposed in this pull request? This patch adss a new mechanism to push query execution progress for batch queries. We add a new response message type and periodically push query progress to the client. The client can consume this data to for example display a progress bar. This patch adds support for displaying a progress bar in the PySpark shell when started with Spark Connect. The proto message is defined as follows: ``` // This message is used to communicate progress about the query progress during the execution. // This message is used to communicate progress about the query progress during the execution. message ExecutionProgress { // Captures the progress of each individual stage. repeated StageInfo stages = 1; // Captures the currently in progress tasks. int64 num_inflight_tasks = 2; message StageInfo { int64 stage_id = 1; int64 num_tasks = 2; int64 num_completed_tasks = 3; int64 input_bytes_read = 4; bool done = 5; } } ``` Clients can simply ignore the messages or consume them. On top of that this adds additional capabilities to register a callback for progress tracking to the SparkSession. ``` handler = lambda **kwargs: print(kwargs) spark.register_progress_handler(handler) spark.range(100).collect() spark.remove_progress_handler(handler) ``` #### Example 1 ![progress_medium_query_multi_stage mp4](https://github.com/apache/spark/assets/3421/5eff1ec4-def2-4d39-8a75-13a6af784c99) #### Example 2 ![progress_bar mp4](https://github.com/apache/spark/assets/3421/20638511-2da4-4bd6-83f2-da3b9f500bde) ### Why are the changes needed? Usability and Experience ### Does this PR introduce _any_ user-facing change? When the user opens the PySpark shell with Spark Connect mode, it will use the progress bar by default. ### How was this patch tested? Added new tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45150 from grundprinzip/SPARK-47081. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? PySpark worker processes may die while they are idling. Here we aim to provide some resilience, by validating process and selectionkey aliveness prior to returning the process from idle pool. ### Why are the changes needed? To not fail queries when a python process crashed while idling. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added appropriate testcase. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45635 from sebastianhillig-db/python-worker-factory-crash. Authored-by: Ubuntu <sebastian.hillig@ip-10-110-16-229.us-west-2.compute.internal> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… PySpark This reverts commit d87ac8e. Turns out the sparkconnect logger is disabled by default. This make us loose the ability to log the error message if required (by explicitly turning on the logger). ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes apache#45878 from nemanja-boric-databricks/revert-logger. Authored-by: Nemanja Boric <nemanja.boric@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ILED error ### What changes were proposed in this pull request? This is a follow-up of apache#45797 . Instead of detecting query execution errors and not wrapping them, it's better to do the error wrapping only in the data writer, which has more context. We can provide the specific file path when the error happened, instead of the destination directory name. ### Why are the changes needed? better error message ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45844 from cloud-fan/write-error. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…connect testcases ### What changes were proposed in this pull request? This PR proposes to get the proper default port for `pyspark-connect` testcases. ### Why are the changes needed? `pyspark-connect` cannot access to the JVM, so cannot get the randomized port assigned from JVM. ### Does this PR introduce _any_ user-facing change? No, `pyspark-connect` is not published yet, and this is a test-only change. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45875 from HyukjinKwon/SPARK-47729. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This is to document G1 Concurrent GC metrics introduced with https://issues.apache.org/jira/browse/SPARK-44162 ### Why are the changes needed? To improve the documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45874 from LucaCanali/documentG1GCCurrentMetrics. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…sible=true for spark-connect-scala-client ### What changes were proposed in this pull request? Add `-Dio.netty.tryReflectionSetAccessible=true` for `spark-connect-scala-client` ### Why are the changes needed? The previous change missed spark-connect-scala-client, may be due to bad IDEA index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45860 from pan3793/SPARK-47610-followup. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…red logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logError` in module `MLLib` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45837 from panbingkun/SPARK-47598. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…pyarrow to skip tests ### What changes were proposed in this pull request? This is a follow-up of SPARK-46812 to skip the tests more robustly and to recover PyPy CIs. - https://github.com/apache/spark/actions/runs/8556900899/job/23447948557 ### Why are the changes needed? - `should_test_connect` covers more edge cases than `have_pandas`. - `test_resources.py` has Arrow usage too. https://github.com/apache/spark/blob/25fc67fa114d2c34099c3ab50396870f543c338b/python/pyspark/resource/tests/test_resources.py#L85 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tests with `pandas` and without `pyarrow`. ``` $ pip3 freeze | grep pyarrow $ pip3 freeze | grep pandas pandas==2.2.1 pandas-stubs==1.2.0.53 $ python/run-tests --modules=pyspark-resource --parallelism=1 --python-executables=python3.10 Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark-merge/python/unit-tests.log Will test against the following Python executables: ['python3.10'] Will test the following Python modules: ['pyspark-resource'] python3.10 python_implementation is CPython python3.10 version is: Python 3.10.13 Starting test(python3.10): pyspark.resource.profile (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/db9cb886-2698-49d9-a663-9b8bea79caba/python3.10__pyspark.resource.profile__8mg46xru.log) Finished test(python3.10): pyspark.resource.profile (1s) Starting test(python3.10): pyspark.resource.tests.test_connect_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/53f979bd-1073-41e6-99ba-8e787edc415b/python3.10__pyspark.resource.tests.test_connect_resources__hrgrs5sk.log) Finished test(python3.10): pyspark.resource.tests.test_connect_resources (0s) ... 1 tests were skipped Starting test(python3.10): pyspark.resource.tests.test_resources (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/2b06c671-0199-4827-a0e5-f852a28313fd/python3.10__pyspark.resource.tests.test_resources__jis6mk9a.log) Finished test(python3.10): pyspark.resource.tests.test_resources (2s) ... 1 tests were skipped Tests passed in 4 seconds Skipped tests in pyspark.resource.tests.test_connect_resources with python3.10: test_profile_before_sc_for_connect (pyspark.resource.tests.test_connect_resources.ResourceProfileTests) ... skip (0.002s) Skipped tests in pyspark.resource.tests.test_resources with python3.10: test_profile_before_sc_for_sql (pyspark.resource.tests.test_resources.ResourceProfileTests) ... skip (0.001s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45880 from dongjoon-hyun/SPARK-46812-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…lly enumeration field in `LogEntry` automatically ### What changes were proposed in this pull request? The pr aims to `introduce` a `tool` that can `sort alphabetically` enumeration field in `LogEntry` automatically. ### Why are the changes needed? Enable developers to more conveniently write the enumeration values in `LogEntry` in alphabetical order according to the requirements of structured log development documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "common-utils/testOnly *LogKeySuite -- -t \"LogKey enumeration fields are correctly sorted\"" ``` - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45867 from panbingkun/SPARK-47723. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…ckage ### What changes were proposed in this pull request? This PR is a followup of apache#45150 that adds the new `shell` module into PyPI package. ### Why are the changes needed? So PyPI package contains `shell` module. ### Does this PR introduce _any_ user-facing change? No, the main change has not been released yet. ### How was this patch tested? The test case will be added at apache#45870. It was found out during working on that PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45882 from HyukjinKwon/SPARK-47081-followup. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? In the `Window` node, both `partitionSpec` and `orderSpec` must be orderable, but the current type check only verifies `orderSpec` is orderable. This can cause an error in later optimizing phases. Given a query: ``` with t as (select id, map(id, id) as m from range(0, 10)) select rank() over (partition by m order by id) from t ``` Before the PR, it fails with an `INTERNAL_ERROR`: ``` org.apache.spark.SparkException: [INTERNAL_ERROR] grouping/join/window partition keys cannot be map type. SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.needNormalize(NormalizeFloatingNumbers.scala:103) at org.apache.spark.sql.catalyst.optimizer.NormalizeFloatingNumbers$.org$apache$spark$sql$catalyst$optimizer$NormalizeFloatingNumbers$$needNormalize(NormalizeFloatingNumbers.scala:94) ... ``` After the PR, it fails with a `EXPRESSION_TYPE_IS_NOT_ORDERABLE`, which is expected: ``` org.apache.spark.sql.catalyst.ExtendedAnalysisException: [EXPRESSION_TYPE_IS_NOT_ORDERABLE] Column expression "m" cannot be sorted because its type "MAP<BIGINT, BIGINT>" is not orderable. SQLSTATE: 42822; line 2 pos 53; Project [RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4] +- Project [id#1L, m#0, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4, RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4] +- Window [rank(id#1L) windowspecdefinition(m#0, id#1L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() OVER (PARTITION BY m ORDER BY id ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#4], [m#0], [id#1L ASC NULLS FIRST] +- Project [id#1L, m#0] +- SubqueryAlias t +- SubqueryAlias t +- Project [id#1L, map(id#1L, id#1L) AS m#0] +- Range (0, 10, step=1, splits=None) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52) ... ``` ### How was this patch tested? Unit test. Closes apache#45730 from chenhao-db/SPARK-47572. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… throw internal error ### What changes were proposed in this pull request? This PR fixes the error messages and classes when Python UDFs are used in higher order functions. ### Why are the changes needed? To show the proper user-facing exceptions with error classes. ### Does this PR introduce _any_ user-facing change? Yes, previously it threw internal error such as: ```python from pyspark.sql.functions import transform, udf, col, array spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: y)(x))).collect() ``` Before: ``` py4j.protocol.Py4JJavaError: An error occurred while calling o74.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 0.0 failed 1 times, most recent failure: Lost task 15.0 in stage 0.0 (TID 15) (ip-192-168-123-103.ap-northeast-2.compute.internal executor driver): org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: <lambda>(lambda x_0#3L)#2 SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) ``` After: ``` pyspark.errors.exceptions.captured.AnalysisException: [INVALID_LAMBDA_FUNCTION_CALL.UNEVALUABLE] Invalid lambda function call. Python UDFs should be used in a lambda function at a higher order function. However, "<lambda>(lambda x_0#3L)" was a Python UDF. SQLSTATE: 42K0D; Project [transform(array(id#0L), lambdafunction(<lambda>(lambda x_0#3L)#2, lambda x_0#3L, false)) AS transform(array(id), lambdafunction(<lambda>(lambda x_0#3L), namedlambdavariable()))#4] +- Range (0, 1, step=1, splits=Some(16)) ``` ### How was this patch tested? Unittest was added ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47079 from HyukjinKwon/SPARK-48706. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

### What changes were proposed in this pull request? Address the open follow-ups from [SPARK-56681](https://issues.apache.org/jira/browse/SPARK-56681) (umbrella for PATH / SPARK-56605 cleanup) in a single cleanup PR. Items #1 and #2 were already wired by SPARK-56639; this PR covers the remainder. | # | Item | Resolution | |---|---|---| | #1 | `FunctionResolution.resolveProcedure` was dead code | Already wired by SPARK-56639 (no action). | | #2 | Frozen view / SQL-function PATH wiring unfinished | Already done by SPARK-56639 (no action). | | #3 | `AnalysisContext.resolutionPathEntries` threadlocal | Audit only: confirmed `withNewAnalysisContext` / `reset()` correctly clear it. Full removal needs a coordinated refactor to plumb the path through `RelationResolution` / `FunctionResolution` method calls; flagged as a follow-up. | | #4 | `Analyzer.executeAndCheck` clobbers outer `SQLConf.withExistingConf` | Extracted `runWithSessionConf` helper, added `SQLConf.getExistingConfIfSet`. `executeAndCheck` and `executeSameContext` now share one path that yields to any outer scope. | | #5 | `VariableResolution.allowUnqualifiedSessionTempVariableLookup` force-loads default catalog | Replaced the hot-path catalog read with `CatalogManager.isSystemSessionOnPath`, which inspects stored session-path entries directly. No catalog load on column resolution. | | #6 | `DROP VARIABLE` PATH gate asymmetric with `DECLARE` / `CREATE` | Removed the gate. DDL on session variables (`DECLARE` / `CREATE` / `DROP`) always targets `system.session` directly; only DML (`SET VAR`, `SELECT x`) goes through PATH. | | #7 | `lookupFunctionType` exception swallow too broad | Narrowed from `NonFatal` to the explicit not-found list (`NoSuchFunctionException`, `NoSuchNamespaceException`, `CatalogNotFoundException`, `FORBIDDEN_OPERATION`). Other exceptions propagate. | | #8 | `lookupFunctionType` fan-out had wasteful `system.*` candidates | Filtered them out — `system.session`, `system.builtin`, `system.ai` are already resolved earlier in the same method. | | #9 | Three near-duplicate path-resolution helpers | Lifted into `CatalogManager.resolutionPathEntriesForAnalysis(pinnedEntries, viewCatalogAndNamespace)`. Relation, routine, and procedure resolution all route through it. | | #10 | Tests for the new error paths and gates | Added a DECLARE / SET VAR / DROP cycle test under non-default PATH and a struct-variable field-vs-qualified ambiguity test in `sql-session-variables.sql`. | | #11 | `ProtoToParsedPlanTestSuite.analyzerIsolationConf` was a bare `SQLConf` | Clone `spark.sessionState.conf` and only override `PATH_ENABLED=false`, so all `sparkConf` overrides (ANSI, alias config, ...) propagate automatically. | | Bonus | `ResolveSetVariable` hardcoded `SYSTEM.SESSION` regardless of actual PATH | `unresolvedVariableError` now takes `Seq[Seq[String]]` path entries with **required** `Origin` (no overloads). DML lookup failures (`SET VAR`, `FETCH ... INTO`) report the full SQL path as a bracketed list, byte-for-byte consistent with `UNRESOLVED_ROUTINE` and `TABLE_OR_VIEW_NOT_FOUND`. DDL name validation in `ResolveCatalogs` continues to report `[system.session]` since PATH does not apply there. Origin is plumbed through `VariableManager.set` so all error sites carry a `queryContext` pointing at the offending variable identifier (parser opt-ins via `withOrigin(identifierReference)` so the highlight is the variable name, not the whole statement). | ### Why are the changes needed? These are the cleanup items called out on SPARK-56681 from the post-merge source review of SPARK-56605. They eliminate dead code paths, plug user-visible bugs (force-loading a misconfigured default catalog on column resolution; clobbering pinned session configs; swallowing real catalog errors as `UNRESOLVED_ROUTINE`), remove the asymmetry between DDL and DML on session variables, and make `UNRESOLVED_VARIABLE` self-consistent with the other "not found" errors. ### Does this PR introduce _any_ user-facing change? Yes. - **`UNRESOLVED_VARIABLE.searchPath`** is now rendered as a bracketed list. For DML lookups (`SET VAR`, `FETCH ... INTO`), the list reflects the actual SQL PATH that was consulted instead of a hardcoded `SYSTEM.SESSION`. For DDL name validation (`DECLARE` / `DROP` with a non-session namespace), the list is `[`` `system`.`session` ``]` since PATH does not apply. - **`UNRESOLVED_VARIABLE`** now always carries a `queryContext` that highlights just the offending variable identifier (e.g. `"builtin.var1"`, `"ses.var1"`), not the whole `DECLARE` / `SET VAR` statement. - **`DROP TEMPORARY VARIABLE`** no longer raises `UNRESOLVED_VARIABLE` when the SQL PATH does not contain `system.session`. DDL on session variables ignores PATH, matching the existing behaviour of `DECLARE OR REPLACE VARIABLE`. - **`lookupFunctionType`** no longer swallows non–`NotFound` errors. A catalog reporting `PERMISSION_DENIED` (or similar) for a function lookup now propagates instead of silently producing `UNRESOLVED_ROUTINE`. ### How was this patch tested? - Added `sql-session-variables.sql` regression test for the struct-variable field-vs-qualified ambiguity (`DECLARE VARIABLE session STRUCT<a INT>` → `SELECT session.a` succeeds → `DROP` → `SELECT session.a` falls through to `UNRESOLVED_COLUMN`). - Updated `SetPathSuite`: DECLARE / SET VAR / DROP cycle under a non-default PATH; bonus test asserts the actual rendered search path and the variable-identifier `queryContext`. - Updated `SqlScriptingExecutionSuite` for the new bracketed `searchPath` and identifier-pinned `queryContext`. - Regenerated `sql-session-variables.sql.out` for the new error shape. - Added `resolutionPathEntriesForAnalysis` stubs to mocked `CatalogManager` instances in `PlanResolutionSuite`, `AlignAssignmentsSuiteBase`, and `TableLookupCacheSuite`. - Ran focused suites locally; all pass: - `build/sbt 'sql/testOnly *SetPathSuite *SqlScriptingExecutionSuite *ExecuteImmediateEndToEndSuite'` - `build/sbt 'sql/testOnly *SimpleSQLViewSuite *SQLFunctionSuite'` - `build/sbt 'sql/testOnly *PlanResolutionSuite *UpdateTableAlignAssignmentsSuite *MergeIntoTableAlignAssignmentsSuite'` - `build/sbt 'catalyst/testOnly *TableLookupCacheSuite *AnalysisSuite *AnalysisErrorSuite *LookupFunctionsSuite'` - `build/sbt 'sql/testOnly *FunctionQualificationSuite *RelationQualificationSuite *DataSourceV2FunctionSuite'` - `build/sbt 'sql/testOnly *SQLQuerySuite'` - `build/sbt 'connect/testOnly *ProtoToParsedPlanTestSuite'` - `build/sbt 'sql/testOnly *SQLQueryTestSuite -- -z sql-session-variables.sql'` - Full `org.apache.spark.sql.catalyst.analysis.*`, `org.apache.spark.sql.catalyst.parser.*`, and `org.apache.spark.sql.analysis.resolver.*` suites. - `scalastyle` and `scalafmt` clean across catalyst, sql, and connect modules. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor Claude Opus 4.7 Closes apache#55647 from srielau/SPARK-56681-patch-clean-up. Authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

github-actions Bot added SQL STRUCTURED STREAMING labels Apr 1, 2024

ericm-db force-pushed the list-state-ttl branch from 073572e to 727dbee Compare April 2, 2024 21:21

ericm-db force-pushed the map-state-ttl branch from 303c57d to 1e156f0 Compare April 2, 2024 23:36

ericm-db force-pushed the list-state-ttl branch from 4bb6eb9 to 4e29a4a Compare April 3, 2024 18:02

ericm-db force-pushed the map-state-ttl branch from 1e156f0 to b7f65ac Compare April 3, 2024 20:17

dtenedor and others added 22 commits April 3, 2024 14:38

ericm-db force-pushed the list-state-ttl branch from 88bb413 to 3a83265 Compare April 5, 2024 00:42

ericm-db added 10 commits April 8, 2024 13:32

using NextIterator instead

998764c

refactor feedback

1dcb7d8

undoing unnecessary change

47867e7

refactor get_ttl_value

cfd30c3

refactor test case

4a19cb7

specific doc for clearIfExpired

993125c

moving isExpired to common place

fd5200f

refactoring to use common utils

d43ffb1

updating interface header

30f6094

Map State TTL, Initial Commit

e9376d9

ericm-db force-pushed the map-state-ttl branch from 3674d27 to e9376d9 Compare April 9, 2024 23:45

github-actions Bot added BUILD CORE DSTREAM WEB UI DOCS ML PYTHON PANDAS API ON SPARK YARN CONNECT PROTOBUF EXAMPLES AVRO MLLIB R labels Apr 10, 2024

ericm-db closed this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Map State init, test cases pass#4

Map State init, test cases pass#4
ericm-db wants to merge 74 commits into
list-state-ttlfrom
map-state-ttl

ericm-db commented Apr 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ericm-db commented Apr 1, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants