[SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect by dbtsai · Pull Request #55574 · apache/spark

dbtsai · 2026-04-27T22:05:27Z

What changes were proposed in this pull request?

This PR adds a Spark Connect-compatible SQLContext (and HiveContext) implementation in
pyspark.sql.connect.context so that legacy code using SQLContext continues to work
transparently when running against a Connect server.

Key changes:

New pyspark.sql.connect.context.SQLContext — wraps a Connect SparkSession directly
(no SparkContext required). Delegates all supported operations to the session:
sql, table, range, createDataFrame, conf, udf, udtf, read,
readStream, streams, and catalog operations (cacheTable, uncacheTable,
clearCache, tables, tableNames, registerDataFrameAsTable, dropTempTable,
createExternalTable).
- newSession() uses cloneSession() (the Connect equivalent of SparkSession.newSession()).
- JVM-only APIs (registerJavaFunction, HiveContext.__init__) raise PySparkNotImplementedError.
Connect dispatch in classic SQLContext.getOrCreate() — when running in remote-only mode
(is_remote_only()), the classic getOrCreate() now automatically returns a
Connect SQLContext wrapping the active Connect session, so callers do not need to
import from pyspark.sql.connect directly.
Shared test mixin — SQLContextTestsMixin extracted to test_sql_context.py so the same
suite runs against both the classic and Connect implementations via SQLContextParityTests.
API reference docs — new python/docs/source/reference/pyspark.sql/legacy.rst page
listing SQLContext and HiveContext in the public API reference.
CI registration — test_connect_context registered in modules.py.

Why are the changes needed?

SQLContext is deprecated since Spark 2.0 in favor of SparkSession, but many existing
PySpark applications still instantiate it directly. Without this wrapper, those applications
fail entirely on Spark Connect because the classic SQLContext.__init__ requires a live
SparkContext (JVM), which is not available in Connect mode. This patch closes that
compatibility gap.

Does this PR introduce any user-facing change?

Yes. Previously, calling SQLContext(spark) or SQLContext.getOrCreate() in a Spark Connect
environment raised an error because the classic implementation requires a SparkContext.
After this PR, both calls succeed and return a fully functional (but still deprecated)
SQLContext backed by the active Connect session.

JVM-specific methods (registerJavaFunction, HiveContext) now raise a clear
PySparkNotImplementedError instead of a cryptic JVM/attribute error.

How was this patch tested?

Added SQLContextTestsMixin in python/pyspark/sql/tests/test_sql_context.py covering:
setConf/getConf, createDataFrame, sql, table, tables/tableNames,
cacheTable/uncacheTable/clearCache, registerDataFrameAsTable/dropTempTable,
range, read, readStream, streams, udf/udtf, newSession, registerFunction.
SQLContextConnectTests in python/pyspark/sql/tests/connect/test_connect_context.py
adds Connect-specific cases: deprecation warning on __init__, getOrCreate() in
remote-only mode returns a Connect-backed context and emits a deprecation warning,
HiveContext.getOrCreate() in remote-only mode raises PySparkNotImplementedError,
registerJavaFunction raises PySparkNotImplementedError, and HiveContext.__init__
raises PySparkNotImplementedError. The is_remote_only() code path is tested by
patching pyspark.util.is_remote_only.
Registered in dev/sparktestsupport/modules.py so the Connect test is picked up by CI.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (claude-sonnet-4-6), via Anthropic Claude Code

HyukjinKwon · 2026-04-27T23:12:06Z

+        :class:`DataFrame`
+        """
+        listed = self.sparkSession.catalog.listTables(dbName)
+        rows = [


Why do we need this?

Replaced with spark.sql("SHOW TABLES [IN dbName]")

HyukjinKwon · 2026-04-27T23:12:39Z

+from pyspark.testing.connectutils import ReusedConnectTestCase
+
+
+class SQLContextConnectTests(ReusedConnectTestCase):


We should share the tests with Spark Classic

SQLContextTestsMixin extracted into test_context.py; Connect test is now SQLContextParityTests(SQLContextTestsMixin, ReusedConnectTestCase)

HyukjinKwon · 2026-04-27T23:13:29Z

+__all__ = ["SQLContext", "HiveContext"]
+
+
+class SQLContext:


We do sth in SQLContext.getOrCreate. Otherwise, users would have to manually import pyspark.sql.connect to use spark Connect

Classic getOrCreate now detects is_remote_only() and dispatches to ConnectSQLContext._get_or_create_from_session(); sc made optional

can HiveContext.getOrCreate(spark) work? it might be able to bypass the not supported HiveContext?

Good catch — confirmed this was a real bypass. HiveContext.getOrCreate(spark) dispatches to SQLContext.getOrCreate(cls=HiveContext, ...), which calls _from_session → object.__new__(HiveContext), creating an instance without ever invoking __init__, so the PySparkNotImplementedError was silently skipped.

Fixed in c967950 by overriding _from_session in HiveContext to unconditionally raise. Since getOrCreate, _get_or_create_from_session, and newSession all route through _from_session, this closes all bypass paths. Added test_hive_context_get_or_create_raises to cover this case.

Implements a Connect-compatible SQLContext in pyspark.sql.connect.context that wraps a Connect SparkSession instead of requiring a SparkContext. All SparkSession-delegate methods (sql, table, range, createDataFrame, conf, udf, udtf, read, readStream, streams, catalog ops) are wired up. JVM-only APIs (registerJavaFunction, HiveContext) raise PySparkNotImplementedError. Adds 20 unit tests. Co-authored-by: Isaac

- Replace manual Row construction in tables() with SHOW TABLES SQL - Fix all versionadded annotations to 4.0.0 (Connect-era version) - Add _get_or_create_from_session() and dispatch in classic SQLContext.getOrCreate() so users need not import from pyspark.sql.connect directly (sc arg made optional) - Extract SQLContextTestsMixin into test_context.py; Connect test now inherits the shared suite via SQLContextParityTests Co-authored-by: Isaac

- Use base pyspark.sql.dataframe.DataFrame as return type in connect/context.py since SparkSession.sql/range/table/createDataFrame are annotated to return the parent class, not the Connect subclass - Remove now-unnecessary # type: ignore[call-overload] on createDataFrame - Add # type: ignore[return-value, arg-type] to the classic SQLContext.getOrCreate Connect dispatch path where the two SQLContext classes are structurally equivalent but not in the same hierarchy Co-authored-by: Isaac

…atting Co-authored-by: Isaac

…ints Co-authored-by: Isaac

SparkSession.newSession() is JVM-only and not supported in Spark Connect. Use cloneSession() which is the Connect equivalent. Co-authored-by: Isaac

Yicong-Huang

I think there are some changes needed, please see inline comments.

Yicong-Huang · 2026-05-23T06:47:27Z

+__all__ = ["SQLContext", "HiveContext"]
+
+
+class SQLContext:


can HiveContext.getOrCreate(spark) work? it might be able to bypass the not supported HiveContext?

…x setUp/tearDown - tables() in Connect SQLContext now uses catalog.listTables() to always return consistent (namespace, tableName, isTemporary) columns instead of SHOW TABLES whose column names vary across catalogs - Replace type: ignore[return-value/arg-type] in classic getOrCreate() with explicit cast() calls for clarity - Override setUp/tearDown in SQLContextParityTests to reset Connect SQLContext._instantiatedContext between tests, not just the classic one

HiveContext.__init__ raised PySparkNotImplementedError, but HiveContext.getOrCreate(spark) bypassed it because getOrCreate routes through _from_session which uses object.__new__(cls), skipping __init__. Override _from_session in HiveContext to raise unconditionally, closing all bypass paths (getOrCreate, _get_or_create_from_session, newSession). Add test_hive_context_get_or_create_raises to cover the bypass.

dbtsai · 2026-05-26T21:22:54Z

cc @HyukjinKwon @Yicong-Huang @zhengruifeng @haoyangeng-db

zhengruifeng · 2026-05-27T06:43:01Z

        reload(window)


+class SQLContextTestsMixin:


this is just a mixin, not a real test, so it is not tested in classic sides.
I think we can reorg the tests in this way to follow existing connect parity test convention

1, python/pyspark/sql/tests/test_sql_context.py

class SQLContextTestsMixin

class SQLContextTests(SQLContextTestsMixin,ReusedSQLTestCase)

2, python/pyspark/sql/tests/connect/test_parity_sql_context.py

class SQLContextParityTests(SQLContextTestsMixin,ReusedConnectTestCase)

Fixed: created python/pyspark/sql/tests/test_sql_context.py with SQLContextTestsMixin + SQLContextTests(SQLContextTestsMixin, ReusedSQLTestCase) so the mixin now runs against classic Spark too. Added python/pyspark/sql/tests/connect/test_parity_sql_context.py following the existing Connect parity convention (SQLContextParityTests(SQLContextTestsMixin, ReusedConnectTestCase)). test_connect_context.py is now Connect-specific tests only.

cloud-fan

Summary

Prior state and problem. SQLContext has been deprecated since Spark 2.0 in favor of SparkSession, but legacy code still constructs it directly via SQLContext(sc) or SQLContext.getOrCreate(). Both classic paths hard-require a JVM SparkContext (__init__ dereferences sparkContext._jsc, _jvm, and sparkSession._jsparkSession.sqlContext()). In a Spark Connect remote-only install (pyspark-client) there is no SparkContext, so any of these calls crashes. The Connect SparkSession itself goes further — __getattr__ raises JVM_ATTRIBUTE_NOT_SUPPORTED for _jsc, _jconf, _jvm, _jsparkSession, sparkContext, and newSession — so the classic class can't be made to work by handing it a Connect session.

Design approach.

A new pyspark.sql.connect.context.SQLContext that wraps a Connect SparkSession and pure-delegates almost every method; HiveContext is a sentinel that raises everywhere.
Classic SQLContext.getOrCreate(sc=None) checks is_remote_only() and dispatches to ConnectSQLContext._get_or_create_from_session(active_session). sc becomes optional; the classic branch keeps an assert.

Key design decisions made by this PR.

Dispatch keyed on is_remote_only(), not session type: only fires for pyspark-client installs; full installs still take the JVM path.
HiveContext raises on every Connect construction: closes the direct-Connect bypass (commit c967950). The classic-dispatch path still bypasses this — see Finding #2.
SQLContext.newSession() is implemented via cloneSession(): opposite state semantics from classic newSession() — see Finding #3.
tables() materializes catalog rows on the driver rather than using SHOW TABLES, for column-name stability across catalog versions (per the 2026-05-26 follow-up to @Yicong-Huang).

Notes on the existing review thread.

HiveContext bypass: c967950 closes the direct Connect path. The classic-dispatch path remains open — see Finding #2.
Test reorg (@zhengruifeng, 2026-05-27): still open and the right call — the SQLContextTestsMixin currently never runs on the classic side. The current file naming test_connect_context.py also doesn't match the parity-mixin convention (test_parity_*.py).

PR description. The opening sentence is broken — "...implementation in continues to work transparently...". Please fix the missing clause.

cloud-fan · 2026-05-27T08:05:04Z

+__all__ = ["SQLContext", "HiveContext"]
+
+
+class SQLContext:


Finding #1 — design / top-level question: what is the public pyspark.sql.connect.context.SQLContext for?

The classic-getOrCreate dispatch already covers the only transparent-compat case: legacy code that calls SQLContext.getOrCreate() in a remote-only install keeps working unchanged. Every other path requires the user to edit their imports — and once they're editing, migrating the call site to SparkSession is a strictly smaller, less-deprecated diff than switching to from pyspark.sql.connect.context import SQLContext (the deprecation recommendation since 2.0 already points there). I can't see a user journey where the public Connect-side class is the right answer over SparkSession.

Two cleaner shapes worth considering:

(a) Make it implementation-private — rename _SQLContext, or inline the wrapper into the is_remote_only() branch of classic getOrCreate. Drop the per-method entries in legacy.rst. The contract becomes: legacy getOrCreate keeps working; everyone else writes SparkSession.

(b) Keep it public but trim to a one-paragraph stub in legacy.rst that points to SparkSession, instead of advertising the full method surface.

If (a): the signature-divergence and getOrCreate/_get_or_create_from_session duplication concerns in Finding #4 also disappear, since there are no public callers — we keep _get_or_create_from_session and delete the public getOrCreate.

Fixed via option (a): removed the public getOrCreate classmethod from the Connect class entirely. The Connect SQLContext is now an internal implementation detail, accessible only through _get_or_create_from_session (invoked by the classic dispatch). Legacy code calling SQLContext.getOrCreate() via the classic import path continues to work unchanged; new code uses SparkSession.

cloud-fan · 2026-05-27T08:05:04Z

            FutureWarning,
        )
+        if is_remote_only():
+            from pyspark.sql.connect.context import SQLContext as ConnectSQLContext


Finding #2 — HiveContext bypass through classic dispatch.

This is_remote_only() branch hardcodes ConnectSQLContext regardless of cls. A remote-only user calling from pyspark.sql import HiveContext; HiveContext.getOrCreate() reaches here with cls=HiveContext, gets handed back a plain ConnectSQLContext, and never sees the PySparkNotImplementedError they should.

c967950 closed the direct-Connect bypass via _from_session, but the classic dispatch bypasses _from_session entirely because it routes to ConnectSQLContext, not ConnectHiveContext.

Two fixes:

(a) route based on cls — if cls is HiveContext: from pyspark.sql.connect.context import HiveContext as ConnectHiveContext; ...; or

(b) override HiveContext.getOrCreate in the classic file to raise in remote-only mode before delegating.

Fixed via your option (a): the classic SQLContext.getOrCreate now does connect_cls = getattr(_connect_context, cls.__name__, _connect_context.SQLContext) before calling _get_or_create_from_session. When cls is HiveContext, this routes to ConnectHiveContext._get_or_create_from_session, which calls ConnectHiveContext._from_session — and that raises PySparkNotImplementedError.

cloud-fan · 2026-05-27T08:05:05Z

+
+        .. versionadded:: 4.0.0
+        """
+        return self._from_session(self.sparkSession.cloneSession())


Finding #3 — cloneSession() is not equivalent to newSession().

Classic SparkSession.newSession() returns a session with separate SQLConf, temp views, and UDFs and a shared cache — i.e. a fresh-empty session (pyspark/sql/session.py:717-735).

Connect cloneSession() does the opposite: it copies the current session's conf, temp views, registered functions, and catalog state into an independent server session (pyspark/sql/connect/session.py:1282-1314).

The PR description and the docstring at line 124-126 both promise classic "separate SQLConf / temp views / UDFs" semantics, but users get the cloned state. Either update the docstring to describe the cloneSession behavior accurately, or reach for a server-side construct that gives fresh state.

Note also that cloneSession is documented as a "developer API" (pyspark/sql/connect/session.py:1306) — coupling a user-facing method to it makes us fragile to changes there.

Fixed: updated the newSession() docstring to accurately describe cloneSession() semantics — it creates a new independent server session with the current session's configuration, temporary views, and registered functions copied in, rather than returning a fresh-empty session.

cloud-fan · 2026-05-27T08:05:05Z

+        return cls._instantiatedContext
+
+    @classmethod
+    def getOrCreate(cls: Type["SQLContext"], sparkSession: "SparkSession") -> "SQLContext":


Finding #4 — signature divergence + duplication (subsumed if Finding #1 takes path (a)).

(a) Classic getOrCreate(sc=None) makes the session optional and resolves the active one inside; this Connect signature is getOrCreate(sparkSession) — positional, required. Code that does SQLContext.getOrCreate() after importing from pyspark.sql.connect.context directly hits a TypeError.

(b) The body of this method is identical to _get_or_create_from_session except for warnings.warn — it could just call through.

Both concerns stop mattering if the class is made implementation-private (Finding #1).

Fixed as a consequence of Finding #1: with getOrCreate removed from the Connect class, both (a) the signature divergence and (b) the duplication disappear.

cloud-fan · 2026-05-27T08:05:05Z

+        return cls._instantiatedContext
+
+    def newSession(self) -> "SQLContext":
+        """Returns a new SQLContext as new session, that has separate SQLConf,


Grammar — missing article:

Suggested change

"""Returns a new SQLContext as new session, that has separate SQLConf,

"""Returns a new SQLContext as a new session, that has separate SQLConf,

cloud-fan · 2026-05-27T08:05:05Z

+            :class:`tuple`, ``int``, ``boolean``, etc.), :class:`list`,
+            :class:`pandas.DataFrame`, or :class:`pyarrow.Table`.
+        schema : :class:`~pyspark.sql.types.DataType`, str or list, optional
+            a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of


The signature accepts Tuple[str, ...] too:

Suggested change

a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of

a :class:`~pyspark.sql.types.DataType` or a datatype string or a list/tuple of

cloud-fan · 2026-05-27T08:05:05Z

+            a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of
+            column names.
+        samplingRatio : float, optional
+            the sample ratio of rows used for inferring


Sentence is incomplete — inferring what?

Suggested change

the sample ratio of rows used for inferring

the sample ratio of rows used for inferring the schema.

cloud-fan · 2026-05-27T08:05:05Z

+        Returns
+        -------
+        list
+            list of table names, in string


Awkward phrasing:

Suggested change

list of table names, in string

list of table names as strings

cloud-fan · 2026-05-27T08:05:05Z

+    @property
+    def streams(self) -> StreamingQueryManager:
+        """Returns a :class:`StreamingQueryManager` that allows managing all the
+        :class:`~pyspark.sql.streaming.StreamingQuery` StreamingQueries active on


"StreamingQuery StreamingQueries" duplicates the class name; the backticks around this read oddly:

Suggested change

:class:`~pyspark.sql.streaming.StreamingQuery` StreamingQueries active on

:class:`~pyspark.sql.streaming.StreamingQuery` instances active on this

cloud-fan · 2026-05-27T08:05:05Z

+                StructField("isTemporary", BooleanType(), nullable=False),
+            ]
+        )
+        rows = [


Worth a one-line comment that this materializes the catalog list on the driver (chosen over SHOW TABLES for column-name parity per the earlier @Yicong-Huang thread). For catalogs with very many tables this is in-memory work that classic delegates server-side; future readers will appreciate the rationale being captured here rather than only in PR history.

Added a two-line comment explaining the choice: SHOW TABLES returns database vs namespace depending on the active catalog, so catalog.listTables() is used to guarantee the column names always match the classic implementation.

…ntext, fix HiveContext bypass, reorganize tests - Remove public `getOrCreate` from Connect SQLContext; internal dispatch uses `_get_or_create_from_session` only (fixes Finding #1 / #4) - Fix HiveContext bypass in classic dispatch: route getOrCreate to the Connect counterpart by class name so ConnectHiveContext._from_session raises as expected (fixes Finding #2) - Fix newSession() docstring to accurately describe cloneSession() semantics (fixes Finding #3) - Fix docstring nits: missing article, list/tuple, inferring the schema, table names as strings, streams wording - Add comment explaining catalog.listTables() over SHOW TABLES - Reorganize tests: add test_sql_context.py with mixin + classic runner, test_parity_sql_context.py for Connect parity, slim test_connect_context.py to Connect-specific tests only Co-authored-by: DB Tsai <db.tsai@databricks.com>

… modules in CI - Add _instantiatedContext = None override on ConnectHiveContext to prevent Python MRO from finding the parent SQLContext's cached instance; without this, HiveContext.getOrCreate() silently returns a SQLContext when called after SQLContext.getOrCreate() instead of raising PySparkNotImplementedError - Register test_parity_sql_context and test_sql_context in modules.py so the shared mixin tests actually run in CI Co-authored-by: DB Tsai <db.tsai@databricks.com>

…ass name Using SQLContext._instantiatedContext in __init__ would set the parent's cache for any subclass that reaches __init__ without overriding it. Using type(self) ensures each class maintains its own singleton correctly. Co-authored-by: DB Tsai <db.tsai@databricks.com>

…Connect SQLContext The `.. deprecated:: 2.3.0` note implied spark.udf.registerJavaFunction is a working alternative in Connect, but the method simply raises PySparkNotImplementedError. The annotation is from the classic implementation and should not appear here. Co-authored-by: DB Tsai <db.tsai@databricks.com>

…LContext tables(): t.namespace[-1] instead of t.namespace[0] — for multi-level namespaces (e.g. Unity Catalog ['catalog', 'db']), the database is the innermost component (last element), not the first. Using index 0 was silently returning the catalog name instead of the database name. stop(): clear connect.SQLContext._instantiatedContext when the session being stopped is the one wrapped by the cached context. The classic SparkSession.stop() already does this (session.py:2158); the Connect variant was missing the equivalent cleanup, causing getOrCreate() to return a stale context wrapping a closed session after stop(). Co-authored-by: DB Tsai <db.tsai@databricks.com>

cloud-fan

Re-review — 11 addressed, 1 remaining, 5 new (5 new = 3 newly introduced, 2 late catches). 1 blocking, 1 non-blocking, 3 nits.

Thanks for the thorough rework — Findings #1–#4, the doc nits, and the @zhengruifeng test reorg all look correctly addressed. One gap surfaced from the reorg itself, plus a few nits.

Correctness (1)

context.py:168: the is_remote_only() dispatch and the Finding #2 fix (HiveContext.getOrCreate() raising) are now untested — the reorg dropped test_getOrCreate_emits_deprecation_warning and test_hive_context_get_or_create_raises without replacement. See inline.

Design / architecture (1)

connect/context.py:49: class declared internal but still exported in __all__. See inline.

Nits: 3 minor items (see inline comments).

PR description suggestions

The opening sentence is still broken ("…implementation in continues to work transparently…") — flagged last round, still unfixed. Add the missing clause (e.g. "implementation in pyspark.sql.connect.context so that legacy code using SQLContext continues to work…").
The "How was this patch tested?" section claims Connect-specific cases include "deprecation warnings on __init__ and getOrCreate", but the getOrCreate test was dropped in the reorg and no longer exists — restore it (see the inline comment) or update the text.

cloud-fan · 2026-05-29T19:24:00Z

            "Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.",
            FutureWarning,
        )
+        if is_remote_only():


This is_remote_only() dispatch is the PR's central user-facing behavior, and the Finding #2 fix routes HiveContext.getOrCreate() here to raise — but neither is tested. The reorg dropped test_getOrCreate_emits_deprecation_warning and test_hive_context_get_or_create_raises (both present at the prior review commit 6af283f, the latter added specifically to cover Finding #2) without replacing them. is_remote_only() is mockable (pyspark.util.is_remote_only), so a ReusedConnectTestCase can patch it True and assert (1) SQLContext.getOrCreate() returns a Connect-backed context and (2) HiveContext.getOrCreate() raises PySparkNotImplementedError.

Fixed in e33dd66 — added test_getOrCreate_emits_deprecation_and_returns_connect_context and test_hive_context_getOrCreate_raises to test_connect_context.py, both using unittest.mock.patch('pyspark.util.is_remote_only', return_value=True) to exercise the dispatch path.

This comment was generated with GitHub MCP.

cloud-fan · 2026-05-29T19:24:00Z

+    from pyspark.sql.connect.udtf import UDTFRegistration
+    from pyspark.sql._typing import UserDefinedFunctionLike
+
+__all__ = ["SQLContext", "HiveContext"]


Finding #1 made these classes an internal implementation detail, but they're still exported in __all__, which signals public API. If they're meant to be internal, drop them from __all__ — the classic dispatch in context.py imports the module attribute directly (getattr(_connect_context, cls.__name__, ...)), so it doesn't rely on __all__. Non-blocking.

Fixed in e33dd66 — removed __all__ entirely and replaced it with a comment marking the module as an internal implementation detail.

This comment was generated with GitHub MCP.

cloud-fan · 2026-05-29T19:24:00Z

 from pyspark.sql import Row, SparkSession
 from pyspark.sql.types import StructType, StringType, StructField
 from pyspark.testing.sqlutils import ReusedSQLTestCase
+from pyspark.sql.tests.test_sql_context import SQLContextTestsMixin  # noqa: F401


This import is dead — SQLContextTestsMixin isn't a TestCase and isn't subclassed in this file, so it's never collected or run (hence the # noqa: F401). Looks like a leftover from the reorg; please remove it.

Fixed in e33dd66 — removed the dead import.

This comment was generated with GitHub MCP.

cloud-fan · 2026-05-29T19:24:00Z

+        with the current session's configuration, temporary views, and registered functions
+        copied into it.
+
+        .. versionadded:: 4.0.0


Confirming @Yicong-Huang's earlier question: the repo is at 5.0.0.dev0 and 4.0.0 is already released, so these brand-new methods can't be versionadded:: 4.0.0 — they should be 5.0.0. Applies to every .. versionadded:: 4.0.0 in this file (lines 107, 120, 133, 141, 153, 172, 197, 210, …).

Fixed in e33dd66 — changed all 22 occurrences of versionadded:: 4.0.0 to versionadded:: 5.0.0 in connect/context.py.

This comment was generated with GitHub MCP.

Correction in 634fddb — changed to 4.2.0 (this PR targets the 4.2 release, not 5.0).

This comment was generated with GitHub MCP.

Changed to versionadded:: 4.2.0 as claude code doesn't have context about our new release model.

cloud-fan · 2026-05-29T19:24:00Z

+
+            session = SparkSession._getActiveSessionOrCreate()
+            # Route to the Connect counterpart so subclasses (e.g. HiveContext) are handled
+            # correctly: ConnectHiveContext._from_session raises PySparkNotImplementedError.


There's no class named ConnectHiveContext — the Connect-side class is HiveContext (in pyspark.sql.connect.context), and the routing reaches it via cls.__name__.

Suggested change

# correctly: ConnectHiveContext._from_session raises PySparkNotImplementedError.

# correctly: the Connect HiveContext._from_session raises PySparkNotImplementedError.

Fixed in e33dd66 — applied the suggestion exactly.

This comment was generated with GitHub MCP.

…ate tests - Remove __all__ from connect/context.py (internal module, not public API) - Change all versionadded:: 4.0.0 to 5.0.0 in connect/context.py - Fix comment: s/ConnectHiveContext/the Connect HiveContext/ in context.py - Remove dead SQLContextTestsMixin import from test_context.py - Add test_getOrCreate_emits_deprecation_and_returns_connect_context and test_hive_context_getOrCreate_raises to test_connect_context.py, both using unittest.mock.patch to mock is_remote_only Co-authored-by: DB Tsai

This PR targets the 4.2 release; 5.0.0 was incorrect. Co-authored-by: DB Tsai

Yicong-Huang

LGTM. Thanks for the thorough rework. all my earlier change-requests are addressed. Approving with a few non-blocking nits left inline.

Yicong-Huang · 2026-05-29T23:41:42Z

+        return ctx
+
+    @classmethod
+    def _get_or_create_from_session(cls, sparkSession: "SparkSession") -> "SQLContext":


Cached instance is never validated against the currently-active session - it only resets via the stop() hook. If a different Connect session becomes active after the first getOrCreate(), this keeps returning a context bound to the old session. Classic _get_or_create re-creates when the cached context is dead. Intentional?

Yicong-Huang · 2026-05-29T23:41:42Z

+    def test_readStream_is_available(self) -> None:
+        self.assertIsNotNone(self._make_ctx().readStream)
+
+    def test_streams_is_available(self) -> None:


Mixin has no coverage for udf, udtf, registerFunction, or createExternalTable, yet the PR description claims udf/udtf/registerFunction are covered. please consider add the cases or correct the description.

HyukjinKwon reviewed Apr 27, 2026

View reviewed changes

Comment thread python/pyspark/sql/connect/context.py Outdated

dbtsai force-pushed the connect-sqlcontext-wrapper branch from 0d26795 to d99ee95 Compare April 28, 2026 05:14

dbtsai added 6 commits May 22, 2026 13:49

Fix CI: register test_connect_context in modules.py and fix ruff form…

4c88987

…atting Co-authored-by: Isaac

Add API reference docs for SQLContext and HiveContext legacy entry po…

ceb50fa

…ints Co-authored-by: Isaac

Fix newSession() in Connect SQLContext to use cloneSession()

8febe99

SparkSession.newSession() is JVM-only and not supported in Spark Connect. Use cloneSession() which is the Connect equivalent. Co-authored-by: Isaac

dbtsai force-pushed the connect-sqlcontext-wrapper branch from 3496379 to 8febe99 Compare May 22, 2026 20:50

dbtsai changed the title ~~[WIP][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect~~ [SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect May 23, 2026

Yicong-Huang requested changes May 23, 2026

View reviewed changes

dbtsai added 2 commits May 26, 2026 12:00

dbtsai added 2 commits May 26, 2026 15:27

Fix ruff format: collapse inner cast onto one line

3d02674

Trigger CI

6af283f

zhengruifeng reviewed May 27, 2026

View reviewed changes

cloud-fan reviewed May 27, 2026

View reviewed changes

dbtsai and others added 5 commits May 28, 2026 15:26

cloud-fan reviewed May 29, 2026

View reviewed changes

dbtsai added 2 commits May 29, 2026 16:15

fix: versionadded tags should be 4.2.0, not 5.0.0

634fddb

This PR targets the 4.2 release; 5.0.0 was incorrect. Co-authored-by: DB Tsai

Yicong-Huang approved these changes May 29, 2026

View reviewed changes

		from pyspark.testing.connectutils import ReusedConnectTestCase


		class SQLContextConnectTests(ReusedConnectTestCase):

	"""Returns a new SQLContext as new session, that has separate SQLConf,
	"""Returns a new SQLContext as a new session, that has separate SQLConf,

	a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of
	a :class:`~pyspark.sql.types.DataType` or a datatype string or a list/tuple of

	the sample ratio of rows used for inferring
	the sample ratio of rows used for inferring the schema.

	list of table names, in string
	list of table names as strings

	:class:`~pyspark.sql.streaming.StreamingQuery` StreamingQueries active on
	:class:`~pyspark.sql.streaming.StreamingQuery` instances active on this

	# correctly: ConnectHiveContext._from_session raises PySparkNotImplementedError.
	# correctly: the Connect HiveContext._from_session raises PySparkNotImplementedError.

Conversation

dbtsai commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yicong-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dbtsai commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dbtsai commented Apr 27, 2026 •

edited

Loading

dbtsai commented May 26, 2026 •

edited

Loading

zhengruifeng May 27, 2026 •

edited

Loading