Skip to content

[SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect#55574

Open
dbtsai wants to merge 17 commits into
apache:masterfrom
dbtsai:connect-sqlcontext-wrapper
Open

[SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect#55574
dbtsai wants to merge 17 commits into
apache:masterfrom
dbtsai:connect-sqlcontext-wrapper

Conversation

@dbtsai
Copy link
Copy Markdown
Member

@dbtsai dbtsai commented Apr 27, 2026

What changes were proposed in this pull request?

This PR adds a Spark Connect-compatible SQLContext (and HiveContext) implementation in
pyspark.sql.connect.context so that legacy code using SQLContext continues to work
transparently when running against a Connect server.

Key changes:

  1. New pyspark.sql.connect.context.SQLContext — wraps a Connect SparkSession directly
    (no SparkContext required). Delegates all supported operations to the session:
    sql, table, range, createDataFrame, conf, udf, udtf, read,
    readStream, streams, and catalog operations (cacheTable, uncacheTable,
    clearCache, tables, tableNames, registerDataFrameAsTable, dropTempTable,
    createExternalTable).

    • newSession() uses cloneSession() (the Connect equivalent of SparkSession.newSession()).
    • JVM-only APIs (registerJavaFunction, HiveContext.__init__) raise PySparkNotImplementedError.
  2. Connect dispatch in classic SQLContext.getOrCreate() — when running in remote-only mode
    (is_remote_only()), the classic getOrCreate() now automatically returns a
    Connect SQLContext wrapping the active Connect session, so callers do not need to
    import from pyspark.sql.connect directly.

  3. Shared test mixinSQLContextTestsMixin extracted to test_sql_context.py so the same
    suite runs against both the classic and Connect implementations via SQLContextParityTests.

  4. API reference docs — new python/docs/source/reference/pyspark.sql/legacy.rst page
    listing SQLContext and HiveContext in the public API reference.

  5. CI registrationtest_connect_context registered in modules.py.

Why are the changes needed?

SQLContext is deprecated since Spark 2.0 in favor of SparkSession, but many existing
PySpark applications still instantiate it directly. Without this wrapper, those applications
fail entirely on Spark Connect because the classic SQLContext.__init__ requires a live
SparkContext (JVM), which is not available in Connect mode. This patch closes that
compatibility gap.

Does this PR introduce any user-facing change?

Yes. Previously, calling SQLContext(spark) or SQLContext.getOrCreate() in a Spark Connect
environment raised an error because the classic implementation requires a SparkContext.
After this PR, both calls succeed and return a fully functional (but still deprecated)
SQLContext backed by the active Connect session.

JVM-specific methods (registerJavaFunction, HiveContext) now raise a clear
PySparkNotImplementedError instead of a cryptic JVM/attribute error.

How was this patch tested?

  • Added SQLContextTestsMixin in python/pyspark/sql/tests/test_sql_context.py covering:
    setConf/getConf, createDataFrame, sql, table, tables/tableNames,
    cacheTable/uncacheTable/clearCache, registerDataFrameAsTable/dropTempTable,
    range, read, readStream, streams, udf/udtf, newSession, registerFunction.
  • SQLContextConnectTests in python/pyspark/sql/tests/connect/test_connect_context.py
    adds Connect-specific cases: deprecation warning on __init__, getOrCreate() in
    remote-only mode returns a Connect-backed context and emits a deprecation warning,
    HiveContext.getOrCreate() in remote-only mode raises PySparkNotImplementedError,
    registerJavaFunction raises PySparkNotImplementedError, and HiveContext.__init__
    raises PySparkNotImplementedError. The is_remote_only() code path is tested by
    patching pyspark.util.is_remote_only.
  • Registered in dev/sparktestsupport/modules.py so the Connect test is picked up by CI.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (claude-sonnet-4-6), via Anthropic Claude Code

:class:`DataFrame`
"""
listed = self.sparkSession.catalog.listTables(dbName)
rows = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with spark.sql("SHOW TABLES [IN dbName]")

from pyspark.testing.connectutils import ReusedConnectTestCase


class SQLContextConnectTests(ReusedConnectTestCase):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should share the tests with Spark Classic

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLContextTestsMixin extracted into test_context.py; Connect test is now SQLContextParityTests(SQLContextTestsMixin, ReusedConnectTestCase)

__all__ = ["SQLContext", "HiveContext"]


class SQLContext:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do sth in SQLContext.getOrCreate. Otherwise, users would have to manually import pyspark.sql.connect to use spark Connect

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classic getOrCreate now detects is_remote_only() and dispatches to ConnectSQLContext._get_or_create_from_session(); sc made optional

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can HiveContext.getOrCreate(spark) work? it might be able to bypass the not supported HiveContext?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — confirmed this was a real bypass. HiveContext.getOrCreate(spark) dispatches to SQLContext.getOrCreate(cls=HiveContext, ...), which calls _from_sessionobject.__new__(HiveContext), creating an instance without ever invoking __init__, so the PySparkNotImplementedError was silently skipped.

Fixed in c967950 by overriding _from_session in HiveContext to unconditionally raise. Since getOrCreate, _get_or_create_from_session, and newSession all route through _from_session, this closes all bypass paths. Added test_hive_context_get_or_create_raises to cover this case.

Comment thread python/pyspark/sql/connect/context.py Outdated
@dbtsai dbtsai force-pushed the connect-sqlcontext-wrapper branch from 0d26795 to d99ee95 Compare April 28, 2026 05:14
dbtsai added 6 commits May 22, 2026 13:49
Implements a Connect-compatible SQLContext in pyspark.sql.connect.context
that wraps a Connect SparkSession instead of requiring a SparkContext.
All SparkSession-delegate methods (sql, table, range, createDataFrame,
conf, udf, udtf, read, readStream, streams, catalog ops) are wired up.
JVM-only APIs (registerJavaFunction, HiveContext) raise
PySparkNotImplementedError. Adds 20 unit tests.

Co-authored-by: Isaac
- Replace manual Row construction in tables() with SHOW TABLES SQL
- Fix all versionadded annotations to 4.0.0 (Connect-era version)
- Add _get_or_create_from_session() and dispatch in classic
  SQLContext.getOrCreate() so users need not import from
  pyspark.sql.connect directly (sc arg made optional)
- Extract SQLContextTestsMixin into test_context.py; Connect test
  now inherits the shared suite via SQLContextParityTests

Co-authored-by: Isaac
- Use base pyspark.sql.dataframe.DataFrame as return type in
  connect/context.py since SparkSession.sql/range/table/createDataFrame
  are annotated to return the parent class, not the Connect subclass
- Remove now-unnecessary # type: ignore[call-overload] on createDataFrame
- Add # type: ignore[return-value, arg-type] to the classic
  SQLContext.getOrCreate Connect dispatch path where the two SQLContext
  classes are structurally equivalent but not in the same hierarchy

Co-authored-by: Isaac
SparkSession.newSession() is JVM-only and not supported in Spark Connect.
Use cloneSession() which is the Connect equivalent.

Co-authored-by: Isaac
@dbtsai dbtsai force-pushed the connect-sqlcontext-wrapper branch from 3496379 to 8febe99 Compare May 22, 2026 20:50
@dbtsai dbtsai changed the title [WIP][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect [SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect May 23, 2026
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some changes needed, please see inline comments.

Comment thread python/pyspark/sql/connect/context.py Outdated
__all__ = ["SQLContext", "HiveContext"]


class SQLContext:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can HiveContext.getOrCreate(spark) work? it might be able to bypass the not supported HiveContext?

Comment thread python/pyspark/sql/tests/test_context.py Outdated
Comment thread python/pyspark/sql/context.py Outdated
dbtsai added 2 commits May 26, 2026 12:00
…x setUp/tearDown

- tables() in Connect SQLContext now uses catalog.listTables() to always
  return consistent (namespace, tableName, isTemporary) columns instead of
  SHOW TABLES whose column names vary across catalogs
- Replace type: ignore[return-value/arg-type] in classic getOrCreate() with
  explicit cast() calls for clarity
- Override setUp/tearDown in SQLContextParityTests to reset Connect
  SQLContext._instantiatedContext between tests, not just the classic one
HiveContext.__init__ raised PySparkNotImplementedError, but
HiveContext.getOrCreate(spark) bypassed it because getOrCreate routes
through _from_session which uses object.__new__(cls), skipping __init__.
Override _from_session in HiveContext to raise unconditionally, closing
all bypass paths (getOrCreate, _get_or_create_from_session, newSession).
Add test_hive_context_get_or_create_raises to cover the bypass.
@dbtsai
Copy link
Copy Markdown
Member Author

dbtsai commented May 26, 2026

reload(window)


class SQLContextTestsMixin:
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a mixin, not a real test, so it is not tested in classic sides.
I think we can reorg the tests in this way to follow existing connect parity test convention

1, python/pyspark/sql/tests/test_sql_context.py

  • class SQLContextTestsMixin
  • class SQLContextTests(SQLContextTestsMixin,ReusedSQLTestCase)

2, python/pyspark/sql/tests/connect/test_parity_sql_context.py

  • class SQLContextParityTests(SQLContextTestsMixin,ReusedConnectTestCase)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: created python/pyspark/sql/tests/test_sql_context.py with SQLContextTestsMixin + SQLContextTests(SQLContextTestsMixin, ReusedSQLTestCase) so the mixin now runs against classic Spark too. Added python/pyspark/sql/tests/connect/test_parity_sql_context.py following the existing Connect parity convention (SQLContextParityTests(SQLContextTestsMixin, ReusedConnectTestCase)). test_connect_context.py is now Connect-specific tests only.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem. SQLContext has been deprecated since Spark 2.0 in favor of SparkSession, but legacy code still constructs it directly via SQLContext(sc) or SQLContext.getOrCreate(). Both classic paths hard-require a JVM SparkContext (__init__ dereferences sparkContext._jsc, _jvm, and sparkSession._jsparkSession.sqlContext()). In a Spark Connect remote-only install (pyspark-client) there is no SparkContext, so any of these calls crashes. The Connect SparkSession itself goes further — __getattr__ raises JVM_ATTRIBUTE_NOT_SUPPORTED for _jsc, _jconf, _jvm, _jsparkSession, sparkContext, and newSession — so the classic class can't be made to work by handing it a Connect session.

Design approach.

  1. A new pyspark.sql.connect.context.SQLContext that wraps a Connect SparkSession and pure-delegates almost every method; HiveContext is a sentinel that raises everywhere.
  2. Classic SQLContext.getOrCreate(sc=None) checks is_remote_only() and dispatches to ConnectSQLContext._get_or_create_from_session(active_session). sc becomes optional; the classic branch keeps an assert.

Key design decisions made by this PR.

  • Dispatch keyed on is_remote_only(), not session type: only fires for pyspark-client installs; full installs still take the JVM path.
  • HiveContext raises on every Connect construction: closes the direct-Connect bypass (commit c967950). The classic-dispatch path still bypasses this — see Finding #2.
  • SQLContext.newSession() is implemented via cloneSession(): opposite state semantics from classic newSession() — see Finding #3.
  • tables() materializes catalog rows on the driver rather than using SHOW TABLES, for column-name stability across catalog versions (per the 2026-05-26 follow-up to @Yicong-Huang).

Notes on the existing review thread.

  • HiveContext bypass: c967950 closes the direct Connect path. The classic-dispatch path remains open — see Finding #2.
  • Test reorg (@zhengruifeng, 2026-05-27): still open and the right call — the SQLContextTestsMixin currently never runs on the classic side. The current file naming test_connect_context.py also doesn't match the parity-mixin convention (test_parity_*.py).

PR description. The opening sentence is broken — "...implementation in continues to work transparently...". Please fix the missing clause.

__all__ = ["SQLContext", "HiveContext"]


class SQLContext:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding #1 — design / top-level question: what is the public pyspark.sql.connect.context.SQLContext for?

The classic-getOrCreate dispatch already covers the only transparent-compat case: legacy code that calls SQLContext.getOrCreate() in a remote-only install keeps working unchanged. Every other path requires the user to edit their imports — and once they're editing, migrating the call site to SparkSession is a strictly smaller, less-deprecated diff than switching to from pyspark.sql.connect.context import SQLContext (the deprecation recommendation since 2.0 already points there). I can't see a user journey where the public Connect-side class is the right answer over SparkSession.

Two cleaner shapes worth considering:

(a) Make it implementation-private — rename _SQLContext, or inline the wrapper into the is_remote_only() branch of classic getOrCreate. Drop the per-method entries in legacy.rst. The contract becomes: legacy getOrCreate keeps working; everyone else writes SparkSession.

(b) Keep it public but trim to a one-paragraph stub in legacy.rst that points to SparkSession, instead of advertising the full method surface.

If (a): the signature-divergence and getOrCreate/_get_or_create_from_session duplication concerns in Finding #4 also disappear, since there are no public callers — we keep _get_or_create_from_session and delete the public getOrCreate.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via option (a): removed the public getOrCreate classmethod from the Connect class entirely. The Connect SQLContext is now an internal implementation detail, accessible only through _get_or_create_from_session (invoked by the classic dispatch). Legacy code calling SQLContext.getOrCreate() via the classic import path continues to work unchanged; new code uses SparkSession.

Comment thread python/pyspark/sql/context.py Outdated
FutureWarning,
)
if is_remote_only():
from pyspark.sql.connect.context import SQLContext as ConnectSQLContext
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding #2HiveContext bypass through classic dispatch.

This is_remote_only() branch hardcodes ConnectSQLContext regardless of cls. A remote-only user calling from pyspark.sql import HiveContext; HiveContext.getOrCreate() reaches here with cls=HiveContext, gets handed back a plain ConnectSQLContext, and never sees the PySparkNotImplementedError they should.

c967950 closed the direct-Connect bypass via _from_session, but the classic dispatch bypasses _from_session entirely because it routes to ConnectSQLContext, not ConnectHiveContext.

Two fixes:

  • (a) route based on clsif cls is HiveContext: from pyspark.sql.connect.context import HiveContext as ConnectHiveContext; ...; or
  • (b) override HiveContext.getOrCreate in the classic file to raise in remote-only mode before delegating.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via your option (a): the classic SQLContext.getOrCreate now does connect_cls = getattr(_connect_context, cls.__name__, _connect_context.SQLContext) before calling _get_or_create_from_session. When cls is HiveContext, this routes to ConnectHiveContext._get_or_create_from_session, which calls ConnectHiveContext._from_session — and that raises PySparkNotImplementedError.


.. versionadded:: 4.0.0
"""
return self._from_session(self.sparkSession.cloneSession())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding #3cloneSession() is not equivalent to newSession().

Classic SparkSession.newSession() returns a session with separate SQLConf, temp views, and UDFs and a shared cache — i.e. a fresh-empty session (pyspark/sql/session.py:717-735).

Connect cloneSession() does the opposite: it copies the current session's conf, temp views, registered functions, and catalog state into an independent server session (pyspark/sql/connect/session.py:1282-1314).

The PR description and the docstring at line 124-126 both promise classic "separate SQLConf / temp views / UDFs" semantics, but users get the cloned state. Either update the docstring to describe the cloneSession behavior accurately, or reach for a server-side construct that gives fresh state.

Note also that cloneSession is documented as a "developer API" (pyspark/sql/connect/session.py:1306) — coupling a user-facing method to it makes us fragile to changes there.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: updated the newSession() docstring to accurately describe cloneSession() semantics — it creates a new independent server session with the current session's configuration, temporary views, and registered functions copied in, rather than returning a fresh-empty session.

Comment thread python/pyspark/sql/connect/context.py Outdated
return cls._instantiatedContext

@classmethod
def getOrCreate(cls: Type["SQLContext"], sparkSession: "SparkSession") -> "SQLContext":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding #4 — signature divergence + duplication (subsumed if Finding #1 takes path (a)).

(a) Classic getOrCreate(sc=None) makes the session optional and resolves the active one inside; this Connect signature is getOrCreate(sparkSession) — positional, required. Code that does SQLContext.getOrCreate() after importing from pyspark.sql.connect.context directly hits a TypeError.

(b) The body of this method is identical to _get_or_create_from_session except for warnings.warn — it could just call through.

Both concerns stop mattering if the class is made implementation-private (Finding #1).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as a consequence of Finding #1: with getOrCreate removed from the Connect class, both (a) the signature divergence and (b) the duplication disappear.

Comment thread python/pyspark/sql/connect/context.py Outdated
return cls._instantiatedContext

def newSession(self) -> "SQLContext":
"""Returns a new SQLContext as new session, that has separate SQLConf,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar — missing article:

Suggested change
"""Returns a new SQLContext as new session, that has separate SQLConf,
"""Returns a new SQLContext as a new session, that has separate SQLConf,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread python/pyspark/sql/connect/context.py Outdated
:class:`tuple`, ``int``, ``boolean``, etc.), :class:`list`,
:class:`pandas.DataFrame`, or :class:`pyarrow.Table`.
schema : :class:`~pyspark.sql.types.DataType`, str or list, optional
a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signature accepts Tuple[str, ...] too:

Suggested change
a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of
a :class:`~pyspark.sql.types.DataType` or a datatype string or a list/tuple of

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread python/pyspark/sql/connect/context.py Outdated
a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of
column names.
samplingRatio : float, optional
the sample ratio of rows used for inferring
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence is incomplete — inferring what?

Suggested change
the sample ratio of rows used for inferring
the sample ratio of rows used for inferring the schema.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread python/pyspark/sql/connect/context.py Outdated
Returns
-------
list
list of table names, in string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awkward phrasing:

Suggested change
list of table names, in string
list of table names as strings

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread python/pyspark/sql/connect/context.py Outdated
@property
def streams(self) -> StreamingQueryManager:
"""Returns a :class:`StreamingQueryManager` that allows managing all the
:class:`~pyspark.sql.streaming.StreamingQuery` StreamingQueries active on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"StreamingQuery StreamingQueries" duplicates the class name; the backticks around this read oddly:

Suggested change
:class:`~pyspark.sql.streaming.StreamingQuery` StreamingQueries active on
:class:`~pyspark.sql.streaming.StreamingQuery` instances active on this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

StructField("isTemporary", BooleanType(), nullable=False),
]
)
rows = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a one-line comment that this materializes the catalog list on the driver (chosen over SHOW TABLES for column-name parity per the earlier @Yicong-Huang thread). For catalogs with very many tables this is in-memory work that classic delegates server-side; future readers will appreciate the rationale being captured here rather than only in PR history.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a two-line comment explaining the choice: SHOW TABLES returns database vs namespace depending on the active catalog, so catalog.listTables() is used to guarantee the column names always match the classic implementation.

dbtsai and others added 5 commits May 28, 2026 15:26
…ntext, fix HiveContext bypass, reorganize tests

- Remove public `getOrCreate` from Connect SQLContext; internal dispatch
  uses `_get_or_create_from_session` only (fixes Finding #1 / #4)
- Fix HiveContext bypass in classic dispatch: route getOrCreate to the
  Connect counterpart by class name so ConnectHiveContext._from_session
  raises as expected (fixes Finding #2)
- Fix newSession() docstring to accurately describe cloneSession()
  semantics (fixes Finding #3)
- Fix docstring nits: missing article, list/tuple, inferring the schema,
  table names as strings, streams wording
- Add comment explaining catalog.listTables() over SHOW TABLES
- Reorganize tests: add test_sql_context.py with mixin + classic runner,
  test_parity_sql_context.py for Connect parity, slim test_connect_context.py
  to Connect-specific tests only

Co-authored-by: DB Tsai <db.tsai@databricks.com>
… modules in CI

- Add _instantiatedContext = None override on ConnectHiveContext to prevent
  Python MRO from finding the parent SQLContext's cached instance; without
  this, HiveContext.getOrCreate() silently returns a SQLContext when called
  after SQLContext.getOrCreate() instead of raising PySparkNotImplementedError
- Register test_parity_sql_context and test_sql_context in modules.py so
  the shared mixin tests actually run in CI

Co-authored-by: DB Tsai <db.tsai@databricks.com>
…ass name

Using SQLContext._instantiatedContext in __init__ would set the parent's
cache for any subclass that reaches __init__ without overriding it. Using
type(self) ensures each class maintains its own singleton correctly.

Co-authored-by: DB Tsai <db.tsai@databricks.com>
…Connect SQLContext

The `.. deprecated:: 2.3.0` note implied spark.udf.registerJavaFunction is a
working alternative in Connect, but the method simply raises
PySparkNotImplementedError. The annotation is from the classic implementation
and should not appear here.

Co-authored-by: DB Tsai <db.tsai@databricks.com>
…LContext

tables(): t.namespace[-1] instead of t.namespace[0] — for multi-level
namespaces (e.g. Unity Catalog ['catalog', 'db']), the database is the
innermost component (last element), not the first. Using index 0 was
silently returning the catalog name instead of the database name.

stop(): clear connect.SQLContext._instantiatedContext when the session
being stopped is the one wrapped by the cached context. The classic
SparkSession.stop() already does this (session.py:2158); the Connect
variant was missing the equivalent cleanup, causing getOrCreate() to
return a stale context wrapping a closed session after stop().

Co-authored-by: DB Tsai <db.tsai@databricks.com>
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — 11 addressed, 1 remaining, 5 new (5 new = 3 newly introduced, 2 late catches). 1 blocking, 1 non-blocking, 3 nits.

Thanks for the thorough rework — Findings #1#4, the doc nits, and the @zhengruifeng test reorg all look correctly addressed. One gap surfaced from the reorg itself, plus a few nits.

Correctness (1)

  • context.py:168: the is_remote_only() dispatch and the Finding #2 fix (HiveContext.getOrCreate() raising) are now untested — the reorg dropped test_getOrCreate_emits_deprecation_warning and test_hive_context_get_or_create_raises without replacement. See inline.

Design / architecture (1)

  • connect/context.py:49: class declared internal but still exported in __all__. See inline.

Nits: 3 minor items (see inline comments).

PR description suggestions

  • The opening sentence is still broken ("…implementation in continues to work transparently…") — flagged last round, still unfixed. Add the missing clause (e.g. "implementation in pyspark.sql.connect.context so that legacy code using SQLContext continues to work…").
  • The "How was this patch tested?" section claims Connect-specific cases include "deprecation warnings on __init__ and getOrCreate", but the getOrCreate test was dropped in the reorg and no longer exists — restore it (see the inline comment) or update the text.

"Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.",
FutureWarning,
)
if is_remote_only():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is_remote_only() dispatch is the PR's central user-facing behavior, and the Finding #2 fix routes HiveContext.getOrCreate() here to raise — but neither is tested. The reorg dropped test_getOrCreate_emits_deprecation_warning and test_hive_context_get_or_create_raises (both present at the prior review commit 6af283f, the latter added specifically to cover Finding #2) without replacing them. is_remote_only() is mockable (pyspark.util.is_remote_only), so a ReusedConnectTestCase can patch it True and assert (1) SQLContext.getOrCreate() returns a Connect-backed context and (2) HiveContext.getOrCreate() raises PySparkNotImplementedError.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e33dd66 — added test_getOrCreate_emits_deprecation_and_returns_connect_context and test_hive_context_getOrCreate_raises to test_connect_context.py, both using unittest.mock.patch('pyspark.util.is_remote_only', return_value=True) to exercise the dispatch path.


This comment was generated with GitHub MCP.

Comment thread python/pyspark/sql/connect/context.py Outdated
from pyspark.sql.connect.udtf import UDTFRegistration
from pyspark.sql._typing import UserDefinedFunctionLike

__all__ = ["SQLContext", "HiveContext"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding #1 made these classes an internal implementation detail, but they're still exported in __all__, which signals public API. If they're meant to be internal, drop them from __all__ — the classic dispatch in context.py imports the module attribute directly (getattr(_connect_context, cls.__name__, ...)), so it doesn't rely on __all__. Non-blocking.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e33dd66 — removed __all__ entirely and replaced it with a comment marking the module as an internal implementation detail.


This comment was generated with GitHub MCP.

from pyspark.sql import Row, SparkSession
from pyspark.sql.types import StructType, StringType, StructField
from pyspark.testing.sqlutils import ReusedSQLTestCase
from pyspark.sql.tests.test_sql_context import SQLContextTestsMixin # noqa: F401
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is dead — SQLContextTestsMixin isn't a TestCase and isn't subclassed in this file, so it's never collected or run (hence the # noqa: F401). Looks like a leftover from the reorg; please remove it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e33dd66 — removed the dead import.


This comment was generated with GitHub MCP.

Comment thread python/pyspark/sql/connect/context.py Outdated
with the current session's configuration, temporary views, and registered functions
copied into it.

.. versionadded:: 4.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming @Yicong-Huang's earlier question: the repo is at 5.0.0.dev0 and 4.0.0 is already released, so these brand-new methods can't be versionadded:: 4.0.0 — they should be 5.0.0. Applies to every .. versionadded:: 4.0.0 in this file (lines 107, 120, 133, 141, 153, 172, 197, 210, …).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e33dd66 — changed all 22 occurrences of versionadded:: 4.0.0 to versionadded:: 5.0.0 in connect/context.py.


This comment was generated with GitHub MCP.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction in 634fddb — changed to 4.2.0 (this PR targets the 4.2 release, not 5.0).


This comment was generated with GitHub MCP.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to versionadded:: 4.2.0 as claude code doesn't have context about our new release model.

Comment thread python/pyspark/sql/context.py Outdated

session = SparkSession._getActiveSessionOrCreate()
# Route to the Connect counterpart so subclasses (e.g. HiveContext) are handled
# correctly: ConnectHiveContext._from_session raises PySparkNotImplementedError.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no class named ConnectHiveContext — the Connect-side class is HiveContext (in pyspark.sql.connect.context), and the routing reaches it via cls.__name__.

Suggested change
# correctly: ConnectHiveContext._from_session raises PySparkNotImplementedError.
# correctly: the Connect HiveContext._from_session raises PySparkNotImplementedError.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e33dd66 — applied the suggestion exactly.


This comment was generated with GitHub MCP.

dbtsai added 2 commits May 29, 2026 16:15
…ate tests

- Remove __all__ from connect/context.py (internal module, not public API)
- Change all versionadded:: 4.0.0 to 5.0.0 in connect/context.py
- Fix comment: s/ConnectHiveContext/the Connect HiveContext/ in context.py
- Remove dead SQLContextTestsMixin import from test_context.py
- Add test_getOrCreate_emits_deprecation_and_returns_connect_context
  and test_hive_context_getOrCreate_raises to test_connect_context.py,
  both using unittest.mock.patch to mock is_remote_only

Co-authored-by: DB Tsai
This PR targets the 4.2 release; 5.0.0 was incorrect.

Co-authored-by: DB Tsai
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the thorough rework. all my earlier change-requests are addressed. Approving with a few non-blocking nits left inline.

return ctx

@classmethod
def _get_or_create_from_session(cls, sparkSession: "SparkSession") -> "SQLContext":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cached instance is never validated against the currently-active session - it only resets via the stop() hook. If a different Connect session becomes active after the first getOrCreate(), this keeps returning a context bound to the old session. Classic _get_or_create re-creates when the cached context is dead. Intentional?

def test_readStream_is_available(self) -> None:
self.assertIsNotNone(self._make_ctx().readStream)

def test_streams_is_available(self) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixin has no coverage for udf, udtf, registerFunction, or createExternalTable, yet the PR description claims udf/udtf/registerFunction are covered. please consider add the cases or correct the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants