Skip to content

[SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs#55768

Closed
sven-weber-db wants to merge 3 commits into
apache:masterfrom
sven-weber-db:sven-weber_data/spark-56661-catalyst-and-udf
Closed

[SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs#55768
sven-weber-db wants to merge 3 commits into
apache:masterfrom
sven-weber-db:sven-weber_data/spark-56661-catalyst-and-udf

Conversation

@sven-weber-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR introduces new logical and physical Catalyst nodes for language-agnostic User Defined Functions (UDF) as part of SPIP SPARK-55278, which proposes language-agnostic UDFs.

As a first step towards the goal of language-agnostic UDFs, we want to target mapPartition UDFs like pyspark.sql.DataFrame.mapInArrow, pyspark.RDD.mapPartitions, or pyspark.sql.DataFrame.mapInArrow. The overarching goal is to deprecate the current, language-specific Catalyst nodes (like mapInArrow). However, for now, the new nodes will exist in addition to the old ones until the new framework has reach maturity.

In summary, this PR introduces:

  • A new Catalyst Expression, ExternalUDFExpression, which captures language-agnostic UDF properties (payload, name, etc.)
  • A new Catalyst logical node, ExternalUDF, which serves as a base class for all language-agnostic UDF nodes
  • A new Catalyst logical node, MapPartitionExternalUDF, which is the new, language-agnostic map partition node
  • Catalyst physical nodes for both logical nodes
  • WorkerDispatcherManager - A manager class which manages UDF Dispatchers based on the target UDFWorkerSpecification

None of the changes introduced above are currently consumed in Spark.

Why are the changes needed?

This is the first step toward language-agnostic UDF execution for Spark. Existing physical and logical planning nodes need to be replaced eventually to achieve this goal as they make language-specific assumptions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit-tests were added.

Was this patch authored or co-authored using generative AI tooling?

Partially. However, the code was manually reviewed and adjusted.

Comment thread sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala Outdated
@sven-weber-db sven-weber-db force-pushed the sven-weber_data/spark-56661-catalyst-and-udf branch from 5a35dee to df7bde7 Compare May 11, 2026 14:28
* Creates a [[WorkerSession]] via [[SparkEnv#getExternalUDFDispatcher]]
* and registers cancellation on task failure. The provided function
* receives the session and must return the result iterator. Moreover,
* the function MUST close the session once all input data has been sent.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"all input data have been sent"
what does this mean , do you try to say all udf results have been consumed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should call close once all the input rows have been sent to the UDF. This is the signal that no more input is to be expected, and the UDF can finish processing after it has consumed all of this data. This is aligned with what we discussed offline earlier today.

I changed the comment slightly to make this point clearer. Could you have a look at this new comment?

val session = dispatcher.createSession(securityScope)

// Make sure to cancel the session, if the task fails
taskContext.addTaskFailureListener { (_, _) =>
Copy link
Copy Markdown
Contributor

@haiyangsun-db haiyangsun-db May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to add another completion listener as well to call session.close()

The reason is that, spark doesn't have to consume the whole result iterator, e.g., in case of 'limit'. So if we rely on the iterator's last element being consumed, then we may miss the close.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused, either

  1. not introducing this class in this PR
  2. use it in MapPartitionsExternalUDFExec but give a f that throws unimplemented error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to add another completion listener as well to call session.close()

As discussed offline, this is actually not needed. In case of early termination of the task (e.g., through a limit), we cancel the execution instead. The close() call on the session should be done by the user of this function when all input has been sent to the UDF.

This seems unused, either

Actually, ExternalUDFExec is used as the parent class of MapPartitionsExternalUDFExec. However, I agree with your point that we could make the future use much clearer by calling withUDFWorkerSession in doExecute of MapPartitionsExternalUDFExec. I changed the PR to do exactly this and then throw the NotImplementedError when we have received the session.

DirectUnixSocketWorkerDispatcher, DirectWorkerProcess,
DirectWorkerSession}

/**
Copy link
Copy Markdown
Contributor

@haiyangsun-db haiyangsun-db May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can reuse the testing dispatcher defined in https://github.com/apache/spark/blob/master/udf/worker/core/src/test/scala/org/apache/spark/udf/worker/core/DirectWorkerDispatcherSuite.scala (can be updated if necessary)? As that is supposed to be agnostic to a worker spec.

So we can reduce some duplication and in case of API changes, we need to only update one place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea! I moved the TestDispatcher into a test-only shared file that can be reused here. There are still some parts of the implementation that remain in this suite, as this test relies on an actual socket connection, and the test in /udf/ only checks for file existence. It would be weird to move the logic from this test into /udf/ as well, as this logic is not consumed in the /udf package.

@sven-weber-db sven-weber-db force-pushed the sven-weber_data/spark-56661-catalyst-and-udf branch from df7bde7 to 2605573 Compare May 13, 2026 11:23
@sven-weber-db sven-weber-db marked this pull request as ready for review May 13, 2026 11:25
* Dispatcher factory to generate UDF worker dispatchers
* using the new UDF framework proposed in SPARK-55278
*/
private val udfDispatcherManager: UDFDispatcherManager =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create this on the driver as well? In general the patten in SparkEnv is that we initialize variables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. On the driver, this would only be required when running a single-node cluster. I changed the val to be lazily initialized. This way, we will only acquire the resources that are actually needed. This approach also follows the current implementation of pythonWorkers. Do you think this is better?

In general the patten in SparkEnv is that we initialize variables.
Could you elaborate on this statement? The udfDispatcherManager is initialized in the code above. Should we initialize it directly instead of moving the initialization logic into a separate function?

My reasoning for the existence of createUDFDispatcherManager() was that this approach makes it easier to exchange the implementation with a different UDFDispatcherManager, e.g., depending on some Spark conf value.

@sven-weber-db sven-weber-db changed the title [WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs [SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs May 18, 2026
@sven-weber-db sven-weber-db force-pushed the sven-weber_data/spark-56661-catalyst-and-udf branch from 2605573 to 07264e8 Compare May 18, 2026 14:04
@sven-weber-db sven-weber-db requested a review from hvanhovell May 18, 2026 14:04
sparkSession,
val output = toAttributes(func.dataType.asInstanceOf[StructType])

if (SQLConf.get.getConf(SQLConf.UNIFIED_UDF_EXECUTION_ENABLED)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is early days, but can you do me a favor here. Can we define an interface for UDF planning. One for the current implementation and one for the new one? This way we need only one if/else statement, and we can keep the implementations separate...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as discussed offline: I introduced a new planning node, which captures the whole UDF planning. There is a single branch when building the session state, which decides, if the new or the legacy planning will be used.

Comment thread core/src/main/scala/org/apache/spark/SparkEnv.scala Outdated
case class MapPartitionsExternalUDFExec(
workerSpec: UDFWorkerSpecification,
function: ExternalUserDefinedFunction,
isBarrier: Boolean,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supported? It does not seem like it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to support it as PySpark does - but we can probably start without that, keeping the field just for future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is defined - as per earlier review comments - but not yet consumed. It will definitely be required in the future.

// TODO [SPARK-55278]: Stream rows to/from the worker
// via session.process().
// scalastyle:off throwerror
throw new NotImplementedError("doExecute() is not yet implemented.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this waiting for #55657?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, also a GRPC-based session/dispatcher impl. on top of that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is waiting for the WorkerSession to be implemented end-to-end for the DirectDispatcher

@Evolving
public interface InsertSummary extends WriteSummary {
@Experimental
trait UDFDispatcherFactory {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an implementation of this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. Implementing this trait requires an end-to-end implementation of the Dispatcher. However, the only Dispatcher to exist at the moment is the DirectWorkerDispatcher, which still has abstract, non-implemented functions for session creation. We need to wait for @haiyangsun-db's second PR to land before we can implement this trait.

Copy link
Copy Markdown
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look pretty good. Let me know how you want to proceed here.

@sven-weber-db sven-weber-db force-pushed the sven-weber_data/spark-56661-catalyst-and-udf branch 2 times, most recently from 80615e1 to 8ef226a Compare May 21, 2026 13:40
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is required due to the new dependencies in udf/worker we are now consuming

@sven-weber-db sven-weber-db requested a review from hvanhovell May 21, 2026 13:41
@sven-weber-db
Copy link
Copy Markdown
Contributor Author

Hey @hvanhovell, thank you very much for your review. I addressed all of your comments. Could you have another look? Happy to adjust the PR further if there are any more questions or anything unclear.

synchronized {
// Get or Else synchronized to protect
// against concurrent creation requests.
udfDispatcherManager.getOrElse {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever...

dispatcher.close()
} catch {
case e: Exception =>
workerLogger.warn(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in collecting all errors and throwing a combined error (using addSupressed) if there are multiple? Or do you think logging it good enough?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will be called on Spark shutdown in the SparkEnv.close() function. If we throw here, this means other cleanup code will not run, and Spark won't shut down/cleanup properly. It is probably better to log here and continue with other cleanup steps than to abort the whole shutdown procedure. What do you think?

* A test [[UnixSocketWorkerConnection]] that opens a real Unix
* domain socket channel to the worker.
*/
private class RealSocketConnection(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these socket connection classes could potentially be moved to /udf/worker module as they are not specific to spark, and we might reuse them in other unit tests in /udf/worker module.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought. However, in the current state its a bit weird. The class will exist in the UDF package and will not be used there, but it is consumed in another package.

Copy link
Copy Markdown
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hvanhovell
Copy link
Copy Markdown
Contributor

Merging this as soon as CI completes...

@sven-weber-db sven-weber-db force-pushed the sven-weber_data/spark-56661-catalyst-and-udf branch from 8ef226a to de26f00 Compare May 21, 2026 16:25
@sven-weber-db sven-weber-db force-pushed the sven-weber_data/spark-56661-catalyst-and-udf branch from de26f00 to 50881be Compare May 21, 2026 16:25
asf-gitbox-commits pushed a commit that referenced this pull request May 22, 2026
…guage-agnostic Spark UDFs

### What changes were proposed in this pull request?

This PR introduces new logical and physical Catalyst nodes for language-agnostic User Defined Functions (UDF) as part of [SPIP SPARK-55278](https://issues.apache.org/jira/browse/SPARK-55278), which proposes language-agnostic UDFs.

As a first step towards the goal of language-agnostic UDFs, we want to target mapPartition UDFs like `pyspark.sql.DataFrame.mapInArrow`, `pyspark.RDD.mapPartitions`, or `pyspark.sql.DataFrame.mapInArrow`. The overarching goal is to deprecate the current, language-specific Catalyst nodes (like `mapInArrow`). However, for now, the new nodes will exist in addition to the old ones until the new framework has reach maturity.

In summary, this PR introduces:

- A new Catalyst Expression, `ExternalUDFExpression`, which captures language-agnostic UDF properties (payload, name, etc.)
- A new Catalyst logical node, `ExternalUDF`, which serves as a base class for all language-agnostic UDF nodes
- A new Catalyst logical node, `MapPartitionExternalUDF`, which is the new, language-agnostic map partition node
- Catalyst physical nodes for both logical nodes
- `WorkerDispatcherManager` - A manager class which manages UDF Dispatchers based on the target `UDFWorkerSpecification`

None of the changes introduced above are currently consumed in Spark.

### Why are the changes needed?

This is the first step toward  language-agnostic UDF execution for Spark. Existing physical and logical planning nodes need to be replaced eventually to achieve this goal as they make language-specific assumptions.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit-tests were added.

### Was this patch authored or co-authored using generative AI tooling?

Partially. However, the code was manually reviewed and adjusted.

Closes #55768 from sven-weber-db/sven-weber_data/spark-56661-catalyst-and-udf.

Authored-by: Sven Weber <sven.weber@databricks.com>
Signed-off-by: Herman van Hövell <herman@databricks.com>
(cherry picked from commit c2057a3)
Signed-off-by: Herman van Hövell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants