[SPARK-48756][CONNECT][PYTHON]Support for df.debug() in Connect Mode#47153
[SPARK-48756][CONNECT][PYTHON]Support for df.debug() in Connect Mode#47153grundprinzip wants to merge 2 commits into
df.debug() in Connect Mode#47153Conversation
df.debug() in Connect Modedf.debug() in Connect Mode
df.debug() in Connect Modedf.debug() in Connect Mode
|
|
||
| =========== | ||
| Spark Connect - Execution Info and Debug | ||
| =========== |
There was a problem hiding this comment.
Should have =========== to match with its size - otherwise Sphinx warns and complains about it
| the usage. In addition, it makes sure that the captured metrics are properly collected | ||
| as part of the execution info. | ||
|
|
||
| .. versionadded:: 4.0.0 |
There was a problem hiding this comment.
| .. versionadded:: 4.0.0 | |
| .. versionadded:: 4.0.0 | |
Otherwise the HTML output is malformed
| from pyspark.errors import PySparkValueError | ||
| from pyspark.errors import PySparkValueError, PySparkTypeError | ||
| from pyspark.sql import Observation, Column | ||
|
|
|
|
||
| @classmethod | ||
| def count_values(cls) -> "DataDebugOp": | ||
| return DataDebugOp("count_values", F.count(F.lit(1)).alias("count_values")) |
There was a problem hiding this comment.
So this is a wrapper of observe API. I think it does not simplify a lot vs the existing uscase ..
observation = Observation("my metrics")
observed_df = df.observe(Observation("my metrics"), count(lit(1)).alias("count"), max(col("age")))
observation.get()and this won't work for streaming.
itholic
left a comment
There was a problem hiding this comment.
Good catch. Let me address _capture_call_site as well
| self._execution_info.setObservations(self._plan.observations) | ||
| return self._execution_info | ||
|
|
||
| def debug(self, *other: List["DataDebugOp"]) -> "DataFrame": |
There was a problem hiding this comment.
If the usage is:
spark.range(100).debug(DataDebugOp.max_value("id"), DataDebugOp.count_null_values("id"))instead of:
spark.range(100).debug([DataDebugOp.max_value("id"), DataDebugOp.count_null_values("id")])| def debug(self, *other: List["DataDebugOp"]) -> "DataFrame": | |
| def debug(self, *other: "DataDebugOp") -> "DataFrame": |
| """ | ||
| ... | ||
|
|
||
| def debug(self) -> "DataFrame": |
There was a problem hiding this comment.
The signature should all the same:
| def debug(self) -> "DataFrame": | |
| def debug(self, *other: "DataDebugOp") -> "DataFrame": |
| message_parameters={"member": "queryExecution"}, | ||
| ) | ||
|
|
||
| def debug(self) -> "DataFrame": |
There was a problem hiding this comment.
ditto.
| def debug(self) -> "DataFrame": | |
| def debug(self, *other: "DataDebugOp") -> "DataFrame": |
| data debug operations. | ||
| """ | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
nit: @staticmethod if cls is not used?
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
At times users want to evaluate the properties of their data flow graph and understand how certain transformations behave. Today this is more complex than necessary. Even though the
df.observe()API has been around in Spark since Spark 3.3, it's usage is not widespread.To give users a more visible API for understanding the data flow execution in Spark, this patch adds a new method to the DataFrame API called
df.debug(). Debug will by default do the following:debug:<uuid>count(1)observation to itAfter the execution, users can now access the observation using the execution info property of the DataFrame.
The debug String contains the reference to the observation, the call site and the values.
In addition to the count, we have defined several useful additional debug observations that can be easily injected.
Produces the following output:
Why are the changes needed?
User-support
Does this PR introduce any user-facing change?
Adds new method.
How was this patch tested?
New UT
Was this patch authored or co-authored using generative AI tooling?
No