Skip to content

[SPARK-34805][SQL] Propagate metadata from nested columns in Alias#35270

Closed
kevinwallimann wants to merge 11 commits intoapache:masterfrom
kevinwallimann:SPARK-34805
Closed

[SPARK-34805][SQL] Propagate metadata from nested columns in Alias#35270
kevinwallimann wants to merge 11 commits intoapache:masterfrom
kevinwallimann:SPARK-34805

Conversation

@kevinwallimann
Copy link
Contributor

What changes were proposed in this pull request?

The metadata of a GetStructField expression is propagated in the Alias expression.

Why are the changes needed?

Currently, in a dataframe with nested structs, when selecting an inner struct, the metadata of that inner struct is lost. For example, suppose
df.schema.head.dataType.head.metadata
returns a non-empty Metadata object, then
df.select('Field0.SubField0').schema.head.metadata
returns an empty Metadata object

The following snippet demonstrates the issue

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, MetadataBuilder, StructField, StructType}
val metadataAbc = new MetadataBuilder().putString("my-metadata", "abc").build()
val metadataXyz = new MetadataBuilder().putString("my-metadata", "xyz").build()
val schema = StructType(Seq(
  StructField("abc",
    StructType(Seq(
      StructField("xyz", LongType, nullable = true, metadataXyz)
    )), metadata = metadataAbc)))
import scala.collection.JavaConverters._
val data = Seq(Row(Row(1L))).asJava
val df = spark.createDataFrame(data, schema)

println(df.select("abc").schema.head.metadata) // OK, metadata is {"my-metadata":"abc"}
println(df.select("abc.xyz").schema.head.metadata) // NOT OK, metadata is {}, expected {"my-metadata","xyz"}

The issue can be reproduced in versions 3.2.0, 3.1.2 and 2.4.8

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new test

@github-actions github-actions bot added the SQL label Jan 21, 2022
@kevinwallimann kevinwallimann marked this pull request as ready for review January 21, 2022 13:13
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

cc @viirya FYI

assert(newCol.expr.asInstanceOf[NamedExpression].metadata.getString("key") === "value")
}

test("as propagates metadata from nested column") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add jira number as test name prefix? I.e. SPARK-34805: ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

nonInheritableMetadataKeys.foreach(builder.remove)
builder.build()

case structField: GetStructField => structField.metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to remove keys in nonInheritableMetadataKeys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I fixed it and added an additional test.

@kevinwallimann kevinwallimann requested a review from viirya January 25, 2022 15:29
@kevinwallimann
Copy link
Contributor Author

Hi @viirya Could you please review my changes again?

@Zejnilovic
Copy link

Hello, @viirya any news on this review?

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.3!

@cloud-fan cloud-fan closed this in f8fd023 Mar 21, 2022
cloud-fan pushed a commit that referenced this pull request Mar 21, 2022
### What changes were proposed in this pull request?
The metadata of a `GetStructField` expression is propagated in the `Alias` expression.

### Why are the changes needed?
Currently, in a dataframe with nested structs, when selecting an inner struct, the metadata of that inner struct is lost. For example, suppose
`df.schema.head.dataType.head.metadata`
returns a non-empty Metadata object, then
`df.select('Field0.SubField0').schema.head.metadata`
returns an empty Metadata object

The following snippet demonstrates the issue
```
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, MetadataBuilder, StructField, StructType}
val metadataAbc = new MetadataBuilder().putString("my-metadata", "abc").build()
val metadataXyz = new MetadataBuilder().putString("my-metadata", "xyz").build()
val schema = StructType(Seq(
  StructField("abc",
    StructType(Seq(
      StructField("xyz", LongType, nullable = true, metadataXyz)
    )), metadata = metadataAbc)))
import scala.collection.JavaConverters._
val data = Seq(Row(Row(1L))).asJava
val df = spark.createDataFrame(data, schema)

println(df.select("abc").schema.head.metadata) // OK, metadata is {"my-metadata":"abc"}
println(df.select("abc.xyz").schema.head.metadata) // NOT OK, metadata is {}, expected {"my-metadata","xyz"}
```

The issue can be reproduced in versions 3.2.0, 3.1.2 and 2.4.8

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a new test

Closes #35270 from kevinwallimann/SPARK-34805.

Authored-by: Kevin Wallimann <kevin.wallimann@absa.africa>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f8fd023)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants