-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-24906][SQL] Adaptively enlarge split / partition size for Parq… #21868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,10 +31,11 @@ import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext | |
| import org.apache.spark.sql.catalyst.plans.QueryPlan | ||
| import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning, UnknownPartitioning} | ||
| import org.apache.spark.sql.execution.datasources._ | ||
| import org.apache.spark.sql.execution.datasources.orc.OrcFileFormat | ||
| import org.apache.spark.sql.execution.datasources.parquet.{ParquetFileFormat => ParquetSource} | ||
| import org.apache.spark.sql.execution.metric.SQLMetrics | ||
| import org.apache.spark.sql.sources.{BaseRelation, Filter} | ||
| import org.apache.spark.sql.types.StructType | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.util.Utils | ||
| import org.apache.spark.util.collection.BitSet | ||
|
|
||
|
|
@@ -425,12 +426,44 @@ case class FileSourceScanExec( | |
| fsRelation: HadoopFsRelation): RDD[InternalRow] = { | ||
| val defaultMaxSplitBytes = | ||
| fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes | ||
| val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes | ||
| var openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes | ||
| val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism | ||
| val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum | ||
| val bytesPerCore = totalBytes / defaultParallelism | ||
|
|
||
| val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) | ||
| var maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) | ||
|
|
||
| if(fsRelation.sparkSession.sessionState.conf.isColumnarStorageSplitSizeAdaptiveEnabled && | ||
| (fsRelation.fileFormat.isInstanceOf[ParquetSource] || | ||
| fsRelation.fileFormat.isInstanceOf[OrcFileFormat])) { | ||
| if (relation.dataSchema.map(_.dataType).forall(dataType => | ||
| dataType.isInstanceOf[CalendarIntervalType] || dataType.isInstanceOf[StructType] | ||
| || dataType.isInstanceOf[MapType] || dataType.isInstanceOf[NullType] | ||
| || dataType.isInstanceOf[AtomicType] || dataType.isInstanceOf[ArrayType])) { | ||
|
|
||
| def getTypeLength(dataType: DataType): Int = { | ||
| if (dataType.isInstanceOf[StructType]) { | ||
| fsRelation.sparkSession.sessionState.conf.columnarStructTypeLength | ||
| } else if (dataType.isInstanceOf[ArrayType]) { | ||
| fsRelation.sparkSession.sessionState.conf.columnarArrayTypeLength | ||
| } else if (dataType.isInstanceOf[MapType]) { | ||
| fsRelation.sparkSession.sessionState.conf.columnarMapTypeLength | ||
| } else { | ||
| dataType.defaultSize | ||
| } | ||
| } | ||
|
|
||
| val selectedColumnSize = requiredSchema.map(_.dataType).map(getTypeLength(_)) | ||
| .reduceOption(_ + _).getOrElse(StringType.defaultSize) | ||
| val totalColumnSize = relation.dataSchema.map(_.dataType).map(getTypeLength(_)) | ||
| .reduceOption(_ + _).getOrElse(StringType.defaultSize) | ||
| val multiplier = totalColumnSize / selectedColumnSize | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems here you can only get the ratio of selected columns to total columns. The actual type sizes are not put into consideration.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are many data types. CalendarIntervalType StructType MapType NullType UserDefinedType AtomicType(TimestampType StringType HiveStringType BooleanType DateType BinaryType NumericType) ObjectType ArrayType. For AtomicType, the size is fixed to the defaultSize. For complex type, such as StructType, MapType, ArrayType, the size is mutable. So I make it configurable with default value. With the data type size, multiplier is not only the ratio of selected columns to total columns, but the total size of selected columns to total size of all columns.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya As defined in getTypeLength, user can define the complex types' length as per the data statistics. And the length for AtomicType can be determined by AtomicType.defaultSize. So the multiplier is the ratio of the total length of the selected columns to the total length of all columns. def getTypeLength (dataType : DataType) : Int = {
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya Now it also support ORC. Please help to review |
||
| maxSplitBytes = maxSplitBytes * multiplier | ||
| openCostInBytes = openCostInBytes * multiplier | ||
| } | ||
| } | ||
|
|
||
|
|
||
| logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " + | ||
| s"open cost is considered as scanning $openCostInBytes bytes.") | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type based estimation is very rough. This is still hard for end users to decide the initial size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile The target of this change is not making users easy to set the partition size. Instead, when user set the partition size, this change will try its best to make sure the read size is close to the value that set by user. Without this change, when user set partition size to 128MB, the actual read size may be 1MB or even smaller because of column pruning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think his point is that the estimation is super rough which I agree with .. I am less sure if we should go ahead or not partially by this reason as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I agree that the estimation is rough especially for complex type. For AtomicType, it works better. And at least it take column pruning into consideration.