ARROW-9603: [C++] Fix parquet write to not assume leaf-array validity bitmaps have the same values as parent structs#8219
Conversation
|
There is a better solution. I'll update the PR |
|
Nm, I think this is likely the only reasonable approach. We might consider pushing bitmap building up the stack at some point. |
|
I'm not sure I have enough mental context to review this PR carefully |
There was a problem hiding this comment.
this is unused now.
|
@xhochy might be the only one. I can do my best to provide some comments |
wesm
left a comment
There was a problem hiding this comment.
I reviewed only the new parts -- overall seemed pretty reasonable. Can you update the PR title to explain the issue?
It's regrettable that this change has to touch so much code -- makes me think there could be some code restructurings possible in column_writer.cc, but not sure it's worth the expense right now
There was a problem hiding this comment.
Seems like there might be a helper function opportunity if this pattern is repeated in other test functions
There was a problem hiding this comment.
it turns out this could be simplified as well, so I don't think a helper function is necessary.
There was a problem hiding this comment.
You can use ArrayData::Make for nicer syntax (don't have to write out std::vector<std::shared_ptr<Buffer>>)
There was a problem hiding this comment.
thanks, I somehow keep forgetting this.
cpp/src/parquet/arrow/writer.cc
Outdated
There was a problem hiding this comment.
Since WriteArrow returns Status, should we adopt that APIs must either return Status or throw an exception, but not both? (FWIW I regret that we chose to allow exceptions in the Parquet C++ project back in 2016)
There was a problem hiding this comment.
done. I suppose it is too late to revisit this? Perhaps provide status/result returning methods in one PR and then deprecated exception throwing ones?
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
The fact that maybe_has_nulls is false whenever nested is false seems odd
There was a problem hiding this comment.
yeah, I renamed maybe_has_nulls to maybe_has_parent_nulls which is hopefully clearer? Happy to pick another name that makes sense.
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Might be useful someday to have a helper function to make an array copy with a particular buffer replaced, I seem to recall a JIRA issue about this
There was a problem hiding this comment.
Agree, looks like:https://issues.apache.org/jira/browse/ARROW-7071 might be it?
cpp/src/parquet/column_writer.cc
Outdated
26745a9 to
96d2ad5
Compare
emkornfield
left a comment
There was a problem hiding this comment.
@wesm thanks for the review. I addressed comments and rebased off of master to remove the first commit.
cpp/src/parquet/arrow/writer.cc
Outdated
There was a problem hiding this comment.
done. I suppose it is too late to revisit this? Perhaps provide status/result returning methods in one PR and then deprecated exception throwing ones?
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
yeah, I renamed maybe_has_nulls to maybe_has_parent_nulls which is hopefully clearer? Happy to pick another name that makes sense.
cpp/src/parquet/column_writer.cc
Outdated
cpp/src/parquet/column_writer.cc
Outdated
There was a problem hiding this comment.
Agree, looks like:https://issues.apache.org/jira/browse/ARROW-7071 might be it?
|
@xhochy did you want to review? |
cpp/src/parquet/column_writer.cc
Outdated
| ArrowWriteContext* ctx, bool nested, bool array_nullable) override { | ||
| BEGIN_PARQUET_CATCH_EXCEPTIONS | ||
| bool leaf_is_not_nullable = !level_info_.HasNullableValues(); | ||
| // Leaf nulls are canonical when there is only a single null element and it is at the |
There was a problem hiding this comment.
"single nullable element" perhaps?
cpp/src/parquet/column_writer.cc
Outdated
| // leaf. | ||
| bool leaf_nulls_are_canonical = | ||
| (level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) && | ||
| array_nullable; |
There was a problem hiding this comment.
array_nullable refers to the parent, the root, the leaf? This is difficult to follow.
There was a problem hiding this comment.
Perhaps rename to parent_nullable or root_nullable or...
There was a problem hiding this comment.
it is the leaf, will do some renaming to make this clearer.
cpp/src/parquet/column_writer.cc
Outdated
| ArrowWriteContext* ctx) override { | ||
| ArrowWriteContext* ctx, bool nested, bool array_nullable) override { | ||
| BEGIN_PARQUET_CATCH_EXCEPTIONS | ||
| bool leaf_is_not_nullable = !level_info_.HasNullableValues(); |
cpp/src/parquet/column_writer.cc
Outdated
| (level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) && | ||
| array_nullable; | ||
| bool maybe_parent_nulls = | ||
| nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical); |
There was a problem hiding this comment.
Wait, if nested is false, is all this complicated dance required?
There was a problem hiding this comment.
nested is actually unncessary. i've removed it. The only thing that matters is if the column is nullable according to columninfo and it isn't the only nullable column.
cpp/src/parquet/column_writer.cc
Outdated
| arrow::AllocateResizableBuffer( | ||
| BitUtil::BytesForBits(properties_->write_batch_size()), ctx->memory_pool)); | ||
| bits_buffer_->ZeroPadding(); | ||
| std::static_pointer_cast<ResizableBuffer>(AllocateBuffer(allocator_, 0)); |
There was a problem hiding this comment.
Is this allocating a new (temporary?) validity buffer for each write batch?
There was a problem hiding this comment.
this line should be removed. but above, yes, we do allocate a new buffer for each WriteArrow call. I think the lifecycle of this object might only be used for one WriteArrow call. internally there is a concept of batching, and the allocation should only happen once for here for each of those batches.
|
I reserved my self an hour tomorrow to review this. I haven't touched this code for over a year but this is the code path that actually got me into Arrow/Parquet project, so I'm happy to carve out time for it. |
| 3); | ||
| } | ||
|
|
||
| TEST(ArrowReadWrite, NestedRequiredField) { |
There was a problem hiding this comment.
The test cases look very very similar, just the name and the used values differ. I would have expected that we also would have set nullable=false somewhere in this one.
There was a problem hiding this comment.
Are you looking at the latest version?
Right below this comment is:
auto int_field = ::arrow::field("int_array", ::arrow::int32(), /*nullable=*/false);
(note the last parameter?)
There was a problem hiding this comment.
Ah, didn't see that when reviewing this code. Now this makes sense!
Don't rely on nullability values of leaf nodes matching their parents.
In general it feels like the WriteArrow code path in column_writer.cc could use some cleanup to remove duplicated code, but while ugly I think this fix works.