-
Notifications
You must be signed in to change notification settings - Fork 589
HDDS-14246. Change fsync boundary for FilePerBlockStrategy to block level #9570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
thanks @ivandika3 for the patch! |
| in the container happen as sync I/0 or buffered I/O operation. For FilePerBlockStrategy, this | ||
| the sync I/O operation only happens before block file is closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| in the container happen as sync I/0 or buffered I/O operation. For FilePerBlockStrategy, this | |
| the sync I/O operation only happens before block file is closed. | |
| in the container happen as sync I/O or buffered I/O operation. For FilePerBlockStrategy, this | |
| sync I/O operation only happens before block file is closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
|
@vyalamar Do you wanna take a look at this patch? |
|
@rnblough Do you wanna take a look at this issue? |
siddhantsangwan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 Thanks for working on this. I agree with the overall idea, will do a deeper review soon.
siddhantsangwan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the Ratis streaming writes? Will this change also affect that code path and do we need any handling there? CC @szetszwo
Please add some tests to verify this change.
| if (eob) { | ||
| chunkManager.finishWriteChunks(kvContainer, blockData); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to sync in the else case here, when eob is false? Similar to the else case that you added in handlePutBlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FilePerBlockStrategy#finishWriteChunks calls FilePerBlockStrategy.OpenFiles#close that will trigger FilePerBlockStrategy.OpenFiles#close which calls OpenFile#close which will sync before closing the block file.
| in the container happen as sync I/0 or buffered I/O operation. For FilePerBlockStrategy, this | ||
| the sync I/O operation only happens before block file is closed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove "the".
For FilePerBlockStrategy, this the sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Streaming Write Pipeline sync is triggered by client and I have made it configurable in #9533 through ozone.client.datastream.sync.size configuration. In the future, we might need to revisit this. I expect this require 1) A way of keep track of the
Let me think about this. This requires some fault injection to trigger datanode crash just after PutBlock. Let me check if we can use byteman under ozone-fi for this. |
What changes were proposed in this pull request?
Currently, datanode has an option to flush the write on chunk boundary (hdds.container.chunk.write.sync) which is disabled by default since it might affect the DN write throughput and latency. However, disabling this means that if the datanode machine is suddenly down (e.g. power failure, reaped by OOM killer), this might cause the file to have incomplete data even if PutBlock (write commit) is successful which violates our durability guarantee. Although PutBlock triggers FilePerBlockStrategy#finishWriteChunks which will trigger close (RandomAccessFile#close), the buffer cache might not be flushed yet since closing a file does not imply that the buffer cache for the file is flushed (see https://man7.org/linux/man-pages/man2/close.2.html). So there might be a chance where the user's key is committed, but the data do not exist in datanodes.
However, flushing for every WriteChunk might cause unnecessary overhead. We might need to consider calling FileChannel#force on PutBlock instead of WriteChunk since the data is only visible for users when PutBlock returns successfully (the data is committed) and for failure the client will try to replace the block (allocate another block). Therefore, we can guarantee that the after user successfully uploaded the key, the data has been persistently stored in the leader and at least one follower promise to flush the data (MAJORITY_COMMITTED).
This might still affect the write throughput and latency due to waiting for the buffer cached to be flushed to persistent storage (ssd or disk), but will increase our data durability guarantee (which should be our priority). Flushing the buffer cache might also reduce the memory usage of datanode.
In the future, we should consider enabling hdds.container.chunk.write.sync by default.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14246
How was this patch tested?
CI when sync is enabled (https://github.com/ivandika3/ozone/actions/runs/20535392231)