-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13415. Support DeleteRange operation as part of the rocksdb batch write #8774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…h write Change-Id: I65444c0155b471c31ea1283d563d9a948f5e4005
…impact Change-Id: I3b34e1bb6ca7587243f6eef176c396ca0132dbc3
|
I changed the initial patch to use a HashMap instead of using a deleteRange. We only have to skip the final entries at the when we have to commit the batch. This would have more memory overhead as the deleteRange wouldn't be deleting the keys from operation map. Using a TreeMap has a performance overhead when it has to maintain all the values in sorted order. This can have performance impact when batches don't have a deleteRange operation which could be infrequent. |
Change-Id: Iab23d0b1316eac9c376fd06226d234b97e908b4f
Change-Id: I74fed5aebb5c0588cbf97f70a0f9acf94edfa87c
| startKey = null; | ||
| endKey = null; | ||
| } | ||
| for (int i = startIndex; i < deleteRangeIndex; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swamirishi, Shouldn't we check all operations prior to deleteRangeIndex i.e., [0, deleteRangeIndex) to discard operations that fall within the delete range Index
// If order of ops is, I think put<2, val1> is not discarded even though it falls in delRange<1, 3>
put<2, val1> // Should be discarded
delRange<1, 2>
delRange<1, 3>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SaketaChalamchala
The outerLoop will have deleteRangeIndices [1, 2, 3]
The inner Loop will perform put operation<2, val1> b/w indices [0, 1).
After the inner loop outerLoop delete range [1,2) at index 1 [line 430]
The inner Loop will perform put operation<2, val1> b/w indices [2, 2) which is nothing.
After the inner loop outerLoop delete range [1,3) at index 2 [line 430]
So it will perform all op. Yeah I agree what we could have possibly done is checked for continuous indexes of delRange[1, 2) , delRange [1, 3) and we could perform a union of overlapping ranges before performing a put operation to avoid it altogether.
Effectively [1,2) + [1,3) will be effectively [1,3)
I didn't want a lot of complexity in this logic. But if you feel we should do it I can make that change.
| } | ||
|
|
||
| void deleteRange(byte[] startKey, byte[] endKey) { | ||
| delRangeCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to move the "discard keys that fall in deleteRange logic" here?
Iterate through the opsKeys map, remove keys that fall in range and then put the operation in batchOps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make deleteRange operation not performant. The initial patch had TreeMap which will perform deleteRange efficiently with O(log(N)) in a BST. But because of that put, deletes would suffer O(1) vs O(log(n)) in hashMap and TreeMap. Instead of optimizing deleteRange which is going to be a lesser frequent operation than individual puts and delete. I made it HashMap and I am just taking a hit on commit where I sort Keys on the basis of sequence numbers which is going to be much faster with use of more memory.
jojochuang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few cosmetic nitpicks otherwise it seems good.
| } | ||
| for (int i = startIndex; i < deleteRangeIndex; i++) { | ||
| Operation op = ops.get(i); | ||
| Bytes key = op.getKey(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to add a check to verify key is not null pointer. And that is expected because the key is between startIndex and deleteRangeIndices.
| Bytes key = op.getKey(); | |
| Bytes key = op.getKey(); | |
| Preconditions.checkArgument(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getKey() should never return a null value
| Preconditions.checkState(overwritten == null); | ||
| } | ||
|
|
||
| int overWriteOpIfExist(Bytes key, Operation operation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider updating overWriteOpIfExist() so that DeleteRanger API usage is included.
Otherwise it means they don't contribute to batchSize tracking. This inconsistency could lead to inaccurate batch size reporting.
| */ | ||
| private final Map<Bytes, Object> ops = new HashMap<>(); | ||
| private final Map<Bytes, Integer> opsKeys = new HashMap<>(); | ||
| private final Map<Integer, Operation> batchOps = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a description for batchOps similar to opsKeys?
| private final Map<Integer, Operation> batchOps = new HashMap<>(); | |
| /** | |
| * Cache of all batch operations to be applied for this family, keyed by their insertion order index. | |
| * | |
| * Each entry maps a unique integer index (representing the order in which the operation was added) | |
| * to an {@link Operation} object (such as put, delete, or delete range). | |
| * | |
| * This ensures that all database operations are executed in the same order as they were added, | |
| * which is crucial for correctly handling overlapping or dependent operations, such as delete ranges. | |
| */ | |
| private final Map<Integer, Operation> batchOps = new HashMap<>(); |
| Preconditions.checkState(overwritten == null); | ||
| } | ||
|
|
||
| int overWriteOpIfExist(Bytes key, Operation operation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return value is never used, and can be made a private method
Change-Id: I54cdc1733e7f23e449b7336fbdff90c11b382bad
Change-Id: I0395802eed3ca7514d668948fee62ab21c3ac8d2
|
@jojochuang @SaketaChalamchala Can you review the code again I have addressed the review comments and also added the test cases. |
|
H I Dm: Found reliance on default encoding in org.apache.hadoop.hdds.utils.db.TestRDBBatchOperation.testBatchOperationWithDeleteRange(): String.getBytes() At TestRDBBatchOperation.java:[line 105] |
Change-Id: I93add67453d71ae576b457b7ec72ddd30982e22e
Change-Id: I6d50300c1af91db9afdd63c437f30889bd8f6fc9
SaketaChalamchala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @swamirishi. The updated patch includes overwriting operations covered by continuous delete ranges.
LGTM overall.
|
There's a JDK compilation error. Trigger a rebuild just in case. |
Change-Id: I0ab3ec95ff18e84182f7412758d5da98fd979c13
|
@jojochuang Can this be merged? |
|
+1 it passed in my local tree https://github.com/jojochuang/ozone/actions/runs/17170949740 |
…rocksdb batch write (apache#8774) (cherry picked from commit ae92a18) Conflicts: hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/metadata/DatanodeTable.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBBatchOperation.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBTable.java hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/TypedTable.java hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBTableStore.java Change-Id: I1f6eadab38ead1d3ba5a13dfdd00ec0c0913e722
| // This method mimics the ByteWiseComparator in RocksDB. | ||
| @Override | ||
| public int compareTo(RDBBatchOperation.Bytes that) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swamirishi , Let's just use ByteWiseComparator. There is no reason to implemant a new compareTo method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that and was going to raise a patch for this
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swamirishi , this change in general was good. Thanks!
However, it would be much better if you didn't mixing the refactoring and the new DeleteRange feature. The change became unnecessarily big and difficult to review (Hope that we could respect reviewers' time and effort).
Also, the tests are way to simple compared with the changes. Could you add a better test?
| final List<Pair<Pair<String, String>, Integer>> putKeys = new ArrayList<>(); | ||
| final List<Pair<String, Integer>> deleteKeys = new ArrayList<>(); | ||
| AtomicInteger cnt = new AtomicInteger(0); | ||
| try (MockedConstruction<ManagedWriteBatch> mockedConstruction = Mockito.mockConstruction(ManagedWriteBatch.class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mocking but not testing with a real db? How do you test the mock to see if it could mock everything correctly?
| final List<Pair<Pair<String, String>, Integer>> deleteKeyRangePairs = new ArrayList<>(); | ||
| final List<Pair<Pair<String, String>, Integer>> putKeys = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a List<Pair<Pair<String, String>, Integer>>? How could other people easily understand your code?
| batchOperation.put(columnFamily, string2Bytes("key01"), string2Bytes("value01")); | ||
| batchOperation.put(columnFamily, string2Bytes("key02"), string2Bytes("value02")); | ||
| batchOperation.put(columnFamily, string2Bytes("key03"), string2Bytes("value03")); | ||
| batchOperation.put(columnFamily, string2Bytes("key03"), string2Bytes("value04")); | ||
| batchOperation.delete(columnFamily, string2Bytes("key05")); | ||
| batchOperation.deleteRange(columnFamily, string2Bytes("key01"), string2Bytes("key02")); | ||
| batchOperation.deleteRange(columnFamily, string2Bytes("key02"), string2Bytes("key03")); | ||
| batchOperation.put(columnFamily, string2Bytes("key04"), string2Bytes("value04")); | ||
| batchOperation.put(columnFamily, string2Bytes("key06"), string2Bytes("value05")); | ||
| batchOperation.deleteRange(columnFamily, string2Bytes("key06"), string2Bytes("key12")); | ||
| batchOperation.deleteRange(columnFamily, string2Bytes("key09"), string2Bytes("key10")); | ||
| RocksDatabase db = Mockito.mock(RocksDatabase.class); | ||
| doNothing().when(db).batchWrite(any()); | ||
| batchOperation.commit(db); | ||
| assertEquals(deleteKeys, Collections.singletonList(Pair.of("key05", 1))); | ||
| assertEquals(deleteKeyRangePairs, asList(Pair.of(Pair.of("key01", "key02"), 2), | ||
| Pair.of(Pair.of("key02", "key03"), 3), | ||
| Pair.of(Pair.of("key06", "key12"), 5), | ||
| Pair.of(Pair.of("key09", "key10"), 6))); | ||
| assertEquals(putKeys, Arrays.asList(Pair.of(Pair.of("key03", "value04"), 0), | ||
| Pair.of(Pair.of("key04", "value04"), 4))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hard code test seem way too simple compared to the code added? All the keys just have the same length?
I suggest to have a random test to run at least 10,000 operations. Run them with two different DBs,
- one with RDBBatchOperation, and
- another one directly with RocksDB (without RDBBatchOperation).
Then, compare contents of two DBs .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice! This is a great test to have. Let me add it
|
Filed HDDS-13415 for cleaning up the code. |
What changes were proposed in this pull request?
Added support for DeleteRange operation in RocksDB batch writes
Key Changes:
New Operation Type:
DELETE_RANGE``Op``DELETE``PUTData Structure Improvements:
Comparable``BytesNew API Methods:
deleteRange``RDBBatchOperationTableinterfacedeleteRangeWithBatchbatchDeleteRange``RocksDatabase.ColumnFamilyOperation Handling:
OperationInterface Updates:
Table.java``deleteRangeWithBatchTypedTable``RDBTableThe changes improve the RocksDB integration by adding support for batch range deletions, which allows for more efficient deletion of multiple consecutive keys in a single operation. The implementation includes proper resource management and maintains operation ordering within batches.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13415
How was this patch tested?
Working on adding unit tests