Skip to content

Conversation

@swamirishi
Copy link
Contributor

@swamirishi swamirishi commented Jul 10, 2025

What changes were proposed in this pull request?

Added support for DeleteRange operation in RocksDB batch writes
Key Changes:

  1. New Operation Type:

    • Added to the enum alongside existing and operations DELETE_RANGE``Op``DELETE``PUT
    • Added new DeleteRangeOperation to perform deleteRange on a rocksdb column family.
  2. Data Structure Improvements:

    • Implemented interface for class to support ordering Comparable``Bytes
  3. New API Methods:

    • Added method to deleteRange``RDBBatchOperation
    • Added to Table interface deleteRangeWithBatch
    • Added to batchDeleteRange``RocksDatabase.ColumnFamily
  4. Operation Handling:

    • Introduced new interface for better operation abstraction Operation
    • Added proper cleanup and resource management for operations
    • Implemented ordered operation execution based on operation index
  5. Interface Updates:

    • Modified to include new method Table.java``deleteRangeWithBatch
    • Updated and to implement the new range deletion functionality TypedTable``RDBTable

The changes improve the RocksDB integration by adding support for batch range deletions, which allows for more efficient deletion of multiple consecutive keys in a single operation. The implementation includes proper resource management and maintains operation ordering within batches.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13415

How was this patch tested?

Working on adding unit tests

…h write

Change-Id: I65444c0155b471c31ea1283d563d9a948f5e4005
…impact

Change-Id: I3b34e1bb6ca7587243f6eef176c396ca0132dbc3
@swamirishi
Copy link
Contributor Author

I changed the initial patch to use a HashMap instead of using a deleteRange. We only have to skip the final entries at the when we have to commit the batch. This would have more memory overhead as the deleteRange wouldn't be deleting the keys from operation map. Using a TreeMap has a performance overhead when it has to maintain all the values in sorted order. This can have performance impact when batches don't have a deleteRange operation which could be infrequent.

Change-Id: Iab23d0b1316eac9c376fd06226d234b97e908b4f
Change-Id: I74fed5aebb5c0588cbf97f70a0f9acf94edfa87c
startKey = null;
endKey = null;
}
for (int i = startIndex; i < deleteRangeIndex; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi, Shouldn't we check all operations prior to deleteRangeIndex i.e., [0, deleteRangeIndex) to discard operations that fall within the delete range Index

// If order of ops is, I think put<2, val1> is not discarded even though it falls in delRange<1, 3>
put<2, val1> // Should be discarded
delRange<1, 2>
delRange<1, 3>

Copy link
Contributor Author

@swamirishi swamirishi Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaketaChalamchala
The outerLoop will have deleteRangeIndices [1, 2, 3]
The inner Loop will perform put operation<2, val1> b/w indices [0, 1).
After the inner loop outerLoop delete range [1,2) at index 1 [line 430]
The inner Loop will perform put operation<2, val1> b/w indices [2, 2) which is nothing.
After the inner loop outerLoop delete range [1,3) at index 2 [line 430]
So it will perform all op. Yeah I agree what we could have possibly done is checked for continuous indexes of delRange[1, 2) , delRange [1, 3) and we could perform a union of overlapping ranges before performing a put operation to avoid it altogether.
Effectively [1,2) + [1,3) will be effectively [1,3)

I didn't want a lot of complexity in this logic. But if you feel we should do it I can make that change.

}

void deleteRange(byte[] startKey, byte[] endKey) {
delRangeCount++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to move the "discard keys that fall in deleteRange logic" here?
Iterate through the opsKeys map, remove keys that fall in range and then put the operation in batchOps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make deleteRange operation not performant. The initial patch had TreeMap which will perform deleteRange efficiently with O(log(N)) in a BST. But because of that put, deletes would suffer O(1) vs O(log(n)) in hashMap and TreeMap. Instead of optimizing deleteRange which is going to be a lesser frequent operation than individual puts and delete. I made it HashMap and I am just taking a hit on commit where I sort Keys on the basis of sequence numbers which is going to be much faster with use of more memory.

@smengcl smengcl added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jul 17, 2025
Copy link
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few cosmetic nitpicks otherwise it seems good.

}
for (int i = startIndex; i < deleteRangeIndex; i++) {
Operation op = ops.get(i);
Bytes key = op.getKey();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add a check to verify key is not null pointer. And that is expected because the key is between startIndex and deleteRangeIndices.

Suggested change
Bytes key = op.getKey();
Bytes key = op.getKey();
Preconditions.checkArgument(key)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getKey() should never return a null value

Preconditions.checkState(overwritten == null);
}

int overWriteOpIfExist(Bytes key, Operation operation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider updating overWriteOpIfExist() so that DeleteRanger API usage is included.

Otherwise it means they don't contribute to batchSize tracking. This inconsistency could lead to inaccurate batch size reporting.

*/
private final Map<Bytes, Object> ops = new HashMap<>();
private final Map<Bytes, Integer> opsKeys = new HashMap<>();
private final Map<Integer, Operation> batchOps = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a description for batchOps similar to opsKeys?

Suggested change
private final Map<Integer, Operation> batchOps = new HashMap<>();
/**
* Cache of all batch operations to be applied for this family, keyed by their insertion order index.
*
* Each entry maps a unique integer index (representing the order in which the operation was added)
* to an {@link Operation} object (such as put, delete, or delete range).
*
* This ensures that all database operations are executed in the same order as they were added,
* which is crucial for correctly handling overlapping or dependent operations, such as delete ranges.
*/
private final Map<Integer, Operation> batchOps = new HashMap<>();

Preconditions.checkState(overwritten == null);
}

int overWriteOpIfExist(Bytes key, Operation operation) {
Copy link
Contributor

@jojochuang jojochuang Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return value is never used, and can be made a private method

Change-Id: I54cdc1733e7f23e449b7336fbdff90c11b382bad
Change-Id: I0395802eed3ca7514d668948fee62ab21c3ac8d2
@swamirishi
Copy link
Contributor Author

@jojochuang @SaketaChalamchala Can you review the code again I have addressed the review comments and also added the test cases.

@jojochuang jojochuang marked this pull request as ready for review August 20, 2025 22:37
@jojochuang
Copy link
Contributor

H I Dm: Found reliance on default encoding in org.apache.hadoop.hdds.utils.db.TestRDBBatchOperation.testBatchOperationWithDeleteRange(): String.getBytes() At TestRDBBatchOperation.java:[line 105]

Change-Id: I93add67453d71ae576b457b7ec72ddd30982e22e
Change-Id: I6d50300c1af91db9afdd63c437f30889bd8f6fc9
Copy link
Contributor

@SaketaChalamchala SaketaChalamchala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @swamirishi. The updated patch includes overwriting operations covered by continuous delete ranges.
LGTM overall.

@jojochuang
Copy link
Contributor

There's a JDK compilation error. Trigger a rebuild just in case.

Change-Id: I0ab3ec95ff18e84182f7412758d5da98fd979c13
@swamirishi
Copy link
Contributor Author

@jojochuang Can this be merged?

@jojochuang
Copy link
Contributor

+1 it passed in my local tree https://github.com/jojochuang/ozone/actions/runs/17170949740

@jojochuang jojochuang merged commit ae92a18 into apache:master Aug 24, 2025
42 checks passed
swamirishi added a commit to swamirishi/ozone that referenced this pull request Dec 3, 2025
…rocksdb batch write (apache#8774)

(cherry picked from commit ae92a18)

 Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/metadata/DatanodeTable.java
	hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBBatchOperation.java
	hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/RDBTable.java
	hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/db/TypedTable.java
	hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/utils/db/TestRDBTableStore.java

Change-Id: I1f6eadab38ead1d3ba5a13dfdd00ec0c0913e722
Comment on lines +135 to +137
// This method mimics the ByteWiseComparator in RocksDB.
@Override
public int compareTo(RDBBatchOperation.Bytes that) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi , Let's just use ByteWiseComparator. There is no reason to implemant a new compareTo method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that and was going to raise a patch for this

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swamirishi , this change in general was good. Thanks!

However, it would be much better if you didn't mixing the refactoring and the new DeleteRange feature. The change became unnecessarily big and difficult to review (Hope that we could respect reviewers' time and effort).

Also, the tests are way to simple compared with the changes. Could you add a better test?

final List<Pair<Pair<String, String>, Integer>> putKeys = new ArrayList<>();
final List<Pair<String, Integer>> deleteKeys = new ArrayList<>();
AtomicInteger cnt = new AtomicInteger(0);
try (MockedConstruction<ManagedWriteBatch> mockedConstruction = Mockito.mockConstruction(ManagedWriteBatch.class,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why mocking but not testing with a real db? How do you test the mock to see if it could mock everything correctly?

Comment on lines +58 to +59
final List<Pair<Pair<String, String>, Integer>> deleteKeyRangePairs = new ArrayList<>();
final List<Pair<Pair<String, String>, Integer>> putKeys = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a List<Pair<Pair<String, String>, Integer>>? How could other people easily understand your code?

Comment on lines +106 to +126
batchOperation.put(columnFamily, string2Bytes("key01"), string2Bytes("value01"));
batchOperation.put(columnFamily, string2Bytes("key02"), string2Bytes("value02"));
batchOperation.put(columnFamily, string2Bytes("key03"), string2Bytes("value03"));
batchOperation.put(columnFamily, string2Bytes("key03"), string2Bytes("value04"));
batchOperation.delete(columnFamily, string2Bytes("key05"));
batchOperation.deleteRange(columnFamily, string2Bytes("key01"), string2Bytes("key02"));
batchOperation.deleteRange(columnFamily, string2Bytes("key02"), string2Bytes("key03"));
batchOperation.put(columnFamily, string2Bytes("key04"), string2Bytes("value04"));
batchOperation.put(columnFamily, string2Bytes("key06"), string2Bytes("value05"));
batchOperation.deleteRange(columnFamily, string2Bytes("key06"), string2Bytes("key12"));
batchOperation.deleteRange(columnFamily, string2Bytes("key09"), string2Bytes("key10"));
RocksDatabase db = Mockito.mock(RocksDatabase.class);
doNothing().when(db).batchWrite(any());
batchOperation.commit(db);
assertEquals(deleteKeys, Collections.singletonList(Pair.of("key05", 1)));
assertEquals(deleteKeyRangePairs, asList(Pair.of(Pair.of("key01", "key02"), 2),
Pair.of(Pair.of("key02", "key03"), 3),
Pair.of(Pair.of("key06", "key12"), 5),
Pair.of(Pair.of("key09", "key10"), 6)));
assertEquals(putKeys, Arrays.asList(Pair.of(Pair.of("key03", "value04"), 0),
Pair.of(Pair.of("key04", "value04"), 4)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard code test seem way too simple compared to the code added? All the keys just have the same length?

I suggest to have a random test to run at least 10,000 operations. Run them with two different DBs,

  1. one with RDBBatchOperation, and
  2. another one directly with RocksDB (without RDBBatchOperation).

Then, compare contents of two DBs .

Copy link
Contributor Author

@swamirishi swamirishi Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice! This is a great test to have. Let me add it

@szetszwo
Copy link
Contributor

Filed HDDS-13415 for cleaning up the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants