Skip to content

tests merge idempotence and fixes found bug#4643

Merged
keith-turner merged 1 commit into
apache:elasticityfrom
keith-turner:flaky-fate
Jun 6, 2024
Merged

tests merge idempotence and fixes found bug#4643
keith-turner merged 1 commit into
apache:elasticityfrom
keith-turner:flaky-fate

Conversation

@keith-turner
Copy link
Copy Markdown
Contributor

Adds a way to make Fate in the manager always runs steps multiple times to test idempotence. Created versions of merge and delete rows ITs that uses this test manager. Found and fixed a bug in the merge code when running these new test. Changed mini accumulo to allow specifying a different manager class to use prio to startup. There was an existing method that would allow starting a manager with a different class, but using it would have meant letting the actuall manager start and then killing it and starting another which would have added time to test. The change to mini accumulo was not made to its public API, only its implementation.

fixes #4642

Adds a way to make Fate in the manager always runs steps multiple times to test
idempotence. Created versions of merge and delete rows ITs that uses this test
manager. Found and fixed a bug in the merge code when running these new test.
Changed mini accumulo to allow specifying a different manager class to use prio
to startup.  There was an existing method that would allow starting a manager
with a different class, but using it would have meant letting the actuall
manager start and then killing it and starting another which would have added
time to test.  The change to mini accumulo was not made to its public API, only
its implementation.

fixes apache#4642
@keith-turner keith-turner linked an issue Jun 6, 2024 that may be closed by this pull request
@keith-turner keith-turner requested review from cshannon and dlmarion June 6, 2024 19:43
@keith-turner
Copy link
Copy Markdown
Contributor Author

I manually verified that when the test runs that fate steps run twice by looking at the logs. Saw logs like the following when doing this. It would be good to verify this at run time during the test, but have not been able to figure out a good way to do this.

$ grep Running org.apache.accumulo.test.functional.MergeFlakyFateIT_merge/logs/FlakyFateManager_491616213.out | tail
2024-06-06T15:03:22,813 109 [fate.Fate] DEBUG: Running DeleteTablets.isReady() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 0 ms and returned 0
2024-06-06T15:03:22,814 109 [fate.Fate] DEBUG: Running DeleteTablets.isReady() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 0 ms and returned 0
2024-06-06T15:03:22,822 109 [fate.Fate] DEBUG: Running DeleteTablets.call() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 8 ms and returned FinishTableRangeOp
2024-06-06T15:03:22,826 109 [fate.Fate] DEBUG: Running DeleteTablets.call() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 4 ms and returned FinishTableRangeOp
2024-06-06T15:03:22,833 109 [fate.Fate] DEBUG: Running FinishTableRangeOp.isReady() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 0 ms and returned 0
2024-06-06T15:03:22,833 109 [fate.Fate] DEBUG: Running FinishTableRangeOp.isReady() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 0 ms and returned 0
2024-06-06T15:03:22,860 109 [fate.Fate] DEBUG: Running FinishTableRangeOp.call() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 26 ms and returned null
2024-06-06T15:03:22,862 109 [fate.Fate] DEBUG: Running FinishTableRangeOp.call() FATE:USER:c0599ae5-9671-4118-977e-b5529fd93983 took 2 ms and returned null
2024-06-06T15:03:23,447 105 [fate.Fate] DEBUG: Running RenameCompactionFile.isReady() FATE:META:6bff1bc6-09e0-30f8-9d48-b4d04e59c402 took 0 ms and returned 0
2024-06-06T15:03:23,447 105 [fate.Fate] DEBUG: Running RenameCompactionFile.isReady() FATE:META:6bff1bc6-09e0-30f8-9d48-b4d04e59c402 took 0 ms and returned 0

@keith-turner
Copy link
Copy Markdown
Contributor Author

keith-turner commented Jun 6, 2024

Could expand the usage of this new testing technique to cover more of the Fate operations like bulk import, tablet compaction, etc. Will look into that after this, should be easy to do, just need to find the right existing test to extend.

public class DeleteRowsFlakyFateIT extends DeleteRowsIT {
@Override
public void configureMiniCluster(MiniAccumuloConfigImpl cfg, Configuration hadoopCoreSite) {
cfg.setServerClass(ServerType.MANAGER, FlakyFateManager.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd bet that there are some other tests, related to Scan Servers and/or External Compactions, that could use this new method instead of stopping the "normal" implementation and starting a different one for the test. It's probably worth a new ticket for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to look into and see if it could simplify those test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #4644

@cshannon
Copy link
Copy Markdown
Contributor

cshannon commented Jun 7, 2024

@keith-turner - I took a look over this fix this morning since I worked on no-chop merge and added the merge marker to handle the indempotent case in #3975 and it looks good to me.

This does not affect 2.1 since the merge code is entirely different, but I was curious about main because the merged marker and no-chop merge also exists there in #3957. I took a look and found the following that I had added so I do not think this bug is currently an issue in main. In the merge code in main we are only checking if the last tablet in the merge range has the merged marker and then returning because we are assuming it is already merged if it does. We are not checking for a correct linked list. This makes sense as I tested the idempotent case manually when working on this and didn't see an issue.

I guess the only outstanding question if if we should add a similar check to the merge code in main to verify the linked list or not worry about it.

@keith-turner
Copy link
Copy Markdown
Contributor Author

I guess the only outstanding question if if we should add a similar check to the merge code in main to verify the linked list or not worry about it.

The expectation is that the tablets being merged form a linked list. Did not realize that was only checked in elasticity. I think it would be good to add an issue for that. If the persisted tablet metadata does not meet expectations its probably best to avoid proceeding w/ merge.

@keith-turner
Copy link
Copy Markdown
Contributor Author

This does not affect 2.1 since the merge code is entirely different, but I was curious about main because the merged marker and no-chop merge also exists there in #3957.

Thanks for looking into that. Made a comment on #4646 that while this exact bug does not exist in earlier branches that other bugs related to idempotence may exists. Your suggestion there about back porting FlakyFate would help test that.

@cshannon
Copy link
Copy Markdown
Contributor

cshannon commented Jun 7, 2024

The expectation is that the tablets being merged form a linked list. Did not realize that was only checked in elasticity. I think it would be good to add an issue for that. If the persisted tablet metadata does not meet expectations its probably best to avoid proceeding w/ merge.

I created a new issue #4651

cshannon added a commit to cshannon/accumulo that referenced this pull request Jun 8, 2024
This backports the FlakyFate and FlakyFateManager impl from elasticity
that was added in apache#4643 so that fate operations can be easily tested to
check if they are idempotent. DeleteRowsFlakyFateIT and MergeFlayFateIT
were also backported and pass verifying the operations are idempotent.
cshannon added a commit to cshannon/accumulo that referenced this pull request Jun 8, 2024
This backports the FlakyFate and FlakyFateManager impl from elasticity
that was added in apache#4643 so that fate operations can be easily tested to
check if they are idempotent. DeleteRowsFlakyFateIT and MergeFlakyFateIT
were also backported and pass verifying the operations are idempotent.
cshannon added a commit that referenced this pull request Jun 8, 2024
This backports the FlakyFate and FlakyFateManager impl from elasticity
that was added in #4643 so that fate operations can be easily tested to
check if they are idempotent. DeleteRowsFlakyFateIT and MergeFlakyFateIT
were also backported and pass verifying the operations are idempotent.
@keith-turner keith-turner added this to the 4.0.0 milestone Jul 12, 2024
cshannon added a commit to cshannon/accumulo that referenced this pull request Aug 16, 2024
Update a few ITs that were stopping a normal server instance and
starting up a test impl by taking advantage of changes in apache#4643 to set
the server instance type before start up. This simplifies the tests and
also prevents weird behavior by having servers start up and be shut down
quickly.
cshannon added a commit to cshannon/accumulo that referenced this pull request Aug 16, 2024
Update a few ITs that were stopping a normal server instance and
starting up a test impl by taking advantage of changes in apache#4643 to set
the server instance type before start up. This simplifies the tests and
also prevents weird behavior by having servers start up and be shut down
quickly.

This closes apache#4644
cshannon added a commit that referenced this pull request Aug 19, 2024
Update a few ITs that were stopping a normal server instance and
starting up a test impl by taking advantage of changes in #4643 to set
the server instance type before start up. This simplifies the tests and
also prevents weird behavior by having servers start up and be shut down
quickly.

This closes #4644
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test the merge code is idempotent

3 participants