*: rewrite snapshot merging and sending for v3#3970
Conversation
There was a problem hiding this comment.
where do we get the size of buffer 128? If arbitrary, we should briefly comment.
|
New code looks cleaner. Look forward to seeing more updates. Left some comments. |
|
@gyuho I cleaned up a bunch of things. Now it is reviewable. |
|
Thanks, I will have a look |
There was a problem hiding this comment.
Does this mean that it forwards to leader?
There was a problem hiding this comment.
this (change To to 0) means the message will be dropped. see line 823 too. probably we should doc it.
we drop the original snapshot message since etcdserver will send it later when it gets merged snapshot.
There was a problem hiding this comment.
Oh got it. Yeah I couldn't find documentation on that.
There was a problem hiding this comment.
It is doced somewhere in rafthttp. :(
|
@xiang90 I read all the change and makes sense to me, but it's a big change. Do you plan any more updates on this PR? Otherwise, I will run the actual cluster with this code tomorrow to double check. |
It will fail since there is an issue with bolt db. I will update you with details tomorrow. |
No. I do not. |
|
Got it. What was the issue with boltdb? Code-wise, LGTM. |
|
@gyuho I will update with you more details about boltdb tomorrow or the day after tomorrow. |
|
Got it. Thanks! I will keep reviewing this to understand more. |
|
@gyuho The issue with boltdb is that its reader might block writer. So when sending out a large snapshot (open a long running read txn in bolt), the writes might get blocked. This is what we do not want to see. We have a pending pr at boltdb/bolt#432 |
|
Other than that feel free to test it out. What I have done is to set a 3 nodes cluster with v3 enabled. Then I use benchmark tool to generate enough puts to trigger snapshot while killing nodes to force snapshot sending. |
|
Cool, I will test it out as well. |
|
I tested this with benchmark tool and it works well. LGTM. |
|
@gyuho Did you trigger a snapshot sending? |
|
Yes |
|
@gyuho Great. I will keep this open until tomorrow night. You can try to play with it more. |
|
Sure, thanks, On Wed, Dec 9, 2015 at 3:15 PM, Xiang Li notifications@github.com wrote:
Sincerely, |
|
BTW This is not a snapshot sending. This is just etcdserver doing a snapshot. You have to kill a etcd member, force etcd leader to compact its log. Then when the killed member comes up, the leader does not have required log to send to it (due to compaction). So the leader will have to send a snapshot to that member. You will see log like recovering from incoming snapshot at the other side. |
|
@xiang90 +1. Thanks! |
|
@xiang90 I tested that scenario.
And I checked that it generates log compaction. Please go over this log. |
|
And ran this with this command: |
|
@gyuho Yea. That is what I expected. The log is still a little bit noisy. We need to improve it over time. |
*: rewrite snapshot merging and sending for v3
|
@xiang90 Yeah other than that, code works great. Thanks, |
|
1million_first.txt @xiang90 So I tried to see blocking behavior from boltdb. First case is with 1 million keys, without boltdb/bolt#432. Second case is with 1 million keys and with boltdb/bolt#432. Do you have any suggestion to benchmark the Please let me know. |
|
@gyuho There should not be any perf difference. You should try to create a case that you can observe the blocking behavior. |
|
OK I will try that. Thank Sincerely,
|
|
@gyuho My suggestion will be: Try with boltdb itself first
Then you can understand why we need this in etcd. When we might have a read routine (the snapshot sender that can take up to 1 minute for sending large snapshot), it might block writes for 10s of seconds. |
|
OK got it. Will try. Thanks! Sincerely,
|
I am rewriting the snapshot merging and sending logic for v3 (#3666).
The previous approach is complicated and introduced a few unnecessary components (snapshot_store in both etcdserver and transport). And the coordination between components are error-prone. We have fixed a few bugs but there are still deadlocks around. Moreover the current snapshot creating logic is wrong. It does not stop the apply loop while getting the v3 snapshot.
The previous complexity comes from the delay of the snapshot merging at the final transportation phase. I try to move the merging to the very beginning at etcdserver and it makes thing a lot simper. It is mainly because etcdserver has the full control of the workflow and does not require any coordination.
This pull requests try to simplify the logic. Currently it is work in progress. But the approach works and is significantly simpler.
The snapshot sending logic should be as simple as: