[SPARK-12004] Preserve the RDD partitioner through RDD checkpointing#9983
Closed
tdas wants to merge 4 commits into
Closed
[SPARK-12004] Preserve the RDD partitioner through RDD checkpointing#9983tdas wants to merge 4 commits into
tdas wants to merge 4 commits into
Conversation
Contributor
Author
|
@zsxwing @andrewor14 Can you take a look at this. |
Contributor
Author
There was a problem hiding this comment.
All this code has been moved in the ReliableCheckpointRDD.createCheckpointedRDD
Contributor
|
LGTM. Just style and naming nits. |
|
Test build #46724 has finished for PR 9983 at commit
|
Contributor
Author
|
jenkins test this please |
|
Test build #2132 has finished for PR 9983 at commit
|
|
Test build #46931 has finished for PR 9983 at commit
|
|
Test build #46935 has finished for PR 9983 at commit
|
Contributor
|
retest this please |
|
Test build #46969 has finished for PR 9983 at commit
|
Contributor
|
m1.6 |
asfgit
pushed a commit
that referenced
this pull request
Dec 1, 2015
The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is, `<checkpoint dir>/_partitioner`. In most cases, whether the RDD partitioner was recovered or not, does not affect the correctness, only reduces performance. So this solution makes a best-effort attempt to save and recover the partitioner. If either fails, the checkpointing is not affected. This makes this patch safe and backward compatible. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9983 from tdas/SPARK-12004. (cherry picked from commit 60b541e) Signed-off-by: Andrew Or <andrew@databricks.com>
asfgit
pushed a commit
that referenced
this pull request
Dec 7, 2015
…ner not present The reason is that TrackStateRDDs generated by trackStateByKey expect the previous batch's TrackStateRDDs to have a partitioner. However, when recovery from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a partitioner attached to it. This is because RDD checkpoints do not preserve the partitioner (SPARK-12004). While #9983 solves SPARK-12004 by preserving the partitioner through RDD checkpoints, there may be a non-zero chance that the saving and recovery fails. To be resilient, this PR repartitions the previous state RDD if the partitioner is not detected. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9988 from tdas/SPARK-11932.
asfgit
pushed a commit
that referenced
this pull request
Dec 7, 2015
…ner not present The reason is that TrackStateRDDs generated by trackStateByKey expect the previous batch's TrackStateRDDs to have a partitioner. However, when recovery from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a partitioner attached to it. This is because RDD checkpoints do not preserve the partitioner (SPARK-12004). While #9983 solves SPARK-12004 by preserving the partitioner through RDD checkpoints, there may be a non-zero chance that the saving and recovery fails. To be resilient, this PR repartitions the previous state RDD if the partitioner is not detected. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #9988 from tdas/SPARK-11932. (cherry picked from commit 5d80d8c) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is,
<checkpoint dir>/_partitioner. In most cases, whether the RDD partitioner was recovered or not, does not affect the correctness, only reduces performance. So this solution makes a best-effort attempt to save and recover the partitioner. If either fails, the checkpointing is not affected. This makes this patch safe and backward compatible.