[SPARK-29043][Core] Improve the concurrent performance of History Server#25797
[SPARK-29043][Core] Improve the concurrent performance of History Server#25797turboFei wants to merge 7 commits into
Conversation
|
ok to test cc @wangyum |
|
retest please |
|
ok to test |
|
cc @vanzin too |
HeartSaVioR
left a comment
There was a problem hiding this comment.
The code change looks OK, but I might be missing the effect where "listing" and "reloading applications" run concurrently. It would bring accessing some places via multi-threads which they weren't, so may need to check.
|
ok to test |
|
Test build #110636 has finished for PR 25797 at commit
|
|
gentle ping @dongjoon-hyun @HyukjinKwon |
|
I just modify the a method from protected to private in new commit, it would not impact the test result. |
|
Test build #111038 has finished for PR 25797 at commit
|
|
I just add some comment and rename a method, which would not impact the test result. |
|
Test build #111074 has finished for PR 25797 at commit
|
|
@cloud-fan Could you help take a look? |
|
Seems good to me. |
I have tested it for a week, it works well. |
|
By unblocking the |
Thanks, I will check it. |
I have checked the code of In this PR, if a path is contained in So, should we interrupt the tasks which is processing a path has been expired? |
|
Test build #111606 has finished for PR 25797 at commit
|
| .asScala | ||
| .toList | ||
| stale.foreach { log => | ||
| stale.filterNot(isProcessing(_)).foreach { log => |
There was a problem hiding this comment.
filterNot(isProcessing)
But there's a problem here. Let's say you're unlucky and the cleaner task gets here for a log that should be cleanup up, but for some reason is still being processed by the tasks created by "checkForLogs".
Instead of cleaning that log, it will instead be skipped until the next time the cleaner task run, which may be quote a long time later.
There was a problem hiding this comment.
yes, I think we do not need filter these logs which is contained in processing.
The scene that log is both in listing and processing would only happen when a log has been processed and written to the listing but the finally block(remove it from processing)
has not been executed.
But for this case, even if we cleanup these logs, it would has no impaction.
So, I think we don't need filter these logs here.
@vanzin
There was a problem hiding this comment.
Is it safe if "cleaner task" and "replay task" access the same log directory concurrently? I guess we still need to filter out these logs for safety, but we need to also retry cleaning up these logs sooner, at least sooner than the interval of cleaner.
There was a problem hiding this comment.
Sorry for late reply, I was on my National Day holiday for past several days.
Instead of cleaning that log, it will instead be skipped until the next time the cleaner task run, which may be quote a long time later.
but we need to also retry cleaning up these logs sooner, at least sooner than the interval of cleaner.
It sound a little complex, should we clean these logs with half interval of cleaner later?
There was a problem hiding this comment.
@turboFei
Sorry I got lost on this and revisit this now.
I agree it could be complicated, as it's really just a cleaner. For workaround, we can just do two iterations in a single call, on first iteration try to do all, on second iteration try only with skipped logs.
@vanzin Would the workaround work for you? Or would we have to reschedule the interval?
There was a problem hiding this comment.
I don't think that solves the problem. To solve that problem, the task that's processing the logs should check at the end whether the log file has expired.
|
Test build #113321 has finished for PR 25797 at commit
|
|
Test build #113328 has finished for PR 25797 at commit
|
|
Test build #113330 has finished for PR 25797 at commit
|
|
Test build #113332 has finished for PR 25797 at commit
|
| endProcessing(reader.rootPath) | ||
| pendingReplayTasksCount.decrementAndGet() | ||
|
|
||
| val isExpired = scanTime + conf.get(MAX_LOG_AGE_S) * 1000 < clock.getTimeMillis() |
There was a problem hiding this comment.
This is not right. Expiration is based on the time the log was last updated, not the time it was last scanned.
|
|
||
| val isExpired = scanTime + conf.get(MAX_LOG_AGE_S) * 1000 < clock.getTimeMillis() | ||
| if (isExpired) { | ||
| listing.delete(classOf[LogInfo], reader.rootPath.toString) |
There was a problem hiding this comment.
You may also need to remove the application attempt that refers to the log from the listing database.
Basically you have to do what cleanLogs does, both to define whether the log is expired, and what needs to be deleted.
|
Test build #113379 has finished for PR 25797 at commit
|
|
retest this please. |
|
rebased. |
|
Test build #113625 has finished for PR 25797 at commit
|
|
Test build #113626 has finished for PR 25797 at commit
|
|
@turboFei Hi, could you address the review comments? This is good to have and seems close to be merged (according to #26416 (review) ). |
Thanks, I will address it as soon as possible. |
|
|
||
| log.appId.foreach { appId => | ||
| val app = listing.read(classOf[ApplicationInfoWrapper], appId) | ||
| if (app.oldestAttempt() <= maxTime) { |
There was a problem hiding this comment.
Here the logic is consistent with cleanLogs().
But, I think there is an overlap between app.oldestAttempt() <= maxTime and attempt.info.lastUpdated.getTime() >= maxTime, even it does not matter.
|
Test build #115300 has finished for PR 25797 at commit
|
|
Test build #115334 has finished for PR 25797 at commit
|
vanzin
left a comment
There was a problem hiding this comment.
Looks ok now. Merging to master.
|
|
||
| test("SPARK-29043: clean up specified event log") { | ||
| val clock = new ManualClock() | ||
| val conf = createTestConf().set(MAX_LOG_AGE_S.key, "0").set(CLEANER_ENABLED.key, "true") |
There was a problem hiding this comment.
No need to use .key here.
(I'll fix this during merge.)
What changes were proposed in this pull request?
Even we set spark.history.fs.numReplayThreads to a large number, such as 30.
The history server still replays logs slowly.
We found that, if there is a straggler in a batch of replay tasks, all the other threads will wait for this
straggler.
In this PR, we create processing to save the logs which are being replayed.
So that the replay tasks can execute Asynchronously.
Why are the changes needed?
It can accelerate the speed to replay logs for history server.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.