Skip to content

[SPARK-29043][Core] Improve the concurrent performance of History Server#25797

Closed
turboFei wants to merge 7 commits into
apache:masterfrom
turboFei:SPARK-29043
Closed

[SPARK-29043][Core] Improve the concurrent performance of History Server#25797
turboFei wants to merge 7 commits into
apache:masterfrom
turboFei:SPARK-29043

Conversation

@turboFei
Copy link
Copy Markdown
Member

@turboFei turboFei commented Sep 15, 2019

What changes were proposed in this pull request?

Even we set spark.history.fs.numReplayThreads to a large number, such as 30.
The history server still replays logs slowly.
We found that, if there is a straggler in a batch of replay tasks, all the other threads will wait for this
straggler.

In this PR, we create processing to save the logs which are being replayed.
So that the replay tasks can execute Asynchronously.

Why are the changes needed?

It can accelerate the speed to replay logs for history server.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

@turboFei turboFei changed the title [SPARK-29043] Improve the concurrent performance of History Server [WIP][SPARK-29043] Improve the concurrent performance of History Server Sep 15, 2019
@turboFei turboFei changed the title [WIP][SPARK-29043] Improve the concurrent performance of History Server [WIP][SPARK-29043][Core] Improve the concurrent performance of History Server Sep 15, 2019
@turboFei turboFei changed the title [WIP][SPARK-29043][Core] Improve the concurrent performance of History Server [SPARK-29043][Core] Improve the concurrent performance of History Server Sep 15, 2019
@dongjoon-hyun
Copy link
Copy Markdown
Member

ok to test

cc @wangyum

@turboFei
Copy link
Copy Markdown
Member Author

retest please

@HyukjinKwon
Copy link
Copy Markdown
Member

ok to test

@HyukjinKwon
Copy link
Copy Markdown
Member

cc @vanzin too

Copy link
Copy Markdown
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks OK, but I might be missing the effect where "listing" and "reloading applications" run concurrently. It would bring accessing some places via multi-threads which they weren't, so may need to check.

Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
@wangyum
Copy link
Copy Markdown
Member

wangyum commented Sep 16, 2019

ok to test

Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
@SparkQA
Copy link
Copy Markdown

SparkQA commented Sep 16, 2019

Test build #110636 has finished for PR 25797 at commit ec387f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei turboFei requested a review from kiszk September 17, 2019 03:05
@turboFei
Copy link
Copy Markdown
Member Author

gentle ping @dongjoon-hyun @HyukjinKwon

@kiszk
Copy link
Copy Markdown
Member

kiszk commented Sep 19, 2019

Looks good to me
Want to wait for comments from others cc @vanzin @wangyum

@turboFei
Copy link
Copy Markdown
Member Author

I just modify the a method from protected to private in new commit, it would not impact the test result.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Sep 20, 2019

Test build #111038 has finished for PR 25797 at commit 1c36bfe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable.

Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
@turboFei
Copy link
Copy Markdown
Member Author

I just add some comment and rename a method, which would not impact the test result.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Sep 20, 2019

Test build #111074 has finished for PR 25797 at commit 12f6300.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei
Copy link
Copy Markdown
Member Author

@cloud-fan Could you help take a look?

@kiszk
Copy link
Copy Markdown
Member

kiszk commented Sep 26, 2019

ping @vanzin @wangyum

@gengliangwang
Copy link
Copy Markdown
Member

Seems good to me.
@turboFei Have you tested the changes manually?

@turboFei
Copy link
Copy Markdown
Member Author

Seems good to me.
@turboFei Have you tested the changes manually?

I have tested it for a week, it works well.

@vanzin
Copy link
Copy Markdown
Contributor

vanzin commented Sep 26, 2019

By unblocking the cleanForLogs call, you also unblock the cleaner task which may run after it. I didn't see any changes to that code; it should probably check whether some path is being processed before trying to do things with it.

@turboFei
Copy link
Copy Markdown
Member Author

turboFei commented Sep 27, 2019

By unblocking the cleanForLogs call, you also unblock the cleaner task which may run after it. I didn't see any changes to that code; it should probably check whether some path is being processed before trying to do things with it.

Thanks, I will check it.

@turboFei
Copy link
Copy Markdown
Member Author

By unblocking the cleanForLogs call, you also unblock the cleaner task which may run after it. I didn't see any changes to that code; it should probably check whether some path is being processed before trying to do things with it.

I have checked the code of cleanLogs, it will clean up the expired paths in listing and clear relative record in blackList.

In this PR, if a path is contained in processing, it would not be added into listing.

So, should we interrupt the tasks which is processing a path has been expired?

Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
@SparkQA
Copy link
Copy Markdown

SparkQA commented Sep 30, 2019

Test build #111606 has finished for PR 25797 at commit 55ad333.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
.asScala
.toList
stale.foreach { log =>
stale.filterNot(isProcessing(_)).foreach { log =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filterNot(isProcessing)

But there's a problem here. Let's say you're unlucky and the cleaner task gets here for a log that should be cleanup up, but for some reason is still being processed by the tasks created by "checkForLogs".

Instead of cleaning that log, it will instead be skipped until the next time the cleaner task run, which may be quote a long time later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we do not need filter these logs which is contained in processing.

The scene that log is both in listing and processing would only happen when a log has been processed and written to the listing but the finally block(remove it from processing)
has not been executed.

But for this case, even if we cleanup these logs, it would has no impaction.
So, I think we don't need filter these logs here.
@vanzin

Copy link
Copy Markdown
Contributor

@HeartSaVioR HeartSaVioR Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe if "cleaner task" and "replay task" access the same log directory concurrently? I guess we still need to filter out these logs for safety, but we need to also retry cleaning up these logs sooner, at least sooner than the interval of cleaner.

Copy link
Copy Markdown
Member Author

@turboFei turboFei Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late reply, I was on my National Day holiday for past several days.

Instead of cleaning that log, it will instead be skipped until the next time the cleaner task run, which may be quote a long time later.

but we need to also retry cleaning up these logs sooner, at least sooner than the interval of cleaner.

It sound a little complex, should we clean these logs with half interval of cleaner later?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turboFei
Sorry I got lost on this and revisit this now.
I agree it could be complicated, as it's really just a cleaner. For workaround, we can just do two iterations in a single call, on first iteration try to do all, on second iteration try only with skipped logs.
@vanzin Would the workaround work for you? Or would we have to reschedule the interval?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that solves the problem. To solve that problem, the task that's processing the logs should check at the end whether the log file has expired.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, make sense.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 6, 2019

Test build #113321 has finished for PR 25797 at commit 611ea30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 6, 2019

Test build #113328 has finished for PR 25797 at commit 832dd7a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 6, 2019

Test build #113330 has finished for PR 25797 at commit 8c0b532.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 6, 2019

Test build #113332 has finished for PR 25797 at commit 66e9dd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

endProcessing(reader.rootPath)
pendingReplayTasksCount.decrementAndGet()

val isExpired = scanTime + conf.get(MAX_LOG_AGE_S) * 1000 < clock.getTimeMillis()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right. Expiration is based on the time the log was last updated, not the time it was last scanned.


val isExpired = scanTime + conf.get(MAX_LOG_AGE_S) * 1000 < clock.getTimeMillis()
if (isExpired) {
listing.delete(classOf[LogInfo], reader.rootPath.toString)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also need to remove the application attempt that refers to the log from the listing database.

Basically you have to do what cleanLogs does, both to define whether the log is expired, and what needs to be deleted.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 7, 2019

Test build #113379 has finished for PR 25797 at commit d5ff72c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei
Copy link
Copy Markdown
Member Author

turboFei commented Nov 7, 2019

retest this please.

@turboFei
Copy link
Copy Markdown
Member Author

rebased.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 12, 2019

Test build #113625 has finished for PR 25797 at commit 579c1ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 12, 2019

Test build #113626 has finished for PR 25797 at commit 7d1301c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
@HeartSaVioR
Copy link
Copy Markdown
Contributor

@turboFei Hi, could you address the review comments? This is good to have and seems close to be merged (according to #26416 (review) ).

@turboFei
Copy link
Copy Markdown
Member Author

@turboFei Hi, could you address the review comments? This is good to have and seems close to be merged (according to #26416 (review) ).

Thanks, I will address it as soon as possible.
Thanks for your reminder. @HeartSaVioR


log.appId.foreach { appId =>
val app = listing.read(classOf[ApplicationInfoWrapper], appId)
if (app.oldestAttempt() <= maxTime) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the logic is consistent with cleanLogs().
But, I think there is an overlap between app.oldestAttempt() <= maxTime and attempt.info.lastUpdated.getTime() >= maxTime, even it does not matter.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Dec 13, 2019

Test build #115300 has finished for PR 25797 at commit c6ac35e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Outdated
@SparkQA
Copy link
Copy Markdown

SparkQA commented Dec 14, 2019

Test build #115334 has finished for PR 25797 at commit e9ebb6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok now. Merging to master.


test("SPARK-29043: clean up specified event log") {
val clock = new ManualClock()
val conf = createTestConf().set(MAX_LOG_AGE_S.key, "0").set(CLEANER_ENABLED.key, "true")
Copy link
Copy Markdown
Contributor

@vanzin vanzin Dec 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use .key here.

(I'll fix this during merge.)

@vanzin vanzin closed this in 5954311 Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.