From 981aa13c879dcf3a6bc2aaac23c407093c8dabf9 Mon Sep 17 00:00:00 2001 From: "yi.wu" Date: Wed, 17 Jun 2020 13:28:47 +0000 Subject: [PATCH] [SPARK-32000][CORE][TESTS] Fix the flaky test for partially launched task in barrier-mode This PR changes the test to get an active executorId and set it as preferred location instead of setting a fixed preferred location. The test is flaky. After checking the [log](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124086/artifact/core/), I find the root cause is: Two test cases from different test suites got submitted at the same time because of concurrent execution. In this particular case, the two test cases (from DistributedSuite and BarrierTaskContextSuite) both launch under local-cluster mode. The two applications are submitted at the SAME time so they have the same applications(app-20200615210132-0000). Thus, when the cluster of BarrierTaskContextSuite is launching executors, it failed to create the directory for the executor 0, because the path (/home/jenkins/workspace/work/app-app-20200615210132-0000/0) has been used by the cluster of DistributedSuite. Therefore, it has to launch executor 1 and 2 instead, that lead to non of the tasks can get preferred locality thus they got scheduled together and lead to the test failure. No. The test can not be reproduced locally. We can only know it's been fixed when it's no longer flaky on Jenkins. Closes #28849 from Ngone51/fix-spark-32000. Authored-by: yi.wu Signed-off-by: Wenchen Fan --- .../apache/spark/scheduler/BarrierTaskContextSuite.scala | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala index 469cc4ad66be9..92a97d1eba51d 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala @@ -159,11 +159,13 @@ class BarrierTaskContextSuite extends SparkFunSuite with LocalSparkContext { .setAppName("test-cluster") .set("spark.test.noStageRetry", "true") sc = new SparkContext(conf) + TestUtils.waitUntilExecutorsUp(sc, 2, 6000) + val id = sc.getExecutorIds().head val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) val dep = new OneToOneDependency[Int](rdd0) - // set up a barrier stage with 2 tasks and both tasks prefer executor 0 (only 1 core) for + // set up a barrier stage with 2 tasks and both tasks prefer the same executor (only 1 core) for // scheduling. So, one of tasks won't be scheduled in one round of resource offer. - val rdd = new MyRDD(sc, 2, List(dep), Seq(Seq("executor_h_0"), Seq("executor_h_0"))) + val rdd = new MyRDD(sc, 2, List(dep), Seq(Seq(s"executor_h_$id"), Seq(s"executor_h_$id"))) val errorMsg = intercept[SparkException] { rdd.barrier().mapPartitions { iter => BarrierTaskContext.get().barrier()