[SPARK-56666][INFRA] Reduce unidoc CI log noise with -Xdoclint:-missing and -verbose post-filter by cloud-fan · Pull Request #55605 · apache/spark

cloud-fan · 2026-04-29T11:40:39Z

JIRA: https://issues.apache.org/jira/browse/SPARK-56666

What changes were proposed in this pull request?

Refines the unidoc javacOptions in JavaUnidoc / unidoc / javacOptions and the post-process stream filter in docs/_plugins/build_api_docs.rb so that the Documentation generation CI log is small enough to scan visually while still surfacing per-file error: reference not found diagnostics on broken {@link} references.

Builds on the -Xmaxerrs and -verbose insight from #55581 (SPARK-56630 follow-up): javadoc's default -Xmaxerrs 100 cap was hit by the ~100 inert genjavadoc-stub errors during source loading, so doclint never ran on the real sources, and the per-file error: reference not found diagnostics surfaced only with -verbose. That PR's flag set (-Xmaxerrs 999999, -Xmaxwarns 999999, -verbose) achieved the diagnostic goal but at a ~77K-line CI log per run.

This PR keeps the diagnostic visibility and brings the visible CI log down to ~4K lines (95% reduction), with four changes:

-Xmaxerrs 0 instead of -Xmaxerrs 999999. The 0 value is treated as unlimited by javadoc (locally verified) and reads cleaner than the magic number.
-Xdoclint:all + -Xdoclint:-missing (two separate flags, matching the existing Compile / doc / javacOptions pattern in SparkBuild.scala). Suppresses the missing doclint group at javadoc level: the ~22K no comment / no @param / no @return / no @throws warnings (each rendered as a 3-line block) that dominate the log on every Spark unidoc run. The two-flag form is load-bearing — bare -Xdoclint:-missing alone demotes other doclint groups (notably reference) to warning level, making broken {@link} non-fatal; the explicit -Xdoclint:all first keeps reference at error level. Locally verified.
Drop -Xmaxwarns 999999. Warnings don't fail CI; error visibility is governed by -Xmaxerrs, not -Xmaxwarns. javadoc's default cap of 100 is sufficient — shows a sample of any remaining warnings without flooding. Saves ~4K lines beyond -Xdoclint:-missing alone.
Post-filter -verbose progress lines from the build_api_docs.rb stream. -verbose itself stays (it is load-bearing for per-file error: reference not found emission per [SPARK-56630][INFRA][FOLLOWUP] Make unidoc surface real javadoc failures #55581), but its progress noise — Loading source file ..., [parsing started/completed], [loading /path/X.class], Generating /path/X.html — carries no diagnostic signal. The existing stream filter is extended with a verbose_line regex that drops these single-line progress entries from stdout. Saves ~13K lines.

Why are the changes needed?

Documentation generation CI logs were ~77K lines per run after SPARK-56630's flag set. That is large enough that scanning for diagnostics by eye is impractical, and grep-piping is the only reasonable workflow. Most of the volume is structural noise (genjavadoc stub errors, no comment warnings, -verbose progress markers) with no diagnostic signal. After this PR the log is ~4K lines on a real-failure run; the per-file error: reference not found diagnostics PR #55581 added are the dominant content.

Empirical breakdown of the reduction (verified end-to-end on this branch's earlier test commits with deliberately broken {@link} plants in both a real .java source and a Scala source):

State	Log lines	Vs baseline
PR #55581's flag set (baseline)	77K
Add `-Xdoclint:all,-missing`	22K	-71%
Drop `-Xmaxwarns 999999`	18K	-77%
Post-filter `-verbose` progress	~4K	-95%

All diagnostic targets remain visible: per-file error: reference not found for both Java sources and Scala sources via the genjavadoc stub.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Validated end-to-end on earlier (now reverted) test commits of this branch with planted broken {@link} references in both code paths:

ColumnarMap.java (real Java source): {@link org.apache.spark.deliberately.NoSuchClass} and {@link ColumnVector#nonExistentMethod()}.
Partition.scala (Scala source via genjavadoc): [[Partition.index]] — the wrong . separator that javadoc reads as inner-class lookup and fails to resolve. This is the case PR [SPARK-56630][INFRA][FOLLOWUP] Make unidoc surface real javadoc failures #55581's AGENTS.md note documents as the most common scaladoc-side cause of unidoc failure.

Both surfaced as per-file error: reference not found diagnostics in the CI log on the test commit, doc gen failed as expected, log size dropped to 3,977 lines, and zero Loading source file / [parsing started] / [loading X.class] / Generating *.html / no comment lines remained visible. See the test-result comment below for the full breakdown.

-Xmaxerrs 0 and the bare--Xdoclint:-missing demotion behavior were verified locally with standalone javadoc invocations on a minimal test file.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic)

…ng and -verbose post-filter ### What changes were proposed in this pull request? Refines the unidoc javacOptions in `JavaUnidoc / unidoc / javacOptions` and the post-process stream filter in `docs/_plugins/build_api_docs.rb` so that the Documentation generation CI log is small enough to scan visually while still surfacing per-file `error: reference not found` diagnostics on broken `{@link}` references. Builds on the `-Xmaxerrs` and `-verbose` insight from apache#55581 (SPARK-56630 follow-up): javadoc's default `-Xmaxerrs 100` cap was hit by the ~100 inert genjavadoc-stub errors during source loading, so doclint never ran on the real sources, and the per-file `error: reference not found` diagnostics surfaced only with `-verbose`. That PR's flag set (`-Xmaxerrs 999999`, `-Xmaxwarns 999999`, `-verbose`) achieved the diagnostic goal but at a ~77K-line CI log per run. This PR keeps the diagnostic visibility and brings the visible CI log down to ~4K lines (95% reduction), with three changes: 1. **`-Xmaxerrs 0`** instead of `-Xmaxerrs 999999`. The `0` value is treated as unlimited by javadoc (locally verified) and reads cleaner than the magic number. 2. **`-Xdoclint:all` + `-Xdoclint:-missing`** (two separate flags, matching the existing `Compile / doc / javacOptions` pattern in `SparkBuild.scala`). Suppresses the `missing` doclint group at javadoc level: the ~22K `no comment` / `no @param` / `no @return` / `no @throws` warnings (each rendered as a 3-line block) that dominate the log on every Spark unidoc run. The two-flag form is load-bearing -- bare `-Xdoclint:-missing` alone demotes other doclint groups (notably `reference`) to warning level, making broken `{@link}` non-fatal; the explicit `-Xdoclint:all` first keeps reference at error level. Locally verified. 3. **Drop `-Xmaxwarns 999999`.** Warnings don't fail CI; error visibility is governed by `-Xmaxerrs`, not `-Xmaxwarns`. javadoc's default cap of 100 is sufficient -- shows a sample of any remaining warnings without flooding. Saves ~4K lines beyond `-Xdoclint:-missing` alone. 4. **Post-filter `-verbose` progress lines from the build_api_docs.rb stream.** `-verbose` itself stays (it is load-bearing for per-file `error: reference not found` emission per apache#55581), but its progress noise -- `Loading source file ...`, `[parsing started/completed]`, `[loading /path/X.class]`, `Generating /path/X.html` -- carries no diagnostic signal. The existing stream filter is extended with a `verbose_line` regex that drops these single-line progress entries from stdout. Saves ~13K lines. ### Why are the changes needed? Documentation generation CI logs were ~77K lines per run after SPARK-56630's flag set. That is large enough that scanning for diagnostics by eye is impractical, and grep-piping is the only reasonable workflow. Most of the volume is structural noise (genjavadoc stub errors, `no comment` warnings, `-verbose` progress markers) with no diagnostic signal. After this PR the log is ~4K lines on a real-failure run; the per-file `error: reference not found` diagnostics PR apache#55581 added are the dominant content. Empirical breakdown of the reduction (verified via test PR apache#55605 with deliberately broken `{@link}` plants in both a real `.java` source and a Scala source): | State | Log lines | Vs baseline | | ---------------------------------- | --------: | ----------: | | PR apache#55581's flag set (baseline) | 77K | | | Add `-Xdoclint:all,-missing` | 22K | -71% | | Drop `-Xmaxwarns 999999` | 18K | -77% | | Post-filter `-verbose` progress | **~4K** | **-95%** | All four diagnostic targets remain visible in the final form: 2 broken `{@link}`s in `ColumnarMap.java` (Java source) and 2 broken `[[Class.member]]`-style refs in a Scala source via the genjavadoc stub. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested end-to-end on PR apache#55605 (testing-only fork PR) with planted broken `{@link}` references in both code paths: - `ColumnarMap.java` (real Java source): `{@link org.apache.spark.deliberately.NoSuchClass}` and `{@link ColumnVector#nonExistentMethod()}`. - `Partition.scala` (Scala source via genjavadoc): `[[Partition.index]]` -- the wrong `.` separator that javadoc reads as inner-class lookup and fails to resolve. This is the case PR apache#55581's AGENTS.md note documents as the most common scaladoc-side cause of unidoc failure. Both surfaced as per-file `error: reference not found` diagnostics in the CI log on the test branch, doc gen failed as expected, log size dropped to 3,977 lines, and zero `Loading source file` / `[parsing started]` / `[loading X.class]` / `Generating *.html` / `no comment` lines remained visible. `-Xmaxerrs 0` and the bare-`-Xdoclint:-missing` demotion behavior were verified locally with standalone javadoc invocations on a minimal test file. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic)

cloud-fan · 2026-04-29T18:32:34Z

End-to-end test result

Validated on earlier branch commit a2bbe807909 with three deliberately broken {@link} references planted across one Java source and one Scala source (later reverted). The Documentation generation CI job ran and produced this log:

Doc gen status: FAILURE ✅ (broken refs are fatal)

Per-file diagnostics — all 4 visible:

[warn] /__w/spark/spark/core/target/java/org/apache/spark/Partition.java:3: error: reference not found
[warn] /__w/spark/spark/core/target/java/org/apache/spark/Partition.java:5: error: reference not found
[warn] /__w/spark/spark/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarMap.java:26: error: reference not found
[warn] /__w/spark/spark/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarMap.java:27: error: reference not found

The Scala plant [[Partition.index]] (with the wrong . separator) was correctly translated by genjavadoc into a stub {@link Partition.index}, and javadoc surfaced the resolution failure on the stub path core/target/java/org/apache/spark/Partition.java — exactly the failure mode #55581's AGENTS.md note documents.

Log composition:

Pattern	Count
Total lines	3,977
`Loading source file`	1 (a wrapped `Unexpected javac output: Loading source file ...` from sbt — different prefix, unfilterable)
`[parsing started]` / `[parsing completed]`	0
`[loading X.class]`	0
`Generating *.html`	0
`no comment` / `no @param` / `no @return` / `no @throws`	0 each
Final summary	`4 errors / 2,408 warnings.` (default 100 cap on warning printing)

Reduction journey:

State	Log lines	Vs baseline
#55581's full flag set (`-Xmaxerrs 999999 -Xmaxwarns 999999 -verbose`)	77K	baseline
+ `-Xdoclint:all -Xdoclint:-missing`	22K	-71%
+ drop `-Xmaxwarns 999999`	18K	-77%
+ post-filter `-verbose` progress lines	3,977	-95%

The current CI run on this PR's clean state (no plants) is expected to pass.

These pre-dated SPARK-14790 (2016) and quieted the per-task log output when scalastyle was hooked into (Compile/compile) and (Test/compile). After SPARK-56636 decoupled scalastyle from compile, the tasks are only invoked from dev/lint-scala, so the per-task logLevel settings reference unread keys (sbt's lintUnused surfaces them as warnings).

cloud-fan · 2026-04-30T06:18:34Z

The timeout issue is unrelated, thanks for review, merging to master!

…ry and CI annotations After the noise filters from apache#55605, the Documentation generation CI log is about 4K lines. The two-line per-file fatal diagnostics (`error: reference not found`) are still buried in the middle of the log and the GitHub Actions check panel only shows "Process completed with exit code 1", which leaves reviewers grepping through the raw log to find the actual problem. This change is purely additive -- it drops no existing log lines. After the unidoc pipe closes, `build_api_docs.rb` prints a trailing `Fatal javadoc errors (N):` block listing each captured diagnostic, then emits a `::error file=,line=::` GitHub Actions workflow command per diagnostic so they appear as inline annotations on the PR check panel. Diagnostics are captured strictly within the Standard Doclet phase bracketed by `Building tree for all the packages and classes...` and `Building index for all classes...`, which is where doclint emits the build-failing diagnostics that count toward javadoc's exit code. Source- loading "error:" chatter outside that window is excluded. The captured count is cross-checked against javadoc's own `N errors` summary line. If they diverge -- e.g. because a future JDK changes the Standard Doclet phase wording -- a `::warning::` workflow command is emitted so the drift is surfaced without silently masking real failures. Co-authored-by: Isaac

… fatal-error summary Mirrors PR apache#55605's testing pattern. Plants two unresolvable references on the real Java path (ColumnarMap.java) and one on the genjavadoc stub path (Partition.scala) so the fatal-error summary added in the previous commit gets exercised end-to-end in CI. To be dropped before merge. Co-authored-by: Isaac

…ry and CI annotations After the noise filters from apache#55605, the Documentation generation CI log is about 4K lines. The two-line per-file fatal diagnostics (`error: reference not found`) are still buried in the middle of the log and the GitHub Actions check panel only shows "Process completed with exit code 1", which leaves reviewers grepping through the raw log to find the actual problem. This change is purely additive -- it drops no existing log lines. After the unidoc pipe closes, `build_api_docs.rb` prints a trailing `Fatal javadoc errors (N):` block listing each captured diagnostic, then emits a `::error file=,line=::` GitHub Actions workflow command per diagnostic so they appear as inline annotations on the PR check panel. Diagnostics are captured strictly within the Standard Doclet phase bracketed by `Building tree for all the packages and classes...` and `Building index for all classes...`, which is where doclint emits the build-failing diagnostics that count toward javadoc's exit code. Source- loading "error:" chatter outside that window is excluded. The captured count is cross-checked against javadoc's own `N errors` summary line. If they diverge -- e.g. because a future JDK changes the Standard Doclet phase wording -- a `::warning::` workflow command is emitted so the drift is surfaced without silently masking real failures. Co-authored-by: Isaac

…ry and CI annotations ### What changes were proposed in this pull request? After the noise filters from #55605, the Documentation generation CI log is around 4K lines on a failure run. The two-line per-file `error: reference not found` diagnostics are still buried in the middle of the log, and the GitHub Actions check panel for a failed doc-gen job only surfaces `Process completed with exit code 1`. Reviewers end up scrolling the raw log to find what actually broke. This PR is purely additive in `docs/_plugins/build_api_docs.rb` -- no existing log lines are dropped. After the unidoc pipe closes: 1. A trailing `Fatal javadoc errors (N):` block is printed, listing each captured diagnostic with file, line, and message. 2. One `::error file=<path>,line=<line>,title=javadoc::<msg>` GitHub Actions workflow command is emitted per diagnostic, so they appear as inline annotations on the PR check panel instead of as a single opaque `exit code 1`. Diagnostics are captured strictly within the Standard Doclet phase bracketed by `Building tree for all the packages and classes...` and `Building index for all classes...`, which is where doclint emits the build-failing diagnostics that count toward javadoc's exit code. Source-loading `error:` chatter outside that window is excluded -- it's already non-fatal and matches what javadoc's own `N errors` summary line counts. As a self-check, the captured count is compared against javadoc's own `N errors` summary line. If they diverge -- e.g. because a future JDK changes the Standard Doclet phase wording -- a `::warning::` workflow command is emitted so the drift is surfaced without silently masking real failures. ### Why are the changes needed? PR #55605 made the doc-gen log small enough to read, but the failure path is still discoverable only via grep. The per-file diagnostics emitted by doclint are the actionable content; promoting them to the PR check panel and a clearly delimited summary block makes a doc-gen failure self-explanatory without leaving the PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? End-to-end on this branch with deliberately broken references planted in two code paths (mirroring the test pattern from PR #55605): - `ColumnarMap.java` (real Java source): `{link org.apache.spark.deliberately.NoSuchClass}` and `{link ColumnVector#nonExistentMethod()}`. - `Partition.scala` (Scala source via genjavadoc): `[[Partition.index]]` -- the `.`-separator case that javadoc treats as inner-class lookup. The Documentation generation job will fail with the expected `Fatal javadoc errors` summary block in the log and per-file inline annotations on this PR's check panel. The plant commit will be dropped before this PR is taken out of draft. The state machine was also exercised locally against a captured log from a prior failing doc-gen run; the captured fatal count matches javadoc's `N errors` summary line. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic) Closes #55814 from cloud-fan/unidoc-fatal-summary. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ry and CI annotations ### What changes were proposed in this pull request? After the noise filters from #55605, the Documentation generation CI log is around 4K lines on a failure run. The two-line per-file `error: reference not found` diagnostics are still buried in the middle of the log, and the GitHub Actions check panel for a failed doc-gen job only surfaces `Process completed with exit code 1`. Reviewers end up scrolling the raw log to find what actually broke. This PR is purely additive in `docs/_plugins/build_api_docs.rb` -- no existing log lines are dropped. After the unidoc pipe closes: 1. A trailing `Fatal javadoc errors (N):` block is printed, listing each captured diagnostic with file, line, and message. 2. One `::error file=<path>,line=<line>,title=javadoc::<msg>` GitHub Actions workflow command is emitted per diagnostic, so they appear as inline annotations on the PR check panel instead of as a single opaque `exit code 1`. Diagnostics are captured strictly within the Standard Doclet phase bracketed by `Building tree for all the packages and classes...` and `Building index for all classes...`, which is where doclint emits the build-failing diagnostics that count toward javadoc's exit code. Source-loading `error:` chatter outside that window is excluded -- it's already non-fatal and matches what javadoc's own `N errors` summary line counts. As a self-check, the captured count is compared against javadoc's own `N errors` summary line. If they diverge -- e.g. because a future JDK changes the Standard Doclet phase wording -- a `::warning::` workflow command is emitted so the drift is surfaced without silently masking real failures. ### Why are the changes needed? PR #55605 made the doc-gen log small enough to read, but the failure path is still discoverable only via grep. The per-file diagnostics emitted by doclint are the actionable content; promoting them to the PR check panel and a clearly delimited summary block makes a doc-gen failure self-explanatory without leaving the PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? End-to-end on this branch with deliberately broken references planted in two code paths (mirroring the test pattern from PR #55605): - `ColumnarMap.java` (real Java source): `{link org.apache.spark.deliberately.NoSuchClass}` and `{link ColumnVector#nonExistentMethod()}`. - `Partition.scala` (Scala source via genjavadoc): `[[Partition.index]]` -- the `.`-separator case that javadoc treats as inner-class lookup. The Documentation generation job will fail with the expected `Fatal javadoc errors` summary block in the log and per-file inline annotations on this PR's check panel. The plant commit will be dropped before this PR is taken out of draft. The state machine was also exercised locally against a captured log from a prior failing doc-gen run; the captured fatal count matches javadoc's `N errors` summary line. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Anthropic) Closes #55814 from cloud-fan/unidoc-fatal-summary. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 12b2595) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan force-pushed the unidoc-fixes-test branch 8 times, most recently from 41cb984 to 277354d Compare April 29, 2026 18:04

cloud-fan force-pushed the unidoc-fixes-test branch from a2bbe80 to 2618e91 Compare April 29, 2026 18:31

cloud-fan changed the title ~~[DO NOT MERGE] Test PR #55581's unidoc diagnostic banner~~ [SPARK-56666][INFRA] Reduce unidoc CI log noise with -Xdoclint:-missing and -verbose post-filter Apr 29, 2026

juliuszsompolski mentioned this pull request Apr 29, 2026

[SPARK-56630][INFRA][FOLLOWUP] Make unidoc surface real javadoc failures #55581

Closed

HyukjinKwon approved these changes Apr 30, 2026

View reviewed changes

cloud-fan closed this in b039387 Apr 30, 2026

srielau mentioned this pull request May 11, 2026

[SPARK-56750] default path config #55717

Closed

cloud-fan mentioned this pull request May 12, 2026

[SPARK-56832][INFRA] Surface fatal javadoc errors in unidoc log summary and CI annotations #55814

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56666][INFRA] Reduce unidoc CI log noise with -Xdoclint:-missing and -verbose post-filter#55605

[SPARK-56666][INFRA] Reduce unidoc CI log noise with -Xdoclint:-missing and -verbose post-filter#55605
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:unidoc-fixes-test

cloud-fan commented Apr 29, 2026 •

edited

Loading

Uh oh!

cloud-fan commented Apr 29, 2026

Uh oh!

cloud-fan commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cloud-fan commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Apr 29, 2026

End-to-end test result

Uh oh!

cloud-fan commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloud-fan commented Apr 29, 2026 •

edited

Loading