Skip to content

Change conversion mechanism to create better markdown files#71

Merged
umago merged 8 commits intoopenstack-lightspeed:mainfrom
Akrog:pandoc-conversion
Nov 5, 2025
Merged

Change conversion mechanism to create better markdown files#71
umago merged 8 commits intoopenstack-lightspeed:mainfrom
Akrog:pandoc-conversion

Conversation

@Akrog
Copy link
Contributor

@Akrog Akrog commented Oct 27, 2025

We detected a series of issues with the markdown files we generate from the AsciiDoc sources, in this PR we (Cursor and I) aim to fix these issues.

The major change that this PR does is use a different conversion process, we now use asciidoctor to convert to DocBook5 XML and then use pandoc to convert it to Markdown.

Unfortunately we encountered multiple issues in every part of the process:

  • Source files that don't conform to the AsciiDoc specifications
  • asciidoctor bugs/issues/limitations
  • pandoc bugs/issues/limitations

So we had to work around them in some hackish ways such as manually processing source documents and using filters.

We also had to get creative when we wanted to convert things like Jira bug references to headers so they could be adecuately chunked.

@Akrog Akrog requested a review from a team as a code owner October 27, 2025 13:57
@Akrog Akrog force-pushed the pandoc-conversion branch from 20d7c6f to c678a8a Compare October 27, 2025 18:34
@Akrog
Copy link
Contributor Author

Akrog commented Oct 27, 2025

I should have fixed all ruff issues, I just hope I didn't break anything while doing so...

@Akrog Akrog force-pushed the pandoc-conversion branch from c678a8a to 93a68d4 Compare October 28, 2025 10:29
Copy link
Contributor

@lpiwowar lpiwowar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I feel a bit puzzled about this PR 🙈 .

It's mostly about the AI-generated code. I don't have an issue with using AI-generated code at all.

But I'm thinking, especially for this PR and repo. I would personally prefer that code that is fully AI-generated (you know, the kind where you didn't write a single bit of it and just used AI with almost closed eyes) gets separated out.

  • For example, when it comes to the filters, I would move them to ./scripts/filters.
  • Also, for the rhoso_adoc_docs_to_test.py file, I would move the AI-generated code at least to a separate file and mark it properly with a very detailed docstring at the beginning explaining what it does.

This would make it clearer at least for me what's what and easier to maintain. I do not know. I just felt overwhelmed by this PR. Maybe tomorrow I would feel differently about this, but this is my opinion now.

I think it is pretty useful PR. Thank you 🙏 I guess there will be a lot more conversations about AI generated code in the near future.

Change how the release notes are generated to improve the input to the
RAG chunking.

We change how the docs are generated, now we:
- Convert AsiiDoc files to DocBook5 XML using `asciidoctor`
- Convert DocBook5 XML to Markdown files using `pandoc`

As part of this improvement we create a Pandoc filter that will convert
the jira tickets that are just in bold to a header, so they can be
properly split.

AI: Cursor generated the Pandoc filter
Copy link
Contributor

@jpodivin jpodivin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that I'm not active in the project anymore. But I've got request, so why not review?

@Akrog Akrog force-pushed the pandoc-conversion branch from 1f0b32c to 1087785 Compare November 4, 2025 11:15
Akrog added 7 commits November 4, 2025 18:46
In this patch we move the document conversion to use `asciidoctor` and
`pandoc` like we did with the release notes.

There are some things that need to be worked around (like list
formatting, not increasing header depth due to the doc title, make lists
more compact) and we do this with 2 Pandoc filters.

Additionally this patch works around some limitations and bugs in both
`asciidoctor` and `pandoc` with malformed nested inline elements:
  - <link><literal>...</link></literal> (wrong closing order)
  - <link><literal>text1</link>text2</literal> (content spans closing
    tags)
  - Other improperly nested inline elements

It also fixes some source document issues with empty tables and tables
that are not properly closed.

AI: Code generated by Cursor
Some documents have headers in bold, which is unnecessary, so we improve
the filter to remove this formatting in headers.
Unfortunately our source documentation is not 100% conformant with the
AsciiDoc spec, so we need to workaround this situation.

In this patch we add code that modifies the asciidoc sources to resolve
non-conformant files such as:

- Callouts in non source blocks
- Non sequential callouts
- Link text containing square brackets

AI: Code generated by Cursor
More complex callout fixes in our source documents:

- Each block code shall have its own callouts, we cannot have 2 code
  blocks and the callouts for both at the end of the second one.
- Ensure there are blank lines after callout definitions.

AI: Code generated by Cursor
Fix some Pandoc table issues/limitations:

- Fix XML parsing issues in tables
- Fix tables appearing inside code blocks (with LUA filter)
- Pandoc has table length limitations that make it dump XML/HTML in the
  output instead of markdown.

Fix issues in our documentation:

- Pipe characters on standalone lines
- Malformed table structures:(missing cells, incorrect colspan/rowspan

And while we are at it we compact table formatting for better
readability.

AI: Code generated by Cursor
This patch improves/fixes logs:

- Skip backup documentation directories to avoid showing unnecessary
  errors
- Prevent concurrent file edits
- Fix incorrect error messages when there's nothing wrong with the
  source files
- Remove duplicated log messages
- Show correct status of applied fixes

AI: Code generated by Cursor
Attributes were being dumped at the beginning of documents like this:

```

Ceph: Red Hat Ceph Storage
CephCluster: Red Hat Ceph Storage
CephVernum: 7
MessageBus: AMQ-Interconnect
```

This was because wee were passing an attributes file to the processor
instead of an asciidoctor file with the attributes.

Upon inspection we don't even need to pass these attributes, because the
source documents already use the include directive to include their
specific attribute files, which in some cases is different from the
global one we were previously using.

Example:

```
include::assemblies/common/global/adoption-attributes.adoc[]
```

In this patch we stop passing the `--attributes-file` argument to the
`rhoso_adoc_docs_to_text.py` script.

We also remove the `RHOSO_DOCS_ATTRIBUTES_FILE_URL` environmental
variable that is no longer needed.
Copy link
Contributor

@lpiwowar lpiwowar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 This improves the chunks quite a lot! ➕

Three new issues we might want to address if the repository has a longer future ahead:

Thank you! 🙏

@umago umago merged commit 62439fb into openstack-lightspeed:main Nov 5, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants