Change conversion mechanism to create better markdown files by Akrog · Pull Request #71 · openstack-lightspeed/rag-content

Akrog · 2025-10-27T13:57:41Z

We detected a series of issues with the markdown files we generate from the AsciiDoc sources, in this PR we (Cursor and I) aim to fix these issues.

The major change that this PR does is use a different conversion process, we now use asciidoctor to convert to DocBook5 XML and then use pandoc to convert it to Markdown.

Unfortunately we encountered multiple issues in every part of the process:

Source files that don't conform to the AsciiDoc specifications
asciidoctor bugs/issues/limitations
pandoc bugs/issues/limitations

So we had to work around them in some hackish ways such as manually processing source documents and using filters.

We also had to get creative when we wanted to convert things like Jira bug references to headers so they could be adecuately chunked.

Akrog · 2025-10-27T18:35:30Z

I should have fixed all ruff issues, I just hope I didn't break anything while doing so...

lpiwowar

Honestly, I feel a bit puzzled about this PR 🙈 .

It's mostly about the AI-generated code. I don't have an issue with using AI-generated code at all.

But I'm thinking, especially for this PR and repo. I would personally prefer that code that is fully AI-generated (you know, the kind where you didn't write a single bit of it and just used AI with almost closed eyes) gets separated out.

For example, when it comes to the filters, I would move them to ./scripts/filters.
Also, for the rhoso_adoc_docs_to_test.py file, I would move the AI-generated code at least to a separate file and mark it properly with a very detailed docstring at the beginning explaining what it does.

This would make it clearer at least for me what's what and easier to maintain. I do not know. I just felt overwhelmed by this PR. Maybe tomorrow I would feel differently about this, but this is my opinion now.

I think it is pretty useful PR. Thank you 🙏 I guess there will be a lot more conversations about AI generated code in the near future.

Containerfile

scripts/rhoso_adoc_docs_to_text.py

scripts/filters/tightlists.lua

scripts/rhoso_adoc_docs_to_text.py

Change how the release notes are generated to improve the input to the RAG chunking. We change how the docs are generated, now we: - Convert AsiiDoc files to DocBook5 XML using `asciidoctor` - Convert DocBook5 XML to Markdown files using `pandoc` As part of this improvement we create a Pandoc filter that will convert the jira tickets that are just in bold to a header, so they can be properly split. AI: Cursor generated the Pandoc filter

jpodivin

I realize that I'm not active in the project anymore. But I've got request, so why not review?

Containerfile

scripts/rhoso_adoc_docs_to_text.py

In this patch we move the document conversion to use `asciidoctor` and `pandoc` like we did with the release notes. There are some things that need to be worked around (like list formatting, not increasing header depth due to the doc title, make lists more compact) and we do this with 2 Pandoc filters. Additionally this patch works around some limitations and bugs in both `asciidoctor` and `pandoc` with malformed nested inline elements: - <link><literal>...</link></literal> (wrong closing order) - <link><literal>text1</link>text2</literal> (content spans closing tags) - Other improperly nested inline elements It also fixes some source document issues with empty tables and tables that are not properly closed. AI: Code generated by Cursor

Some documents have headers in bold, which is unnecessary, so we improve the filter to remove this formatting in headers.

Unfortunately our source documentation is not 100% conformant with the AsciiDoc spec, so we need to workaround this situation. In this patch we add code that modifies the asciidoc sources to resolve non-conformant files such as: - Callouts in non source blocks - Non sequential callouts - Link text containing square brackets AI: Code generated by Cursor

More complex callout fixes in our source documents: - Each block code shall have its own callouts, we cannot have 2 code blocks and the callouts for both at the end of the second one. - Ensure there are blank lines after callout definitions. AI: Code generated by Cursor

Fix some Pandoc table issues/limitations: - Fix XML parsing issues in tables - Fix tables appearing inside code blocks (with LUA filter) - Pandoc has table length limitations that make it dump XML/HTML in the output instead of markdown. Fix issues in our documentation: - Pipe characters on standalone lines - Malformed table structures:(missing cells, incorrect colspan/rowspan And while we are at it we compact table formatting for better readability. AI: Code generated by Cursor

This patch improves/fixes logs: - Skip backup documentation directories to avoid showing unnecessary errors - Prevent concurrent file edits - Fix incorrect error messages when there's nothing wrong with the source files - Remove duplicated log messages - Show correct status of applied fixes AI: Code generated by Cursor

Attributes were being dumped at the beginning of documents like this: ``` Ceph: Red Hat Ceph Storage CephCluster: Red Hat Ceph Storage CephVernum: 7 MessageBus: AMQ-Interconnect ``` This was because wee were passing an attributes file to the processor instead of an asciidoctor file with the attributes. Upon inspection we don't even need to pass these attributes, because the source documents already use the include directive to include their specific attribute files, which in some cases is different from the global one we were previously using. Example: ``` include::assemblies/common/global/adoption-attributes.adoc[] ``` In this patch we stop passing the `--attributes-file` argument to the `rhoso_adoc_docs_to_text.py` script. We also remove the `RHOSO_DOCS_ATTRIBUTES_FILE_URL` environmental variable that is no longer needed.

lpiwowar

LGTM! 👍 This improves the chunks quite a lot! ➕

Three new issues we might want to address if the repository has a longer future ahead:

Thank you! 🙏

scripts/rhoso_adoc_docs_to_text.py

Akrog requested a review from a team as a code owner October 27, 2025 13:57

Akrog force-pushed the pandoc-conversion branch from 20d7c6f to c678a8a Compare October 27, 2025 18:34

Akrog force-pushed the pandoc-conversion branch from c678a8a to 93a68d4 Compare October 28, 2025 10:29

lpiwowar reviewed Oct 30, 2025

View reviewed changes

Akrog force-pushed the pandoc-conversion branch from 79c8600 to 87a17a5 Compare November 3, 2025 09:54

lpiwowar mentioned this pull request Nov 3, 2025

Update filters to be compatible with the latest version of pandoc #73

Open

Akrog force-pushed the pandoc-conversion branch 3 times, most recently from bbb2508 to 1f0b32c Compare November 3, 2025 10:58

jpodivin reviewed Nov 3, 2025

View reviewed changes

Containerfile Show resolved Hide resolved

scripts/rhoso_adoc_docs_to_text.py Show resolved Hide resolved

scripts/rhoso_adoc_docs_to_text.py Show resolved Hide resolved

scripts/rhoso_adoc_docs_to_text.py Outdated Show resolved Hide resolved

Akrog force-pushed the pandoc-conversion branch from 1f0b32c to 1087785 Compare November 4, 2025 11:15

Akrog added 7 commits November 4, 2025 18:46

Fix doc header formatting

2b71ddf

Some documents have headers in bold, which is unnecessary, so we improve the filter to remove this formatting in headers.

Akrog force-pushed the pandoc-conversion branch from 1087785 to 974322f Compare November 4, 2025 17:49

This was referenced Nov 4, 2025

Polish OUTPUT_FILE_EXTENSION #74

Open

Investigate whether we want to install Pandoc from EPEL #75

Open

lpiwowar approved these changes Nov 4, 2025

View reviewed changes

umago approved these changes Nov 5, 2025

View reviewed changes

scripts/rhoso_adoc_docs_to_text.py Show resolved Hide resolved

umago merged commit 62439fb into openstack-lightspeed:main Nov 5, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change conversion mechanism to create better markdown files#71

Change conversion mechanism to create better markdown files#71
umago merged 8 commits intoopenstack-lightspeed:mainfrom
Akrog:pandoc-conversion

Akrog commented Oct 27, 2025

Uh oh!

Akrog commented Oct 27, 2025

Uh oh!

lpiwowar left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpodivin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lpiwowar left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Akrog commented Oct 27, 2025

Uh oh!

Akrog commented Oct 27, 2025

Uh oh!

lpiwowar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jpodivin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lpiwowar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants