Change conversion mechanism to create better markdown files#71
Change conversion mechanism to create better markdown files#71umago merged 8 commits intoopenstack-lightspeed:mainfrom
Conversation
20d7c6f to
c678a8a
Compare
|
I should have fixed all ruff issues, I just hope I didn't break anything while doing so... |
c678a8a to
93a68d4
Compare
lpiwowar
left a comment
There was a problem hiding this comment.
Honestly, I feel a bit puzzled about this PR 🙈 .
It's mostly about the AI-generated code. I don't have an issue with using AI-generated code at all.
But I'm thinking, especially for this PR and repo. I would personally prefer that code that is fully AI-generated (you know, the kind where you didn't write a single bit of it and just used AI with almost closed eyes) gets separated out.
- For example, when it comes to the filters, I would move them to
./scripts/filters. - Also, for the
rhoso_adoc_docs_to_test.pyfile, I would move the AI-generated code at least to a separate file and mark it properly with a very detailed docstring at the beginning explaining what it does.
This would make it clearer at least for me what's what and easier to maintain. I do not know. I just felt overwhelmed by this PR. Maybe tomorrow I would feel differently about this, but this is my opinion now.
I think it is pretty useful PR. Thank you 🙏 I guess there will be a lot more conversations about AI generated code in the near future.
Change how the release notes are generated to improve the input to the RAG chunking. We change how the docs are generated, now we: - Convert AsiiDoc files to DocBook5 XML using `asciidoctor` - Convert DocBook5 XML to Markdown files using `pandoc` As part of this improvement we create a Pandoc filter that will convert the jira tickets that are just in bold to a header, so they can be properly split. AI: Cursor generated the Pandoc filter
79c8600 to
87a17a5
Compare
bbb2508 to
1f0b32c
Compare
jpodivin
left a comment
There was a problem hiding this comment.
I realize that I'm not active in the project anymore. But I've got request, so why not review?
1f0b32c to
1087785
Compare
In this patch we move the document conversion to use `asciidoctor` and
`pandoc` like we did with the release notes.
There are some things that need to be worked around (like list
formatting, not increasing header depth due to the doc title, make lists
more compact) and we do this with 2 Pandoc filters.
Additionally this patch works around some limitations and bugs in both
`asciidoctor` and `pandoc` with malformed nested inline elements:
- <link><literal>...</link></literal> (wrong closing order)
- <link><literal>text1</link>text2</literal> (content spans closing
tags)
- Other improperly nested inline elements
It also fixes some source document issues with empty tables and tables
that are not properly closed.
AI: Code generated by Cursor
Some documents have headers in bold, which is unnecessary, so we improve the filter to remove this formatting in headers.
Unfortunately our source documentation is not 100% conformant with the AsciiDoc spec, so we need to workaround this situation. In this patch we add code that modifies the asciidoc sources to resolve non-conformant files such as: - Callouts in non source blocks - Non sequential callouts - Link text containing square brackets AI: Code generated by Cursor
More complex callout fixes in our source documents: - Each block code shall have its own callouts, we cannot have 2 code blocks and the callouts for both at the end of the second one. - Ensure there are blank lines after callout definitions. AI: Code generated by Cursor
Fix some Pandoc table issues/limitations: - Fix XML parsing issues in tables - Fix tables appearing inside code blocks (with LUA filter) - Pandoc has table length limitations that make it dump XML/HTML in the output instead of markdown. Fix issues in our documentation: - Pipe characters on standalone lines - Malformed table structures:(missing cells, incorrect colspan/rowspan And while we are at it we compact table formatting for better readability. AI: Code generated by Cursor
This patch improves/fixes logs: - Skip backup documentation directories to avoid showing unnecessary errors - Prevent concurrent file edits - Fix incorrect error messages when there's nothing wrong with the source files - Remove duplicated log messages - Show correct status of applied fixes AI: Code generated by Cursor
Attributes were being dumped at the beginning of documents like this: ``` Ceph: Red Hat Ceph Storage CephCluster: Red Hat Ceph Storage CephVernum: 7 MessageBus: AMQ-Interconnect ``` This was because wee were passing an attributes file to the processor instead of an asciidoctor file with the attributes. Upon inspection we don't even need to pass these attributes, because the source documents already use the include directive to include their specific attribute files, which in some cases is different from the global one we were previously using. Example: ``` include::assemblies/common/global/adoption-attributes.adoc[] ``` In this patch we stop passing the `--attributes-file` argument to the `rhoso_adoc_docs_to_text.py` script. We also remove the `RHOSO_DOCS_ATTRIBUTES_FILE_URL` environmental variable that is no longer needed.
1087785 to
974322f
Compare
We detected a series of issues with the markdown files we generate from the AsciiDoc sources, in this PR we (Cursor and I) aim to fix these issues.
The major change that this PR does is use a different conversion process, we now use
asciidoctorto convert to DocBook5 XML and then usepandocto convert it to Markdown.Unfortunately we encountered multiple issues in every part of the process:
asciidoctorbugs/issues/limitationspandocbugs/issues/limitationsSo we had to work around them in some hackish ways such as manually processing source documents and using filters.
We also had to get creative when we wanted to convert things like Jira bug references to headers so they could be adecuately chunked.