Add indexdata + automatic indexing of PDF items#182
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #182 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 32 33 +1
Lines 1452 1531 +79
Branches 251 273 +22
=========================================
+ Hits 1452 1531 +79 ☔ View full report in Codecov by Sentry. |
2b753ea to
d50b67b
Compare
|
Some files like https://irp.fas.org/doddir/milmed/milderm.pdf are raising "MuPDF error: format error: cmsOpenProfileFromMem failed" error. Looks like it could be fixed since it is an ICC profile issue (for which we do not care): pymupdf/PyMuPDF#3572. I will fix this. |
|
Fix is different than expected, but at least it is working, PR is again ready for review |
3da83f5 to
7d519e1
Compare
rgaudin
left a comment
There was a problem hiding this comment.
Thank you, this is great.
I think we should extend StaticItem instead.
As for the API, I think we can add the following to add_item_for():
auto_index: bool = True: keeps libzim auto index + our pdf autoindex. Setting this to False would skip the PDF but would also overload the item with an empty IndexData so that libzim doesn't index it.index_content: str | None = Nonewhich would generate the appropriate indexdata with wordcount if set. It doesn't handle keywords but current PDF impl doesn't either and we can extend in the future.
WDYT?
|
I did not passed And I also modified Other than that, I think the change will please you. |
I see it's missing from my comment but I meant There are a couple of unresolved discussions… |
Then I get what you meant, and I agree the extra import is not very lean |
|
I finally decided to keep using |
Fix #167
Fix #168
Edited description
Changes:
IndexDatato hold indexing data (title, content, keywords) before passing it to libzimindex_data: IndexData | Noneandauto_index: bool | Nonefor customizing indexing inStaticItemandadd_item_for:index_datafrom calller for customized indexingauto_indexto False to disable indexing (both in python-scraperlib and libzim)Former description and points to discuss
Changes:
IndexingItemclass capable to customize index data from data passed from the scraper or automatically from PDF contentIndexDataclass holding the index dataOpen points to discuss:
IndexingItemclass or should we simply embed all this logic inStaticItem?add_indexing_item_for, similar toadd_item_for? Or just enrich theadd_item_forwith new arguments?