Add indexdata + automatic indexing of PDF items by benoit74 · Pull Request #182 · openzim/python-scraperlib

benoit74 · 2024-07-11T15:44:46Z

Fix #167
Fix #168

Edited description

Changes:

create helper class IndexData to hold indexing data (title, content, keywords) before passing it to libzim
add support for automated indexing of PDFs in python-scraperlib, building an IndexData object automatically
- it relies on PyMuPDF for now to extract PDF metadata and content
- item_data title is populated based on title + author + subject, separated by dashes
- item_data content is simply all pages content concatenated with new line to separate them
add support for index_data: IndexData | None and auto_index: bool | None for customizing indexing in StaticItem and add_item_for:
- pass custom index_data from calller for customized indexing
- set auto_index to False to disable indexing (both in python-scraperlib and libzim)
- otherwise, indexing is automated in python-scraperlib for PDF documents (for now) or in libzim (for others, text or html for now)

Former description and points to discuss

Changes:

add new IndexingItem class capable to customize index data from data passed from the scraper or automatically from PDF content
- this uses a new IndexData class holding the index data
- for PDF, it relies on PyMuPDF for now to extract PDF metadata and content
  - item title is populated based on title + author + subject, separated by dashes
  - item content is simply all pages content concatenated with new line to separate them

Open points to discuss:

do we really need this new IndexingItem class or should we simply embed all this logic in StaticItem?
if we keep the separate class, do we need a new add_indexing_item_for, similar to add_item_for? Or just enrich the add_item_for with new arguments?

codecov · 2024-07-11T15:47:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (1eddabc) to head (c91646f).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #182   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           32        33    +1     
  Lines         1452      1531   +79     
  Branches       251       273   +22     
=========================================
+ Hits          1452      1531   +79

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74 · 2024-07-17T09:37:15Z

Some files like https://irp.fas.org/doddir/milmed/milderm.pdf are raising "MuPDF error: format error: cmsOpenProfileFromMem failed" error. Looks like it could be fixed since it is an ICC profile issue (for which we do not care): pymupdf/PyMuPDF#3572. I will fix this.

benoit74 · 2024-07-17T10:16:41Z

Fix is different than expected, but at least it is working, PR is again ready for review

rgaudin

Thank you, this is great.
I think we should extend StaticItem instead.

As for the API, I think we can add the following to add_item_for():

auto_index: bool = True: keeps libzim auto index + our pdf autoindex. Setting this to False would skip the PDF but would also overload the item with an empty IndexData so that libzim doesn't index it.
index_content: str | None = None which would generate the appropriate indexdata with wordcount if set. It doesn't handle keywords but current PDF impl doesn't either and we can extend in the future.

WDYT?

tests/conftest.py

tests/zim/test_indexing_item.py

src/zimscraperlib/zim/indexing_item.py

benoit74 · 2024-07-29T13:37:51Z

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions, which is quite important (item title is not used for suggestions when index data is passed)

And I also modified add_item_for since this is quite heavily used in scrapers.

Other than that, I think the change will please you.

rgaudin · 2024-07-29T17:45:21Z

I did not passed index_content: str | None = None but index_data: IndexData | None since it also allows to set the title which is used for suggestions

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

There are a couple of unresolved discussions…

src/zimscraperlib/zim/indexing_item.py

benoit74 · 2024-07-29T17:53:41Z

I see it's missing from my comment but I meant index_content and index_title. I think requiring this extra import is in opposition with what add_item_for tries to achieve but you're the judge of that.

Then I get what you meant, and I agree the extra import is not very lean

benoit74 · 2024-07-30T06:37:30Z

I finally decided to keep using index_data in add_item_for and StaticItem because it is a convenient way to force user to pass both title and content should he decide to customize index_data and to detect when this is not done with pyright. Otherwise one might be tempted to pass only an index_title or only an index_content and this is not what we want.

benoit74 mentioned this pull request Jul 15, 2024

Restore search functionality by indexing content in JSON files openzim/youtube#265

Merged

benoit74 force-pushed the pdf_support branch 2 times, most recently from 2b753ea to d50b67b Compare July 15, 2024 12:37

benoit74 self-assigned this Jul 15, 2024

benoit74 force-pushed the pdf_support branch from d50b67b to 97e1fa3 Compare July 15, 2024 12:44

benoit74 marked this pull request as ready for review July 15, 2024 12:46

benoit74 requested a review from rgaudin July 15, 2024 12:46

benoit74 force-pushed the pdf_support branch from add1ec4 to a22a266 Compare July 17, 2024 12:22

benoit74 mentioned this pull request Jul 17, 2024

WIP: Add support for PDF indexing + consider them as front article openzim/warc2zim#350

Closed

benoit74 force-pushed the pdf_support branch 2 times, most recently from 3da83f5 to 7d519e1 Compare July 17, 2024 13:44

rgaudin requested changes Jul 26, 2024

View reviewed changes

benoit74 changed the title ~~Add indexing item + automatic indexing of PDF items~~ Add indexdata + automatic indexing of PDF items Jul 29, 2024

benoit74 force-pushed the pdf_support branch from a47bdc4 to 3e1df16 Compare July 29, 2024 13:29

benoit74 requested a review from rgaudin July 29, 2024 13:38

rgaudin approved these changes Jul 29, 2024

View reviewed changes

Add support for automatically indexing PDF title and content

31558d0

benoit74 force-pushed the pdf_support branch from dcb1f26 to 31558d0 Compare July 30, 2024 06:34

benoit74 merged commit efc2942 into main Jul 30, 2024

benoit74 deleted the pdf_support branch July 30, 2024 06:37

Uh oh!

Conversation

benoit74 commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edited description

Former description and points to discuss

Uh oh!

codecov bot commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benoit74 commented Jul 17, 2024

Uh oh!

benoit74 commented Jul 17, 2024

Uh oh!

rgaudin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benoit74 commented Jul 29, 2024

Uh oh!

rgaudin commented Jul 29, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benoit74 commented Jul 29, 2024

Uh oh!

benoit74 commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benoit74 commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading