Skip to content

added more book data across different geners#4

Merged
Shubhamnpk merged 6 commits into
mainfrom
v1
May 24, 2026
Merged

added more book data across different geners#4
Shubhamnpk merged 6 commits into
mainfrom
v1

Conversation

@Shubhamnpk

@Shubhamnpk Shubhamnpk commented May 24, 2026

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Added audio streaming endpoint and a health check endpoint.
  • Documentation

    • Expanded README with CEHRD-first source strategy, scraping/merge/validation steps, updated API notes, and project structure.
    • Added issue/PR templates, updated contributing checklist, and updated changelog.
  • Content Expansion

    • Large import of Pustakalaya datasets across Course Materials, Literature & Arts, Reference, Teaching, and Other collections.
  • Chores

    • Added CI workflow and example env settings.

Review Change Stack

Introduce a new `GET /api/audio?url=<audioUrl>` endpoint to proxy audio files through the same origin, matching how PDF proxying works for the reader/player UX.

- Generalize catalog URL validation to accept specific fields and enforce `audioUrl`/`pdfUrl` checks per endpoint
- Add `audioUrl` and `level` to list/book response fields
- Update API docs and README examples to document `/api/audio` and the `audioUrl` data shape
- Expand source priority ordering to include new CEHRD sources (`cehrd-stories`, `cehrd-nfe`, `cehrd-audio`)

This improves secure media access by only allowing catalog-backed audio URLs while extending metadata needed for new CEHRD content.feat(api): add catalog-validated audio streaming support

Introduce a new `GET /api/audio?url=<audioUrl>` endpoint to proxy audio files through the same origin, matching how PDF proxying works for the reader/player UX.

- Generalize catalog URL validation to accept specific fields and enforce `audioUrl`/`pdfUrl` checks per endpoint
- Add `audioUrl` and `level` to list/book response fields
- Update API docs and README examples to document `/api/audio` and the `audioUrl` data shape
- Expand source priority ordering to include new CEHRD sources (`cehrd-stories`, `cehrd-nfe`, `cehrd-audio`)

This improves secure media access by only allowing catalog-backed audio URLs while extending metadata needed for new CEHRD content.
Add several Literature & Arts collections to `scrape_pustakalaya_stories`
to broaden scraping coverage and capture more relevant books. Also remove
the BOM character at the top of `scripts/scraper.py` for cleaner parsing
and file consistency.feat(scraper): expand Pustakalaya story collection sources

Add several Literature & Arts collections to `scrape_pustakalaya_stories`
to broaden scraping coverage and capture more relevant books. Also remove
the BOM character at the top of `scripts/scraper.py` for cleaner parsing
and file consistency.
Expand the Pustakalaya scraper to support multiple specialized collections
including Course Materials, Literature and Arts, Reference Materials,
Teaching Materials, and Other Educational Materials.

- Add new scraping scripts for specific Pustakalaya categories
- Implement a hierarchical data directory structure for categorized JSON files
- Update API to support recursive data loading and new source priorities
- Update UI to display new Pustakalaya source names and handle PDF/readUrl
- Update OpenAPI documentation and playground to reflect new sources
@vercel

vercel Bot commented May 24, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
yobook-api Ready Ready Preview, Comment May 24, 2026 4:43pm

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented May 24, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
❌ Deployment failed
View logs
yobook-api 59b8ce7 May 24 2026, 04:44 PM

@coderabbitai

coderabbitai Bot commented May 24, 2026

Copy link
Copy Markdown

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 63c7eaff-105b-4f40-a57e-75a6c2bf4b94

📥 Commits

Reviewing files that changed from the base of the PR and between 082afc9 and 59b8ce7.

⛔ Files ignored due to path filters (3)
  • data/reports/pdf_links_missing_no_stored_url.csv is excluded by !**/*.csv
  • data/reports/pdf_links_not_opening.csv is excluded by !**/*.csv
  • data/reports/pdf_missing_rerun_results.csv is excluded by !**/*.csv
📒 Files selected for processing (3)
  • data/Literature and Arts/pus_english-children-s-literature.json
  • data/Literature and Arts/pus_inspirational-materials.json
  • data/all_books.json

📝 Walkthrough

Walkthrough

This PR introduces audio streaming support to the API catalog system and substantially expands the available educational content. The core changes generalize URL validation, add audio proxy functionality with HTTP Range support, and populate the catalog with 40+ JSON datasets across multiple educational categories.

Changes

Catalog audio support and data expansion

Layer / File(s) Summary
Catalog model and field extensions
api.py
LIST_BOOK_FIELDS extended with level and audioUrl fields; SOURCE_PRIORITY mapping expanded with additional sources and updated ranks for improved sorting/filtering.
Parameterized URL validation and recursive catalog loading
api.py
is_catalog_resource_url generalized to accept configurable catalog fields for validation; load_all_books refactored to prefer pre-merged all_books.json, recursively discover JSON files via os.walk, deduplicate by id, and sort by source/grade/subject/title.
Audio proxy route and API documentation
api.py
New GET /api/audio endpoint validates requested URL against audioUrl catalog entries, forwards HTTP Range requests, propagates response headers (Content-Length, Content-Range, Accept-Ranges), and streams upstream content; /api/pdf updated to use parameterized validation; /api docs updated.
API health and docs
api.py
New GET /api/health endpoint returns book and distinct source counts; /api documentation JSON updated to list /api/health and /api/audio.
README & contributor docs
README.md, CONTRIBUTING.md, CHANGELOG.md, .env.example, .github/*, .github/workflows/ci.yml
README updated with Source Strategy, scraping commands, audio endpoint docs, example data shape (audioUrl), and project tree; CONTRIBUTING updated merge/validation steps; PR/issue templates and CI workflow added/updated.
Course materials datasets
data/Course Materials/*
18 JSON files added/updated with course material metadata (accounting, animal-science, civics, civil-engineering, computer-engineering, e-paath, education, electrical-engineering, geography, moral-education, music, occupation-business-and-technology-education, our-surroundings, plant-science, population, rural-development, sanskrit, technical-and-vocational).
Literature and Arts datasets
data/Literature and Arts/*
3 JSON files added (do-it-yourself, inspirational-materials, traditional-art) containing story/media metadata with URLs, keywords, and optional descriptive fields.
Other Educational Materials datasets
data/Other Educational Materials/*
7 JSON files added (computer, philosophy-and-religion, photo-essay, sports, tourism, plus related entries) with educational material metadata including source/cover/read URLs, keywords, and descriptions.
Reference Materials datasets
data/Reference Materials/*
3 JSON files added (atlas, children-s-encyclopedia, dictionary) containing reference material metadata with URLs, keywords, page counts, and descriptions.
Teaching Materials datasets
data/Teaching Materials/*
4 JSON files added (educational-theory-and-philosophy, literacy-resources, local-curriculum, quality-education-support-material) with teaching material metadata including author info, URLs, keywords, and descriptions.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 A rabbit hops through data, now with audio streams so clear,
Catalogs multiply like clover, spreading learning far and near,
From Course Materials to Teaching guides so bright,
The API proxy flows with Range headers, a technical delight! 🎵📚

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title is misleading and does not accurately describe the main changes in the pull request. Revise the title to reflect the actual scope: include infrastructure changes (API audio proxy, health check, catalog loading refactor) and README/documentation updates alongside data additions. Example: 'Add audio streaming API, improve catalog loading, and expand Pustakalaya data collections'.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch v1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api.py`:
- Around line 111-113: is_catalog_resource_url currently calls load_all_books()
on every request causing repeated disk reads; change it to use a cached catalog
and precomputed allowlists for the fields (e.g., pdfUrl/readUrl/audioUrl) stored
at module-level (or in a simple CatalogCache class) and have load_all_books()
populate/update that cache instead of re-parsing each time; implement cache
invalidation by checking file mtimes (or a single catalog last-modified
timestamp) and refresh the cached book list and derived sets only when files
change so is_catalog_resource_url() simply checks membership in the precomputed
set rather than iterating load_all_books() each call.
- Around line 130-134: The active-catalog discovery loop currently collects all
JSON files under DATA_DIR (using os.walk) including archived datasets; modify
the loop that builds filepaths so it skips files located in the archive
directory (e.g., any root path matching os.path.join(DATA_DIR, "archive_data")
or contains "/archive" segment) and still excludes "all_books.json" and
non-JSONs; update the condition in the for root, _, files in os.walk(DATA_DIR)
block (the code that appends to filepaths) to continue when the file's root or
full path indicates an archived dataset so archived JSONs are not added to
filepaths.
- Around line 354-358: The current Response uses
stream_with_context(upstream.iter_content(...)) but never closes the
requests.Response object "upstream", which can leak connections; wrap
iter_content in a generator that yields chunks and calls upstream.close() in a
finally block (or use contextlib.closing) and pass that generator to
stream_with_context so "upstream" is always closed on iterator exhaustion or
client disconnect; update the return to use
Response(stream_with_context(your_chunk_generator()), headers=headers,
status=upstream.status_code) and ensure the generator references
upstream.iter_content(chunk_size=64 * 1024) and closes upstream in its finally.

In `@data/Course` Materials/pus_geography.json:
- Line 27: The PUS JSON uses "educationLevel" but the API expects "level" (see
LIST_BOOK_FIELDS in api.py and openapi.json); update the PUS ingestion/merge or
add fallback logic so entries get a numeric "level" field: convert/normalize
data/Course Materials/pus_geography.json (and other PUS inputs) by mapping
educationLevel values (e.g., "Primary"/"Secondary"/"Tertiary"/etc.) to the
numeric level used by the API, and ensure the scraper/merge output emits "level"
(or have the API layer that builds LIST_BOOK_FIELDS check for educationLevel and
populate level before serialization). Reference the PUS scraper/merge step that
produces the catalog documents and the API builder that uses LIST_BOOK_FIELDS to
implement this mapping.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Around line 108-109: Normalize the publisher values in
pus_occupation-business-and-technology-education.json by removing embedded
newlines and indentation artifacts and collapsing multiple whitespace (including
non-breaking spaces) into a single regular space for the "publisher" fields;
locate the entries where "publisher" currently contains newline/indentation
(e.g. the value starting with "नेपाल सरकार,") and replace them with a
single-line, trimmed string like "नेपाल सरकार, पाठ्यक्रम विकास केन्द्र" so
exact-match filtering and rendering are not broken.

In `@data/Course` Materials/pus_our-surroundings.json:
- Line 257: The pageCount fields currently contain mixed-script numerals ("३0");
find the "pageCount" entries with the value "३0" and normalize them to use ASCII
digits (e.g., "30") or, preferably, a JSON number (pageCount: 30) to ensure
consistent parsing/sorting; update both occurrences so all pageCount values use
the same script/type.

In `@data/Course` Materials/pus_plant-science.json:
- Around line 471-478: This record (id:
pus-79b624b9-a60e-48ee-aaeb-e6eda45d2d1d, title: "Operation and Maintenance of
Microhydro Plant and Photovoltaic System :Learning Resource Material - Grade
12") is missing a subject field so it won’t be returned by GET
/api/books?subject=...; open the JSON object for that record and add an
appropriate "subject" property (for example "subject": "Plant Science", or
"subject": "Renewable Energy / Microhydro" depending on taxonomy) and optionally
adjust "category"/"keywords" to match catalog filters so it appears in
subject-based queries.

In `@data/Course` Materials/pus_rural-development.json:
- Line 1: The file pus_rural-development.json currently contains an empty array
([]) which makes this category always return zero items; confirm whether this is
intentional and either (a) populate pus_rural-development.json with the expected
dataset entries, (b) remove/rename the file so the catalog bucket isn't dead
until data is ready, or (c) replace the empty array with a short placeholder
object/metadata indicating "data pending" so callers can handle the empty state
explicitly.

In `@data/Literature` and Arts/pus_traditional-art.json:
- Around line 31-51: The record uses a non-unique id
"pus-592843a1-7e1a-4769-b30c-cc2ffa030b53" which is duplicated elsewhere and
will cause ingest collisions; update the "id" field in this JSON object (and the
other duplicate record that reuses that same id) to a newly generated unique id
(e.g., a new UUID with the "pus-" prefix) so each record has a globally unique
"id" value and ensure any references to that id are updated consistently.

In `@data/Other` Educational Materials/pus_computer.json:
- Around line 906-919: The JSON record with id
"pus-6a28b603-c348-4799-a4a5-e3664bc12107" is incomplete (missing readUrl and
other core metadata), so update this object to include the same required fields
used across the dataset (at minimum add "readUrl" with a valid document URL and
any missing core metadata such as publisher/publishedDate/format/size/rights as
your schema requires) and run the same validation/normalization used elsewhere
before persisting to ensure consistency and prevent broken read/download
behavior.

In `@data/Other` Educational Materials/pus_philosophy-and-religion.json:
- Around line 74-103: The JSON records in pus_philosophy-and-religion.json reuse
existing UUID values in the Sanskrit course dataset causing id-based
deduplication collisions; update the "id" fields for the conflicting records
(e.g., the entry with "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703") to
unique UUIDs (or a namespaced id scheme) so each record across all categories is
globally unique, then re-run the loader to verify no id collisions remain.

In `@data/Other` Educational Materials/pus_sports.json:
- Around line 142-165: The record in this file uses the same id
"pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28" that already exists in the other
catalog entry (referenced in pus_photo-essay.json), causing one to be dropped;
fix it by giving this entry a unique id (replace the "id" value in this object
with a new UUID or canonical unique identifier) or merge the two records if they
represent the same item, and then verify no other entries share that id.

In `@data/Teaching` Materials/pus_educational-theory-and-philosophy.json:
- Line 625: The JSON records have "pageCount" fields containing years ("2008",
"2006") instead of numeric page totals; locate the offending "pageCount" keys
and correct them by replacing the year strings with the actual integer page
counts or null if unknown, or move the value to the correct field such as
"publicationYear" if that was intended; add/adjust a validation step (schema or
script) that enforces pageCount as an integer >0 and scan the file for any
4-digit year patterns in "pageCount" to fix or flag for manual review (refer to
the "pageCount" key and the specific values "2008" and "2006" to find the
records).

In `@data/Teaching` Materials/pus_local-curriculum.json:
- Line 73: The educationLevel field contains embedded newlines and extra
whitespace; normalize it by trimming whitespace and collapsing internal
newlines/extra spaces and store as a stable representation (preferably an array
of trimmed values like ["Primary","Middle"] or a comma-joined string
"Primary,Middle"); update the entries where educationLevel appears (the current
key "educationLevel" at the shown diff and the other occurrences called out) to
parse the raw string by splitting on commas/newlines, trimming each token,
filtering out empties, and then serialize the cleaned array or joined string
consistently.

In `@data/Teaching` Materials/pus_quality-education-support-material.json:
- Around line 212-229: This record (id
"pus-6a1b7c47-f833-41db-878e-360c359bc838", title "Study Report on Effectiveness
of Reimbursement System of Free Textbook Distribution") is missing the publisher
field; add a "publisher" key to this JSON object (either a string with the
publisher name or explicit null) so the schema matches other entries and
preserves dataset shape consistency.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a4240fdd-d216-4b11-a4a9-a19d59e40d3e

📥 Commits

Reviewing files that changed from the base of the PR and between 6780ffb and 60d3046.

📒 Files selected for processing (87)
  • README.md
  • api.py
  • data/Course Materials/pus_accounting.json
  • data/Course Materials/pus_animal-science.json
  • data/Course Materials/pus_civics.json
  • data/Course Materials/pus_civil-engineering.json
  • data/Course Materials/pus_computer-engineering.json
  • data/Course Materials/pus_e-paath.json
  • data/Course Materials/pus_economics.json
  • data/Course Materials/pus_education.json
  • data/Course Materials/pus_electrical-engineering.json
  • data/Course Materials/pus_english.json
  • data/Course Materials/pus_environmental-studies.json
  • data/Course Materials/pus_geography.json
  • data/Course Materials/pus_health-and-physical-education.json
  • data/Course Materials/pus_history-and-culture.json
  • data/Course Materials/pus_mathematics.json
  • data/Course Materials/pus_moral-education.json
  • data/Course Materials/pus_music.json
  • data/Course Materials/pus_nepali.json
  • data/Course Materials/pus_occupation-business-and-technology-education.json
  • data/Course Materials/pus_old-textbooks.json
  • data/Course Materials/pus_our-surroundings.json
  • data/Course Materials/pus_plant-science.json
  • data/Course Materials/pus_political-science-and-philosophy.json
  • data/Course Materials/pus_population.json
  • data/Course Materials/pus_rural-development.json
  • data/Course Materials/pus_sanskrit.json
  • data/Course Materials/pus_science.json
  • data/Course Materials/pus_social-studies.json
  • data/Course Materials/pus_sociology-and-anthropology.json
  • data/Course Materials/pus_technical-and-vocational.json
  • data/Course Materials/pus_textbook-chapters.json
  • data/Course Materials/pus_textbooks.json
  • data/Literature and Arts/pus_do-it-yourself.json
  • data/Literature and Arts/pus_english-children-s-literature.json
  • data/Literature and Arts/pus_english-literature.json
  • data/Literature and Arts/pus_inspirational-materials.json
  • data/Literature and Arts/pus_literature-in-other-nepali-languages.json
  • data/Literature and Arts/pus_nepali-children-s-literature.json
  • data/Literature and Arts/pus_nepali-literature.json
  • data/Literature and Arts/pus_traditional-art.json
  • data/Other Educational Materials/pus_agriculture-and-biodiversity.json
  • data/Other Educational Materials/pus_civics-related-materials.json
  • data/Other Educational Materials/pus_computer.json
  • data/Other Educational Materials/pus_cottage-and-small-industries.json
  • data/Other Educational Materials/pus_education-related-materials.json
  • data/Other Educational Materials/pus_environment-related-materials.json
  • data/Other Educational Materials/pus_health-and-security-related-materials.json
  • data/Other Educational Materials/pus_law-and-government.json
  • data/Other Educational Materials/pus_philosophy-and-religion.json
  • data/Other Educational Materials/pus_photo-essay.json
  • data/Other Educational Materials/pus_science-and-technology.json
  • data/Other Educational Materials/pus_sports.json
  • data/Other Educational Materials/pus_tourism.json
  • data/Reference Materials/pus_atlas.json
  • data/Reference Materials/pus_children-s-encyclopedia.json
  • data/Reference Materials/pus_dictionary.json
  • data/Teaching Materials/pus_additional-reading-material-for-teachers.json
  • data/Teaching Materials/pus_curriculum.json
  • data/Teaching Materials/pus_educational-theory-and-philosophy.json
  • data/Teaching Materials/pus_journals-magazines-newsletters-and-pamphlets.json
  • data/Teaching Materials/pus_literacy-resources.json
  • data/Teaching Materials/pus_local-curriculum.json
  • data/Teaching Materials/pus_professional-development.json
  • data/Teaching Materials/pus_quality-education-support-material.json
  • data/Teaching Materials/pus_teacher-training-material.json
  • data/Teaching Materials/pus_teachers-guides-old.json
  • data/Teaching Materials/pus_teachers-guides.json
  • data/Teaching Materials/pus_teaching-support-material.json
  • data/all_books.json
  • data/archive_data/archive_org.json
  • data/archive_data/cdc_nepal.json
  • data/archive_data/open_library.json
  • data/archive_data/pustakalaya.json
  • data/cehrd_audio.json
  • data/cehrd_nfe.json
  • data/cehrd_stories.json
  • index.html
  • openapi.json
  • playground.html
  • scripts/scrape_pustakalaya_course_materials.py
  • scripts/scrape_pustakalaya_literature.py
  • scripts/scrape_pustakalaya_literature_copy.py
  • scripts/scrape_pustakalaya_other_educational_materials.py
  • scripts/scrape_pustakalaya_teaching_materials.py
  • scripts/scraper.py

Comment thread api.py
Comment on lines 111 to 113
for book in load_all_books():
if url in {book.get("pdfUrl"), book.get("readUrl")}:
if url in {book.get(field) for field in fields}:
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid full catalog reload for each URL authorization check.

is_catalog_resource_url() re-parses all JSON files on every /api/pdf and /api/audio request via load_all_books(). That adds avoidable disk I/O and latency on a hot path. Cache the loaded catalog (or precomputed allowlists for pdfUrl/readUrl/audioUrl) and refresh only when files change.

Also applies to: 118-150

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api.py` around lines 111 - 113, is_catalog_resource_url currently calls
load_all_books() on every request causing repeated disk reads; change it to use
a cached catalog and precomputed allowlists for the fields (e.g.,
pdfUrl/readUrl/audioUrl) stored at module-level (or in a simple CatalogCache
class) and have load_all_books() populate/update that cache instead of
re-parsing each time; implement cache invalidation by checking file mtimes (or a
single catalog last-modified timestamp) and refresh the cached book list and
derived sets only when files change so is_catalog_resource_url() simply checks
membership in the precomputed set rather than iterating load_all_books() each
call.

Comment thread api.py
Comment on lines +130 to +134
for root, _, files in os.walk(DATA_DIR):
for filename in sorted(files):
if not filename.endswith(".json") or filename == "all_books.json":
continue
filepaths.append(os.path.join(root, filename))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Exclude archived datasets from active catalog discovery.

Line 130-134 currently ingests all JSON files under data/, which will include data/archive_data/*.json and contradicts the “not part of active merged catalog” behavior documented in README.md (Line 27). This changes result quality and source composition unexpectedly.

Suggested fix
-    for root, _, files in os.walk(DATA_DIR):
+    for root, _, files in os.walk(DATA_DIR):
+        if os.path.basename(root) == "archive_data":
+            continue
         for filename in sorted(files):
             if not filename.endswith(".json") or filename == "all_books.json":
                 continue
             filepaths.append(os.path.join(root, filename))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for root, _, files in os.walk(DATA_DIR):
for filename in sorted(files):
if not filename.endswith(".json") or filename == "all_books.json":
continue
filepaths.append(os.path.join(root, filename))
for root, _, files in os.walk(DATA_DIR):
if os.path.basename(root) == "archive_data":
continue
for filename in sorted(files):
if not filename.endswith(".json") or filename == "all_books.json":
continue
filepaths.append(os.path.join(root, filename))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api.py` around lines 130 - 134, The active-catalog discovery loop currently
collects all JSON files under DATA_DIR (using os.walk) including archived
datasets; modify the loop that builds filepaths so it skips files located in the
archive directory (e.g., any root path matching os.path.join(DATA_DIR,
"archive_data") or contains "/archive" segment) and still excludes
"all_books.json" and non-JSONs; update the condition in the for root, _, files
in os.walk(DATA_DIR) block (the code that appends to filepaths) to continue when
the file's root or full path indicates an archived dataset so archived JSONs are
not added to filepaths.

Comment thread api.py
Comment on lines +354 to +358
return Response(
stream_with_context(upstream.iter_content(chunk_size=64 * 1024)),
headers=headers,
status=upstream.status_code,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify other streamed requests responses are explicitly closed.
rg -n -C3 'iter_content\(|requests\.get\(' api.py

Repository: Shubhamnpk/yobook-api

Length of output: 949


Close the upstream requests response when streaming iter_content to avoid connection leaks.
In api.py (lines 354-358), the code streams upstream.iter_content(...) via Response(stream_with_context(...)) but does not explicitly close the requests.get(..., stream=True) response (upstream). Under early client disconnects, this can keep connections from being released back to the pool.

Suggested fix
-    return Response(
-        stream_with_context(upstream.iter_content(chunk_size=64 * 1024)),
-        headers=headers,
-        status=upstream.status_code,
-    )
+    def generate():
+        try:
+            for chunk in upstream.iter_content(chunk_size=64 * 1024):
+                if chunk:
+                    yield chunk
+        finally:
+            upstream.close()
+
+    return Response(
+        stream_with_context(generate()),
+        headers=headers,
+        status=upstream.status_code,
+    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api.py` around lines 354 - 358, The current Response uses
stream_with_context(upstream.iter_content(...)) but never closes the
requests.Response object "upstream", which can leak connections; wrap
iter_content in a generator that yields chunks and calls upstream.close() in a
finally block (or use contextlib.closing) and pass that generator to
stream_with_context so "upstream" is always closed on iterator exhaustion or
client disconnect; update the return to use
Response(stream_with_context(your_chunk_generator()), headers=headers,
status=upstream.status_code) and ensure the generator references
upstream.iter_content(chunk_size=64 * 1024) and closes upstream in its finally.

"fileSize": "4.11 MB",
"publisher": "पाठ्यक्रम विकास केन्द्र",
"pageCount": "२४१",
"educationLevel": "Secondary"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check which level-related keys API code expects
fd -i 'api.py' --exec rg -n --no-heading 'LIST_BOOK_FIELDS|educationLevel|\"level\"|\blevel\b' {}

# List JSON files that use educationLevel without level
python - <<'PY'
import json, pathlib
for p in pathlib.Path("data").rglob("*.json"):
    try:
        arr = json.loads(p.read_text(encoding="utf-8"))
    except Exception:
        continue
    if not isinstance(arr, list):
        continue
    keys = set()
    for item in arr:
        if isinstance(item, dict):
            keys.update(item.keys())
    if "educationLevel" in keys and "level" not in keys:
        print(p)
PY

Repository: Shubhamnpk/yobook-api

Length of output: 3086


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find code references to educationLevel and level
echo "=== educationLevel references ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' 'educationLevel' .

echo "=== level field references ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' '"level"' .

# Check for mapping/back-compat logic (common patterns)
echo "=== Possible mapping logic ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
  'map(ping)?|compat|transform|normalize|educationLevel.*level|level.*educationLevel' .

# Locate the likely request/response schema definitions
echo "=== Likely schema/field lists ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
  'LIST_BOOK_FIELDS|BOOK_FIELDS|education.*level|educationLevel|level catalog|catalog.*level' .

Repository: Shubhamnpk/yobook-api

Length of output: 50378


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find all relevant source files mentioning each key
echo "=== Files mentioning educationLevel (code only) ==="
rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
  'educationLevel' .

echo "=== Files mentioning \"level\" (code only) ==="
rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
  '"level"' .

echo "=== Files mentioning mapping/transform keywords around levels ==="
rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
  'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' .

# Narrow to likely request/response/schema code
echo "=== Candidate API schema/field lists ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
  'LIST_BOOK_FIELDS|BOOK_FIELDS|field.*level|educationLevel.*field|level.*field' . || true

# Show a small window around any mapping we find
MAP_FILES=$(rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
  'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' . || true)

if [ -n "${MAP_FILES}" ]; then
  echo "=== Context for mapping candidates ==="
  echo "${MAP_FILES}" | head -n 20 | while read -r f; do
    echo "--- ${f} ---"
    rg -n --no-heading \
      'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' "$f" | head -n 5
  done
fi

Repository: Shubhamnpk/yobook-api

Length of output: 31446


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show LIST_BOOK_FIELDS and surrounding code
python - <<'PY'
from pathlib import Path
p = Path("api.py")
lines = p.read_text(encoding="utf-8").splitlines()
# print around LIST_BOOK_FIELDS
for i,l in enumerate(lines):
    if "LIST_BOOK_FIELDS" in l:
        start=max(0,i-20); end=min(len(lines), i+80)
        for j in range(start,end):
            print(f"{j+1:5d}:{lines[j]}")
        break
PY

# Search for where fields are read/filtered for level metadata
rg -n --no-heading 'educationLevel|\"level\"|level\b|LIST_BOOK_FIELDS|book\[[' api.py

# Also inspect scraper.py/openapi.json for schema expectation if present
echo "=== openapi.json (presence of level/educationLevel) ==="
rg -n --no-heading '"level"' openapi.json || true
rg -n --no-heading 'educationLevel' openapi.json || true

Repository: Shubhamnpk/yobook-api

Length of output: 3550


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== api.py: all educationLevel occurrences ==="
rg -n --no-heading 'educationLevel' api.py || true

echo "=== api.py: all occurrences of field 'level' ==="
rg -n --no-heading '\blevel\b' api.py

echo "=== api.py: openapi schema field 'level' / educationLevel ==="
rg -n --no-heading '"level"' openapi.json || true
rg -n --no-heading 'educationLevel' openapi.json || true

echo "=== data: JSON files that contain a top-level or item key \"level\" ==="
# crude but effective: find any file containing the literal string "level":
rg -l --no-heading --hidden --glob 'data/**/*.json' '"level"' .

echo "=== data: JSON files that contain \"educationLevel\" and do NOT contain \"level\" ==="
python3 - <<'PY'
import json, pathlib
paths_with_no_level=[]
for p in pathlib.Path("data").rglob("*.json"):
    try:
        txt=p.read_text(encoding="utf-8")
    except Exception:
        continue
    if '"educationLevel"' not in txt:
        continue
    if '"level"' in txt:
        continue
    # If file is huge still count by substring; better to also parse if it's a list of dicts
    try:
        arr=json.loads(txt)
    except Exception:
        continue
    keys=set()
    if isinstance(arr,list):
        for item in arr:
            if isinstance(item,dict):
                keys.update(item.keys())
    if "educationLevel" in keys and "level" not in keys:
        paths_with_no_level.append(str(p))
for x in sorted(paths_with_no_level):
    print(x)
PY

echo "=== Inspect pus_geography.json keys (first list item) ==="
python3 - <<'PY'
import json
from pathlib import Path
p=Path("data/Course Materials/pus_geography.json")
data=json.loads(p.read_text(encoding="utf-8"))
print("type:", type(data))
if isinstance(data,list) and data and isinstance(data[0],dict):
    d=data[0]
    print("first item keys:", sorted(d.keys()))
    print("educationLevel:", d.get("educationLevel"))
    print("level:", d.get("level"))
else:
    print("not a list of dicts; top-level type:", type(data))
PY

echo "=== Codebase: any educationLevel<->level mapping logic in scripts/ ==="
rg -n --no-heading --hidden --glob 'scripts/**/*' 'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level' .

Repository: Shubhamnpk/yobook-api

Length of output: 3809


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Find how the API uses `level` beyond just listing it in LIST_BOOK_FIELDS
rg -n --no-heading 'level' api.py

# 2) Find any conversion/mapping from educationLevel to level in scripts
rg -n --no-heading 'educationLevel' scripts
rg -n --no-heading '"level"' scripts
rg -n --no-heading 'educationLevel.*level|level.*educationLevel|map.*level|normalize.*level|transform.*level' scripts || true

# 3) Inspect openapi.json for educationLevel vs level usage
rg -n --no-heading 'educationLevel' openapi.json || true
rg -n --no-heading '"level"' openapi.json | head -n 50

# 4) Inspect api.py for where request params are handled for filtering/sorting (search common query patterns)
rg -n --no-heading 'request\.args\.get\("level"\)|args\.get\("level"\)|level.*request\.args|get.*level' api.py || true
rg -n --no-heading 'sort|filter' api.py

Repository: Shubhamnpk/yobook-api

Length of output: 1672


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate /api/books handler in api.py
rg -n --no-heading 'GET /api/books|def .*books|/api/books' api.py

# Print a focused window around the likely books search/filter logic
python3 - <<'PY'
from pathlib import Path
p=Path("api.py")
lines=p.read_text(encoding="utf-8").splitlines()
def dump_around(needle, radius=80):
    for i,l in enumerate(lines):
        if needle in l:
            start=max(0,i-radius); end=min(len(lines), i+radius)
            for j in range(start,end):
                print(f"{j+1:5d}:{lines[j]}")
            return True
    return False

# common needles
dump_around('def get_books')
dump_around('books')
dump_around('request.args')
dump_around('"level"')
PY

# Inspect scripts/scraper.py where it assigns "level"
python3 - <<'PY'
from pathlib import Path
p=Path("scripts/scraper.py")
lines=p.read_text(encoding="utf-8").splitlines()
for i,l in enumerate(lines):
    if '"level":' in l and 'level' in l:
        start=max(0,i-40); end=min(len(lines), i+60)
        for j in range(start,end):
            print(f"{j+1:5d}:{lines[j]}")
        break
PY

# Search for any mapping logic in code (not just scripts) around educationLevel->level
rg -n --no-heading 'educationLevel.*level|level.*educationLevel|educationLevel.*int|educationLevel.*catalog|catalog.*educationLevel|educationLevel.*map' . || true

Repository: Shubhamnpk/yobook-api

Length of output: 25082


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Any query-param usage of "level" in api.py
rg -n --no-heading 'request\.args\.get\("level"\)|args\.get\("level"\)|\blevel\b.*request\.args' api.py || true

# Any API code that reads book["level"] (besides listing it in LIST_BOOK_FIELDS / openapi)
rg -n --no-heading '\bget\("level"\)|\b\["level"\]|LIST_BOOK_FIELDS' api.py

# Is educationLevel referenced anywhere in non-data code (besides scripts)?
rg -n --no-heading --glob '!data/**' 'educationLevel' .

# openapi: is level used as a query param or just a response field?
rg -n --no-heading '"level"' openapi.json
rg -n --no-heading 'educationLevel' openapi.json || true

# Identify which data sources actually contain "level"
python3 - <<'PY'
import json, pathlib
with_level=[]
for p in pathlib.Path("data").rglob("*.json"):
    try:
        txt=p.read_text(encoding="utf-8")
    except Exception:
        continue
    if '"level"' not in txt:
        continue
    # quick parse to see if it's list-of-dicts and level exists in items
    try:
        arr=json.loads(txt)
    except Exception:
        with_level.append(str(p))
        continue
    ok=False
    if isinstance(arr,list):
        for item in arr:
            if isinstance(item,dict) and 'level' in item:
                ok=True
                break
    if ok:
        with_level.append(str(p))
print("\n".join(sorted(set(with_level))))
PY

Repository: Shubhamnpk/yobook-api

Length of output: 846


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== educationLevel usage outside data/** ==="
rg -n --no-heading --glob '!data/**' 'educationLevel' .

echo "=== scripts/code touching all_books / catalog merge ==="
rg -n --no-heading --glob 'scripts/**' 'all_books\.json|merge|normalize|transform' .

echo "=== any code setting level from educationLevel ==="
rg -n --no-heading --glob '!data/**' 'educationLevel.*level|level.*educationLevel' .

Repository: Shubhamnpk/yobook-api

Length of output: 12603


Map educationLevel data to the API/catalog level field (PUS JSON currently omits level)

data/Course Materials/pus_geography.json contains only:

"educationLevel": "Secondary"

But the API/catalog fields are built around level (api.py includes "level" in LIST_BOOK_FIELDS, and openapi.json exposes level), and there is no educationLevellevel conversion found in the codebase/scrapers. As a result, PUS entries like this will have level missing/null in API responses, breaking any level-based sorting/filtering that depends on the level field.

  • Update the PUS scrapers/merge step to emit the numeric level field (preferred), or add backward-compatible fallback logic to interpret educationLevel as level.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Course` Materials/pus_geography.json at line 27, The PUS JSON uses
"educationLevel" but the API expects "level" (see LIST_BOOK_FIELDS in api.py and
openapi.json); update the PUS ingestion/merge or add fallback logic so entries
get a numeric "level" field: convert/normalize data/Course
Materials/pus_geography.json (and other PUS inputs) by mapping educationLevel
values (e.g., "Primary"/"Secondary"/"Tertiary"/etc.) to the numeric level used
by the API, and ensure the scraper/merge output emits "level" (or have the API
layer that builds LIST_BOOK_FIELDS check for educationLevel and populate level
before serialization). Reference the PUS scraper/merge step that produces the
catalog documents and the API builder that uses LIST_BOOK_FIELDS to implement
this mapping.

Comment thread data/Course Materials/pus_occupation-business-and-technology-education.json Outdated
Comment on lines +74 to +103
"id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703",
"title": "कर्मकाण्ड: कक्षा १०",
"author": "Unknown",
"language": "ne",
"country": "np",
"source": "pustakalaya-other-educational",
"sourceUrl": "https://pustakalaya.org/documents/detail/0430b451-b53e-4ac0-a868-c35b5269b703/",
"coverUrl": "https://pustakalaya.org/media/uploads/thumbnails/document/2023/09/11/करमकणड_ककष_१०.jpg",
"category": "Other Educational Materials",
"keywords": [
"Karmakanda",
"कक्षा १०",
"Textbook",
"Grade 10",
"गुरुकुल",
"Gurukul",
"कर्मकाण्ड",
"Sanskrit",
"New Textbook",
"Ritual",
"संस्कृतम्",
"पाठ्यपुस्तक"
],
"scrapedAt": "2026-05-24T11:34:44.382540Z",
"readUrl": "https://pustakalaya.org/media/uploads/documents/2023/09/11/Karmakanda10_124cc050/1685512467.pdf",
"fileSize": "1.0 MB",
"publisher": "पाठ्यक्रम विकास केन्द्र",
"pageCount": "२४६",
"educationLevel": "Secondary"
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Cross-category ID reuse will overwrite/drop records at load time.

Starting at Line 74, this file reuses IDs already present in the Sanskrit course dataset (multiple occurrences in this file). With id-based deduplication, these entries will collide and one side will be lost, causing incorrect category/source exposure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Other` Educational Materials/pus_philosophy-and-religion.json around
lines 74 - 103, The JSON records in pus_philosophy-and-religion.json reuse
existing UUID values in the Sanskrit course dataset causing id-based
deduplication collisions; update the "id" fields for the conflicting records
(e.g., the entry with "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703") to
unique UUIDs (or a namespaced id scheme) so each record across all categories is
globally unique, then re-run the loader to verify no id collisions remain.

Comment on lines +142 to +165
"id": "pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28",
"title": "Motor Sports",
"author": "Prabhat Bhattarai",
"language": "en",
"country": "np",
"source": "pustakalaya-other-educational",
"sourceUrl": "https://pustakalaya.org/documents/detail/94caee86-1d2c-474e-b0a4-d2be3a45ec28/",
"coverUrl": "https://pustakalaya.org/media/uploads/thumbnails/document/2019/02/22/Motor_Sports.jpg",
"category": "Other Educational Materials",
"keywords": [
"Photo Essays",
"Prabhat Bhattarai",
"साझा शिक्षा ई-पाटी",
"OLE Nepal",
"प्रभात भट्टराई",
"Motor Sports"
],
"scrapedAt": "2026-05-24T11:34:44.382540Z",
"readUrl": "https://pustakalaya.org/media/uploads/op/pdf/OLENepal2012_MotorSports.pdf/OLENepal2012_MotorSports.pdf",
"fileSize": "2.01 MB",
"publisher": "OLE Nepal",
"pageCount": "23",
"description": "This photo essay has been prepared by Prabhat Bhattarai. The text and photos for this photo essay were taken from www.wikipedia.com, and edited to suit the audience."
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Duplicate id will cause one catalog record to be dropped.

pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28 is already present in data/Other Educational Materials/pus_photo-essay.json (Line 306). Since load_all_books deduplicates by id, one category entry will be silently discarded at load time.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Other` Educational Materials/pus_sports.json around lines 142 - 165, The
record in this file uses the same id "pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28"
that already exists in the other catalog entry (referenced in
pus_photo-essay.json), causing one to be dropped; fix it by giving this entry a
unique id (replace the "id" value in this object with a new UUID or canonical
unique identifier) or merge the two records if they represent the same item, and
then verify no other entries share that id.

"readUrl": "https://pustakalaya.org/media/uploads/op/pdf/ToruAndVijay2008_Opening_up_education.pdf/ToruAndVijay2008_Opening_up_education.pdf",
"fileSize": "4.33 MB",
"publisher": "The MIT Press",
"pageCount": "2008",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

pageCount appears to contain publication years instead of pages.

For these records, pageCount values ("2008", "2006") look like years, not page totals. This will skew page-based filtering/sorting and mislead consumers.

Also applies to: 1288-1288

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Teaching` Materials/pus_educational-theory-and-philosophy.json at line
625, The JSON records have "pageCount" fields containing years ("2008", "2006")
instead of numeric page totals; locate the offending "pageCount" keys and
correct them by replacing the year strings with the actual integer page counts
or null if unknown, or move the value to the correct field such as
"publicationYear" if that was intended; add/adjust a validation step (schema or
script) that enforces pageCount as an integer >0 and scan the file for any
4-digit year patterns in "pageCount" to fix or flag for manual review (refer to
the "pageCount" key and the specific values "2008" and "2006" to find the
records).

"fileSize": "2.0 MB",
"publisher": "हरिहरपुरगढी गाउँपालिका",
"pageCount": "१५४",
"educationLevel": "Primary , \n \n Middle"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize educationLevel values to a stable format.

These values include embedded newlines/indentation, which can break exact-match filters and create inconsistent facets. Store a normalized value (e.g., "Primary,Middle" or an array like ["Primary","Middle"]) instead of whitespace-heavy text blobs.

Also applies to: 122-122, 217-217, 289-289

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Teaching` Materials/pus_local-curriculum.json at line 73, The
educationLevel field contains embedded newlines and extra whitespace; normalize
it by trimming whitespace and collapsing internal newlines/extra spaces and
store as a stable representation (preferably an array of trimmed values like
["Primary","Middle"] or a comma-joined string "Primary,Middle"); update the
entries where educationLevel appears (the current key "educationLevel" at the
shown diff and the other occurrences called out) to parse the raw string by
splitting on commas/newlines, trimming each token, filtering out empties, and
then serialize the cleaned array or joined string consistently.

Comment thread data/Teaching Materials/pus_quality-education-support-material.json
Introduce catalog validation tooling, API health monitoring, and updated project documentation to support the expanded data sources.

- Add `scripts/validate_catalog.py` to ensure data integrity before commits
- Implement `/api/health` endpoint for monitoring catalog size and source counts
- Update `README.md` and `CONTRIBUTING.md` with new scraping workflows and validation steps
- Add `.env.example`, `CHANGELOG.md`, and GitHub configuration files
- Update OpenAPI specification to include the new health check endpoint
Update the catalog with categorized Pustakalaya collections, clean up metadata formatting in the JSON database, and adjust the API source priority. Additionally, improve the frontend UI to handle audio-specific book displays.

- Update `api.py` to remove deprecated source priorities
- Clean up whitespace and newline artifacts in `data/all_books.json`
- Update `index.html` to hide the read button when an audio URL is present
- Synchronize multiple Pustakalaya category JSON files with new scraped data

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
README.md (1)

164-171: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add /api/health to the endpoint list for consistency.

The “Other Endpoints” block documents /api/audio but omits /api/health, which is now part of the API surface and used in CI smoke checks.

Suggested doc patch
 GET /api/books/<id>
 GET /api/pdf?url=<catalog-pdf-url>
 GET /api/audio?url=<catalog-audio-url>
+GET /api/health
 GET /api/sources
 GET /api/stats
 GET /docs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 164 - 171, Update the "Other Endpoints" list to
include the /api/health endpoint for consistency with the API surface;
specifically add "/api/health" alongside the existing entries (e.g., the GET
/api/audio line) in the same code block so CI smoke checks and docs match the
implemented endpoint.
api.py (1)

142-143: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently ignore catalog parse/load failures.

Line 142 catches everything and Line 143 drops it, which can silently serve a partial catalog and break allowlist checks unpredictably.

Suggested minimal fix
-        except Exception:
-            pass
+        except (OSError, json.JSONDecodeError) as exc:
+            app.logger.warning("Skipping unreadable catalog file %s: %s", filepath, exc)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api.py` around lines 142 - 143, The except Exception: pass block that
swallows catalog parse/load failures must be removed and replaced with explicit
error handling: catch specific exceptions raised by the catalog parsing/loading
code (or Exception if unknown), log the full error (e.g., logger.error("Failed
to load/parse catalog", exc_info=True)) and either re-raise a wrapped exception
or set the catalog to a safe empty/closed state so allowlist checks fail-safe;
locate the try/except around the catalog parse/load call in api.py and update it
to log and propagate or safely fallback instead of silently passing.
♻️ Duplicate comments (2)
data/Course Materials/pus_occupation-business-and-technology-education.json (2)

266-266: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize spacing in publisher field.

The publisher field contains double spaces after the first comma, which should be collapsed to a single space for consistency.

🧹 Proposed fix
-    "publisher": "The World Bank,  The World Bank Group, Nepal",
+    "publisher": "The World Bank, The World Bank Group, Nepal",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json
at line 266, The "publisher" field currently has double spaces after the first
comma in the string "The World Bank,  The World Bank Group, Nepal"; update that
value to collapse the double space into a single space so it reads "The World
Bank, The World Bank Group, Nepal" to normalize spacing and maintain
consistency.

108-108: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add missing space after comma in publisher field.

The publisher field is missing a space after the comma, which affects readability and consistency with other entries.

🧹 Proposed fix
-    "publisher": "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र",
+    "publisher": "नेपाल सरकार, पाठ्यक्रम विकास केन्द्र",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json
at line 108, Update the publisher JSON value for the "publisher" key by
inserting a space after the comma so the string reads "नेपाल सरकार, पाठ्यक्रम
विकास केन्द्र" instead of "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र"; locate the
"publisher" field in the object and adjust the value accordingly to match the
spacing conventions used elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 14-20: Update the two GitHub Actions uses to pinned commit SHAs
and harden checkout: replace actions/checkout@v4 with
actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823 and add the with key
persist-credentials: false to the checkout step; replace actions/setup-python@v5
with actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f so both
actions are pinned to specific SHAs rather than floating tags.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Line 188: The "editor" JSON field currently uses double spaces after commas;
update the value for the editor key ("editor") to use single spaces after each
comma and remove any leading/trailing extra spaces so it reads as "अग्नि प्रसाद
नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण
जोशी"; ensure the normalized string replaces all instances of ",  " with ", "
and trim surrounding whitespace.

In `@data/Literature` and Arts/pus_inspirational-materials.json:
- Line 137: Update the JSON entry where the "publisher" field currently has the
misspelled value "unkown" and correct it to "unknown" (i.e., replace the string
value for the "publisher" key from "unkown" to "unknown") so the record uses the
proper spelling.

In `@data/Other` Educational Materials/pus_tourism.json:
- Line 64: The keywords list contains an inconsistent spelling: replace the
incorrect string "Tourist Accomodation" with the correct "Tourist Accommodation"
wherever it appears (e.g., the entry matching "Tourist Accomodation") so all
occurrences match the correctly spelled "Tourist Accommodation" and avoid
fragmenting keyword indexing.

In `@scripts/validate_catalog.py`:
- Around line 46-47: Narrow the broad except in the JSON file load block:
replace the bare "except Exception as exc" that appends to errors with specific
exception handlers for expected failure modes (e.g.,
FileNotFoundError/PermissionError (or more generally OSError/IOError) and
json.JSONDecodeError) so only read/parse errors are caught; for each handler
append the same "{relpath}: could not read JSON: {exc}" message (or slightly
different messages if you want to distinguish read vs parse) and let any other
unexpected exceptions propagate.

---

Outside diff comments:
In `@api.py`:
- Around line 142-143: The except Exception: pass block that swallows catalog
parse/load failures must be removed and replaced with explicit error handling:
catch specific exceptions raised by the catalog parsing/loading code (or
Exception if unknown), log the full error (e.g., logger.error("Failed to
load/parse catalog", exc_info=True)) and either re-raise a wrapped exception or
set the catalog to a safe empty/closed state so allowlist checks fail-safe;
locate the try/except around the catalog parse/load call in api.py and update it
to log and propagate or safely fallback instead of silently passing.

In `@README.md`:
- Around line 164-171: Update the "Other Endpoints" list to include the
/api/health endpoint for consistency with the API surface; specifically add
"/api/health" alongside the existing entries (e.g., the GET /api/audio line) in
the same code block so CI smoke checks and docs match the implemented endpoint.

---

Duplicate comments:
In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Line 266: The "publisher" field currently has double spaces after the first
comma in the string "The World Bank,  The World Bank Group, Nepal"; update that
value to collapse the double space into a single space so it reads "The World
Bank, The World Bank Group, Nepal" to normalize spacing and maintain
consistency.
- Line 108: Update the publisher JSON value for the "publisher" key by inserting
a space after the comma so the string reads "नेपाल सरकार, पाठ्यक्रम विकास
केन्द्र" instead of "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र"; locate the
"publisher" field in the object and adjust the value accordingly to match the
spacing conventions used elsewhere.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d8662c4-91ae-4aa8-af9d-d0ae78b9fe2d

📥 Commits

Reviewing files that changed from the base of the PR and between 60d3046 and 082afc9.

📒 Files selected for processing (62)
  • .env.example
  • .github/ISSUE_TEMPLATE/bug_report.md
  • .github/ISSUE_TEMPLATE/source_request.md
  • .github/PULL_REQUEST_TEMPLATE.md
  • .github/workflows/ci.yml
  • CHANGELOG.md
  • CONTRIBUTING.md
  • README.md
  • api.py
  • data/Course Materials/pus_economics.json
  • data/Course Materials/pus_english.json
  • data/Course Materials/pus_environmental-studies.json
  • data/Course Materials/pus_health-and-physical-education.json
  • data/Course Materials/pus_history-and-culture.json
  • data/Course Materials/pus_mathematics.json
  • data/Course Materials/pus_nepali.json
  • data/Course Materials/pus_occupation-business-and-technology-education.json
  • data/Course Materials/pus_plant-science.json
  • data/Course Materials/pus_political-science-and-philosophy.json
  • data/Course Materials/pus_sanskrit.json
  • data/Course Materials/pus_science.json
  • data/Course Materials/pus_social-studies.json
  • data/Course Materials/pus_sociology-and-anthropology.json
  • data/Course Materials/pus_technical-and-vocational.json
  • data/Course Materials/pus_textbooks.json
  • data/Literature and Arts/pus_do-it-yourself.json
  • data/Literature and Arts/pus_english-children-s-literature.json
  • data/Literature and Arts/pus_english-literature.json
  • data/Literature and Arts/pus_inspirational-materials.json
  • data/Literature and Arts/pus_literature-in-other-nepali-languages.json
  • data/Literature and Arts/pus_nepali-children-s-literature.json
  • data/Literature and Arts/pus_nepali-literature.json
  • data/Literature and Arts/pus_traditional-art.json
  • data/Other Educational Materials/pus_agriculture-and-biodiversity.json
  • data/Other Educational Materials/pus_civics-related-materials.json
  • data/Other Educational Materials/pus_computer.json
  • data/Other Educational Materials/pus_education-related-materials.json
  • data/Other Educational Materials/pus_environment-related-materials.json
  • data/Other Educational Materials/pus_health-and-security-related-materials.json
  • data/Other Educational Materials/pus_law-and-government.json
  • data/Other Educational Materials/pus_philosophy-and-religion.json
  • data/Other Educational Materials/pus_science-and-technology.json
  • data/Other Educational Materials/pus_sports.json
  • data/Other Educational Materials/pus_tourism.json
  • data/Reference Materials/pus_atlas.json
  • data/Reference Materials/pus_children-s-encyclopedia.json
  • data/Reference Materials/pus_dictionary.json
  • data/Teaching Materials/pus_additional-reading-material-for-teachers.json
  • data/Teaching Materials/pus_educational-theory-and-philosophy.json
  • data/Teaching Materials/pus_journals-magazines-newsletters-and-pamphlets.json
  • data/Teaching Materials/pus_literacy-resources.json
  • data/Teaching Materials/pus_local-curriculum.json
  • data/Teaching Materials/pus_professional-development.json
  • data/Teaching Materials/pus_quality-education-support-material.json
  • data/Teaching Materials/pus_teacher-training-material.json
  • data/Teaching Materials/pus_teachers-guides-old.json
  • data/Teaching Materials/pus_teachers-guides.json
  • data/Teaching Materials/pus_teaching-support-material.json
  • data/all_books.json
  • index.html
  • openapi.json
  • scripts/validate_catalog.py
💤 Files with no reviewable changes (1)
  • data/Other Educational Materials/pus_computer.json
✅ Files skipped from review due to trivial changes (13)
  • .github/ISSUE_TEMPLATE/bug_report.md
  • .github/ISSUE_TEMPLATE/source_request.md
  • data/Literature and Arts/pus_do-it-yourself.json
  • data/Course Materials/pus_plant-science.json
  • data/Literature and Arts/pus_traditional-art.json
  • data/Reference Materials/pus_dictionary.json
  • data/Teaching Materials/pus_literacy-resources.json
  • data/Other Educational Materials/pus_sports.json
  • data/Teaching Materials/pus_local-curriculum.json
  • data/Teaching Materials/pus_quality-education-support-material.json
  • data/Reference Materials/pus_children-s-encyclopedia.json
  • data/Teaching Materials/pus_educational-theory-and-philosophy.json
  • data/Course Materials/pus_sanskrit.json

Comment thread .github/workflows/ci.yml
Comment on lines +14 to +20
- name: Check out repository
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What are the currently recommended pinned commit SHAs for actions/checkout@v4andactions/setup-python@v5, and does GitHub Security guidance recommend persist-credentials: false for checkout?

💡 Result:

Currently recommended pinned commit SHAs (for the major versions you asked about) 1) actions/checkout@v4 - GitHub-hosted docs for the v4 release show the v4 tree pinned at commit fd084cde189b7b76ec305d52e27be545a0172823. [1] 2) actions/setup-python@v5 - GitHub-hosted docs for the actions/setup-python repo show v5 being introduced by commit e9d6f990972a57673cdb72ec29e19d42ba28880f (tagging v5 in the repo’s history). [2] Does GitHub Security guidance recommend persist-credentials: false for checkout? - I did not find a GitHub Security document in the retrieved sources that explicitly recommends setting persist-credentials: false for actions/checkout. The only “guidance” I saw directly addressing persist-credentials was from actions/checkout issue discussions and a compiler/security-related change request, not official GitHub Security guidance. [3][4][5] - Therefore: based on the evidence retrieved here, I cannot confirm that GitHub’s Security guidance recommends persist-credentials: false for checkout. [3][4] If you want, I can do one more targeted search specifically for GitHub’s official security guidance pages mentioning persist-credentials (e.g., the securitylab.github.com / documentation pages) to try to locate the exact recommendation.

Citations:


Harden CI by pinning GitHub Actions to commit SHAs (and consider disabling checkout credential persistence).

  • Replace floating tags:
    • actions/checkout@v4actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823
    • actions/setup-python@v5actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f
  • Add with: persist-credentials: false to actions/checkout as a hardening step; I couldn’t find explicit official GitHub Security guidance recommending it.
🧰 Tools
🪛 zizmor (1.25.2)

[warning] 14-15: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 15-15: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)


[error] 18-18: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yml around lines 14 - 20, Update the two GitHub Actions
uses to pinned commit SHAs and harden checkout: replace actions/checkout@v4 with
actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823 and add the with key
persist-credentials: false to the checkout step; replace actions/setup-python@v5
with actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f so both
actions are pinned to specific SHAs rather than floating tags.

"scrapedAt": "2026-05-24T11:05:36.549054Z",
"readUrl": "https://pustakalaya.org/media/uploads/op/pdf/NARC2013_AgricultureTechnologyPart2.pdf/NARC2013_AgricultureTechnologyPart2.pdf",
"fileSize": "5.11 MB",
"editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize spacing in editor field.

The editor field contains double spaces after commas instead of single spaces, which is inconsistent with standard formatting conventions.

🧹 Proposed fix
-    "editor": "अग्नि प्रसाद नेपाल,  विवेक सापकोटा,  तुलसी पौडेल,  यज्ञ प्रसाद गिरी,  ध्रुवराज भट्टराई,  बालकृष्ण जोशी",
+    "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",
"editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json
at line 188, The "editor" JSON field currently uses double spaces after commas;
update the value for the editor key ("editor") to use single spaces after each
comma and remove any leading/trailing extra spaces so it reads as "अग्नि प्रसाद
नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण
जोशी"; ensure the normalized string replaces all instances of ",  " with ", "
and trim surrounding whitespace.

Comment thread data/Literature and Arts/pus_inspirational-materials.json Outdated
"keywords": [
"Great Himalaya Trail",
"Climate Change",
"Tourist Accommodation",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent keyword spelling: "Accommodation" vs "Accomodation".

Line 64 uses the correct spelling "Tourist Accommodation" while Line 194 has "Tourist Accomodation" (missing the second 'm'). This inconsistency could fragment search results if keywords are indexed.

📝 Proposed fix

At Line 194, correct the spelling:

-      "Tourist Accomodation",
+      "Tourist Accommodation",

Also applies to: 194-194

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Other` Educational Materials/pus_tourism.json at line 64, The keywords
list contains an inconsistent spelling: replace the incorrect string "Tourist
Accomodation" with the correct "Tourist Accommodation" wherever it appears
(e.g., the entry matching "Tourist Accomodation") so all occurrences match the
correctly spelled "Tourist Accommodation" and avoid fragmenting keyword
indexing.

Comment on lines +46 to +47
except Exception as exc:
errors.append(f"{relpath}: could not read JSON: {exc}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Narrow the broad exception catch during file load.

Line 46 catches all exceptions; this can hide unexpected failures. Restrict to expected read/JSON parse errors.

Suggested fix
-        except Exception as exc:
+        except (OSError, json.JSONDecodeError) as exc:
             errors.append(f"{relpath}: could not read JSON: {exc}")
             continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception as exc:
errors.append(f"{relpath}: could not read JSON: {exc}")
except (OSError, json.JSONDecodeError) as exc:
errors.append(f"{relpath}: could not read JSON: {exc}")
🧰 Tools
🪛 Ruff (0.15.13)

[warning] 46-46: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/validate_catalog.py` around lines 46 - 47, Narrow the broad except in
the JSON file load block: replace the bare "except Exception as exc" that
appends to errors with specific exception handlers for expected failure modes
(e.g., FileNotFoundError/PermissionError (or more generally OSError/IOError) and
json.JSONDecodeError) so only read/parse errors are caught; for each handler
append the same "{relpath}: could not read JSON: {exc}" message (or slightly
different messages if you want to distinguish read vs parse) and let any other
unexpected exceptions propagate.

@Shubhamnpk Shubhamnpk self-assigned this May 24, 2026
Clean up book metadata for consistency and completeness by removing an unintended line break in a description, fixing a `publisher` typo (`unkown` → `unknown`), and adding missing `readUrl` and `fileSize` fields for the Stage Fright entry in both category and aggregated datasets.fix(data): correct metadata typos and add missing file info

Clean up book metadata for consistency and completeness by removing an unintended line break in a description, fixing a `publisher` typo (`unkown` → `unknown`), and adding missing `readUrl` and `fileSize` fields for the Stage Fright entry in both category and aggregated datasets.
@Shubhamnpk Shubhamnpk merged commit 6daccc5 into main May 24, 2026
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant