added more book data across different geners by Shubhamnpk · Pull Request #4 · Shubhamnpk/yobook-api

Shubhamnpk · 2026-05-24T13:33:25Z

Summary by CodeRabbit

New Features
- Added audio streaming endpoint and a health check endpoint.
Documentation
- Expanded README with CEHRD-first source strategy, scraping/merge/validation steps, updated API notes, and project structure.
- Added issue/PR templates, updated contributing checklist, and updated changelog.
Content Expansion
- Large import of Pustakalaya datasets across Course Materials, Literature & Arts, Reference, Teaching, and Other collections.
Chores
- Added CI workflow and example env settings.

Introduce a new `GET /api/audio?url=<audioUrl>` endpoint to proxy audio files through the same origin, matching how PDF proxying works for the reader/player UX. - Generalize catalog URL validation to accept specific fields and enforce `audioUrl`/`pdfUrl` checks per endpoint - Add `audioUrl` and `level` to list/book response fields - Update API docs and README examples to document `/api/audio` and the `audioUrl` data shape - Expand source priority ordering to include new CEHRD sources (`cehrd-stories`, `cehrd-nfe`, `cehrd-audio`) This improves secure media access by only allowing catalog-backed audio URLs while extending metadata needed for new CEHRD content.feat(api): add catalog-validated audio streaming support Introduce a new `GET /api/audio?url=<audioUrl>` endpoint to proxy audio files through the same origin, matching how PDF proxying works for the reader/player UX. - Generalize catalog URL validation to accept specific fields and enforce `audioUrl`/`pdfUrl` checks per endpoint - Add `audioUrl` and `level` to list/book response fields - Update API docs and README examples to document `/api/audio` and the `audioUrl` data shape - Expand source priority ordering to include new CEHRD sources (`cehrd-stories`, `cehrd-nfe`, `cehrd-audio`) This improves secure media access by only allowing catalog-backed audio URLs while extending metadata needed for new CEHRD content.

Add several Literature & Arts collections to `scrape_pustakalaya_stories` to broaden scraping coverage and capture more relevant books. Also remove the BOM character at the top of `scripts/scraper.py` for cleaner parsing and file consistency.feat(scraper): expand Pustakalaya story collection sources Add several Literature & Arts collections to `scrape_pustakalaya_stories` to broaden scraping coverage and capture more relevant books. Also remove the BOM character at the top of `scripts/scraper.py` for cleaner parsing and file consistency.

Expand the Pustakalaya scraper to support multiple specialized collections including Course Materials, Literature and Arts, Reference Materials, Teaching Materials, and Other Educational Materials. - Add new scraping scripts for specific Pustakalaya categories - Implement a hierarchical data directory structure for categorized JSON files - Update API to support recursive data loading and new source priorities - Update UI to display new Pustakalaya source names and handle PDF/readUrl - Update OpenAPI documentation and playground to reflect new sources

vercel · 2026-05-24T13:33:30Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
yobook-api	Ready	Preview, Comment	May 24, 2026 4:43pm

cloudflare-workers-and-pages · 2026-05-24T13:33:31Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Updated (UTC)
❌ Deployment failed View logs	yobook-api	`59b8ce7`	May 24 2026, 04:44 PM

coderabbitai · 2026-05-24T13:33:38Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 63c7eaff-105b-4f40-a57e-75a6c2bf4b94

📥 Commits

Reviewing files that changed from the base of the PR and between 082afc9 and 59b8ce7.

⛔ Files ignored due to path filters (3)

data/reports/pdf_links_missing_no_stored_url.csv is excluded by !**/*.csv
data/reports/pdf_links_not_opening.csv is excluded by !**/*.csv
data/reports/pdf_missing_rerun_results.csv is excluded by !**/*.csv

📒 Files selected for processing (3)

data/Literature and Arts/pus_english-children-s-literature.json
data/Literature and Arts/pus_inspirational-materials.json
data/all_books.json

📝 Walkthrough

Walkthrough

This PR introduces audio streaming support to the API catalog system and substantially expands the available educational content. The core changes generalize URL validation, add audio proxy functionality with HTTP Range support, and populate the catalog with 40+ JSON datasets across multiple educational categories.

Changes

Catalog audio support and data expansion

Layer / File(s)	Summary
Catalog model and field extensions `api.py`	`LIST_BOOK_FIELDS` extended with `level` and `audioUrl` fields; `SOURCE_PRIORITY` mapping expanded with additional sources and updated ranks for improved sorting/filtering.
Parameterized URL validation and recursive catalog loading `api.py`	`is_catalog_resource_url` generalized to accept configurable catalog fields for validation; `load_all_books` refactored to prefer pre-merged `all_books.json`, recursively discover JSON files via `os.walk`, deduplicate by `id`, and sort by source/grade/subject/title.
Audio proxy route and API documentation `api.py`	New `GET /api/audio` endpoint validates requested URL against `audioUrl` catalog entries, forwards HTTP `Range` requests, propagates response headers (`Content-Length`, `Content-Range`, `Accept-Ranges`), and streams upstream content; `/api/pdf` updated to use parameterized validation; `/api` docs updated.
API health and docs `api.py`	New `GET /api/health` endpoint returns book and distinct source counts; `/api` documentation JSON updated to list `/api/health` and `/api/audio`.
README & contributor docs `README.md`, `CONTRIBUTING.md`, `CHANGELOG.md`, `.env.example`, `.github/*`, `.github/workflows/ci.yml`	README updated with Source Strategy, scraping commands, audio endpoint docs, example data shape (`audioUrl`), and project tree; CONTRIBUTING updated merge/validation steps; PR/issue templates and CI workflow added/updated.
Course materials datasets `data/Course Materials/*`	18 JSON files added/updated with course material metadata (accounting, animal-science, civics, civil-engineering, computer-engineering, e-paath, education, electrical-engineering, geography, moral-education, music, occupation-business-and-technology-education, our-surroundings, plant-science, population, rural-development, sanskrit, technical-and-vocational).
Literature and Arts datasets `data/Literature and Arts/*`	3 JSON files added (do-it-yourself, inspirational-materials, traditional-art) containing story/media metadata with URLs, keywords, and optional descriptive fields.
Other Educational Materials datasets `data/Other Educational Materials/*`	7 JSON files added (computer, philosophy-and-religion, photo-essay, sports, tourism, plus related entries) with educational material metadata including source/cover/read URLs, keywords, and descriptions.
Reference Materials datasets `data/Reference Materials/*`	3 JSON files added (atlas, children-s-encyclopedia, dictionary) containing reference material metadata with URLs, keywords, page counts, and descriptions.
Teaching Materials datasets `data/Teaching Materials/*`	4 JSON files added (educational-theory-and-philosophy, literacy-resources, local-curriculum, quality-education-support-material) with teaching material metadata including author info, URLs, keywords, and descriptions.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🐰 A rabbit hops through data, now with audio streams so clear,
Catalogs multiply like clover, spreading learning far and near,
From Course Materials to Teaching guides so bright,
The API proxy flows with Range headers, a technical delight! 🎵📚

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title is misleading and does not accurately describe the main changes in the pull request.	Revise the title to reflect the actual scope: include infrastructure changes (API audio proxy, health check, catalog loading refactor) and README/documentation updates alongside data additions. Example: 'Add audio streaming API, improve catalog loading, and expand Pustakalaya data collections'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch v1

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api.py`:
- Around line 111-113: is_catalog_resource_url currently calls load_all_books()
on every request causing repeated disk reads; change it to use a cached catalog
and precomputed allowlists for the fields (e.g., pdfUrl/readUrl/audioUrl) stored
at module-level (or in a simple CatalogCache class) and have load_all_books()
populate/update that cache instead of re-parsing each time; implement cache
invalidation by checking file mtimes (or a single catalog last-modified
timestamp) and refresh the cached book list and derived sets only when files
change so is_catalog_resource_url() simply checks membership in the precomputed
set rather than iterating load_all_books() each call.
- Around line 130-134: The active-catalog discovery loop currently collects all
JSON files under DATA_DIR (using os.walk) including archived datasets; modify
the loop that builds filepaths so it skips files located in the archive
directory (e.g., any root path matching os.path.join(DATA_DIR, "archive_data")
or contains "/archive" segment) and still excludes "all_books.json" and
non-JSONs; update the condition in the for root, _, files in os.walk(DATA_DIR)
block (the code that appends to filepaths) to continue when the file's root or
full path indicates an archived dataset so archived JSONs are not added to
filepaths.
- Around line 354-358: The current Response uses
stream_with_context(upstream.iter_content(...)) but never closes the
requests.Response object "upstream", which can leak connections; wrap
iter_content in a generator that yields chunks and calls upstream.close() in a
finally block (or use contextlib.closing) and pass that generator to
stream_with_context so "upstream" is always closed on iterator exhaustion or
client disconnect; update the return to use
Response(stream_with_context(your_chunk_generator()), headers=headers,
status=upstream.status_code) and ensure the generator references
upstream.iter_content(chunk_size=64 * 1024) and closes upstream in its finally.

In `@data/Course` Materials/pus_geography.json:
- Line 27: The PUS JSON uses "educationLevel" but the API expects "level" (see
LIST_BOOK_FIELDS in api.py and openapi.json); update the PUS ingestion/merge or
add fallback logic so entries get a numeric "level" field: convert/normalize
data/Course Materials/pus_geography.json (and other PUS inputs) by mapping
educationLevel values (e.g., "Primary"/"Secondary"/"Tertiary"/etc.) to the
numeric level used by the API, and ensure the scraper/merge output emits "level"
(or have the API layer that builds LIST_BOOK_FIELDS check for educationLevel and
populate level before serialization). Reference the PUS scraper/merge step that
produces the catalog documents and the API builder that uses LIST_BOOK_FIELDS to
implement this mapping.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Around line 108-109: Normalize the publisher values in
pus_occupation-business-and-technology-education.json by removing embedded
newlines and indentation artifacts and collapsing multiple whitespace (including
non-breaking spaces) into a single regular space for the "publisher" fields;
locate the entries where "publisher" currently contains newline/indentation
(e.g. the value starting with "नेपाल सरकार,") and replace them with a
single-line, trimmed string like "नेपाल सरकार, पाठ्यक्रम विकास केन्द्र" so
exact-match filtering and rendering are not broken.

In `@data/Course` Materials/pus_our-surroundings.json:
- Line 257: The pageCount fields currently contain mixed-script numerals ("३0");
find the "pageCount" entries with the value "३0" and normalize them to use ASCII
digits (e.g., "30") or, preferably, a JSON number (pageCount: 30) to ensure
consistent parsing/sorting; update both occurrences so all pageCount values use
the same script/type.

In `@data/Course` Materials/pus_plant-science.json:
- Around line 471-478: This record (id:
pus-79b624b9-a60e-48ee-aaeb-e6eda45d2d1d, title: "Operation and Maintenance of
Microhydro Plant and Photovoltaic System :Learning Resource Material - Grade
12") is missing a subject field so it won’t be returned by GET
/api/books?subject=...; open the JSON object for that record and add an
appropriate "subject" property (for example "subject": "Plant Science", or
"subject": "Renewable Energy / Microhydro" depending on taxonomy) and optionally
adjust "category"/"keywords" to match catalog filters so it appears in
subject-based queries.

In `@data/Course` Materials/pus_rural-development.json:
- Line 1: The file pus_rural-development.json currently contains an empty array
([]) which makes this category always return zero items; confirm whether this is
intentional and either (a) populate pus_rural-development.json with the expected
dataset entries, (b) remove/rename the file so the catalog bucket isn't dead
until data is ready, or (c) replace the empty array with a short placeholder
object/metadata indicating "data pending" so callers can handle the empty state
explicitly.

In `@data/Literature` and Arts/pus_traditional-art.json:
- Around line 31-51: The record uses a non-unique id
"pus-592843a1-7e1a-4769-b30c-cc2ffa030b53" which is duplicated elsewhere and
will cause ingest collisions; update the "id" field in this JSON object (and the
other duplicate record that reuses that same id) to a newly generated unique id
(e.g., a new UUID with the "pus-" prefix) so each record has a globally unique
"id" value and ensure any references to that id are updated consistently.

In `@data/Other` Educational Materials/pus_computer.json:
- Around line 906-919: The JSON record with id
"pus-6a28b603-c348-4799-a4a5-e3664bc12107" is incomplete (missing readUrl and
other core metadata), so update this object to include the same required fields
used across the dataset (at minimum add "readUrl" with a valid document URL and
any missing core metadata such as publisher/publishedDate/format/size/rights as
your schema requires) and run the same validation/normalization used elsewhere
before persisting to ensure consistency and prevent broken read/download
behavior.

In `@data/Other` Educational Materials/pus_philosophy-and-religion.json:
- Around line 74-103: The JSON records in pus_philosophy-and-religion.json reuse
existing UUID values in the Sanskrit course dataset causing id-based
deduplication collisions; update the "id" fields for the conflicting records
(e.g., the entry with "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703") to
unique UUIDs (or a namespaced id scheme) so each record across all categories is
globally unique, then re-run the loader to verify no id collisions remain.

In `@data/Other` Educational Materials/pus_sports.json:
- Around line 142-165: The record in this file uses the same id
"pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28" that already exists in the other
catalog entry (referenced in pus_photo-essay.json), causing one to be dropped;
fix it by giving this entry a unique id (replace the "id" value in this object
with a new UUID or canonical unique identifier) or merge the two records if they
represent the same item, and then verify no other entries share that id.

In `@data/Teaching` Materials/pus_educational-theory-and-philosophy.json:
- Line 625: The JSON records have "pageCount" fields containing years ("2008",
"2006") instead of numeric page totals; locate the offending "pageCount" keys
and correct them by replacing the year strings with the actual integer page
counts or null if unknown, or move the value to the correct field such as
"publicationYear" if that was intended; add/adjust a validation step (schema or
script) that enforces pageCount as an integer >0 and scan the file for any
4-digit year patterns in "pageCount" to fix or flag for manual review (refer to
the "pageCount" key and the specific values "2008" and "2006" to find the
records).

In `@data/Teaching` Materials/pus_local-curriculum.json:
- Line 73: The educationLevel field contains embedded newlines and extra
whitespace; normalize it by trimming whitespace and collapsing internal
newlines/extra spaces and store as a stable representation (preferably an array
of trimmed values like ["Primary","Middle"] or a comma-joined string
"Primary,Middle"); update the entries where educationLevel appears (the current
key "educationLevel" at the shown diff and the other occurrences called out) to
parse the raw string by splitting on commas/newlines, trimming each token,
filtering out empties, and then serialize the cleaned array or joined string
consistently.

In `@data/Teaching` Materials/pus_quality-education-support-material.json:
- Around line 212-229: This record (id
"pus-6a1b7c47-f833-41db-878e-360c359bc838", title "Study Report on Effectiveness
of Reimbursement System of Free Textbook Distribution") is missing the publisher
field; add a "publisher" key to this JSON object (either a string with the
publisher name or explicit null) so the schema matches other entries and
preserves dataset shape consistency.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a4240fdd-d216-4b11-a4a9-a19d59e40d3e

📥 Commits

Reviewing files that changed from the base of the PR and between 6780ffb and 60d3046.

📒 Files selected for processing (87)

README.md
api.py
data/Course Materials/pus_accounting.json
data/Course Materials/pus_animal-science.json
data/Course Materials/pus_civics.json
data/Course Materials/pus_civil-engineering.json
data/Course Materials/pus_computer-engineering.json
data/Course Materials/pus_e-paath.json
data/Course Materials/pus_economics.json
data/Course Materials/pus_education.json
data/Course Materials/pus_electrical-engineering.json
data/Course Materials/pus_english.json
data/Course Materials/pus_environmental-studies.json
data/Course Materials/pus_geography.json
data/Course Materials/pus_health-and-physical-education.json
data/Course Materials/pus_history-and-culture.json
data/Course Materials/pus_mathematics.json
data/Course Materials/pus_moral-education.json
data/Course Materials/pus_music.json
data/Course Materials/pus_nepali.json
data/Course Materials/pus_occupation-business-and-technology-education.json
data/Course Materials/pus_old-textbooks.json
data/Course Materials/pus_our-surroundings.json
data/Course Materials/pus_plant-science.json
data/Course Materials/pus_political-science-and-philosophy.json
data/Course Materials/pus_population.json
data/Course Materials/pus_rural-development.json
data/Course Materials/pus_sanskrit.json
data/Course Materials/pus_science.json
data/Course Materials/pus_social-studies.json
data/Course Materials/pus_sociology-and-anthropology.json
data/Course Materials/pus_technical-and-vocational.json
data/Course Materials/pus_textbook-chapters.json
data/Course Materials/pus_textbooks.json
data/Literature and Arts/pus_do-it-yourself.json
data/Literature and Arts/pus_english-children-s-literature.json
data/Literature and Arts/pus_english-literature.json
data/Literature and Arts/pus_inspirational-materials.json
data/Literature and Arts/pus_literature-in-other-nepali-languages.json
data/Literature and Arts/pus_nepali-children-s-literature.json
data/Literature and Arts/pus_nepali-literature.json
data/Literature and Arts/pus_traditional-art.json
data/Other Educational Materials/pus_agriculture-and-biodiversity.json
data/Other Educational Materials/pus_civics-related-materials.json
data/Other Educational Materials/pus_computer.json
data/Other Educational Materials/pus_cottage-and-small-industries.json
data/Other Educational Materials/pus_education-related-materials.json
data/Other Educational Materials/pus_environment-related-materials.json
data/Other Educational Materials/pus_health-and-security-related-materials.json
data/Other Educational Materials/pus_law-and-government.json
data/Other Educational Materials/pus_philosophy-and-religion.json
data/Other Educational Materials/pus_photo-essay.json
data/Other Educational Materials/pus_science-and-technology.json
data/Other Educational Materials/pus_sports.json
data/Other Educational Materials/pus_tourism.json
data/Reference Materials/pus_atlas.json
data/Reference Materials/pus_children-s-encyclopedia.json
data/Reference Materials/pus_dictionary.json
data/Teaching Materials/pus_additional-reading-material-for-teachers.json
data/Teaching Materials/pus_curriculum.json
data/Teaching Materials/pus_educational-theory-and-philosophy.json
data/Teaching Materials/pus_journals-magazines-newsletters-and-pamphlets.json
data/Teaching Materials/pus_literacy-resources.json
data/Teaching Materials/pus_local-curriculum.json
data/Teaching Materials/pus_professional-development.json
data/Teaching Materials/pus_quality-education-support-material.json
data/Teaching Materials/pus_teacher-training-material.json
data/Teaching Materials/pus_teachers-guides-old.json
data/Teaching Materials/pus_teachers-guides.json
data/Teaching Materials/pus_teaching-support-material.json
data/all_books.json
data/archive_data/archive_org.json
data/archive_data/cdc_nepal.json
data/archive_data/open_library.json
data/archive_data/pustakalaya.json
data/cehrd_audio.json
data/cehrd_nfe.json
data/cehrd_stories.json
index.html
openapi.json
playground.html
scripts/scrape_pustakalaya_course_materials.py
scripts/scrape_pustakalaya_literature.py
scripts/scrape_pustakalaya_literature_copy.py
scripts/scrape_pustakalaya_other_educational_materials.py
scripts/scrape_pustakalaya_teaching_materials.py
scripts/scraper.py

coderabbitai · 2026-05-24T13:41:02Z

    for book in load_all_books():
-        if url in {book.get("pdfUrl"), book.get("readUrl")}:
+        if url in {book.get(field) for field in fields}:
            return True


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid full catalog reload for each URL authorization check.

is_catalog_resource_url() re-parses all JSON files on every /api/pdf and /api/audio request via load_all_books(). That adds avoidable disk I/O and latency on a hot path. Cache the loaded catalog (or precomputed allowlists for pdfUrl/readUrl/audioUrl) and refresh only when files change.

Also applies to: 118-150

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api.py` around lines 111 - 113, is_catalog_resource_url currently calls load_all_books() on every request causing repeated disk reads; change it to use a cached catalog and precomputed allowlists for the fields (e.g., pdfUrl/readUrl/audioUrl) stored at module-level (or in a simple CatalogCache class) and have load_all_books() populate/update that cache instead of re-parsing each time; implement cache invalidation by checking file mtimes (or a single catalog last-modified timestamp) and refresh the cached book list and derived sets only when files change so is_catalog_resource_url() simply checks membership in the precomputed set rather than iterating load_all_books() each call.

coderabbitai · 2026-05-24T13:41:02Z

+    for root, _, files in os.walk(DATA_DIR):
+        for filename in sorted(files):
+            if not filename.endswith(".json") or filename == "all_books.json":
+                continue
+            filepaths.append(os.path.join(root, filename))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Exclude archived datasets from active catalog discovery.

Line 130-134 currently ingests all JSON files under data/, which will include data/archive_data/*.json and contradicts the “not part of active merged catalog” behavior documented in README.md (Line 27). This changes result quality and source composition unexpectedly.

Suggested fix

- for root, _, files in os.walk(DATA_DIR): + for root, _, files in os.walk(DATA_DIR): + if os.path.basename(root) == "archive_data": + continue for filename in sorted(files): if not filename.endswith(".json") or filename == "all_books.json": continue filepaths.append(os.path.join(root, filename))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for root, _, files in os.walk(DATA_DIR):

for filename in sorted(files):

if not filename.endswith(".json") or filename == "all_books.json":

continue

filepaths.append(os.path.join(root, filename))

for root, _, files in os.walk(DATA_DIR):

if os.path.basename(root) == "archive_data":

continue

for filename in sorted(files):

if not filename.endswith(".json") or filename == "all_books.json":

continue

filepaths.append(os.path.join(root, filename))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api.py` around lines 130 - 134, The active-catalog discovery loop currently collects all JSON files under DATA_DIR (using os.walk) including archived datasets; modify the loop that builds filepaths so it skips files located in the archive directory (e.g., any root path matching os.path.join(DATA_DIR, "archive_data") or contains "/archive" segment) and still excludes "all_books.json" and non-JSONs; update the condition in the for root, _, files in os.walk(DATA_DIR) block (the code that appends to filepaths) to continue when the file's root or full path indicates an archived dataset so archived JSONs are not added to filepaths.

coderabbitai · 2026-05-24T13:41:02Z

+    return Response(
+        stream_with_context(upstream.iter_content(chunk_size=64 * 1024)),
+        headers=headers,
+        status=upstream.status_code,
+    )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify other streamed requests responses are explicitly closed. rg -n -C3 'iter_content\(|requests\.get\(' api.py

Repository: Shubhamnpk/yobook-api

Length of output: 949

Close the upstream requests response when streaming iter_content to avoid connection leaks.
In api.py (lines 354-358), the code streams upstream.iter_content(...) via Response(stream_with_context(...)) but does not explicitly close the requests.get(..., stream=True) response (upstream). Under early client disconnects, this can keep connections from being released back to the pool.

Suggested fix

- return Response( - stream_with_context(upstream.iter_content(chunk_size=64 * 1024)), - headers=headers, - status=upstream.status_code, - ) + def generate(): + try: + for chunk in upstream.iter_content(chunk_size=64 * 1024): + if chunk: + yield chunk + finally: + upstream.close() + + return Response( + stream_with_context(generate()), + headers=headers, + status=upstream.status_code, + )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api.py` around lines 354 - 358, The current Response uses stream_with_context(upstream.iter_content(...)) but never closes the requests.Response object "upstream", which can leak connections; wrap iter_content in a generator that yields chunks and calls upstream.close() in a finally block (or use contextlib.closing) and pass that generator to stream_with_context so "upstream" is always closed on iterator exhaustion or client disconnect; update the return to use Response(stream_with_context(your_chunk_generator()), headers=headers, status=upstream.status_code) and ensure the generator references upstream.iter_content(chunk_size=64 * 1024) and closes upstream in its finally.

coderabbitai · 2026-05-24T13:41:02Z

+    "fileSize": "4.11 MB",
+    "publisher": "पाठ्यक्रम विकास केन्द्र",
+    "pageCount": "२४१",
+    "educationLevel": "Secondary"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check which level-related keys API code expects fd -i 'api.py' --exec rg -n --no-heading 'LIST_BOOK_FIELDS|educationLevel|\"level\"|\blevel\b' {} # List JSON files that use educationLevel without level python - <<'PY' import json, pathlib for p in pathlib.Path("data").rglob("*.json"): try: arr = json.loads(p.read_text(encoding="utf-8")) except Exception: continue if not isinstance(arr, list): continue keys = set() for item in arr: if isinstance(item, dict): keys.update(item.keys()) if "educationLevel" in keys and "level" not in keys: print(p) PY

Repository: Shubhamnpk/yobook-api

Length of output: 3086

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find code references to educationLevel and level echo "=== educationLevel references ===" rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' 'educationLevel' . echo "=== level field references ===" rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' '"level"' . # Check for mapping/back-compat logic (common patterns) echo "=== Possible mapping logic ===" rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \ 'map(ping)?|compat|transform|normalize|educationLevel.*level|level.*educationLevel' . # Locate the likely request/response schema definitions echo "=== Likely schema/field lists ===" rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \ 'LIST_BOOK_FIELDS|BOOK_FIELDS|education.*level|educationLevel|level catalog|catalog.*level' .

Repository: Shubhamnpk/yobook-api

Length of output: 50378

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find all relevant source files mentioning each key echo "=== Files mentioning educationLevel (code only) ===" rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \ 'educationLevel' . echo "=== Files mentioning \"level\" (code only) ===" rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \ '"level"' . echo "=== Files mentioning mapping/transform keywords around levels ===" rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \ 'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' . # Narrow to likely request/response/schema code echo "=== Candidate API schema/field lists ===" rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \ 'LIST_BOOK_FIELDS|BOOK_FIELDS|field.*level|educationLevel.*field|level.*field' . || true # Show a small window around any mapping we find MAP_FILES=$(rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \ 'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' . || true) if [ -n "${MAP_FILES}" ]; then echo "=== Context for mapping candidates ===" echo "${MAP_FILES}" | head -n 20 | while read -r f; do echo "--- ${f} ---" rg -n --no-heading \ 'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' "$f" | head -n 5 done fi

Repository: Shubhamnpk/yobook-api

Length of output: 31446

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show LIST_BOOK_FIELDS and surrounding code python - <<'PY' from pathlib import Path p = Path("api.py") lines = p.read_text(encoding="utf-8").splitlines() # print around LIST_BOOK_FIELDS for i,l in enumerate(lines): if "LIST_BOOK_FIELDS" in l: start=max(0,i-20); end=min(len(lines), i+80) for j in range(start,end): print(f"{j+1:5d}:{lines[j]}") break PY # Search for where fields are read/filtered for level metadata rg -n --no-heading 'educationLevel|\"level\"|level\b|LIST_BOOK_FIELDS|book\[[' api.py # Also inspect scraper.py/openapi.json for schema expectation if present echo "=== openapi.json (presence of level/educationLevel) ===" rg -n --no-heading '"level"' openapi.json || true rg -n --no-heading 'educationLevel' openapi.json || true

Repository: Shubhamnpk/yobook-api

Length of output: 3550

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== api.py: all educationLevel occurrences ===" rg -n --no-heading 'educationLevel' api.py || true echo "=== api.py: all occurrences of field 'level' ===" rg -n --no-heading '\blevel\b' api.py echo "=== api.py: openapi schema field 'level' / educationLevel ===" rg -n --no-heading '"level"' openapi.json || true rg -n --no-heading 'educationLevel' openapi.json || true echo "=== data: JSON files that contain a top-level or item key \"level\" ===" # crude but effective: find any file containing the literal string "level": rg -l --no-heading --hidden --glob 'data/**/*.json' '"level"' . echo "=== data: JSON files that contain \"educationLevel\" and do NOT contain \"level\" ===" python3 - <<'PY' import json, pathlib paths_with_no_level=[] for p in pathlib.Path("data").rglob("*.json"): try: txt=p.read_text(encoding="utf-8") except Exception: continue if '"educationLevel"' not in txt: continue if '"level"' in txt: continue # If file is huge still count by substring; better to also parse if it's a list of dicts try: arr=json.loads(txt) except Exception: continue keys=set() if isinstance(arr,list): for item in arr: if isinstance(item,dict): keys.update(item.keys()) if "educationLevel" in keys and "level" not in keys: paths_with_no_level.append(str(p)) for x in sorted(paths_with_no_level): print(x) PY echo "=== Inspect pus_geography.json keys (first list item) ===" python3 - <<'PY' import json from pathlib import Path p=Path("data/Course Materials/pus_geography.json") data=json.loads(p.read_text(encoding="utf-8")) print("type:", type(data)) if isinstance(data,list) and data and isinstance(data[0],dict): d=data[0] print("first item keys:", sorted(d.keys())) print("educationLevel:", d.get("educationLevel")) print("level:", d.get("level")) else: print("not a list of dicts; top-level type:", type(data)) PY echo "=== Codebase: any educationLevel<->level mapping logic in scripts/ ===" rg -n --no-heading --hidden --glob 'scripts/**/*' 'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level' .

Repository: Shubhamnpk/yobook-api

Length of output: 3809

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Find how the API uses `level` beyond just listing it in LIST_BOOK_FIELDS rg -n --no-heading 'level' api.py # 2) Find any conversion/mapping from educationLevel to level in scripts rg -n --no-heading 'educationLevel' scripts rg -n --no-heading '"level"' scripts rg -n --no-heading 'educationLevel.*level|level.*educationLevel|map.*level|normalize.*level|transform.*level' scripts || true # 3) Inspect openapi.json for educationLevel vs level usage rg -n --no-heading 'educationLevel' openapi.json || true rg -n --no-heading '"level"' openapi.json | head -n 50 # 4) Inspect api.py for where request params are handled for filtering/sorting (search common query patterns) rg -n --no-heading 'request\.args\.get$"level"$|args\.get$"level"$|level.*request\.args|get.*level' api.py || true rg -n --no-heading 'sort|filter' api.py

Repository: Shubhamnpk/yobook-api

Length of output: 1672

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate /api/books handler in api.py rg -n --no-heading 'GET /api/books|def .*books|/api/books' api.py # Print a focused window around the likely books search/filter logic python3 - <<'PY' from pathlib import Path p=Path("api.py") lines=p.read_text(encoding="utf-8").splitlines() def dump_around(needle, radius=80): for i,l in enumerate(lines): if needle in l: start=max(0,i-radius); end=min(len(lines), i+radius) for j in range(start,end): print(f"{j+1:5d}:{lines[j]}") return True return False # common needles dump_around('def get_books') dump_around('books') dump_around('request.args') dump_around('"level"') PY # Inspect scripts/scraper.py where it assigns "level" python3 - <<'PY' from pathlib import Path p=Path("scripts/scraper.py") lines=p.read_text(encoding="utf-8").splitlines() for i,l in enumerate(lines): if '"level":' in l and 'level' in l: start=max(0,i-40); end=min(len(lines), i+60) for j in range(start,end): print(f"{j+1:5d}:{lines[j]}") break PY # Search for any mapping logic in code (not just scripts) around educationLevel->level rg -n --no-heading 'educationLevel.*level|level.*educationLevel|educationLevel.*int|educationLevel.*catalog|catalog.*educationLevel|educationLevel.*map' . || true

Repository: Shubhamnpk/yobook-api

Length of output: 25082

🏁 Script executed:

#!/bin/bash set -euo pipefail # Any query-param usage of "level" in api.py rg -n --no-heading 'request\.args\.get$"level"$|args\.get$"level"$|\blevel\b.*request\.args' api.py || true # Any API code that reads book["level"] (besides listing it in LIST_BOOK_FIELDS / openapi) rg -n --no-heading '\bget$"level"$|\b\["level"\]|LIST_BOOK_FIELDS' api.py # Is educationLevel referenced anywhere in non-data code (besides scripts)? rg -n --no-heading --glob '!data/**' 'educationLevel' . # openapi: is level used as a query param or just a response field? rg -n --no-heading '"level"' openapi.json rg -n --no-heading 'educationLevel' openapi.json || true # Identify which data sources actually contain "level" python3 - <<'PY' import json, pathlib with_level=[] for p in pathlib.Path("data").rglob("*.json"): try: txt=p.read_text(encoding="utf-8") except Exception: continue if '"level"' not in txt: continue # quick parse to see if it's list-of-dicts and level exists in items try: arr=json.loads(txt) except Exception: with_level.append(str(p)) continue ok=False if isinstance(arr,list): for item in arr: if isinstance(item,dict) and 'level' in item: ok=True break if ok: with_level.append(str(p)) print("\n".join(sorted(set(with_level)))) PY

Repository: Shubhamnpk/yobook-api

Length of output: 846

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== educationLevel usage outside data/** ===" rg -n --no-heading --glob '!data/**' 'educationLevel' . echo "=== scripts/code touching all_books / catalog merge ===" rg -n --no-heading --glob 'scripts/**' 'all_books\.json|merge|normalize|transform' . echo "=== any code setting level from educationLevel ===" rg -n --no-heading --glob '!data/**' 'educationLevel.*level|level.*educationLevel' .

Repository: Shubhamnpk/yobook-api

Length of output: 12603

Map educationLevel data to the API/catalog level field (PUS JSON currently omits level)

data/Course Materials/pus_geography.json contains only:

"educationLevel": "Secondary"

But the API/catalog fields are built around level (api.py includes "level" in LIST_BOOK_FIELDS, and openapi.json exposes level), and there is no educationLevel → level conversion found in the codebase/scrapers. As a result, PUS entries like this will have level missing/null in API responses, breaking any level-based sorting/filtering that depends on the level field.

Update the PUS scrapers/merge step to emit the numeric level field (preferred), or add backward-compatible fallback logic to interpret educationLevel as level.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Course` Materials/pus_geography.json at line 27, The PUS JSON uses "educationLevel" but the API expects "level" (see LIST_BOOK_FIELDS in api.py and openapi.json); update the PUS ingestion/merge or add fallback logic so entries get a numeric "level" field: convert/normalize data/Course Materials/pus_geography.json (and other PUS inputs) by mapping educationLevel values (e.g., "Primary"/"Secondary"/"Tertiary"/etc.) to the numeric level used by the API, and ensure the scraper/merge output emits "level" (or have the API layer that builds LIST_BOOK_FIELDS check for educationLevel and populate level before serialization). Reference the PUS scraper/merge step that produces the catalog documents and the API builder that uses LIST_BOOK_FIELDS to implement this mapping.

coderabbitai · 2026-05-24T13:41:02Z

+    "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703",
+    "title": "कर्मकाण्ड: कक्षा १०",
+    "author": "Unknown",
+    "language": "ne",
+    "country": "np",
+    "source": "pustakalaya-other-educational",
+    "sourceUrl": "https://pustakalaya.org/documents/detail/0430b451-b53e-4ac0-a868-c35b5269b703/",
+    "coverUrl": "https://pustakalaya.org/media/uploads/thumbnails/document/2023/09/11/करमकणड_ककष_१०.jpg",
+    "category": "Other Educational Materials",
+    "keywords": [
+      "Karmakanda",
+      "कक्षा १०",
+      "Textbook",
+      "Grade 10",
+      "गुरुकुल",
+      "Gurukul",
+      "कर्मकाण्ड",
+      "Sanskrit",
+      "New Textbook",
+      "Ritual",
+      "संस्कृतम्",
+      "पाठ्यपुस्तक"
+    ],
+    "scrapedAt": "2026-05-24T11:34:44.382540Z",
+    "readUrl": "https://pustakalaya.org/media/uploads/documents/2023/09/11/Karmakanda10_124cc050/1685512467.pdf",
+    "fileSize": "1.0 MB",
+    "publisher": "पाठ्यक्रम विकास केन्द्र",
+    "pageCount": "२४६",
+    "educationLevel": "Secondary"
+  },


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Cross-category ID reuse will overwrite/drop records at load time.

Starting at Line 74, this file reuses IDs already present in the Sanskrit course dataset (multiple occurrences in this file). With id-based deduplication, these entries will collide and one side will be lost, causing incorrect category/source exposure.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Other` Educational Materials/pus_philosophy-and-religion.json around lines 74 - 103, The JSON records in pus_philosophy-and-religion.json reuse existing UUID values in the Sanskrit course dataset causing id-based deduplication collisions; update the "id" fields for the conflicting records (e.g., the entry with "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703") to unique UUIDs (or a namespaced id scheme) so each record across all categories is globally unique, then re-run the loader to verify no id collisions remain.

coderabbitai · 2026-05-24T13:41:02Z

+    "id": "pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28",
+    "title": "Motor Sports",
+    "author": "Prabhat Bhattarai",
+    "language": "en",
+    "country": "np",
+    "source": "pustakalaya-other-educational",
+    "sourceUrl": "https://pustakalaya.org/documents/detail/94caee86-1d2c-474e-b0a4-d2be3a45ec28/",
+    "coverUrl": "https://pustakalaya.org/media/uploads/thumbnails/document/2019/02/22/Motor_Sports.jpg",
+    "category": "Other Educational Materials",
+    "keywords": [
+      "Photo Essays",
+      "Prabhat Bhattarai",
+      "साझा शिक्षा ई-पाटी",
+      "OLE Nepal",
+      "प्रभात भट्टराई",
+      "Motor Sports"
+    ],
+    "scrapedAt": "2026-05-24T11:34:44.382540Z",
+    "readUrl": "https://pustakalaya.org/media/uploads/op/pdf/OLENepal2012_MotorSports.pdf/OLENepal2012_MotorSports.pdf",
+    "fileSize": "2.01 MB",
+    "publisher": "OLE Nepal",
+    "pageCount": "23",
+    "description": "This photo essay has been prepared by Prabhat Bhattarai. The text and photos for this photo essay were taken from www.wikipedia.com, and edited to suit the audience."
+  },


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Duplicate id will cause one catalog record to be dropped.

pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28 is already present in data/Other Educational Materials/pus_photo-essay.json (Line 306). Since load_all_books deduplicates by id, one category entry will be silently discarded at load time.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Other` Educational Materials/pus_sports.json around lines 142 - 165, The record in this file uses the same id "pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28" that already exists in the other catalog entry (referenced in pus_photo-essay.json), causing one to be dropped; fix it by giving this entry a unique id (replace the "id" value in this object with a new UUID or canonical unique identifier) or merge the two records if they represent the same item, and then verify no other entries share that id.

coderabbitai · 2026-05-24T13:41:02Z

+    "readUrl": "https://pustakalaya.org/media/uploads/op/pdf/ToruAndVijay2008_Opening_up_education.pdf/ToruAndVijay2008_Opening_up_education.pdf",
+    "fileSize": "4.33 MB",
+    "publisher": "The MIT Press",
+    "pageCount": "2008",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

pageCount appears to contain publication years instead of pages.

For these records, pageCount values ("2008", "2006") look like years, not page totals. This will skew page-based filtering/sorting and mislead consumers.

Also applies to: 1288-1288

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Teaching` Materials/pus_educational-theory-and-philosophy.json at line 625, The JSON records have "pageCount" fields containing years ("2008", "2006") instead of numeric page totals; locate the offending "pageCount" keys and correct them by replacing the year strings with the actual integer page counts or null if unknown, or move the value to the correct field such as "publicationYear" if that was intended; add/adjust a validation step (schema or script) that enforces pageCount as an integer >0 and scan the file for any 4-digit year patterns in "pageCount" to fix or flag for manual review (refer to the "pageCount" key and the specific values "2008" and "2006" to find the records).

coderabbitai · 2026-05-24T13:41:02Z

+    "fileSize": "2.0 MB",
+    "publisher": "हरिहरपुरगढी गाउँपालिका",
+    "pageCount": "१५४",
+    "educationLevel": "Primary , \n                    \n                        Middle"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize educationLevel values to a stable format.

These values include embedded newlines/indentation, which can break exact-match filters and create inconsistent facets. Store a normalized value (e.g., "Primary,Middle" or an array like ["Primary","Middle"]) instead of whitespace-heavy text blobs.

Also applies to: 122-122, 217-217, 289-289

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Teaching` Materials/pus_local-curriculum.json at line 73, The educationLevel field contains embedded newlines and extra whitespace; normalize it by trimming whitespace and collapsing internal newlines/extra spaces and store as a stable representation (preferably an array of trimmed values like ["Primary","Middle"] or a comma-joined string "Primary,Middle"); update the entries where educationLevel appears (the current key "educationLevel" at the shown diff and the other occurrences called out) to parse the raw string by splitting on commas/newlines, trimming each token, filtering out empties, and then serialize the cleaned array or joined string consistently.

Introduce catalog validation tooling, API health monitoring, and updated project documentation to support the expanded data sources. - Add `scripts/validate_catalog.py` to ensure data integrity before commits - Implement `/api/health` endpoint for monitoring catalog size and source counts - Update `README.md` and `CONTRIBUTING.md` with new scraping workflows and validation steps - Add `.env.example`, `CHANGELOG.md`, and GitHub configuration files - Update OpenAPI specification to include the new health check endpoint

Update the catalog with categorized Pustakalaya collections, clean up metadata formatting in the JSON database, and adjust the API source priority. Additionally, improve the frontend UI to handle audio-specific book displays. - Update `api.py` to remove deprecated source priorities - Clean up whitespace and newline artifacts in `data/all_books.json` - Update `index.html` to hide the read button when an audio URL is present - Synchronize multiple Pustakalaya category JSON files with new scraped data

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

README.md (1)

164-171: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add /api/health to the endpoint list for consistency.

The “Other Endpoints” block documents /api/audio but omits /api/health, which is now part of the API surface and used in CI smoke checks.
Suggested doc patch
 GET /api/books/<id>
 GET /api/pdf?url=<catalog-pdf-url>
 GET /api/audio?url=<catalog-audio-url>
+GET /api/health
 GET /api/sources
 GET /api/stats
 GET /docs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 164 - 171, Update the "Other Endpoints" list to
include the /api/health endpoint for consistency with the API surface;
specifically add "/api/health" alongside the existing entries (e.g., the GET
/api/audio line) in the same code block so CI smoke checks and docs match the
implemented endpoint.

api.py (1)

142-143: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t silently ignore catalog parse/load failures.

Line 142 catches everything and Line 143 drops it, which can silently serve a partial catalog and break allowlist checks unpredictably.

Suggested minimal fix

-        except Exception:
-            pass
+        except (OSError, json.JSONDecodeError) as exc:
+            app.logger.warning("Skipping unreadable catalog file %s: %s", filepath, exc)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api.py` around lines 142 - 143, The except Exception: pass block that
swallows catalog parse/load failures must be removed and replaced with explicit
error handling: catch specific exceptions raised by the catalog parsing/loading
code (or Exception if unknown), log the full error (e.g., logger.error("Failed
to load/parse catalog", exc_info=True)) and either re-raise a wrapped exception
or set the catalog to a safe empty/closed state so allowlist checks fail-safe;
locate the try/except around the catalog parse/load call in api.py and update it
to log and propagate or safely fallback instead of silently passing.

♻️ Duplicate comments (2)

data/Course Materials/pus_occupation-business-and-technology-education.json (2)

266-266: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize spacing in publisher field.

The publisher field contains double spaces after the first comma, which should be collapsed to a single space for consistency.

🧹 Proposed fix

-    "publisher": "The World Bank,  The World Bank Group, Nepal",
+    "publisher": "The World Bank, The World Bank Group, Nepal",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json
at line 266, The "publisher" field currently has double spaces after the first
comma in the string "The World Bank,  The World Bank Group, Nepal"; update that
value to collapse the double space into a single space so it reads "The World
Bank, The World Bank Group, Nepal" to normalize spacing and maintain
consistency.

108-108: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add missing space after comma in publisher field.

The publisher field is missing a space after the comma, which affects readability and consistency with other entries.

🧹 Proposed fix

-    "publisher": "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र",
+    "publisher": "नेपाल सरकार, पाठ्यक्रम विकास केन्द्र",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json
at line 108, Update the publisher JSON value for the "publisher" key by
inserting a space after the comma so the string reads "नेपाल सरकार, पाठ्यक्रम
विकास केन्द्र" instead of "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र"; locate the
"publisher" field in the object and adjust the value accordingly to match the
spacing conventions used elsewhere.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 14-20: Update the two GitHub Actions uses to pinned commit SHAs
and harden checkout: replace actions/checkout@v4 with
actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823 and add the with key
persist-credentials: false to the checkout step; replace actions/setup-python@v5
with actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f so both
actions are pinned to specific SHAs rather than floating tags.

In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Line 188: The "editor" JSON field currently uses double spaces after commas;
update the value for the editor key ("editor") to use single spaces after each
comma and remove any leading/trailing extra spaces so it reads as "अग्नि प्रसाद
नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण
जोशी"; ensure the normalized string replaces all instances of ",  " with ", "
and trim surrounding whitespace.

In `@data/Literature` and Arts/pus_inspirational-materials.json:
- Line 137: Update the JSON entry where the "publisher" field currently has the
misspelled value "unkown" and correct it to "unknown" (i.e., replace the string
value for the "publisher" key from "unkown" to "unknown") so the record uses the
proper spelling.

In `@data/Other` Educational Materials/pus_tourism.json:
- Line 64: The keywords list contains an inconsistent spelling: replace the
incorrect string "Tourist Accomodation" with the correct "Tourist Accommodation"
wherever it appears (e.g., the entry matching "Tourist Accomodation") so all
occurrences match the correctly spelled "Tourist Accommodation" and avoid
fragmenting keyword indexing.

In `@scripts/validate_catalog.py`:
- Around line 46-47: Narrow the broad except in the JSON file load block:
replace the bare "except Exception as exc" that appends to errors with specific
exception handlers for expected failure modes (e.g.,
FileNotFoundError/PermissionError (or more generally OSError/IOError) and
json.JSONDecodeError) so only read/parse errors are caught; for each handler
append the same "{relpath}: could not read JSON: {exc}" message (or slightly
different messages if you want to distinguish read vs parse) and let any other
unexpected exceptions propagate.

---

Outside diff comments:
In `@api.py`:
- Around line 142-143: The except Exception: pass block that swallows catalog
parse/load failures must be removed and replaced with explicit error handling:
catch specific exceptions raised by the catalog parsing/loading code (or
Exception if unknown), log the full error (e.g., logger.error("Failed to
load/parse catalog", exc_info=True)) and either re-raise a wrapped exception or
set the catalog to a safe empty/closed state so allowlist checks fail-safe;
locate the try/except around the catalog parse/load call in api.py and update it
to log and propagate or safely fallback instead of silently passing.

In `@README.md`:
- Around line 164-171: Update the "Other Endpoints" list to include the
/api/health endpoint for consistency with the API surface; specifically add
"/api/health" alongside the existing entries (e.g., the GET /api/audio line) in
the same code block so CI smoke checks and docs match the implemented endpoint.

---

Duplicate comments:
In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Line 266: The "publisher" field currently has double spaces after the first
comma in the string "The World Bank,  The World Bank Group, Nepal"; update that
value to collapse the double space into a single space so it reads "The World
Bank, The World Bank Group, Nepal" to normalize spacing and maintain
consistency.
- Line 108: Update the publisher JSON value for the "publisher" key by inserting
a space after the comma so the string reads "नेपाल सरकार, पाठ्यक्रम विकास
केन्द्र" instead of "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र"; locate the
"publisher" field in the object and adjust the value accordingly to match the
spacing conventions used elsewhere.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d8662c4-91ae-4aa8-af9d-d0ae78b9fe2d

📥 Commits

Reviewing files that changed from the base of the PR and between 60d3046 and 082afc9.

📒 Files selected for processing (62)

.env.example
.github/ISSUE_TEMPLATE/bug_report.md
.github/ISSUE_TEMPLATE/source_request.md
.github/PULL_REQUEST_TEMPLATE.md
.github/workflows/ci.yml
CHANGELOG.md
CONTRIBUTING.md
README.md
api.py
data/Course Materials/pus_economics.json
data/Course Materials/pus_english.json
data/Course Materials/pus_environmental-studies.json
data/Course Materials/pus_health-and-physical-education.json
data/Course Materials/pus_history-and-culture.json
data/Course Materials/pus_mathematics.json
data/Course Materials/pus_nepali.json
data/Course Materials/pus_occupation-business-and-technology-education.json
data/Course Materials/pus_plant-science.json
data/Course Materials/pus_political-science-and-philosophy.json
data/Course Materials/pus_sanskrit.json
data/Course Materials/pus_science.json
data/Course Materials/pus_social-studies.json
data/Course Materials/pus_sociology-and-anthropology.json
data/Course Materials/pus_technical-and-vocational.json
data/Course Materials/pus_textbooks.json
data/Literature and Arts/pus_do-it-yourself.json
data/Literature and Arts/pus_english-children-s-literature.json
data/Literature and Arts/pus_english-literature.json
data/Literature and Arts/pus_inspirational-materials.json
data/Literature and Arts/pus_literature-in-other-nepali-languages.json
data/Literature and Arts/pus_nepali-children-s-literature.json
data/Literature and Arts/pus_nepali-literature.json
data/Literature and Arts/pus_traditional-art.json
data/Other Educational Materials/pus_agriculture-and-biodiversity.json
data/Other Educational Materials/pus_civics-related-materials.json
data/Other Educational Materials/pus_computer.json
data/Other Educational Materials/pus_education-related-materials.json
data/Other Educational Materials/pus_environment-related-materials.json
data/Other Educational Materials/pus_health-and-security-related-materials.json
data/Other Educational Materials/pus_law-and-government.json
data/Other Educational Materials/pus_philosophy-and-religion.json
data/Other Educational Materials/pus_science-and-technology.json
data/Other Educational Materials/pus_sports.json
data/Other Educational Materials/pus_tourism.json
data/Reference Materials/pus_atlas.json
data/Reference Materials/pus_children-s-encyclopedia.json
data/Reference Materials/pus_dictionary.json
data/Teaching Materials/pus_additional-reading-material-for-teachers.json
data/Teaching Materials/pus_educational-theory-and-philosophy.json
data/Teaching Materials/pus_journals-magazines-newsletters-and-pamphlets.json
data/Teaching Materials/pus_literacy-resources.json
data/Teaching Materials/pus_local-curriculum.json
data/Teaching Materials/pus_professional-development.json
data/Teaching Materials/pus_quality-education-support-material.json
data/Teaching Materials/pus_teacher-training-material.json
data/Teaching Materials/pus_teachers-guides-old.json
data/Teaching Materials/pus_teachers-guides.json
data/Teaching Materials/pus_teaching-support-material.json
data/all_books.json
index.html
openapi.json
scripts/validate_catalog.py

💤 Files with no reviewable changes (1)

data/Other Educational Materials/pus_computer.json

✅ Files skipped from review due to trivial changes (13)

.github/ISSUE_TEMPLATE/bug_report.md
.github/ISSUE_TEMPLATE/source_request.md
data/Literature and Arts/pus_do-it-yourself.json
data/Course Materials/pus_plant-science.json
data/Literature and Arts/pus_traditional-art.json
data/Reference Materials/pus_dictionary.json
data/Teaching Materials/pus_literacy-resources.json
data/Other Educational Materials/pus_sports.json
data/Teaching Materials/pus_local-curriculum.json
data/Teaching Materials/pus_quality-education-support-material.json
data/Reference Materials/pus_children-s-encyclopedia.json
data/Teaching Materials/pus_educational-theory-and-philosophy.json
data/Course Materials/pus_sanskrit.json

coderabbitai · 2026-05-24T15:16:22Z

+      - name: Check out repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What are the currently recommended pinned commit SHAs for actions/checkout@v4andactions/setup-python@v5, and does GitHub Security guidance recommend persist-credentials: false for checkout?

💡 Result:

Currently recommended pinned commit SHAs (for the major versions you asked about) 1) actions/checkout@v4 - GitHub-hosted docs for the v4 release show the v4 tree pinned at commit fd084cde189b7b76ec305d52e27be545a0172823. [1] 2) actions/setup-python@v5 - GitHub-hosted docs for the actions/setup-python repo show v5 being introduced by commit e9d6f990972a57673cdb72ec29e19d42ba28880f (tagging v5 in the repo’s history). [2] Does GitHub Security guidance recommend persist-credentials: false for checkout? - I did not find a GitHub Security document in the retrieved sources that explicitly recommends setting persist-credentials: false for actions/checkout. The only “guidance” I saw directly addressing persist-credentials was from actions/checkout issue discussions and a compiler/security-related change request, not official GitHub Security guidance. [3][4][5] - Therefore: based on the evidence retrieved here, I cannot confirm that GitHub’s Security guidance recommends persist-credentials: false for checkout. [3][4] If you want, I can do one more targeted search specifically for GitHub’s official security guidance pages mentioning persist-credentials (e.g., the securitylab.github.com / documentation pages) to try to locate the exact recommendation.

Citations:

1: https://github.com/actions/checkout/tree/fd084cde189b7b76ec305d52e27be545a0172823

2: actions/setup-python@e9d6f99

3: Remove persist-credentials or change the default to false actions/checkout#485

4: Add persist-credentials: false to checkout actions for setup scripts github/gh-aw#7523

5: Validate actions/checkout persist-credentials in agent job steps github/gh-aw#17081

Harden CI by pinning GitHub Actions to commit SHAs (and consider disabling checkout credential persistence).

Replace floating tags:

actions/checkout@v4 → actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823

actions/setup-python@v5 → actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f

Add with: persist-credentials: false to actions/checkout as a hardening step; I couldn’t find explicit official GitHub Security guidance recommending it.

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 14-15: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 15-15: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

[error] 18-18: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci.yml around lines 14 - 20, Update the two GitHub Actions uses to pinned commit SHAs and harden checkout: replace actions/checkout@v4 with actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823 and add the with key persist-credentials: false to the checkout step; replace actions/setup-python@v5 with actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f so both actions are pinned to specific SHAs rather than floating tags.

coderabbitai · 2026-05-24T15:16:22Z

+    "scrapedAt": "2026-05-24T11:05:36.549054Z",
+    "readUrl": "https://pustakalaya.org/media/uploads/op/pdf/NARC2013_AgricultureTechnologyPart2.pdf/NARC2013_AgricultureTechnologyPart2.pdf",
+    "fileSize": "5.11 MB",
+    "editor": "अग्नि प्रसाद नेपाल,  विवेक सापकोटा,  तुलसी पौडेल,  यज्ञ प्रसाद गिरी,  ध्रुवराज भट्टराई,  बालकृष्ण जोशी",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Normalize spacing in editor field.

The editor field contains double spaces after commas instead of single spaces, which is inconsistent with standard formatting conventions.

🧹 Proposed fix

- "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी", + "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",

"editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Course` Materials/pus_occupation-business-and-technology-education.json at line 188, The "editor" JSON field currently uses double spaces after commas; update the value for the editor key ("editor") to use single spaces after each comma and remove any leading/trailing extra spaces so it reads as "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी"; ensure the normalized string replaces all instances of ", " with ", " and trim surrounding whitespace.

coderabbitai · 2026-05-24T15:16:22Z

+    "keywords": [
+      "Great Himalaya Trail",
+      "Climate Change",
+      "Tourist Accommodation",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Inconsistent keyword spelling: "Accommodation" vs "Accomodation".

Line 64 uses the correct spelling "Tourist Accommodation" while Line 194 has "Tourist Accomodation" (missing the second 'm'). This inconsistency could fragment search results if keywords are indexed.

📝 Proposed fix

At Line 194, correct the spelling:

- "Tourist Accomodation", + "Tourist Accommodation",

Also applies to: 194-194

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Other` Educational Materials/pus_tourism.json at line 64, The keywords list contains an inconsistent spelling: replace the incorrect string "Tourist Accomodation" with the correct "Tourist Accommodation" wherever it appears (e.g., the entry matching "Tourist Accomodation") so all occurrences match the correctly spelled "Tourist Accommodation" and avoid fragmenting keyword indexing.

coderabbitai · 2026-05-24T15:16:22Z

+        except Exception as exc:
+            errors.append(f"{relpath}: could not read JSON: {exc}")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Narrow the broad exception catch during file load.

Line 46 catches all exceptions; this can hide unexpected failures. Restrict to expected read/JSON parse errors.

Suggested fix

- except Exception as exc: + except (OSError, json.JSONDecodeError) as exc: errors.append(f"{relpath}: could not read JSON: {exc}") continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as exc:

errors.append(f"{relpath}: could not read JSON: {exc}")

except (OSError, json.JSONDecodeError) as exc:

errors.append(f"{relpath}: could not read JSON: {exc}")

🧰 Tools

🪛 Ruff (0.15.13)

[warning] 46-46: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/validate_catalog.py` around lines 46 - 47, Narrow the broad except in the JSON file load block: replace the bare "except Exception as exc" that appends to errors with specific exception handlers for expected failure modes (e.g., FileNotFoundError/PermissionError (or more generally OSError/IOError) and json.JSONDecodeError) so only read/parse errors are caught; for each handler append the same "{relpath}: could not read JSON: {exc}" message (or slightly different messages if you want to distinguish read vs parse) and let any other unexpected exceptions propagate.

Clean up book metadata for consistency and completeness by removing an unintended line break in a description, fixing a `publisher` typo (`unkown` → `unknown`), and adding missing `readUrl` and `fileSize` fields for the Stage Fright entry in both category and aggregated datasets.fix(data): correct metadata typos and add missing file info Clean up book metadata for consistency and completeness by removing an unintended line break in a description, fixing a `publisher` typo (`unkown` → `unknown`), and adding missing `readUrl` and `fileSize` fields for the Stage Fright entry in both category and aggregated datasets.

Shubhamnpk added 3 commits May 24, 2026 14:06

coderabbitai Bot reviewed May 24, 2026

View reviewed changes

vercel Bot deployed to Preview May 24, 2026 13:43 View deployment

vercel Bot deployed to Preview May 24, 2026 15:09 View deployment

coderabbitai Bot reviewed May 24, 2026

View reviewed changes

Shubhamnpk self-assigned this May 24, 2026

Shubhamnpk merged commit 6daccc5 into main May 24, 2026
2 of 5 checks passed

vercel Bot deployed to Preview May 24, 2026 16:43 View deployment

	"editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",
	"editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",

		except Exception as exc:
		errors.append(f"{relpath}: could not read JSON: {exc}")

Conversation

Shubhamnpk commented May 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

coderabbitai Bot commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Shubhamnpk commented May 24, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented May 24, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented May 24, 2026 •

edited

Loading

coderabbitai Bot commented May 24, 2026 •

edited

Loading