added more book data across different geners#4
Conversation
Introduce a new `GET /api/audio?url=<audioUrl>` endpoint to proxy audio files through the same origin, matching how PDF proxying works for the reader/player UX. - Generalize catalog URL validation to accept specific fields and enforce `audioUrl`/`pdfUrl` checks per endpoint - Add `audioUrl` and `level` to list/book response fields - Update API docs and README examples to document `/api/audio` and the `audioUrl` data shape - Expand source priority ordering to include new CEHRD sources (`cehrd-stories`, `cehrd-nfe`, `cehrd-audio`) This improves secure media access by only allowing catalog-backed audio URLs while extending metadata needed for new CEHRD content.feat(api): add catalog-validated audio streaming support Introduce a new `GET /api/audio?url=<audioUrl>` endpoint to proxy audio files through the same origin, matching how PDF proxying works for the reader/player UX. - Generalize catalog URL validation to accept specific fields and enforce `audioUrl`/`pdfUrl` checks per endpoint - Add `audioUrl` and `level` to list/book response fields - Update API docs and README examples to document `/api/audio` and the `audioUrl` data shape - Expand source priority ordering to include new CEHRD sources (`cehrd-stories`, `cehrd-nfe`, `cehrd-audio`) This improves secure media access by only allowing catalog-backed audio URLs while extending metadata needed for new CEHRD content.
Add several Literature & Arts collections to `scrape_pustakalaya_stories` to broaden scraping coverage and capture more relevant books. Also remove the BOM character at the top of `scripts/scraper.py` for cleaner parsing and file consistency.feat(scraper): expand Pustakalaya story collection sources Add several Literature & Arts collections to `scrape_pustakalaya_stories` to broaden scraping coverage and capture more relevant books. Also remove the BOM character at the top of `scripts/scraper.py` for cleaner parsing and file consistency.
Expand the Pustakalaya scraper to support multiple specialized collections including Course Materials, Literature and Arts, Reference Materials, Teaching Materials, and Other Educational Materials. - Add new scraping scripts for specific Pustakalaya categories - Implement a hierarchical data directory structure for categorized JSON files - Update API to support recursive data loading and new source priorities - Update UI to display new Pustakalaya source names and handle PDF/readUrl - Update OpenAPI documentation and playground to reflect new sources
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ❌ Deployment failed View logs |
yobook-api | 59b8ce7 | May 24 2026, 04:44 PM |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces audio streaming support to the API catalog system and substantially expands the available educational content. The core changes generalize URL validation, add audio proxy functionality with HTTP Range support, and populate the catalog with 40+ JSON datasets across multiple educational categories. ChangesCatalog audio support and data expansion
🎯 3 (Moderate) | ⏱️ ~20 minutes
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@api.py`:
- Around line 111-113: is_catalog_resource_url currently calls load_all_books()
on every request causing repeated disk reads; change it to use a cached catalog
and precomputed allowlists for the fields (e.g., pdfUrl/readUrl/audioUrl) stored
at module-level (or in a simple CatalogCache class) and have load_all_books()
populate/update that cache instead of re-parsing each time; implement cache
invalidation by checking file mtimes (or a single catalog last-modified
timestamp) and refresh the cached book list and derived sets only when files
change so is_catalog_resource_url() simply checks membership in the precomputed
set rather than iterating load_all_books() each call.
- Around line 130-134: The active-catalog discovery loop currently collects all
JSON files under DATA_DIR (using os.walk) including archived datasets; modify
the loop that builds filepaths so it skips files located in the archive
directory (e.g., any root path matching os.path.join(DATA_DIR, "archive_data")
or contains "/archive" segment) and still excludes "all_books.json" and
non-JSONs; update the condition in the for root, _, files in os.walk(DATA_DIR)
block (the code that appends to filepaths) to continue when the file's root or
full path indicates an archived dataset so archived JSONs are not added to
filepaths.
- Around line 354-358: The current Response uses
stream_with_context(upstream.iter_content(...)) but never closes the
requests.Response object "upstream", which can leak connections; wrap
iter_content in a generator that yields chunks and calls upstream.close() in a
finally block (or use contextlib.closing) and pass that generator to
stream_with_context so "upstream" is always closed on iterator exhaustion or
client disconnect; update the return to use
Response(stream_with_context(your_chunk_generator()), headers=headers,
status=upstream.status_code) and ensure the generator references
upstream.iter_content(chunk_size=64 * 1024) and closes upstream in its finally.
In `@data/Course` Materials/pus_geography.json:
- Line 27: The PUS JSON uses "educationLevel" but the API expects "level" (see
LIST_BOOK_FIELDS in api.py and openapi.json); update the PUS ingestion/merge or
add fallback logic so entries get a numeric "level" field: convert/normalize
data/Course Materials/pus_geography.json (and other PUS inputs) by mapping
educationLevel values (e.g., "Primary"/"Secondary"/"Tertiary"/etc.) to the
numeric level used by the API, and ensure the scraper/merge output emits "level"
(or have the API layer that builds LIST_BOOK_FIELDS check for educationLevel and
populate level before serialization). Reference the PUS scraper/merge step that
produces the catalog documents and the API builder that uses LIST_BOOK_FIELDS to
implement this mapping.
In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Around line 108-109: Normalize the publisher values in
pus_occupation-business-and-technology-education.json by removing embedded
newlines and indentation artifacts and collapsing multiple whitespace (including
non-breaking spaces) into a single regular space for the "publisher" fields;
locate the entries where "publisher" currently contains newline/indentation
(e.g. the value starting with "नेपाल सरकार,") and replace them with a
single-line, trimmed string like "नेपाल सरकार, पाठ्यक्रम विकास केन्द्र" so
exact-match filtering and rendering are not broken.
In `@data/Course` Materials/pus_our-surroundings.json:
- Line 257: The pageCount fields currently contain mixed-script numerals ("३0");
find the "pageCount" entries with the value "३0" and normalize them to use ASCII
digits (e.g., "30") or, preferably, a JSON number (pageCount: 30) to ensure
consistent parsing/sorting; update both occurrences so all pageCount values use
the same script/type.
In `@data/Course` Materials/pus_plant-science.json:
- Around line 471-478: This record (id:
pus-79b624b9-a60e-48ee-aaeb-e6eda45d2d1d, title: "Operation and Maintenance of
Microhydro Plant and Photovoltaic System :Learning Resource Material - Grade
12") is missing a subject field so it won’t be returned by GET
/api/books?subject=...; open the JSON object for that record and add an
appropriate "subject" property (for example "subject": "Plant Science", or
"subject": "Renewable Energy / Microhydro" depending on taxonomy) and optionally
adjust "category"/"keywords" to match catalog filters so it appears in
subject-based queries.
In `@data/Course` Materials/pus_rural-development.json:
- Line 1: The file pus_rural-development.json currently contains an empty array
([]) which makes this category always return zero items; confirm whether this is
intentional and either (a) populate pus_rural-development.json with the expected
dataset entries, (b) remove/rename the file so the catalog bucket isn't dead
until data is ready, or (c) replace the empty array with a short placeholder
object/metadata indicating "data pending" so callers can handle the empty state
explicitly.
In `@data/Literature` and Arts/pus_traditional-art.json:
- Around line 31-51: The record uses a non-unique id
"pus-592843a1-7e1a-4769-b30c-cc2ffa030b53" which is duplicated elsewhere and
will cause ingest collisions; update the "id" field in this JSON object (and the
other duplicate record that reuses that same id) to a newly generated unique id
(e.g., a new UUID with the "pus-" prefix) so each record has a globally unique
"id" value and ensure any references to that id are updated consistently.
In `@data/Other` Educational Materials/pus_computer.json:
- Around line 906-919: The JSON record with id
"pus-6a28b603-c348-4799-a4a5-e3664bc12107" is incomplete (missing readUrl and
other core metadata), so update this object to include the same required fields
used across the dataset (at minimum add "readUrl" with a valid document URL and
any missing core metadata such as publisher/publishedDate/format/size/rights as
your schema requires) and run the same validation/normalization used elsewhere
before persisting to ensure consistency and prevent broken read/download
behavior.
In `@data/Other` Educational Materials/pus_philosophy-and-religion.json:
- Around line 74-103: The JSON records in pus_philosophy-and-religion.json reuse
existing UUID values in the Sanskrit course dataset causing id-based
deduplication collisions; update the "id" fields for the conflicting records
(e.g., the entry with "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703") to
unique UUIDs (or a namespaced id scheme) so each record across all categories is
globally unique, then re-run the loader to verify no id collisions remain.
In `@data/Other` Educational Materials/pus_sports.json:
- Around line 142-165: The record in this file uses the same id
"pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28" that already exists in the other
catalog entry (referenced in pus_photo-essay.json), causing one to be dropped;
fix it by giving this entry a unique id (replace the "id" value in this object
with a new UUID or canonical unique identifier) or merge the two records if they
represent the same item, and then verify no other entries share that id.
In `@data/Teaching` Materials/pus_educational-theory-and-philosophy.json:
- Line 625: The JSON records have "pageCount" fields containing years ("2008",
"2006") instead of numeric page totals; locate the offending "pageCount" keys
and correct them by replacing the year strings with the actual integer page
counts or null if unknown, or move the value to the correct field such as
"publicationYear" if that was intended; add/adjust a validation step (schema or
script) that enforces pageCount as an integer >0 and scan the file for any
4-digit year patterns in "pageCount" to fix or flag for manual review (refer to
the "pageCount" key and the specific values "2008" and "2006" to find the
records).
In `@data/Teaching` Materials/pus_local-curriculum.json:
- Line 73: The educationLevel field contains embedded newlines and extra
whitespace; normalize it by trimming whitespace and collapsing internal
newlines/extra spaces and store as a stable representation (preferably an array
of trimmed values like ["Primary","Middle"] or a comma-joined string
"Primary,Middle"); update the entries where educationLevel appears (the current
key "educationLevel" at the shown diff and the other occurrences called out) to
parse the raw string by splitting on commas/newlines, trimming each token,
filtering out empties, and then serialize the cleaned array or joined string
consistently.
In `@data/Teaching` Materials/pus_quality-education-support-material.json:
- Around line 212-229: This record (id
"pus-6a1b7c47-f833-41db-878e-360c359bc838", title "Study Report on Effectiveness
of Reimbursement System of Free Textbook Distribution") is missing the publisher
field; add a "publisher" key to this JSON object (either a string with the
publisher name or explicit null) so the schema matches other entries and
preserves dataset shape consistency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a4240fdd-d216-4b11-a4a9-a19d59e40d3e
📒 Files selected for processing (87)
README.mdapi.pydata/Course Materials/pus_accounting.jsondata/Course Materials/pus_animal-science.jsondata/Course Materials/pus_civics.jsondata/Course Materials/pus_civil-engineering.jsondata/Course Materials/pus_computer-engineering.jsondata/Course Materials/pus_e-paath.jsondata/Course Materials/pus_economics.jsondata/Course Materials/pus_education.jsondata/Course Materials/pus_electrical-engineering.jsondata/Course Materials/pus_english.jsondata/Course Materials/pus_environmental-studies.jsondata/Course Materials/pus_geography.jsondata/Course Materials/pus_health-and-physical-education.jsondata/Course Materials/pus_history-and-culture.jsondata/Course Materials/pus_mathematics.jsondata/Course Materials/pus_moral-education.jsondata/Course Materials/pus_music.jsondata/Course Materials/pus_nepali.jsondata/Course Materials/pus_occupation-business-and-technology-education.jsondata/Course Materials/pus_old-textbooks.jsondata/Course Materials/pus_our-surroundings.jsondata/Course Materials/pus_plant-science.jsondata/Course Materials/pus_political-science-and-philosophy.jsondata/Course Materials/pus_population.jsondata/Course Materials/pus_rural-development.jsondata/Course Materials/pus_sanskrit.jsondata/Course Materials/pus_science.jsondata/Course Materials/pus_social-studies.jsondata/Course Materials/pus_sociology-and-anthropology.jsondata/Course Materials/pus_technical-and-vocational.jsondata/Course Materials/pus_textbook-chapters.jsondata/Course Materials/pus_textbooks.jsondata/Literature and Arts/pus_do-it-yourself.jsondata/Literature and Arts/pus_english-children-s-literature.jsondata/Literature and Arts/pus_english-literature.jsondata/Literature and Arts/pus_inspirational-materials.jsondata/Literature and Arts/pus_literature-in-other-nepali-languages.jsondata/Literature and Arts/pus_nepali-children-s-literature.jsondata/Literature and Arts/pus_nepali-literature.jsondata/Literature and Arts/pus_traditional-art.jsondata/Other Educational Materials/pus_agriculture-and-biodiversity.jsondata/Other Educational Materials/pus_civics-related-materials.jsondata/Other Educational Materials/pus_computer.jsondata/Other Educational Materials/pus_cottage-and-small-industries.jsondata/Other Educational Materials/pus_education-related-materials.jsondata/Other Educational Materials/pus_environment-related-materials.jsondata/Other Educational Materials/pus_health-and-security-related-materials.jsondata/Other Educational Materials/pus_law-and-government.jsondata/Other Educational Materials/pus_philosophy-and-religion.jsondata/Other Educational Materials/pus_photo-essay.jsondata/Other Educational Materials/pus_science-and-technology.jsondata/Other Educational Materials/pus_sports.jsondata/Other Educational Materials/pus_tourism.jsondata/Reference Materials/pus_atlas.jsondata/Reference Materials/pus_children-s-encyclopedia.jsondata/Reference Materials/pus_dictionary.jsondata/Teaching Materials/pus_additional-reading-material-for-teachers.jsondata/Teaching Materials/pus_curriculum.jsondata/Teaching Materials/pus_educational-theory-and-philosophy.jsondata/Teaching Materials/pus_journals-magazines-newsletters-and-pamphlets.jsondata/Teaching Materials/pus_literacy-resources.jsondata/Teaching Materials/pus_local-curriculum.jsondata/Teaching Materials/pus_professional-development.jsondata/Teaching Materials/pus_quality-education-support-material.jsondata/Teaching Materials/pus_teacher-training-material.jsondata/Teaching Materials/pus_teachers-guides-old.jsondata/Teaching Materials/pus_teachers-guides.jsondata/Teaching Materials/pus_teaching-support-material.jsondata/all_books.jsondata/archive_data/archive_org.jsondata/archive_data/cdc_nepal.jsondata/archive_data/open_library.jsondata/archive_data/pustakalaya.jsondata/cehrd_audio.jsondata/cehrd_nfe.jsondata/cehrd_stories.jsonindex.htmlopenapi.jsonplayground.htmlscripts/scrape_pustakalaya_course_materials.pyscripts/scrape_pustakalaya_literature.pyscripts/scrape_pustakalaya_literature_copy.pyscripts/scrape_pustakalaya_other_educational_materials.pyscripts/scrape_pustakalaya_teaching_materials.pyscripts/scraper.py
| for book in load_all_books(): | ||
| if url in {book.get("pdfUrl"), book.get("readUrl")}: | ||
| if url in {book.get(field) for field in fields}: | ||
| return True |
There was a problem hiding this comment.
Avoid full catalog reload for each URL authorization check.
is_catalog_resource_url() re-parses all JSON files on every /api/pdf and /api/audio request via load_all_books(). That adds avoidable disk I/O and latency on a hot path. Cache the loaded catalog (or precomputed allowlists for pdfUrl/readUrl/audioUrl) and refresh only when files change.
Also applies to: 118-150
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api.py` around lines 111 - 113, is_catalog_resource_url currently calls
load_all_books() on every request causing repeated disk reads; change it to use
a cached catalog and precomputed allowlists for the fields (e.g.,
pdfUrl/readUrl/audioUrl) stored at module-level (or in a simple CatalogCache
class) and have load_all_books() populate/update that cache instead of
re-parsing each time; implement cache invalidation by checking file mtimes (or a
single catalog last-modified timestamp) and refresh the cached book list and
derived sets only when files change so is_catalog_resource_url() simply checks
membership in the precomputed set rather than iterating load_all_books() each
call.
| for root, _, files in os.walk(DATA_DIR): | ||
| for filename in sorted(files): | ||
| if not filename.endswith(".json") or filename == "all_books.json": | ||
| continue | ||
| filepaths.append(os.path.join(root, filename)) |
There was a problem hiding this comment.
Exclude archived datasets from active catalog discovery.
Line 130-134 currently ingests all JSON files under data/, which will include data/archive_data/*.json and contradicts the “not part of active merged catalog” behavior documented in README.md (Line 27). This changes result quality and source composition unexpectedly.
Suggested fix
- for root, _, files in os.walk(DATA_DIR):
+ for root, _, files in os.walk(DATA_DIR):
+ if os.path.basename(root) == "archive_data":
+ continue
for filename in sorted(files):
if not filename.endswith(".json") or filename == "all_books.json":
continue
filepaths.append(os.path.join(root, filename))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for root, _, files in os.walk(DATA_DIR): | |
| for filename in sorted(files): | |
| if not filename.endswith(".json") or filename == "all_books.json": | |
| continue | |
| filepaths.append(os.path.join(root, filename)) | |
| for root, _, files in os.walk(DATA_DIR): | |
| if os.path.basename(root) == "archive_data": | |
| continue | |
| for filename in sorted(files): | |
| if not filename.endswith(".json") or filename == "all_books.json": | |
| continue | |
| filepaths.append(os.path.join(root, filename)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api.py` around lines 130 - 134, The active-catalog discovery loop currently
collects all JSON files under DATA_DIR (using os.walk) including archived
datasets; modify the loop that builds filepaths so it skips files located in the
archive directory (e.g., any root path matching os.path.join(DATA_DIR,
"archive_data") or contains "/archive" segment) and still excludes
"all_books.json" and non-JSONs; update the condition in the for root, _, files
in os.walk(DATA_DIR) block (the code that appends to filepaths) to continue when
the file's root or full path indicates an archived dataset so archived JSONs are
not added to filepaths.
| return Response( | ||
| stream_with_context(upstream.iter_content(chunk_size=64 * 1024)), | ||
| headers=headers, | ||
| status=upstream.status_code, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify other streamed requests responses are explicitly closed.
rg -n -C3 'iter_content\(|requests\.get\(' api.pyRepository: Shubhamnpk/yobook-api
Length of output: 949
Close the upstream requests response when streaming iter_content to avoid connection leaks.
In api.py (lines 354-358), the code streams upstream.iter_content(...) via Response(stream_with_context(...)) but does not explicitly close the requests.get(..., stream=True) response (upstream). Under early client disconnects, this can keep connections from being released back to the pool.
Suggested fix
- return Response(
- stream_with_context(upstream.iter_content(chunk_size=64 * 1024)),
- headers=headers,
- status=upstream.status_code,
- )
+ def generate():
+ try:
+ for chunk in upstream.iter_content(chunk_size=64 * 1024):
+ if chunk:
+ yield chunk
+ finally:
+ upstream.close()
+
+ return Response(
+ stream_with_context(generate()),
+ headers=headers,
+ status=upstream.status_code,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api.py` around lines 354 - 358, The current Response uses
stream_with_context(upstream.iter_content(...)) but never closes the
requests.Response object "upstream", which can leak connections; wrap
iter_content in a generator that yields chunks and calls upstream.close() in a
finally block (or use contextlib.closing) and pass that generator to
stream_with_context so "upstream" is always closed on iterator exhaustion or
client disconnect; update the return to use
Response(stream_with_context(your_chunk_generator()), headers=headers,
status=upstream.status_code) and ensure the generator references
upstream.iter_content(chunk_size=64 * 1024) and closes upstream in its finally.
| "fileSize": "4.11 MB", | ||
| "publisher": "पाठ्यक्रम विकास केन्द्र", | ||
| "pageCount": "२४१", | ||
| "educationLevel": "Secondary" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check which level-related keys API code expects
fd -i 'api.py' --exec rg -n --no-heading 'LIST_BOOK_FIELDS|educationLevel|\"level\"|\blevel\b' {}
# List JSON files that use educationLevel without level
python - <<'PY'
import json, pathlib
for p in pathlib.Path("data").rglob("*.json"):
try:
arr = json.loads(p.read_text(encoding="utf-8"))
except Exception:
continue
if not isinstance(arr, list):
continue
keys = set()
for item in arr:
if isinstance(item, dict):
keys.update(item.keys())
if "educationLevel" in keys and "level" not in keys:
print(p)
PYRepository: Shubhamnpk/yobook-api
Length of output: 3086
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find code references to educationLevel and level
echo "=== educationLevel references ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' 'educationLevel' .
echo "=== level field references ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' '"level"' .
# Check for mapping/back-compat logic (common patterns)
echo "=== Possible mapping logic ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
'map(ping)?|compat|transform|normalize|educationLevel.*level|level.*educationLevel' .
# Locate the likely request/response schema definitions
echo "=== Likely schema/field lists ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
'LIST_BOOK_FIELDS|BOOK_FIELDS|education.*level|educationLevel|level catalog|catalog.*level' .Repository: Shubhamnpk/yobook-api
Length of output: 50378
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find all relevant source files mentioning each key
echo "=== Files mentioning educationLevel (code only) ==="
rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
'educationLevel' .
echo "=== Files mentioning \"level\" (code only) ==="
rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
'"level"' .
echo "=== Files mentioning mapping/transform keywords around levels ==="
rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' .
# Narrow to likely request/response/schema code
echo "=== Candidate API schema/field lists ==="
rg -n --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
'LIST_BOOK_FIELDS|BOOK_FIELDS|field.*level|educationLevel.*field|level.*field' . || true
# Show a small window around any mapping we find
MAP_FILES=$(rg -l --no-heading --hidden --glob '!**/node_modules/**' --glob '!**/dist/**' \
'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' . || true)
if [ -n "${MAP_FILES}" ]; then
echo "=== Context for mapping candidates ==="
echo "${MAP_FILES}" | head -n 20 | while read -r f; do
echo "--- ${f} ---"
rg -n --no-heading \
'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level|map.*level|compat' "$f" | head -n 5
done
fiRepository: Shubhamnpk/yobook-api
Length of output: 31446
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show LIST_BOOK_FIELDS and surrounding code
python - <<'PY'
from pathlib import Path
p = Path("api.py")
lines = p.read_text(encoding="utf-8").splitlines()
# print around LIST_BOOK_FIELDS
for i,l in enumerate(lines):
if "LIST_BOOK_FIELDS" in l:
start=max(0,i-20); end=min(len(lines), i+80)
for j in range(start,end):
print(f"{j+1:5d}:{lines[j]}")
break
PY
# Search for where fields are read/filtered for level metadata
rg -n --no-heading 'educationLevel|\"level\"|level\b|LIST_BOOK_FIELDS|book\[[' api.py
# Also inspect scraper.py/openapi.json for schema expectation if present
echo "=== openapi.json (presence of level/educationLevel) ==="
rg -n --no-heading '"level"' openapi.json || true
rg -n --no-heading 'educationLevel' openapi.json || trueRepository: Shubhamnpk/yobook-api
Length of output: 3550
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== api.py: all educationLevel occurrences ==="
rg -n --no-heading 'educationLevel' api.py || true
echo "=== api.py: all occurrences of field 'level' ==="
rg -n --no-heading '\blevel\b' api.py
echo "=== api.py: openapi schema field 'level' / educationLevel ==="
rg -n --no-heading '"level"' openapi.json || true
rg -n --no-heading 'educationLevel' openapi.json || true
echo "=== data: JSON files that contain a top-level or item key \"level\" ==="
# crude but effective: find any file containing the literal string "level":
rg -l --no-heading --hidden --glob 'data/**/*.json' '"level"' .
echo "=== data: JSON files that contain \"educationLevel\" and do NOT contain \"level\" ==="
python3 - <<'PY'
import json, pathlib
paths_with_no_level=[]
for p in pathlib.Path("data").rglob("*.json"):
try:
txt=p.read_text(encoding="utf-8")
except Exception:
continue
if '"educationLevel"' not in txt:
continue
if '"level"' in txt:
continue
# If file is huge still count by substring; better to also parse if it's a list of dicts
try:
arr=json.loads(txt)
except Exception:
continue
keys=set()
if isinstance(arr,list):
for item in arr:
if isinstance(item,dict):
keys.update(item.keys())
if "educationLevel" in keys and "level" not in keys:
paths_with_no_level.append(str(p))
for x in sorted(paths_with_no_level):
print(x)
PY
echo "=== Inspect pus_geography.json keys (first list item) ==="
python3 - <<'PY'
import json
from pathlib import Path
p=Path("data/Course Materials/pus_geography.json")
data=json.loads(p.read_text(encoding="utf-8"))
print("type:", type(data))
if isinstance(data,list) and data and isinstance(data[0],dict):
d=data[0]
print("first item keys:", sorted(d.keys()))
print("educationLevel:", d.get("educationLevel"))
print("level:", d.get("level"))
else:
print("not a list of dicts; top-level type:", type(data))
PY
echo "=== Codebase: any educationLevel<->level mapping logic in scripts/ ==="
rg -n --no-heading --hidden --glob 'scripts/**/*' 'educationLevel.*level|level.*educationLevel|normalize.*level|transform.*level' .Repository: Shubhamnpk/yobook-api
Length of output: 3809
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find how the API uses `level` beyond just listing it in LIST_BOOK_FIELDS
rg -n --no-heading 'level' api.py
# 2) Find any conversion/mapping from educationLevel to level in scripts
rg -n --no-heading 'educationLevel' scripts
rg -n --no-heading '"level"' scripts
rg -n --no-heading 'educationLevel.*level|level.*educationLevel|map.*level|normalize.*level|transform.*level' scripts || true
# 3) Inspect openapi.json for educationLevel vs level usage
rg -n --no-heading 'educationLevel' openapi.json || true
rg -n --no-heading '"level"' openapi.json | head -n 50
# 4) Inspect api.py for where request params are handled for filtering/sorting (search common query patterns)
rg -n --no-heading 'request\.args\.get\("level"\)|args\.get\("level"\)|level.*request\.args|get.*level' api.py || true
rg -n --no-heading 'sort|filter' api.pyRepository: Shubhamnpk/yobook-api
Length of output: 1672
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate /api/books handler in api.py
rg -n --no-heading 'GET /api/books|def .*books|/api/books' api.py
# Print a focused window around the likely books search/filter logic
python3 - <<'PY'
from pathlib import Path
p=Path("api.py")
lines=p.read_text(encoding="utf-8").splitlines()
def dump_around(needle, radius=80):
for i,l in enumerate(lines):
if needle in l:
start=max(0,i-radius); end=min(len(lines), i+radius)
for j in range(start,end):
print(f"{j+1:5d}:{lines[j]}")
return True
return False
# common needles
dump_around('def get_books')
dump_around('books')
dump_around('request.args')
dump_around('"level"')
PY
# Inspect scripts/scraper.py where it assigns "level"
python3 - <<'PY'
from pathlib import Path
p=Path("scripts/scraper.py")
lines=p.read_text(encoding="utf-8").splitlines()
for i,l in enumerate(lines):
if '"level":' in l and 'level' in l:
start=max(0,i-40); end=min(len(lines), i+60)
for j in range(start,end):
print(f"{j+1:5d}:{lines[j]}")
break
PY
# Search for any mapping logic in code (not just scripts) around educationLevel->level
rg -n --no-heading 'educationLevel.*level|level.*educationLevel|educationLevel.*int|educationLevel.*catalog|catalog.*educationLevel|educationLevel.*map' . || trueRepository: Shubhamnpk/yobook-api
Length of output: 25082
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Any query-param usage of "level" in api.py
rg -n --no-heading 'request\.args\.get\("level"\)|args\.get\("level"\)|\blevel\b.*request\.args' api.py || true
# Any API code that reads book["level"] (besides listing it in LIST_BOOK_FIELDS / openapi)
rg -n --no-heading '\bget\("level"\)|\b\["level"\]|LIST_BOOK_FIELDS' api.py
# Is educationLevel referenced anywhere in non-data code (besides scripts)?
rg -n --no-heading --glob '!data/**' 'educationLevel' .
# openapi: is level used as a query param or just a response field?
rg -n --no-heading '"level"' openapi.json
rg -n --no-heading 'educationLevel' openapi.json || true
# Identify which data sources actually contain "level"
python3 - <<'PY'
import json, pathlib
with_level=[]
for p in pathlib.Path("data").rglob("*.json"):
try:
txt=p.read_text(encoding="utf-8")
except Exception:
continue
if '"level"' not in txt:
continue
# quick parse to see if it's list-of-dicts and level exists in items
try:
arr=json.loads(txt)
except Exception:
with_level.append(str(p))
continue
ok=False
if isinstance(arr,list):
for item in arr:
if isinstance(item,dict) and 'level' in item:
ok=True
break
if ok:
with_level.append(str(p))
print("\n".join(sorted(set(with_level))))
PYRepository: Shubhamnpk/yobook-api
Length of output: 846
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== educationLevel usage outside data/** ==="
rg -n --no-heading --glob '!data/**' 'educationLevel' .
echo "=== scripts/code touching all_books / catalog merge ==="
rg -n --no-heading --glob 'scripts/**' 'all_books\.json|merge|normalize|transform' .
echo "=== any code setting level from educationLevel ==="
rg -n --no-heading --glob '!data/**' 'educationLevel.*level|level.*educationLevel' .Repository: Shubhamnpk/yobook-api
Length of output: 12603
Map educationLevel data to the API/catalog level field (PUS JSON currently omits level)
data/Course Materials/pus_geography.json contains only:
"educationLevel": "Secondary"But the API/catalog fields are built around level (api.py includes "level" in LIST_BOOK_FIELDS, and openapi.json exposes level), and there is no educationLevel → level conversion found in the codebase/scrapers. As a result, PUS entries like this will have level missing/null in API responses, breaking any level-based sorting/filtering that depends on the level field.
- Update the PUS scrapers/merge step to emit the numeric
levelfield (preferred), or add backward-compatible fallback logic to interpreteducationLevelaslevel.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@data/Course` Materials/pus_geography.json at line 27, The PUS JSON uses
"educationLevel" but the API expects "level" (see LIST_BOOK_FIELDS in api.py and
openapi.json); update the PUS ingestion/merge or add fallback logic so entries
get a numeric "level" field: convert/normalize data/Course
Materials/pus_geography.json (and other PUS inputs) by mapping educationLevel
values (e.g., "Primary"/"Secondary"/"Tertiary"/etc.) to the numeric level used
by the API, and ensure the scraper/merge output emits "level" (or have the API
layer that builds LIST_BOOK_FIELDS check for educationLevel and populate level
before serialization). Reference the PUS scraper/merge step that produces the
catalog documents and the API builder that uses LIST_BOOK_FIELDS to implement
this mapping.
| "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703", | ||
| "title": "कर्मकाण्ड: कक्षा १०", | ||
| "author": "Unknown", | ||
| "language": "ne", | ||
| "country": "np", | ||
| "source": "pustakalaya-other-educational", | ||
| "sourceUrl": "https://pustakalaya.org/documents/detail/0430b451-b53e-4ac0-a868-c35b5269b703/", | ||
| "coverUrl": "https://pustakalaya.org/media/uploads/thumbnails/document/2023/09/11/करमकणड_ककष_१०.jpg", | ||
| "category": "Other Educational Materials", | ||
| "keywords": [ | ||
| "Karmakanda", | ||
| "कक्षा १०", | ||
| "Textbook", | ||
| "Grade 10", | ||
| "गुरुकुल", | ||
| "Gurukul", | ||
| "कर्मकाण्ड", | ||
| "Sanskrit", | ||
| "New Textbook", | ||
| "Ritual", | ||
| "संस्कृतम्", | ||
| "पाठ्यपुस्तक" | ||
| ], | ||
| "scrapedAt": "2026-05-24T11:34:44.382540Z", | ||
| "readUrl": "https://pustakalaya.org/media/uploads/documents/2023/09/11/Karmakanda10_124cc050/1685512467.pdf", | ||
| "fileSize": "1.0 MB", | ||
| "publisher": "पाठ्यक्रम विकास केन्द्र", | ||
| "pageCount": "२४६", | ||
| "educationLevel": "Secondary" | ||
| }, |
There was a problem hiding this comment.
Cross-category ID reuse will overwrite/drop records at load time.
Starting at Line 74, this file reuses IDs already present in the Sanskrit course dataset (multiple occurrences in this file). With id-based deduplication, these entries will collide and one side will be lost, causing incorrect category/source exposure.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@data/Other` Educational Materials/pus_philosophy-and-religion.json around
lines 74 - 103, The JSON records in pus_philosophy-and-religion.json reuse
existing UUID values in the Sanskrit course dataset causing id-based
deduplication collisions; update the "id" fields for the conflicting records
(e.g., the entry with "id": "pus-0430b451-b53e-4ac0-a868-c35b5269b703") to
unique UUIDs (or a namespaced id scheme) so each record across all categories is
globally unique, then re-run the loader to verify no id collisions remain.
| "id": "pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28", | ||
| "title": "Motor Sports", | ||
| "author": "Prabhat Bhattarai", | ||
| "language": "en", | ||
| "country": "np", | ||
| "source": "pustakalaya-other-educational", | ||
| "sourceUrl": "https://pustakalaya.org/documents/detail/94caee86-1d2c-474e-b0a4-d2be3a45ec28/", | ||
| "coverUrl": "https://pustakalaya.org/media/uploads/thumbnails/document/2019/02/22/Motor_Sports.jpg", | ||
| "category": "Other Educational Materials", | ||
| "keywords": [ | ||
| "Photo Essays", | ||
| "Prabhat Bhattarai", | ||
| "साझा शिक्षा ई-पाटी", | ||
| "OLE Nepal", | ||
| "प्रभात भट्टराई", | ||
| "Motor Sports" | ||
| ], | ||
| "scrapedAt": "2026-05-24T11:34:44.382540Z", | ||
| "readUrl": "https://pustakalaya.org/media/uploads/op/pdf/OLENepal2012_MotorSports.pdf/OLENepal2012_MotorSports.pdf", | ||
| "fileSize": "2.01 MB", | ||
| "publisher": "OLE Nepal", | ||
| "pageCount": "23", | ||
| "description": "This photo essay has been prepared by Prabhat Bhattarai. The text and photos for this photo essay were taken from www.wikipedia.com, and edited to suit the audience." | ||
| }, |
There was a problem hiding this comment.
Duplicate id will cause one catalog record to be dropped.
pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28 is already present in data/Other Educational Materials/pus_photo-essay.json (Line 306). Since load_all_books deduplicates by id, one category entry will be silently discarded at load time.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@data/Other` Educational Materials/pus_sports.json around lines 142 - 165, The
record in this file uses the same id "pus-94caee86-1d2c-474e-b0a4-d2be3a45ec28"
that already exists in the other catalog entry (referenced in
pus_photo-essay.json), causing one to be dropped; fix it by giving this entry a
unique id (replace the "id" value in this object with a new UUID or canonical
unique identifier) or merge the two records if they represent the same item, and
then verify no other entries share that id.
| "readUrl": "https://pustakalaya.org/media/uploads/op/pdf/ToruAndVijay2008_Opening_up_education.pdf/ToruAndVijay2008_Opening_up_education.pdf", | ||
| "fileSize": "4.33 MB", | ||
| "publisher": "The MIT Press", | ||
| "pageCount": "2008", |
There was a problem hiding this comment.
pageCount appears to contain publication years instead of pages.
For these records, pageCount values ("2008", "2006") look like years, not page totals. This will skew page-based filtering/sorting and mislead consumers.
Also applies to: 1288-1288
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@data/Teaching` Materials/pus_educational-theory-and-philosophy.json at line
625, The JSON records have "pageCount" fields containing years ("2008", "2006")
instead of numeric page totals; locate the offending "pageCount" keys and
correct them by replacing the year strings with the actual integer page counts
or null if unknown, or move the value to the correct field such as
"publicationYear" if that was intended; add/adjust a validation step (schema or
script) that enforces pageCount as an integer >0 and scan the file for any
4-digit year patterns in "pageCount" to fix or flag for manual review (refer to
the "pageCount" key and the specific values "2008" and "2006" to find the
records).
| "fileSize": "2.0 MB", | ||
| "publisher": "हरिहरपुरगढी गाउँपालिका", | ||
| "pageCount": "१५४", | ||
| "educationLevel": "Primary , \n \n Middle" |
There was a problem hiding this comment.
Normalize educationLevel values to a stable format.
These values include embedded newlines/indentation, which can break exact-match filters and create inconsistent facets. Store a normalized value (e.g., "Primary,Middle" or an array like ["Primary","Middle"]) instead of whitespace-heavy text blobs.
Also applies to: 122-122, 217-217, 289-289
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@data/Teaching` Materials/pus_local-curriculum.json at line 73, The
educationLevel field contains embedded newlines and extra whitespace; normalize
it by trimming whitespace and collapsing internal newlines/extra spaces and
store as a stable representation (preferably an array of trimmed values like
["Primary","Middle"] or a comma-joined string "Primary,Middle"); update the
entries where educationLevel appears (the current key "educationLevel" at the
shown diff and the other occurrences called out) to parse the raw string by
splitting on commas/newlines, trimming each token, filtering out empties, and
then serialize the cleaned array or joined string consistently.
Introduce catalog validation tooling, API health monitoring, and updated project documentation to support the expanded data sources. - Add `scripts/validate_catalog.py` to ensure data integrity before commits - Implement `/api/health` endpoint for monitoring catalog size and source counts - Update `README.md` and `CONTRIBUTING.md` with new scraping workflows and validation steps - Add `.env.example`, `CHANGELOG.md`, and GitHub configuration files - Update OpenAPI specification to include the new health check endpoint
Update the catalog with categorized Pustakalaya collections, clean up metadata formatting in the JSON database, and adjust the API source priority. Additionally, improve the frontend UI to handle audio-specific book displays. - Update `api.py` to remove deprecated source priorities - Clean up whitespace and newline artifacts in `data/all_books.json` - Update `index.html` to hide the read button when an audio URL is present - Synchronize multiple Pustakalaya category JSON files with new scraped data
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
README.md (1)
164-171:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd
/api/healthto the endpoint list for consistency.The “Other Endpoints” block documents
/api/audiobut omits/api/health, which is now part of the API surface and used in CI smoke checks.Suggested doc patch
GET /api/books/<id> GET /api/pdf?url=<catalog-pdf-url> GET /api/audio?url=<catalog-audio-url> +GET /api/health GET /api/sources GET /api/stats GET /docs🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 164 - 171, Update the "Other Endpoints" list to include the /api/health endpoint for consistency with the API surface; specifically add "/api/health" alongside the existing entries (e.g., the GET /api/audio line) in the same code block so CI smoke checks and docs match the implemented endpoint.api.py (1)
142-143:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t silently ignore catalog parse/load failures.
Line 142 catches everything and Line 143 drops it, which can silently serve a partial catalog and break allowlist checks unpredictably.
Suggested minimal fix
- except Exception: - pass + except (OSError, json.JSONDecodeError) as exc: + app.logger.warning("Skipping unreadable catalog file %s: %s", filepath, exc)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api.py` around lines 142 - 143, The except Exception: pass block that swallows catalog parse/load failures must be removed and replaced with explicit error handling: catch specific exceptions raised by the catalog parsing/loading code (or Exception if unknown), log the full error (e.g., logger.error("Failed to load/parse catalog", exc_info=True)) and either re-raise a wrapped exception or set the catalog to a safe empty/closed state so allowlist checks fail-safe; locate the try/except around the catalog parse/load call in api.py and update it to log and propagate or safely fallback instead of silently passing.
♻️ Duplicate comments (2)
data/Course Materials/pus_occupation-business-and-technology-education.json (2)
266-266:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winNormalize spacing in publisher field.
The publisher field contains double spaces after the first comma, which should be collapsed to a single space for consistency.
🧹 Proposed fix
- "publisher": "The World Bank, The World Bank Group, Nepal", + "publisher": "The World Bank, The World Bank Group, Nepal",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Course` Materials/pus_occupation-business-and-technology-education.json at line 266, The "publisher" field currently has double spaces after the first comma in the string "The World Bank, The World Bank Group, Nepal"; update that value to collapse the double space into a single space so it reads "The World Bank, The World Bank Group, Nepal" to normalize spacing and maintain consistency.
108-108:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd missing space after comma in publisher field.
The publisher field is missing a space after the comma, which affects readability and consistency with other entries.
🧹 Proposed fix
- "publisher": "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र", + "publisher": "नेपाल सरकार, पाठ्यक्रम विकास केन्द्र",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@data/Course` Materials/pus_occupation-business-and-technology-education.json at line 108, Update the publisher JSON value for the "publisher" key by inserting a space after the comma so the string reads "नेपाल सरकार, पाठ्यक्रम विकास केन्द्र" instead of "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र"; locate the "publisher" field in the object and adjust the value accordingly to match the spacing conventions used elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci.yml:
- Around line 14-20: Update the two GitHub Actions uses to pinned commit SHAs
and harden checkout: replace actions/checkout@v4 with
actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823 and add the with key
persist-credentials: false to the checkout step; replace actions/setup-python@v5
with actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f so both
actions are pinned to specific SHAs rather than floating tags.
In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Line 188: The "editor" JSON field currently uses double spaces after commas;
update the value for the editor key ("editor") to use single spaces after each
comma and remove any leading/trailing extra spaces so it reads as "अग्नि प्रसाद
नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण
जोशी"; ensure the normalized string replaces all instances of ", " with ", "
and trim surrounding whitespace.
In `@data/Literature` and Arts/pus_inspirational-materials.json:
- Line 137: Update the JSON entry where the "publisher" field currently has the
misspelled value "unkown" and correct it to "unknown" (i.e., replace the string
value for the "publisher" key from "unkown" to "unknown") so the record uses the
proper spelling.
In `@data/Other` Educational Materials/pus_tourism.json:
- Line 64: The keywords list contains an inconsistent spelling: replace the
incorrect string "Tourist Accomodation" with the correct "Tourist Accommodation"
wherever it appears (e.g., the entry matching "Tourist Accomodation") so all
occurrences match the correctly spelled "Tourist Accommodation" and avoid
fragmenting keyword indexing.
In `@scripts/validate_catalog.py`:
- Around line 46-47: Narrow the broad except in the JSON file load block:
replace the bare "except Exception as exc" that appends to errors with specific
exception handlers for expected failure modes (e.g.,
FileNotFoundError/PermissionError (or more generally OSError/IOError) and
json.JSONDecodeError) so only read/parse errors are caught; for each handler
append the same "{relpath}: could not read JSON: {exc}" message (or slightly
different messages if you want to distinguish read vs parse) and let any other
unexpected exceptions propagate.
---
Outside diff comments:
In `@api.py`:
- Around line 142-143: The except Exception: pass block that swallows catalog
parse/load failures must be removed and replaced with explicit error handling:
catch specific exceptions raised by the catalog parsing/loading code (or
Exception if unknown), log the full error (e.g., logger.error("Failed to
load/parse catalog", exc_info=True)) and either re-raise a wrapped exception or
set the catalog to a safe empty/closed state so allowlist checks fail-safe;
locate the try/except around the catalog parse/load call in api.py and update it
to log and propagate or safely fallback instead of silently passing.
In `@README.md`:
- Around line 164-171: Update the "Other Endpoints" list to include the
/api/health endpoint for consistency with the API surface; specifically add
"/api/health" alongside the existing entries (e.g., the GET /api/audio line) in
the same code block so CI smoke checks and docs match the implemented endpoint.
---
Duplicate comments:
In `@data/Course` Materials/pus_occupation-business-and-technology-education.json:
- Line 266: The "publisher" field currently has double spaces after the first
comma in the string "The World Bank, The World Bank Group, Nepal"; update that
value to collapse the double space into a single space so it reads "The World
Bank, The World Bank Group, Nepal" to normalize spacing and maintain
consistency.
- Line 108: Update the publisher JSON value for the "publisher" key by inserting
a space after the comma so the string reads "नेपाल सरकार, पाठ्यक्रम विकास
केन्द्र" instead of "नेपाल सरकार,पाठ्यक्रम विकास केन्द्र"; locate the
"publisher" field in the object and adjust the value accordingly to match the
spacing conventions used elsewhere.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8d8662c4-91ae-4aa8-af9d-d0ae78b9fe2d
📒 Files selected for processing (62)
.env.example.github/ISSUE_TEMPLATE/bug_report.md.github/ISSUE_TEMPLATE/source_request.md.github/PULL_REQUEST_TEMPLATE.md.github/workflows/ci.ymlCHANGELOG.mdCONTRIBUTING.mdREADME.mdapi.pydata/Course Materials/pus_economics.jsondata/Course Materials/pus_english.jsondata/Course Materials/pus_environmental-studies.jsondata/Course Materials/pus_health-and-physical-education.jsondata/Course Materials/pus_history-and-culture.jsondata/Course Materials/pus_mathematics.jsondata/Course Materials/pus_nepali.jsondata/Course Materials/pus_occupation-business-and-technology-education.jsondata/Course Materials/pus_plant-science.jsondata/Course Materials/pus_political-science-and-philosophy.jsondata/Course Materials/pus_sanskrit.jsondata/Course Materials/pus_science.jsondata/Course Materials/pus_social-studies.jsondata/Course Materials/pus_sociology-and-anthropology.jsondata/Course Materials/pus_technical-and-vocational.jsondata/Course Materials/pus_textbooks.jsondata/Literature and Arts/pus_do-it-yourself.jsondata/Literature and Arts/pus_english-children-s-literature.jsondata/Literature and Arts/pus_english-literature.jsondata/Literature and Arts/pus_inspirational-materials.jsondata/Literature and Arts/pus_literature-in-other-nepali-languages.jsondata/Literature and Arts/pus_nepali-children-s-literature.jsondata/Literature and Arts/pus_nepali-literature.jsondata/Literature and Arts/pus_traditional-art.jsondata/Other Educational Materials/pus_agriculture-and-biodiversity.jsondata/Other Educational Materials/pus_civics-related-materials.jsondata/Other Educational Materials/pus_computer.jsondata/Other Educational Materials/pus_education-related-materials.jsondata/Other Educational Materials/pus_environment-related-materials.jsondata/Other Educational Materials/pus_health-and-security-related-materials.jsondata/Other Educational Materials/pus_law-and-government.jsondata/Other Educational Materials/pus_philosophy-and-religion.jsondata/Other Educational Materials/pus_science-and-technology.jsondata/Other Educational Materials/pus_sports.jsondata/Other Educational Materials/pus_tourism.jsondata/Reference Materials/pus_atlas.jsondata/Reference Materials/pus_children-s-encyclopedia.jsondata/Reference Materials/pus_dictionary.jsondata/Teaching Materials/pus_additional-reading-material-for-teachers.jsondata/Teaching Materials/pus_educational-theory-and-philosophy.jsondata/Teaching Materials/pus_journals-magazines-newsletters-and-pamphlets.jsondata/Teaching Materials/pus_literacy-resources.jsondata/Teaching Materials/pus_local-curriculum.jsondata/Teaching Materials/pus_professional-development.jsondata/Teaching Materials/pus_quality-education-support-material.jsondata/Teaching Materials/pus_teacher-training-material.jsondata/Teaching Materials/pus_teachers-guides-old.jsondata/Teaching Materials/pus_teachers-guides.jsondata/Teaching Materials/pus_teaching-support-material.jsondata/all_books.jsonindex.htmlopenapi.jsonscripts/validate_catalog.py
💤 Files with no reviewable changes (1)
- data/Other Educational Materials/pus_computer.json
✅ Files skipped from review due to trivial changes (13)
- .github/ISSUE_TEMPLATE/bug_report.md
- .github/ISSUE_TEMPLATE/source_request.md
- data/Literature and Arts/pus_do-it-yourself.json
- data/Course Materials/pus_plant-science.json
- data/Literature and Arts/pus_traditional-art.json
- data/Reference Materials/pus_dictionary.json
- data/Teaching Materials/pus_literacy-resources.json
- data/Other Educational Materials/pus_sports.json
- data/Teaching Materials/pus_local-curriculum.json
- data/Teaching Materials/pus_quality-education-support-material.json
- data/Reference Materials/pus_children-s-encyclopedia.json
- data/Teaching Materials/pus_educational-theory-and-philosophy.json
- data/Course Materials/pus_sanskrit.json
| - name: Check out repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.11" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
What are the currently recommended pinned commit SHAs for actions/checkout@v4andactions/setup-python@v5, and does GitHub Security guidance recommend persist-credentials: false for checkout?
💡 Result:
Currently recommended pinned commit SHAs (for the major versions you asked about) 1) actions/checkout@v4 - GitHub-hosted docs for the v4 release show the v4 tree pinned at commit fd084cde189b7b76ec305d52e27be545a0172823. [1] 2) actions/setup-python@v5 - GitHub-hosted docs for the actions/setup-python repo show v5 being introduced by commit e9d6f990972a57673cdb72ec29e19d42ba28880f (tagging v5 in the repo’s history). [2] Does GitHub Security guidance recommend persist-credentials: false for checkout? - I did not find a GitHub Security document in the retrieved sources that explicitly recommends setting persist-credentials: false for actions/checkout. The only “guidance” I saw directly addressing persist-credentials was from actions/checkout issue discussions and a compiler/security-related change request, not official GitHub Security guidance. [3][4][5] - Therefore: based on the evidence retrieved here, I cannot confirm that GitHub’s Security guidance recommends persist-credentials: false for checkout. [3][4] If you want, I can do one more targeted search specifically for GitHub’s official security guidance pages mentioning persist-credentials (e.g., the securitylab.github.com / documentation pages) to try to locate the exact recommendation.
Citations:
- 1: https://github.com/actions/checkout/tree/fd084cde189b7b76ec305d52e27be545a0172823
- 2: actions/setup-python@e9d6f99
- 3: Remove
persist-credentialsor change the default tofalseactions/checkout#485 - 4: Add persist-credentials: false to checkout actions for setup scripts github/gh-aw#7523
- 5: Validate actions/checkout persist-credentials in agent job steps github/gh-aw#17081
Harden CI by pinning GitHub Actions to commit SHAs (and consider disabling checkout credential persistence).
- Replace floating tags:
actions/checkout@v4→actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823actions/setup-python@v5→actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f
- Add
with: persist-credentials: falsetoactions/checkoutas a hardening step; I couldn’t find explicit official GitHub Security guidance recommending it.
🧰 Tools
🪛 zizmor (1.25.2)
[warning] 14-15: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 15-15: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
[error] 18-18: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/ci.yml around lines 14 - 20, Update the two GitHub Actions
uses to pinned commit SHAs and harden checkout: replace actions/checkout@v4 with
actions/checkout@fd084cde189b7b76ec305d52e27be545a0172823 and add the with key
persist-credentials: false to the checkout step; replace actions/setup-python@v5
with actions/setup-python@e9d6f990972a57673cdb72ec29e19d42ba28880f so both
actions are pinned to specific SHAs rather than floating tags.
| "scrapedAt": "2026-05-24T11:05:36.549054Z", | ||
| "readUrl": "https://pustakalaya.org/media/uploads/op/pdf/NARC2013_AgricultureTechnologyPart2.pdf/NARC2013_AgricultureTechnologyPart2.pdf", | ||
| "fileSize": "5.11 MB", | ||
| "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी", |
There was a problem hiding this comment.
Normalize spacing in editor field.
The editor field contains double spaces after commas instead of single spaces, which is inconsistent with standard formatting conventions.
🧹 Proposed fix
- "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",
+ "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी", | |
| "editor": "अग्नि प्रसाद नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण जोशी", |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@data/Course` Materials/pus_occupation-business-and-technology-education.json
at line 188, The "editor" JSON field currently uses double spaces after commas;
update the value for the editor key ("editor") to use single spaces after each
comma and remove any leading/trailing extra spaces so it reads as "अग्नि प्रसाद
नेपाल, विवेक सापकोटा, तुलसी पौडेल, यज्ञ प्रसाद गिरी, ध्रुवराज भट्टराई, बालकृष्ण
जोशी"; ensure the normalized string replaces all instances of ", " with ", "
and trim surrounding whitespace.
| "keywords": [ | ||
| "Great Himalaya Trail", | ||
| "Climate Change", | ||
| "Tourist Accommodation", |
There was a problem hiding this comment.
Inconsistent keyword spelling: "Accommodation" vs "Accomodation".
Line 64 uses the correct spelling "Tourist Accommodation" while Line 194 has "Tourist Accomodation" (missing the second 'm'). This inconsistency could fragment search results if keywords are indexed.
📝 Proposed fix
At Line 194, correct the spelling:
- "Tourist Accomodation",
+ "Tourist Accommodation",Also applies to: 194-194
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@data/Other` Educational Materials/pus_tourism.json at line 64, The keywords
list contains an inconsistent spelling: replace the incorrect string "Tourist
Accomodation" with the correct "Tourist Accommodation" wherever it appears
(e.g., the entry matching "Tourist Accomodation") so all occurrences match the
correctly spelled "Tourist Accommodation" and avoid fragmenting keyword
indexing.
| except Exception as exc: | ||
| errors.append(f"{relpath}: could not read JSON: {exc}") |
There was a problem hiding this comment.
Narrow the broad exception catch during file load.
Line 46 catches all exceptions; this can hide unexpected failures. Restrict to expected read/JSON parse errors.
Suggested fix
- except Exception as exc:
+ except (OSError, json.JSONDecodeError) as exc:
errors.append(f"{relpath}: could not read JSON: {exc}")
continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as exc: | |
| errors.append(f"{relpath}: could not read JSON: {exc}") | |
| except (OSError, json.JSONDecodeError) as exc: | |
| errors.append(f"{relpath}: could not read JSON: {exc}") |
🧰 Tools
🪛 Ruff (0.15.13)
[warning] 46-46: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/validate_catalog.py` around lines 46 - 47, Narrow the broad except in
the JSON file load block: replace the bare "except Exception as exc" that
appends to errors with specific exception handlers for expected failure modes
(e.g., FileNotFoundError/PermissionError (or more generally OSError/IOError) and
json.JSONDecodeError) so only read/parse errors are caught; for each handler
append the same "{relpath}: could not read JSON: {exc}" message (or slightly
different messages if you want to distinguish read vs parse) and let any other
unexpected exceptions propagate.
Clean up book metadata for consistency and completeness by removing an unintended line break in a description, fixing a `publisher` typo (`unkown` → `unknown`), and adding missing `readUrl` and `fileSize` fields for the Stage Fright entry in both category and aggregated datasets.fix(data): correct metadata typos and add missing file info Clean up book metadata for consistency and completeness by removing an unintended line break in a description, fixing a `publisher` typo (`unkown` → `unknown`), and adding missing `readUrl` and `fileSize` fields for the Stage Fright entry in both category and aggregated datasets.
Summary by CodeRabbit
New Features
Documentation
Content Expansion
Chores