Skip to content

Pydantic refactor#151

Closed
Cubix33 wants to merge 2 commits into
fireform-core:mainfrom
Cubix33:pydantic-refactor
Closed

Pydantic refactor#151
Cubix33 wants to merge 2 commits into
fireform-core:mainfrom
Cubix33:pydantic-refactor

Conversation

@Cubix33
Copy link
Copy Markdown

@Cubix33 Cubix33 commented Mar 2, 2026

Closes #148

🚀 Description

This PR fundamentally overhauls the core extraction logic in src/llm.py by integrating Pydantic and Ollama's native JSON Structured Outputs.

Previously, main_loop iterated sequentially through every field, making N API calls to the LLM and relying on fragile string parsing (replace('"', "")) to format the output. This was slow, prone to hallucinations, and broke easily if the LLM wrapped its answer in unexpected formatting.

This update introduces a single-pass, strictly typed architecture:

  1. Dynamically generates a Pydantic Model (create_model) based on the target PDF fields.
  2. Converts the Pydantic model into a JSON Schema.
  3. Passes the schema directly to Ollama via the format API payload parameter.

This reduces extraction time by drastically cutting API calls and guarantees 100% reliable data structures.

🛠️ Changes Made

  • Added pydantic>=2.0.0 to requirements.txt.
  • Refactored src/llm.py:
    • Removed outdated add_response_to_json and handle_plural_values string parsers.
    • Added build_schema() to map complex PDF field names to valid Python identifiers.
    • Added map_schema_to_json() to reconstruct the raw target fields for filler.py.
    • Updated main_loop() to execute a single, schema-enforced API call.

🧪 How to Test

  1. Run pip install -r requirements.txt to install Pydantic.
  2. Ensure Docker and Ollama are running.
  3. Execute make exec.
  4. The console will report [LOG] Extracting all fields using Pydantic structured output... and complete the entire PDF mapping in a single, fast operation.

✅ Checklist

  • Tested locally within Docker.
  • Verified JSON outputs map perfectly to filler.py expectations.
  • Maintains compatibility with the Stateful Resumption logic.

@Cubix33
Copy link
Copy Markdown
Author

Cubix33 commented Apr 14, 2026

Closing for now to reduce load on maintainers. Will reopen after further discussion or during GSOC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: Enforce Structured LLM Outputs via Pydantic & JSON Schema

1 participant