Skip to content

[Enhancement] Add JSON Schema Validation for LLM-Extracted Output to Ensure Data Integrity #40

@Vi-shub

Description

@Vi-shub

Currently, the textToJSON class in backend.py sends prompts to the Ollama LLM and directly stores whatever raw string the model returns into the JSON dictionary — with no validation whatsoever. There is no schema definition, no type checking, no format enforcement, and no constraint verification on the AI-extracted values before they are used to fill PDF forms.

This is a critical gap because LLMs are inherently non-deterministic and can return:

Hallucinated or nonsensical values (e.g., a phone number of "yes")
Incorrectly formatted data (e.g., date as "January 2" instead of "01/02/2025")
Extra conversational text wrapping the answer (e.g., "The name is John Doe." instead of "John Doe")
Empty or partial responses

Proposed Solution
Define a JSON Schema (or Pydantic model) for expected field types and constraints:
json
{
"employee_name": {"type": "string", "min_length": 1},
"phone_number": {"type": "string", "pattern": "^[0-9\-\+\(\) ]+$"},
"date": {"type": "string", "format": "date"},
"email": {"type": "string", "format": "email"}
}
After each LLM response in main_loop(), validate the extracted value against the schema.
Flag any values that fail validation with a warning and optionally prompt the user for correction.
Add a retry mechanism for failed extractions (re-prompt the LLM with more specific instructions).

✅ Acceptance Criteria

How will we know this is finished?

  • Feature works in Docker container.
  • Documentation updated in docs/.
  • JSON output validates against the schema.

📌 Additional Context

Add any other screenshots, links to fire department forms, or research here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Week X TODO's

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions