Currently, the textToJSON class in backend.py sends prompts to the Ollama LLM and directly stores whatever raw string the model returns into the JSON dictionary — with no validation whatsoever. There is no schema definition, no type checking, no format enforcement, and no constraint verification on the AI-extracted values before they are used to fill PDF forms.
This is a critical gap because LLMs are inherently non-deterministic and can return:
Hallucinated or nonsensical values (e.g., a phone number of "yes")
Incorrectly formatted data (e.g., date as "January 2" instead of "01/02/2025")
Extra conversational text wrapping the answer (e.g., "The name is John Doe." instead of "John Doe")
Empty or partial responses
Proposed Solution
Define a JSON Schema (or Pydantic model) for expected field types and constraints:
json
{
"employee_name": {"type": "string", "min_length": 1},
"phone_number": {"type": "string", "pattern": "^[0-9\-\+\(\) ]+$"},
"date": {"type": "string", "format": "date"},
"email": {"type": "string", "format": "email"}
}
After each LLM response in main_loop(), validate the extracted value against the schema.
Flag any values that fail validation with a warning and optionally prompt the user for correction.
Add a retry mechanism for failed extractions (re-prompt the LLM with more specific instructions).
✅ Acceptance Criteria
How will we know this is finished?
📌 Additional Context
Add any other screenshots, links to fire department forms, or research here.
Currently, the textToJSON class in backend.py sends prompts to the Ollama LLM and directly stores whatever raw string the model returns into the JSON dictionary — with no validation whatsoever. There is no schema definition, no type checking, no format enforcement, and no constraint verification on the AI-extracted values before they are used to fill PDF forms.
This is a critical gap because LLMs are inherently non-deterministic and can return:
Hallucinated or nonsensical values (e.g., a phone number of "yes")
Incorrectly formatted data (e.g., date as "January 2" instead of "01/02/2025")
Extra conversational text wrapping the answer (e.g., "The name is John Doe." instead of "John Doe")
Empty or partial responses
Proposed Solution
Define a JSON Schema (or Pydantic model) for expected field types and constraints:
json
{
"employee_name": {"type": "string", "min_length": 1},
"phone_number": {"type": "string", "pattern": "^[0-9\-\+\(\) ]+$"},
"date": {"type": "string", "format": "date"},
"email": {"type": "string", "format": "email"}
}
After each LLM response in main_loop(), validate the extracted value against the schema.
Flag any values that fail validation with a warning and optionally prompt the user for correction.
Add a retry mechanism for failed extractions (re-prompt the LLM with more specific instructions).
✅ Acceptance Criteria
How will we know this is finished?
docs/.📌 Additional Context
Add any other screenshots, links to fire department forms, or research here.