Skip to content

[BUG]: PDF Filler Hallucinates Repeating Values #173

@Cubix33

Description

@Cubix33

⚡️ Describe the Bug

The PDF extraction and filling process is producing inaccurate results by repeating the same value (e.g., "John Doe") across multiple unrelated form fields. Additionally, the core extraction loop crashes mid-execution due to an AttributeError when trying to parse the target fields.

👣 Steps to Reproduce

  1. In main.py, allow reader.get_fields() to overwrite the target fields with raw PDF widget names (e.g., textbox_0_0).
  2. Run the extraction process via controller.fill_form().
  3. In llm.py, the main_loop() method attempts to iterate using for field in self._target_fields.keys():.
  4. If the crash is bypassed, observe the LLM guessing the same value for every field because it lacks semantic context for prompts like "textbox_0_0".

📉 Expected Behavior

  • Context Preservation: The system should pass human-readable labels (e.g., "Employee Name") to the LLM so it can accurately extract distinct, contextually correct values.
  • Output: The final JSON and PDF should contain unique data mapped appropriately, with properly stripped whitespace for plural values.

🖥️ Environment Information

  • OS: WSL / Ubuntu
  • Docker/Compose Version: N/A (Running locally)
  • Ollama Model used: mistral

📸 Screenshots/Logs

[LOG] Resulting JSON created from the input text:
{
  "textbox_0_0": "John Doe",
  "textbox_0_1": "John Doe",
  "textbox_0_2": "managing director",
  "textbox_0_3": "managing director",
  "textbox_0_4": "John Doe",
  "textbox_0_5": "John Doe",
  "textbox_0_6": "managing director"
}

##🕵️ Possible Fix

In main.py: Stop overwriting descriptive_fields with reader.get_fields(). Pass the human-readable list to the controller.
In llm.py (main_loop): Remove the .keys() call. It should be for field in self._target_fields:.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions