Skip to content

[FEAT]: Universal Support for Static (Non-Fillable) Scanned PDF Operations #432

@utkarshqz

Description

@utkarshqz

name: 🚀 Feature Request
about: Suggest an idea or a new capability for FireForm.
title: "[FEAT]: Universal Support for Static (Non-Fillable) Scanned PDF Operations"


📝 Description

Currently, FireForm's architecture relies heavily on AcroForm extraction and filling (via pdfrw). However, a massive operational reality is that the majority of emergency response departments still rely on static, flat, or scanned PDFs that completely lack digital form fields.

A prime example is the CAL FIRE ICS-214 (Activity Log) form. In many deployment environments, these forms are printed, scanned, and distributed as flat images inside a PDF wrapper. Because they lack explicit AcroForm AP.N streams or digital input metadata, our current pipeline cannot fill them, alienating a large percentage of potential station administrators who only possess scanned legacy documents.

💡 Rationale

To achieve true "Reporting Ubiquity", FireForm must be agnostic to the PDF's internal format. If a department uploads a flat image scan of an ICS-214, the platform should still be able to mathematically overlay the LLM-extracted unstructured data (from our Data Lake pipeline) precisely onto the blank lines of the image.

🛠️ Proposed Solution

I propose building a deterministic static-PDF handler that severs reliance on embedded digital fields:

  1. OCR Bounding-Box Detection (Tesseract/OpenCV):

    • Instead of trying to parse embedded text offsets (which is highly mathematically fragile on scanned documents), we pass the flat PDF through Tesseract to identify logical field zones and empty lines visually.
    • We map the bounding coordinates (X, Y, W, H) of these empty regions dynamically.
  2. Semantic Hardware Overlay (PyMuPDF / fitz):

    • Once the zones are mapped, the LLM maps the values.
    • We utilize PyMuPDF to programmatically "stamp" the extracted text strings exactly at those coordinate locations.
    • PyMuPDF handles word-wrapping, font scaling, and bleeding automatically within the strict bounding boxes, preventing text from overlapping into other rows.

✅ Acceptance Criteria

  • Pipeline dynamically detects if a document is Flat (0 fillable fields) vs an AcroForm.
  • Ability to parse the visual structure of a static CAL FIRE ICS-214 form.
  • Backend perfectly overlays/stamps text strings dynamically into the blank regions using fitz.
  • Text strictly auto-wraps within its calculated geometric bounding box.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions