name: 🚀 Feature Request
about: Suggest an idea or a new capability for FireForm.
title: "[FEAT]: Universal Support for Static (Non-Fillable) Scanned PDF Operations"
📝 Description
Currently, FireForm's architecture relies heavily on AcroForm extraction and filling (via pdfrw). However, a massive operational reality is that the majority of emergency response departments still rely on static, flat, or scanned PDFs that completely lack digital form fields.
A prime example is the CAL FIRE ICS-214 (Activity Log) form. In many deployment environments, these forms are printed, scanned, and distributed as flat images inside a PDF wrapper. Because they lack explicit AcroForm AP.N streams or digital input metadata, our current pipeline cannot fill them, alienating a large percentage of potential station administrators who only possess scanned legacy documents.
💡 Rationale
To achieve true "Reporting Ubiquity", FireForm must be agnostic to the PDF's internal format. If a department uploads a flat image scan of an ICS-214, the platform should still be able to mathematically overlay the LLM-extracted unstructured data (from our Data Lake pipeline) precisely onto the blank lines of the image.
🛠️ Proposed Solution
I propose building a deterministic static-PDF handler that severs reliance on embedded digital fields:
-
OCR Bounding-Box Detection (Tesseract/OpenCV):
- Instead of trying to parse embedded text offsets (which is highly mathematically fragile on scanned documents), we pass the flat PDF through Tesseract to identify logical field zones and empty lines visually.
- We map the bounding coordinates
(X, Y, W, H) of these empty regions dynamically.
-
Semantic Hardware Overlay (PyMuPDF / fitz):
- Once the zones are mapped, the LLM maps the values.
- We utilize
PyMuPDF to programmatically "stamp" the extracted text strings exactly at those coordinate locations.
PyMuPDF handles word-wrapping, font scaling, and bleeding automatically within the strict bounding boxes, preventing text from overlapping into other rows.
✅ Acceptance Criteria
name: 🚀 Feature Request
about: Suggest an idea or a new capability for FireForm.
title: "[FEAT]: Universal Support for Static (Non-Fillable) Scanned PDF Operations"
📝 Description
Currently, FireForm's architecture relies heavily on AcroForm extraction and filling (via
pdfrw). However, a massive operational reality is that the majority of emergency response departments still rely on static, flat, or scanned PDFs that completely lack digital form fields.A prime example is the CAL FIRE ICS-214 (Activity Log) form. In many deployment environments, these forms are printed, scanned, and distributed as flat images inside a PDF wrapper. Because they lack explicit AcroForm
AP.Nstreams or digital input metadata, our current pipeline cannot fill them, alienating a large percentage of potential station administrators who only possess scanned legacy documents.💡 Rationale
To achieve true "Reporting Ubiquity", FireForm must be agnostic to the PDF's internal format. If a department uploads a flat image scan of an ICS-214, the platform should still be able to mathematically overlay the LLM-extracted unstructured data (from our Data Lake pipeline) precisely onto the blank lines of the image.
🛠️ Proposed Solution
I propose building a deterministic static-PDF handler that severs reliance on embedded digital fields:
OCR Bounding-Box Detection (Tesseract/OpenCV):
(X, Y, W, H)of these empty regions dynamically.Semantic Hardware Overlay (PyMuPDF / fitz):
PyMuPDFto programmatically "stamp" the extracted text strings exactly at those coordinate locations.PyMuPDFhandles word-wrapping, font scaling, and bleeding automatically within the strict bounding boxes, preventing text from overlapping into other rows.✅ Acceptance Criteria
fitz.