Skip to content

Commit 902e5c9

Browse files
authored
example(dspy): extract patient intake forms by DSPy (#1365)
1 parent 7255d3d commit 902e5c9

File tree

10 files changed

+268
-0
lines changed

10 files changed

+268
-0
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Postgres database address for cocoindex
2+
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex
3+
4+
GEMINI_API_KEY=
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.env
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Extract structured data from patient intake forms with DSPy
2+
3+
[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
4+
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.
5+
6+
This example shows how to use [DSPy](https://github.com/stanfordnlp/dspy) with Gemini 2.5 Flash (vision model) to extract structured data from patient intake PDFs. DSPy provides a programming model for building AI systems using language models as building blocks.
7+
8+
- **Pydantic Models** (`main.py`) - Defines the data structure using Pydantic for type safety
9+
- **DSPy Module** (`main.py`) - Defines the extraction signature and module using DSPy's ChainOfThought with vision support
10+
- **CocoIndex Flow** (`main.py`) - Wraps DSPy in a custom function, provides the flow to process files incrementally
11+
12+
## Key Features
13+
14+
- **Native PDF Support**: Converts PDFs to images and processes directly with vision models
15+
- **DSPy Vision Integration**: Uses DSPy's `Image` type with `ChainOfThought` for visual document understanding
16+
- **Structured Outputs**: Pydantic models ensure type-safe, validated extraction
17+
- **No Text Extraction Required**: Directly processes PDF images without intermediate markdown conversion
18+
- **Incremental Processing**: CocoIndex handles batching and caching automatically
19+
- **PostgreSQL Storage**: Results stored in a structured database table
20+
21+
## Prerequisites
22+
23+
1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
24+
25+
2. Install dependencies
26+
27+
```sh
28+
pip install -U cocoindex dspy-ai pydantic pymupdf
29+
```
30+
31+
3. Create a `.env` file. You can copy it from `.env.example` first:
32+
33+
```sh
34+
cp .env.example .env
35+
```
36+
37+
Then edit the file to fill in your `GEMINI_API_KEY`.
38+
39+
## Run
40+
41+
Update index:
42+
43+
```sh
44+
cocoindex update main
45+
```
46+
47+
## How It Works
48+
49+
The example demonstrates DSPy vision integration with CocoIndex:
50+
51+
1. **Pydantic Models**: Define the structured schema (Patient, Contact, Address, etc.)
52+
2. **DSPy Signature**: Declares input (`list[dspy.Image]`) and output (Patient model) fields
53+
3. **DSPy Module**: Uses `ChainOfThought` with vision capabilities to reason about extraction from images
54+
4. **Single-Step Extraction**:
55+
- The extractor receives PDF bytes directly
56+
- Internally converts PDF pages to DSPy Image objects using PyMuPDF
57+
- Processes images with vision model
58+
- Returns Pydantic model directly
59+
5. **CocoIndex Flow**:
60+
- Loads PDFs from local directory as binary
61+
- Applies single transform: PDF bytes → Patient data
62+
- Stores results in PostgreSQL
63+
64+
## CocoInsight
65+
66+
I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with zero pipeline data retention. Run following command to start CocoInsight:
67+
68+
```sh
69+
cocoindex server -ci main
70+
```
71+
72+
Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
## Note:
2+
Example files here are purely artificial and not real, for testing purposes only.
3+
Please do not use these examples for any other purpose.
4+
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
import datetime
2+
3+
import dspy
4+
from pydantic import BaseModel, Field
5+
import fitz # PyMuPDF
6+
7+
import cocoindex
8+
9+
10+
# Pydantic models for DSPy structured outputs
11+
class Contact(BaseModel):
12+
name: str
13+
phone: str
14+
relationship: str
15+
16+
17+
class Address(BaseModel):
18+
street: str
19+
city: str
20+
state: str
21+
zip_code: str
22+
23+
24+
class Pharmacy(BaseModel):
25+
name: str
26+
phone: str
27+
address: Address
28+
29+
30+
class Insurance(BaseModel):
31+
provider: str
32+
policy_number: str
33+
group_number: str | None = None
34+
policyholder_name: str
35+
relationship_to_patient: str
36+
37+
38+
class Condition(BaseModel):
39+
name: str
40+
diagnosed: bool
41+
42+
43+
class Medication(BaseModel):
44+
name: str
45+
dosage: str
46+
47+
48+
class Allergy(BaseModel):
49+
name: str
50+
51+
52+
class Surgery(BaseModel):
53+
name: str
54+
date: str
55+
56+
57+
class Patient(BaseModel):
58+
name: str
59+
dob: datetime.date
60+
gender: str
61+
address: Address
62+
phone: str
63+
email: str
64+
preferred_contact_method: str
65+
emergency_contact: Contact
66+
insurance: Insurance | None = None
67+
reason_for_visit: str
68+
symptoms_duration: str
69+
past_conditions: list[Condition] = Field(default_factory=list)
70+
current_medications: list[Medication] = Field(default_factory=list)
71+
allergies: list[Allergy] = Field(default_factory=list)
72+
surgeries: list[Surgery] = Field(default_factory=list)
73+
occupation: str | None = None
74+
pharmacy: Pharmacy | None = None
75+
consent_given: bool
76+
consent_date: str | None = None
77+
78+
79+
# DSPy Signature for patient information extraction from images
80+
class PatientExtractionSignature(dspy.Signature):
81+
"""Extract structured patient information from a medical intake form image."""
82+
83+
form_images: list[dspy.Image] = dspy.InputField(
84+
desc="Images of the patient intake form pages"
85+
)
86+
patient: Patient = dspy.OutputField(
87+
desc="Extracted patient information with all available fields filled"
88+
)
89+
90+
91+
class PatientExtractor(dspy.Module):
92+
"""DSPy module for extracting patient information from intake form images."""
93+
94+
def __init__(self) -> None:
95+
super().__init__()
96+
self.extract = dspy.ChainOfThought(PatientExtractionSignature)
97+
98+
def forward(self, form_images: list[dspy.Image]) -> Patient:
99+
"""Extract patient information from form images and return as a Pydantic model."""
100+
result = self.extract(form_images=form_images)
101+
return result.patient # type: ignore
102+
103+
104+
@cocoindex.op.function(cache=True, behavior_version=1)
105+
def extract_patient(pdf_content: bytes) -> Patient:
106+
"""Extract patient information from PDF content."""
107+
108+
# Convert PDF pages to DSPy Image objects
109+
pdf_doc = fitz.open(stream=pdf_content, filetype="pdf")
110+
111+
form_images = []
112+
for page in pdf_doc:
113+
# Render page to pixmap (image) at 2x resolution for better quality
114+
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
115+
# Convert to PNG bytes
116+
img_bytes = pix.tobytes("png")
117+
# Create DSPy Image from bytes
118+
form_images.append(dspy.Image(img_bytes))
119+
120+
pdf_doc.close()
121+
122+
# Extract patient information using DSPy with vision
123+
extractor = PatientExtractor()
124+
patient = extractor(form_images=form_images)
125+
126+
return patient # type: ignore
127+
128+
129+
@cocoindex.flow_def(name="PatientIntakeExtractionDSPy")
130+
def patient_intake_extraction_dspy_flow(
131+
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
132+
) -> None:
133+
"""
134+
Define a flow that extracts patient information from intake forms using DSPy.
135+
136+
This flow:
137+
1. Reads patient intake PDFs as binary
138+
2. Uses DSPy with vision models to extract structured patient information
139+
(PDF to image conversion happens automatically inside the extractor)
140+
3. Stores the results in a Postgres database
141+
"""
142+
data_scope["documents"] = flow_builder.add_source(
143+
cocoindex.sources.LocalFile(path="data/patient_forms", binary=True)
144+
)
145+
146+
patients_index = data_scope.add_collector()
147+
148+
with data_scope["documents"].row() as doc:
149+
# Extract patient information directly from PDF using DSPy with vision
150+
# (PDF->Image conversion happens inside the extractor)
151+
doc["patient_info"] = doc["content"].transform(extract_patient)
152+
153+
# Collect the extracted patient information
154+
patients_index.collect(
155+
filename=doc["filename"],
156+
patient_info=doc["patient_info"],
157+
)
158+
159+
# Export to Postgres
160+
patients_index.export(
161+
"patients",
162+
cocoindex.storages.Postgres(table_name="patients_info_dspy"),
163+
primary_key_fields=["filename"],
164+
)
165+
166+
167+
@cocoindex.settings
168+
def cocoindex_settings() -> cocoindex.Settings:
169+
# Configure the model used in DSPy
170+
lm = dspy.LM("gemini/gemini-2.5-flash")
171+
dspy.configure(lm=lm)
172+
173+
return cocoindex.Settings.from_env()
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[project]
2+
name = "patient-intake-extraction-dspy"
3+
version = "0.1.0"
4+
description = "Extract structured information from patient intake forms using DSPy."
5+
requires-python = ">=3.10"
6+
dependencies = [
7+
"cocoindex>=0.3.9",
8+
"dspy-ai>=3.0.4",
9+
"pydantic>=2.0.0",
10+
"pymupdf>=1.24.0",
11+
]
12+
13+
[tool.setuptools]
14+
packages = []

0 commit comments

Comments
 (0)