Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions examples/patient_intake_extraction_dspy/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Postgres database address for cocoindex
COCOINDEX_DATABASE_URL=postgres://cocoindex:cocoindex@localhost/cocoindex

GEMINI_API_KEY=
1 change: 1 addition & 0 deletions examples/patient_intake_extraction_dspy/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.env
72 changes: 72 additions & 0 deletions examples/patient_intake_extraction_dspy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Extract structured data from patient intake forms with DSPy

[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
We appreciate a star ⭐ at [CocoIndex Github](https://github.com/cocoindex-io/cocoindex) if this is helpful.

This example shows how to use [DSPy](https://github.com/stanfordnlp/dspy) with Gemini 2.5 Flash (vision model) to extract structured data from patient intake PDFs. DSPy provides a programming model for building AI systems using language models as building blocks.

- **Pydantic Models** (`main.py`) - Defines the data structure using Pydantic for type safety
- **DSPy Module** (`main.py`) - Defines the extraction signature and module using DSPy's ChainOfThought with vision support
- **CocoIndex Flow** (`main.py`) - Wraps DSPy in a custom function, provides the flow to process files incrementally

## Key Features

- **Native PDF Support**: Converts PDFs to images and processes directly with vision models
- **DSPy Vision Integration**: Uses DSPy's `Image` type with `ChainOfThought` for visual document understanding
- **Structured Outputs**: Pydantic models ensure type-safe, validated extraction
- **No Text Extraction Required**: Directly processes PDF images without intermediate markdown conversion
- **Incremental Processing**: CocoIndex handles batching and caching automatically
- **PostgreSQL Storage**: Results stored in a structured database table

## Prerequisites

1. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.

2. Install dependencies

```sh
pip install -U cocoindex dspy-ai pydantic pymupdf
```

3. Create a `.env` file. You can copy it from `.env.example` first:

```sh
cp .env.example .env
```

Then edit the file to fill in your `GEMINI_API_KEY`.

## Run

Update index:

```sh
cocoindex update main
```

## How It Works

The example demonstrates DSPy vision integration with CocoIndex:

1. **Pydantic Models**: Define the structured schema (Patient, Contact, Address, etc.)
2. **DSPy Signature**: Declares input (`list[dspy.Image]`) and output (Patient model) fields
3. **DSPy Module**: Uses `ChainOfThought` with vision capabilities to reason about extraction from images
4. **Single-Step Extraction**:
- The extractor receives PDF bytes directly
- Internally converts PDF pages to DSPy Image objects using PyMuPDF
- Processes images with vision model
- Returns Pydantic model directly
5. **CocoIndex Flow**:
- Loads PDFs from local directory as binary
- Applies single transform: PDF bytes → Patient data
- Stores results in PostgreSQL

## CocoInsight

I used CocoInsight (Free beta now) to troubleshoot the index generation and understand the data lineage of the pipeline. It just connects to your local CocoIndex server, with zero pipeline data retention. Run following command to start CocoInsight:

```sh
cocoindex server -ci main
```

Then open the CocoInsight UI at [https://cocoindex.io/cocoinsight](https://cocoindex.io/cocoinsight).
4 changes: 4 additions & 0 deletions examples/patient_intake_extraction_dspy/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## Note:
Example files here are purely artificial and not real, for testing purposes only.
Please do not use these examples for any other purpose.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
173 changes: 173 additions & 0 deletions examples/patient_intake_extraction_dspy/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
import datetime

import dspy
from pydantic import BaseModel, Field
import fitz # PyMuPDF

import cocoindex


# Pydantic models for DSPy structured outputs
class Contact(BaseModel):
name: str
phone: str
relationship: str


class Address(BaseModel):
street: str
city: str
state: str
zip_code: str


class Pharmacy(BaseModel):
name: str
phone: str
address: Address


class Insurance(BaseModel):
provider: str
policy_number: str
group_number: str | None = None
policyholder_name: str
relationship_to_patient: str


class Condition(BaseModel):
name: str
diagnosed: bool


class Medication(BaseModel):
name: str
dosage: str


class Allergy(BaseModel):
name: str


class Surgery(BaseModel):
name: str
date: str


class Patient(BaseModel):
name: str
dob: datetime.date
gender: str
address: Address
phone: str
email: str
preferred_contact_method: str
emergency_contact: Contact
insurance: Insurance | None = None
reason_for_visit: str
symptoms_duration: str
past_conditions: list[Condition] = Field(default_factory=list)
current_medications: list[Medication] = Field(default_factory=list)
allergies: list[Allergy] = Field(default_factory=list)
surgeries: list[Surgery] = Field(default_factory=list)
occupation: str | None = None
pharmacy: Pharmacy | None = None
consent_given: bool
consent_date: str | None = None


# DSPy Signature for patient information extraction from images
class PatientExtractionSignature(dspy.Signature):
"""Extract structured patient information from a medical intake form image."""

form_images: list[dspy.Image] = dspy.InputField(
desc="Images of the patient intake form pages"
)
patient: Patient = dspy.OutputField(
desc="Extracted patient information with all available fields filled"
)


class PatientExtractor(dspy.Module):
"""DSPy module for extracting patient information from intake form images."""

def __init__(self) -> None:
super().__init__()
self.extract = dspy.ChainOfThought(PatientExtractionSignature)

def forward(self, form_images: list[dspy.Image]) -> Patient:
"""Extract patient information from form images and return as a Pydantic model."""
result = self.extract(form_images=form_images)
return result.patient # type: ignore


@cocoindex.op.function(cache=True, behavior_version=1)
def extract_patient(pdf_content: bytes) -> Patient:
"""Extract patient information from PDF content."""

# Convert PDF pages to DSPy Image objects
pdf_doc = fitz.open(stream=pdf_content, filetype="pdf")

form_images = []
for page in pdf_doc:
# Render page to pixmap (image) at 2x resolution for better quality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is super cool! I've found that parsing images of higher quality results in fewer typos in the resulting output, at the cost of more tokens. But super useful approach to allow it to upscale the image!

pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
# Convert to PNG bytes
img_bytes = pix.tobytes("png")
# Create DSPy Image from bytes
form_images.append(dspy.Image(img_bytes))

pdf_doc.close()

# Extract patient information using DSPy with vision
extractor = PatientExtractor()
patient = extractor(form_images=form_images)

return patient # type: ignore


@cocoindex.flow_def(name="PatientIntakeExtractionDSPy")
def patient_intake_extraction_dspy_flow(
flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope
) -> None:
"""
Define a flow that extracts patient information from intake forms using DSPy.

This flow:
1. Reads patient intake PDFs as binary
2. Uses DSPy with vision models to extract structured patient information
(PDF to image conversion happens automatically inside the extractor)
3. Stores the results in a Postgres database
"""
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="data/patient_forms", binary=True)
)

patients_index = data_scope.add_collector()

with data_scope["documents"].row() as doc:
# Extract patient information directly from PDF using DSPy with vision
# (PDF->Image conversion happens inside the extractor)
doc["patient_info"] = doc["content"].transform(extract_patient)

# Collect the extracted patient information
patients_index.collect(
filename=doc["filename"],
patient_info=doc["patient_info"],
)

# Export to Postgres
patients_index.export(
"patients",
cocoindex.storages.Postgres(table_name="patients_info_dspy"),
primary_key_fields=["filename"],
)


@cocoindex.settings
def cocoindex_settings() -> cocoindex.Settings:
# Configure the model used in DSPy
lm = dspy.LM("gemini/gemini-2.5-flash")
dspy.configure(lm=lm)

return cocoindex.Settings.from_env()
14 changes: 14 additions & 0 deletions examples/patient_intake_extraction_dspy/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[project]
name = "patient-intake-extraction-dspy"
version = "0.1.0"
description = "Extract structured information from patient intake forms using DSPy."
requires-python = ">=3.10"
dependencies = [
"cocoindex>=0.3.9",
"dspy-ai>=3.0.4",
"pydantic>=2.0.0",
"pymupdf>=1.24.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgeh0 what is the need for pymupdf here if the LLM is directly extracting from the image bytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what fitz is? Oops, now I get it.

]

[tool.setuptools]
packages = []