Ingestion engine for newsence.app. Pulls contents from RSS / Twitter / YouTube / HN / Bilibili / Xiaohongshu, runs bilingual AI analysis on each, stores them as searchable embeddings, entities graph. Follows the LLM Wiki pattern — each source is read once and integrated into a persistent artifact (summaries, entities, embeddings, cross-refs), not RAG'd at query time.
| Platform | Type | Schedule | What it does |
|---|---|---|---|
| RSS Feeds | Monitor | Every 5 min | Fetches feeds, deduplicates by URL, detects HN links |
| Twitter/X | Monitor | Every 6 hours | Tracks users via Kaito API — tweets, threads, articles, media |
| YouTube | Monitor | Every 30 min | Atom feed → video metadata, transcripts, chapters, AI highlights |
| Bilibili | Monitor | Every 30 min | gRPC mobile API → user dynamics, video cards |
| Xiaohongshu | Monitor | Every 30 min | Profile scraping → user notes, covers |
| Hacker News | Processor | Via RSS | Detects HN links → fetches comments via Algolia → generates editorial notes |
| Web | Scraper | On demand | Full content extraction (Readability + Cheerio), OG metadata |
| User Submissions | Ingestion | Real-time | POST /submit — full crawl + workflow, sync response |
All platforms output a unified ScrapedContent shape → same AI pipeline.
Each article goes through an automated workflow with independent retries:
URL arrives (RSS cron / Twitter cron / YouTube cron / /submit)
│
├─ 1. Fetch Article ──────── Load article row from database
├─ 2. AI Analysis ────────── Gemini Flash Lite → bilingual title, summary, tags, keywords, entities
├─ 3. Fetch OG Image ─────── Grab OG image if missing (first 32 KB of HTML)
├─ 4. Translate Content ──── Full article → Traditional Chinese
├─ 5. Save to DB ─────────── Write all AI results in a single UPDATE
├─ Sync Entities ──────── (conditional) Upsert entities, link to article
├─ 6. YouTube Highlights ─── (YouTube only) Transcript → AI highlight segments
└─ 7. Embed ─────────────── BGE-M3 → 1024-dim vector from title + summary + content + entities
Roughly 30 seconds per article. Each step retries independently with exponential backoff.
| Stage | Model | What it does |
|---|---|---|
| Analysis | Gemini Flash Lite | Article → bilingual title, summary, tags, keywords, category |
| Entity Extraction | Gemini Flash Lite | Article → named entities (person, organization, product, technology, event) with EN + zh-TW names |
| Content Translation | Gemini Flash | Full article content → Traditional Chinese |
| Embedding | BGE-M3 (1024d) | Title + summary + content + entity names → dense vector (HNSW-indexed) |
Entity extraction happens in the same LLM call as analysis — zero extra API cost.
| Layer | Technology |
|---|---|
| Runtime | Cloudflare Workers (V8 isolates) |
| Orchestration | Cloudflare Queues + Workflows |
| Database | PostgreSQL + pgvector (via Cloudflare Hyperdrive) |
| LLM | OpenRouter → Gemini Flash / Flash Lite |
| Embeddings | Cloudflare Workers AI → BGE-M3 |
| Twitter Data | Kaito API (third-party) |
The one-click Deploy button above handles Worker + Queue + Workflow, but Hyperdrive, the database, and secrets need manual setup. Full walkthrough:
You need a PostgreSQL instance with pgvector. Tested with Supabase; any Postgres ≥ 15 with the vector extension works.
Required tables: articles, user_articles, RssList, youtube_transcripts, plus entity/citation tables. The canonical schema is defined in frontend/prisma/schema.prisma in the parent monorepo — a standalone schema.sql is on the roadmap. For now, inspect the Prisma models or reach out via Issues if you want to run just the worker.
Create a Hyperdrive that points to your database:
wrangler hyperdrive create newsence-db \
--connection-string="postgres://user:pass@host:5432/dbname"Copy the returned ID into wrangler.jsonc under the hyperdrive[].id field.
Create the article-processing queue (the Worker is already configured as both producer and consumer):
wrangler queues create article-processing-queue-core
wrangler queues create article-processing-dlq-coreWorkflows are provisioned automatically on first deploy via the workflows binding in wrangler.jsonc.
Only OPENROUTER_API_KEY is strictly required. The others enable specific platforms:
wrangler secret put OPENROUTER_API_KEY # required — AI analysis
wrangler secret put KAITO_API_KEY # optional — Twitter monitoring
wrangler secret put YOUTUBE_API_KEY # optional — YouTube monitoring
wrangler secret put CORE_WORKER_INTERNAL_TOKEN # optional — auth for /submitpnpm install
pnpm run deployOr run locally with pnpm dev (uses wrangler dev --test-scheduled, so you can curl /__scheduled?cron=*/5+*+*+*+* to trigger RSS manually).
# Health check
curl https://your-worker.workers.dev/health
# Submit a URL
curl -X POST https://your-worker.workers.dev/submit \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/article"}'
# Generate embeddings
curl -X POST https://your-worker.workers.dev/embed \
-H "Content-Type: application/json" \
-d '{"text": "search query"}'Response example
{
"success": true,
"results": [
{
"articleId": "550e8400-e29b-41d4-a716-446655440000",
"title": "Article Title",
"sourceType": "web",
"alreadyExists": false
}
]
}Optional auth: X-Internal-Token header. Rate limiting: 20 req/60s per key (configurable via SUBMIT_RATE_LIMIT_MAX / SUBMIT_RATE_LIMIT_WINDOW_SEC).
Also available as a CLI and MCP server via the separate newsence npm package:
npx newsence search "AI agents" # search articles
npx newsence recent --hours 6 # recent articles
claude mcp add newsence -- npx newsence mcp # Claude Code
# Remote MCP: https://www.newsence.app/api/mcpsrc/
├── index.ts # Cloudflare WorkerEntrypoint class only
├── entrypoints/ # HTTP, scheduled, queue, and RPC adapters
├── app/
│ ├── handlers/ # Thin HTTP route handlers (/submit, /preview, /embed, /health)
│ ├── use-cases/ # Application actions shared by HTTP + RPC
│ ├── monitors/ # Cross-platform scheduled maintenance
│ └── workflows/ # Queue consumer, Workflow class, and workflow steps
├── platforms/ # Each platform lives in its own folder
│ ├── registry.ts # URL detection dispatch → platform scraper
│ ├── twitter/ # monitor + scraper + processor + metadata
│ ├── youtube/ # monitor + scraper + highlights + metadata
│ ├── hackernews/ # scraper + processor + metadata (no monitor — fed by RSS)
│ ├── bilibili/ # monitor + scraper + metadata
│ ├── xiaohongshu/ # monitor + scraper + metadata
│ ├── rss/ # monitor + parser + feed-config
│ └── web/ # shared scraper (Readability + Cheerio + OG extraction)
├── domain/
│ ├── content/ # Shared content cleanup and editorial domain helpers
│ ├── processing/ # AI processor registry, DefaultProcessor, AI helpers
│ └── entities.ts # Entity sync to normalized tables
├── infra/
│ ├── db.ts # Hyperdrive client + insertArticle / dedup / transcript helpers
│ ├── fetch.ts # fetchWithTimeout
│ ├── log.ts # Structured JSON logging
│ └── openrouter.ts # OpenRouter + embedding wrappers
└── models/ # Types + PlatformMetadata discriminated union
Bindings (in wrangler.jsonc):
| Binding | Purpose |
|---|---|
HYPERDRIVE |
Hyperdrive connection to your Postgres |
ARTICLE_QUEUE |
Producer for article-processing-queue-core |
MONITOR_WORKFLOW |
NewsenceMonitorWorkflow instance creator |
AI |
Workers AI (BGE-M3 embeddings) |
BROWSER |
Cloudflare Browser Rendering (reserved) |
Secrets (via wrangler secret put):
| Variable | Required | Description |
|---|---|---|
OPENROUTER_API_KEY |
Yes | OpenRouter (Gemini) for AI analysis |
CORE_WORKER_INTERNAL_TOKEN |
No | Bearer token for /submit endpoint |
KAITO_API_KEY |
No | Enables Twitter monitoring |
YOUTUBE_API_KEY |
No | Enables YouTube channel monitoring |
SUBMIT_RATE_LIMIT_MAX |
No | Requests allowed per window (default 20) |
SUBMIT_RATE_LIMIT_WINDOW_SEC |
No | Window in seconds (default 60) |
Platforms today follow a loose convention rather than a formal interface — each platform folder contains some combination of monitor.ts (cron ingestion), scraper.ts (URL-triggered fetch), metadata.ts (typed platform metadata + builders), and optionally processor.ts (custom AI analysis). Not every platform has all four; pick the closest existing one and copy its shape.
Minimum to add a new source:
- Scraper (
platforms/foo/scraper.ts) — export a function that returnsScrapedContent. - Metadata (
platforms/foo/metadata.ts) — define yourFooMetadatashape and abuildFoo(...)constructor; register it inmodels/platform-metadata.ts. - Detection + dispatch — add the URL pattern to
models/scraped-content.ts:detectPlatformTypeand route it inplatforms/registry.ts. - Monitor (optional,
platforms/foo/monitor.ts) — if the source is pollable, mirror one of the existing cron handlers; wire it intoentrypoints/scheduled.ts. - Processor (optional,
platforms/foo/processor.ts) — only if you need AI behavior that differs fromDefaultProcessor; register indomain/processing/processors.ts.
The new article goes through the same Queue → Workflow pipeline as every other platform — you don't touch the AI steps.
MIT