Research-assistant chatbot for SEC filings. Ask natural-language questions about 10-Ks, 10-Qs, and other financial documents — get grounded answers with cited source passages.
Document Copilot turns hours of document reading into seconds of Q&A.
Financial professionals spend enormous time reading through SEC filings — hundreds of pages per filing, dozens of filings per quarter. The information is all public, all structured, but the sheer volume makes it impractical to search manually. Document Copilot ingests these filings and lets you ask questions in plain English, returning concise answers with citations back to the exact source passage.
The key design principle: every answer is grounded in retrieved evidence. The system does not allow the LLM to fabricate information. If a citation can't be verified against the actual document corpus, the answer is rejected.
| Who | What they can do |
|---|---|
| Equity Research Analyst | "Compare Apple's gross margin trajectory from 2022 to 2025 and explain the drivers." |
| Portfolio Manager | "Which of my covered companies mentioned AI as a risk factor in their most recent 10-K?" |
| Compliance Officer | "Show me all disclosure controls and procedures descriptions across our portfolio." |
| Investment Banker | "Find precedent transactions in the semiconductor space from 2023-2025 filings." |
| Data Scientist | "Extract all revenue figures by segment for FAANG companies over the last 3 years." |
| Journalist | "What did Microsoft say about OpenAI investment risks in their 2025 10-K?" |
| Law Student | "Walk me through the risk factor section of NVDA's latest filing compared to AMD's." |
┌─────────────────────────────────────────────────────────┐
│ YOUR QUESTION │
│ "What was Apple's Services revenue in 2025?" │
└────────────────────────┬────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ 1. INGESTION (done ahead of time) │
│ │
│ SEC EDGAR ──▶ download.py ──▶ chunking (500t) ──▶ embedding ──▶ pgvector DB │
│ │
│ Each filing is split into overlapping passages, each passage gets an embedding │
│ vector, and both the vector and full text are stored in Postgres for retrieval. │
└──────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ 2. RETRIEVAL (hybrid search) │
│ │
│ ┌─── Your question ───┐ │
│ │ │ │
│ │ Embed question │ │
│ │ with same model │ │
│ └────────┬────────────┘ │
│ │ │
│ ┌────────▼────────┐ ┌───────────────────┐ │
│ │ SEMANTIC SEARCH │ │ FULL-TEXT SEARCH │ │
│ │ (pgvector) │ │ (Postgres tsquery)│ │
│ │ "Services │ │ "services AND │ │
│ │ revenue 2025" │ │ revenue AND 2025"│ │
│ └────────┬────────┘ └──────────┬─────────┘ │
│ │ │ │
│ └──────────┬───────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ RRF FUSION │ │
│ │ (Reciprocal │ │
│ │ Rank Fusion) │ │
│ └───────┬────────┘ │
│ │ │
│ ▼ │
│ Top-10 ranked passages │
│ (with scores from both methods) │
│ │
│ Why hybrid? Semantic search catches synonyms and paraphrases ("Services │
│ segment performance"), full-text guarantees keyword precision ("Services │
│ revenue 2025"). RRF merges both rankings into one robust result set. │
└──────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ 3. GENERATION (LLM with tools) │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ SYSTEM: You are a research assistant. You have access to the │ │
│ │ search_filings tool. You MUST cite every claim with [1], [2] etc. │ │
│ │ Each citation must reference a real chunk_id from search results. │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
│ │
│ LLM (Claude / GPT / Gemini / Grok / ...) │
│ │ │
│ ├── 1. Calls search_filings("Services revenue 2025 Apple") │
│ │ ◀── Receives top passages with chunk_id + content │
│ │ │
│ ├── 2. Calls read_chunk(chunk_id) if it needs the full passage │
│ │ │
│ └── 3. Generates grounded answer with citations │
│ │
│ Example output: │
│ │
│ "Apple's Services revenue reached $25.1B in Q1 2025, up 14% YoY, driven │
│ by record App Store and Apple Music performance [1][2]. The Services │
│ segment now represents 24% of total revenue, up from 20% in 2023 [3]." │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ 4. GROUNDING VALIDATION │
│ │
│ Every [1], [2], [3] is checked: │
│ │
│ ✓ Citation [1] → chunk_id matches a retrieved passage │
│ ✓ Citation [2] → chunk_id matches a retrieved passage │
│ ✓ Citation [3] → chunk_id matches a retrieved passage │
│ │
│ If ANY citation doesn't match a real chunk → answer REJECTED │
│ If NO citations present at all → answer REJECTED │
│ If too few citations for answer length → answer REJECTED │
│ │
│ Rejected answers trigger a re-prompt: "You must cite the source passages." │
│ │
│ This prevents hallucination. The LLM cannot invent facts — they must come │
│ from the actual document corpus. │
└──────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│ 5. STREAMING (to the frontend) │
│ │
│ ┌────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ FastAPI │ ──SSE──▶│ React Frontend │ ──JWT──▶│ Supabase Auth │ │
│ │ Backend │ stream│ (Vite + TS) │ auth │ (email/pw) │ │
│ └────────────┘ └──────────────────┘ └──────────────────┘ │
│ │ │
│ ├── "0:" = text chunk (streaming incremental text) │
│ └── "2:" = citations JSON (final citations with chunk_id + excerpt) │
│ │
│ The frontend renders text as it arrives (low latency), then appends │
│ clickable citation badges at the end. Each badge links to the source doc. │
└──────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ YOU SEE ON SCREEN: │
│ │
│ Q: What was Apple's Services revenue │
│ in 2025? │
│ │
│ A: Apple's Services revenue reached │
│ $25.1B in Q1 2025, up 14% YoY, driven │
│ by record App Store and Apple Music │
│ performance [1][2]. The Services segment │
│ now represents 24% of total revenue [3]. │
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ [1] │ │ [2] │ │ [3] │ ← clickable │
│ └──────┘ └──────┘ └──────┘ │
│ │
│ [1] AAPL 10-K 2025-10-31 — "Management │
│ Discussion" — "Services revenue grew │
│ 14% to $25.1B in the first quarter..." │
└─────────────────────────────────────────┘
LLM and embeddings are decoupled. You can use any combination:
# Powerful reasoning LLM + cheap/efficient embedding provider
LLM_PROVIDER=anthropic # Claude Sonnet 4
EMBEDDING_PROVIDER=openai # text-embedding-3-small# Fully local (no API keys needed)
LLM_PROVIDER=ollama # Llama 3.2 via Ollama
EMBEDDING_PROVIDER=ollama # nomic-embed-text via Ollama# Same provider for both
LLM_PROVIDER=gemini # Gemini 2.0 Flash
EMBEDDING_PROVIDER=gemini # gemini-embedding-001# Open-source LLM via OpenRouter + Gemini embeddings
LLM_PROVIDER=openrouter # Any OpenRouter model
EMBEDDING_PROVIDER=gemini # gemini-embedding-001# Free local LLM + cloud embeddings
LLM_PROVIDER=lm_studio # Local model via LM Studio
EMBEDDING_PROVIDER=cohere # embed-english-v3.0- 📄 SEC Filing Ingestion — Download, chunk, embed, and index filings from SEC EDGAR
- 🔍 Hybrid Retrieval — Semantic (pgvector) + full-text (Postgres tsvector) search fused via Reciprocal Rank Fusion
- 💬 Grounded Q&A — LLM answers must cite retrieved source chunks; answers without citations are rejected
- 🔗 Source Citations — Every claim links back to the exact passage, with ticker, filing date, and document metadata
- 🔐 Supabase Auth — Email-based authentication with JWT bearer tokens
- 💾 Chat History — Threaded conversations persisted in Postgres
- ⚡ Streaming Responses — Server-Sent Events with AI SDK data parts for text + citations
┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Frontend │────▶│ Backend │────▶│ Database │
│ React SPA │ │ FastAPI + │ │ Postgres + │
│ Vite + TS │ │ PydanticAI │ │ pgvector │
│ Tailwind v4 │◀────│ │◀────│ │
└──────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ │ ├─ document_chunks
│ │ │ (vectors + full-text)
│ │ ├─ threads
│ │ └─ messages
│ │
│ ├── Assistant Agent (PydanticAI)
│ │ ├── search_filings tool
│ │ ├── read_chunk tool
│ │ └── GroundedAnswer output
│ │
│ ├── Retrieval Pipeline
│ │ ├── Semantic search (pgvector)
│ │ ├── Full-text search (Postgres)
│ │ └── RRF fusion
│ │
│ ├── Grounding Validator
│ └── Embedding Provider (15 options)
| Layer | Technology | Purpose |
|---|---|---|
| Backend | Python 3.12+, FastAPI, PydanticAI | API server, agent orchestration |
| Frontend | React 19, TypeScript, Vite, Tailwind v4 | SPA user interface |
| Database | PostgreSQL 17 + pgvector | Relational storage + vector search |
| Auth | Supabase Auth (email/password) | Authentication + JWT |
| LLM | 14 providers via PydanticAI | Answer generation |
| Embeddings | 15 providers | Document vectorization |
| Orchestration | uv (packaging), pnpm (frontend), Docker | Dev & deployment tooling |
| Provider | Env Name | Embedding Support | Type |
|---|---|---|---|
| OpenAI | openai |
✅ | Native |
| Gemini | gemini |
✅ | Native |
| Anthropic | anthropic |
❌ | Native |
| Groq | groq |
❌ | Native |
| Mistral | mistral |
✅ | Native |
| Cohere | cohere |
✅ | Native |
| xAI (Grok) | xai |
❌ | Native |
| Cerebras | cerebras |
❌ | Native |
| AWS Bedrock | bedrock |
✅ | Native |
| OpenRouter | openrouter |
✅ | OpenAI-compat |
| NVIDIA | nvidia |
✅ | OpenAI-compat |
| Ollama | ollama |
✅ | OpenAI-compat |
| LM Studio | lm_studio |
✅ | OpenAI-compat |
| HuggingFace | huggingface |
✅ | OpenAI-compat |
| Provider | Env Name | Dimensions | Approach |
|---|---|---|---|
| OpenAI | openai |
1536–3072 | Native SDK |
| Gemini | gemini |
768–3072 | Native SDK |
| Cohere | cohere |
1024+ | Native SDK |
| VoyageAI | voyageai |
768–1024 | Native SDK |
| Sentence-Transformers | sentence_transformers |
384+ | Local model |
| AWS Bedrock | bedrock |
1024+ | Boto3 |
| HuggingFace | huggingface |
384–4096 | OpenAI-compat |
| OpenRouter | openrouter |
1536–4096 | OpenAI-compat |
| NVIDIA | nvidia |
1024–4096 | OpenAI-compat |
| Ollama | ollama |
384–4096 | OpenAI-compat |
| LM Studio | lm_studio |
varies | OpenAI-compat |
| Mistral | mistral |
1024 | Native SDK |
| Together AI | together |
768–1024 | OpenAI-compat |
| Fireworks AI | fireworks |
768–4096 | OpenAI-compat |
| Perplexity | perplexity |
1024–2560 | Base64-decode |
- Python 3.12+
- Node.js 22+
- pnpm 11.5+
- Docker Desktop (for local Postgres)
- A Supabase project (free tier)
git clone <repo-url> && cd document-copilot
# Copy environment and edit with your keys
cp .env.example .envRequired config in .env:
SUPABASE_URL=https://your-project-ref.supabase.co
SUPABASE_ANON_KEY=your-anon-key
SUPABASE_SERVICE_ROLE_KEY=your-service-role-keyPick your LLM + embedding provider (e.g. Gemini with Gemini embeddings):
LLM_PROVIDER=gemini
GEMINI_API_KEY=your-gemini-api-key
EMBEDDING_PROVIDER=gemini
GEMINI_EMBEDDING_MODEL=gemini-embedding-001docker compose up -d dbcd backend
uv sync
uv run alembic upgrade head
uv run uvicorn app.main:app --reload --port 8000The API is available at http://localhost:8000. Health check: GET /health.
cd frontend
pnpm install
pnpm devThe app is available at http://localhost:5173.
Download SEC filings:
uv run python data/download.pyThis downloads 10-Ks for AAPL, MSFT, NVDA, AMZN, GOOGL, and more into data/payloads/.
Ingest a filing:
uv run python backend/ingest/ingest.py --dir data/payloads/AAPL_2025-10-31cd backend
uv run pytest tests/ -vStart all three services (db, backend, frontend):
docker compose up --build- Postgres:
localhost:5432 - Backend:
http://localhost:8000 - Frontend:
http://localhost:5173
# backend/Dockerfile
FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim
ENV UV_PROJECT_ENVIRONMENT=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]Required environment variables:
| Variable | Description |
|---|---|
SUPABASE_URL |
Supabase project URL |
SUPABASE_ANON_KEY |
Supabase anon key |
SUPABASE_SERVICE_ROLE_KEY |
Supabase service role key |
DATABASE_URL |
Postgres connection string |
LLM_PROVIDER |
LLM provider name |
GEMINI_API_KEY (or your provider's key) |
LLM API key |
EMBEDDING_PROVIDER |
Embedding provider (optional, defaults to LLM) |
ALLOWED_ORIGINS |
CORS origins (comma-separated) |
# frontend/Dockerfile
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json pnpm-lock.yaml .npmrc ./
RUN corepack enable && pnpm install --frozen-lockfile
COPY . .
RUN pnpm build
FROM nginx:alpine
COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf
CMD ["nginx", "-g", "daemon off;"]The nginx config proxies /api/ requests to the backend service:
location /api/ {
proxy_pass http://backend:8000;
}Required environment variables:
| Variable | Description |
|---|---|
VITE_SUPABASE_URL |
Supabase project URL |
VITE_SUPABASE_ANON_KEY |
Supabase anon key |
VITE_API_BASE_URL |
Backend API URL |
- Create a Supabase project at supabase.com
- Enable email auth in Authentication → Settings
- Run the database migrations:
cd backend
uv run alembic upgrade headSupabase is used only for authentication. All document and chat data lives in your own Postgres database.
document-copilot/
├── .env # Local configuration
├── .env.example # Documented config template
├── docker-compose.yml # Dev environment (db + backend + frontend)
├── data/
│ ├── download.py # SEC EDGAR filing downloader
│ └── payloads/ # Downloaded filings (gitignored)
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI entrypoint
│ │ ├── config.py # Pydantic settings (all env vars)
│ │ ├── embeddings.py # 15 embedding provider implementations
│ │ ├── api/chat.py # Chat REST endpoints
│ │ ├── assistant/
│ │ │ ├── agent.py # PydanticAI agent factory (14 LLMs)
│ │ │ ├── deps.py # Agent dependencies
│ │ │ ├── outputs.py # GroundedAnswer + Citation models
│ │ │ └── instructions.md # System prompt
│ │ ├── auth/ # Supabase JWT verification
│ │ ├── chat/orchestrator.py# Retrieval → LLM → grounding → stream
│ │ ├── database/ # SQLAlchemy models, connection, repos
│ │ ├── grounding/ # Citation validation
│ │ ├── retrieval/ # pgvector + full-text + RRF fusion
│ │ └── prompts/ # Prompt templates
│ ├── ingest/ # Ingestion scripts (Markdown → chunks → embed → DB)
│ ├── tests/ # 17 unit tests
│ ├── Dockerfile.dev
│ ├── pyproject.toml
│ └── AGENTS.md
├── frontend/
│ ├── src/
│ │ ├── App.tsx # Router
│ │ ├── main.tsx # Entry point
│ │ ├── components/ # React components
│ │ ├── pages/ # Route-level pages
│ │ ├── hooks/ # Custom hooks
│ │ └── lib/ # API client, auth, env config
│ ├── Dockerfile # Production (nginx)
│ ├── Dockerfile.dev
│ ├── nginx.conf
│ ├── vite.config.ts
│ └── AGENTS.md
└── AGENTS.md # Agent instructions for codegen tools
Returns {"status": "ok"}.
Stream a chat turn. Requires Supabase JWT in Authorization: Bearer <token>.
Request:
{
"threadId": "uuid",
"messages": [{"role": "user", "content": "What was Apple's revenue in 2025?"}]
}Response: Server-Sent Events stream:
0:{text_chunk}— streaming text2:{citations_json}— final citations with chunk IDs and excerpts
| Method | Path | Description |
|---|---|---|
GET |
/api/threads |
List user's threads |
POST |
/api/threads |
Create new thread |
GET |
/api/threads/{id} |
Get thread details |
PATCH |
/api/threads/{id} |
Update thread title |
DELETE |
/api/threads/{id} |
Delete thread + messages |
GET |
/api/threads/{id}/messages |
List messages in thread |
Every answer must cite the specific source passage it came from. The system:
- Retrieves relevant chunks via hybrid search (semantic + full-text + RRF)
- Generates an answer with citations using the LLM
- Validates every citation maps to a retrieved chunk
- Rejects ungrounded claims — the LLM is re-prompted if citations are missing or fabricated
Citations link to the exact chunk with ticker, filing type, filing date, and a direct excerpt.