Skip to content

acb-code/data-droid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Droid

FastAPI + Streamlit RAG demo for small consultancies. Upload PDFs/Docs, chunk/encode, store in Qdrant, and ask grounded questions with citations.

Features

  • Document Upload: Upload PDFs, Word docs, PowerPoint, and text files
  • RAG Chat: Ask questions grounded in your uploaded documents with source citations
  • Section Generator: Generate proposal sections (Executive Summary, Methodology, etc.) using context from past projects
  • Staff Resume Generator: Generate tailored experience summaries for project proposals
    • Manage a staff directory in Settings
    • Select team members for each project
    • Customize roles and generate 4-6 sentence experience paragraphs
  • Multilingual UI: Interface available in English, German, and French
  • Configurable Profiles: Industry-specific configurations for engineering, consulting, etc.

Architecture

  • Backend: FastAPI (backend/app.py) for upload, chat, section generation, and resume generation.
  • LLM Provider: Supports OpenAI (default) or Google Gemini, configurable via LLM_PROVIDER env var. Provider abstraction in backend/llm/.
  • Embeddings + retrieval: Provider-specific embeddings stored in Qdrant (backend/embeddings/embedder.py). Each backend start recreates the collection.
  • Vector store: Qdrant with a named volume qdrant_storage (data persists while the volume exists).
  • Frontend: Streamlit (frontend/streamlit_app.py) with pages for Upload, Library, Search, Chat, Section Generator, Staff Resumes, and Settings.

Quickstart (Docker)

Prereqs: Docker + Docker Compose, and an API key for your chosen provider (OpenAI or Gemini).

  1. Create .env in repo root:
# LLM Provider: "openai" (default) or "gemini"
LLM_PROVIDER=openai

# OpenAI settings (required if LLM_PROVIDER=openai)
OPENAI_API_KEY=sk-...

# Gemini settings (required if LLM_PROVIDER=gemini)
GEMINI_API_KEY=...

# Override if running qdrant elsewhere
# QDRANT_URL=http://localhost:6333
  1. Build and run:
# Using OpenAI (default)
docker compose up --build

# Using Gemini
LLM_PROVIDER=gemini docker compose up --build

Example Configurations

Data Droid includes example configurations for different industries:

Default Configuration

docker compose up --build
# or explicitly:
FRONTEND_CONFIG=example_configs/default.json docker compose up --build

Bridge Engineering (Ironspan Consultants)

# Use bridge engineering config with English/French support
FRONTEND_CONFIG=example_configs/bridge_eng.json docker compose up --build

# Seed example documents
uv run python examples/bridge_eng/seed_docs.py

AlpenBau (Civil Engineering - German/English)

# Use AlpenBau config with German/English support
FRONTEND_CONFIG=example_configs/alpenbau.json docker compose up --build

# Seed example documents
uv run python examples/alpenbau/seed_docs.py

See examples/*/README.md for demo questions and document descriptions.

  1. Stop containers (keep data):
docker compose down
  1. Stop and remove Qdrant data:
docker compose down -v
  • Qdrant data lives in the Docker volume qdrant_storage (e.g., /var/lib/docker/volumes/qdrant_storage/_data).
  • Note: backend/embeddings/embedder.py currently calls qdrant.recreate_collection(...) at startup, which clears the collection on each backend container start.

Tech stack & references

Value for small consultancies (your pitch)

  • Private knowledge: answers come from their project docs/proposals/wiki; data stays in their environment vs. pasting into public ChatGPT/Gemini.
  • Grounded, auditable answers: every reply cites source docs (“Project_X_Report_2023.pdf, section 3.2”) so they can click/verify.
  • Firm-specific workflows: pre-built flows like “draft proposal,” “summarize site visit,” “answer client question based on project history” (1–2 clicks, no prompt engineering).
  • Consistent, up-to-date internal knowledge: indexes their latest file shares; avoids stale web data.
  • Multilingual/local context (Zurich/Paris): German/French/English in one workspace; uses their terminology and templates.

Demo concepts (Zurich)

  • A) Project history co-pilot: “For the Bahnhofstrasse upgrade project, what were the main geotechnical risks and mitigations?” → concise answer + 2–3 citations. You say: “ChatGPT can’t know this; it’s from your internal reports.”
  • B) Proposal drafting assistant: “Draft a 1-page project description in German… emphasizing sustainability and noise mitigation.” → finds similar past projects, drafts in German, cites sources; highlight tone reuse and multilingual capability.
  • C) Internal handbook Q&A: “What’s our standard approval process for change orders above CHF 100k?” → 3–5 steps with links to the handbook; stress instant, reliable answers with sources.

Demo concepts (Paris civil engineering)

  • D) Precedent finder: “Shallow foundations on soft clay near the Seine—show similar past projects and what solutions we used.” → returns 2–3 precedents with links; emphasize surfacing institutional precedent.
  • E) Norms navigator: “For retaining walls >4 m, what do our guidelines say about safety factors and monitoring?” → synthesizes internal guidelines + Eurocode snippets with citations; generic GPT can’t cite their adaptations.
  • F) Site report summarizer (FR → EN): paste a rough French site note/email → English client-ready summary with actions/risks; keeps technical meaning.

How to showcase vs. pure ChatGPT/Gemini

  • Side-by-side: one question (“key open issues on project X”) vs. your app; show citations and grounded answers.
  • 5-minute live script: current workflow (SharePoint hunting) → new workflow (ask, draft, clarify) with speed/trust/privacy emphasis.
  • Visible “RAG-ness”: show sources sidebar, clickable documents, filters (project/client/date) so it feels like an internal copilot.

RAG vs. “just upload PDFs to ChatGPT/Gemini”

  • ChatGPT/Gemini uploads: great for 1–20 small PDFs; limited by context window; no persistence or metadata filters; uploads leave when the chat closes; privacy depends on provider.
  • This RAG: indexes once, searches many; persistent knowledge base; filters/metadata possible; enforced citations/grounding; only top-k snippets per query go to the LLM. Embeddings use OpenAI or Gemini APIs (text leaves infra at embed time); you can swap to local embeddings if needed.
  • Scale guidance: tens of docs (fast pilots); hundreds (sweet spot for most firms); thousands (needs durable vector store + ingestion pipeline). Qdrant already covers persistence—no extra SQLite/Postgres needed for 1k–10k chunks; consider heavier DB only when you need multi-tenant corpora or >100k chunks.
  • Limitations: bad scans/images need OCR; chunking trade-offs (too big → overflow, too small → weak answers); bulk reindexing of 10k PDFs takes time; if content isn’t in the docs, the model can’t invent it (good for trust).

About

Internal knowledge storage, search, and interface with RAG for consultancies and small businesses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors