GitHub - acb-code/data-droid: Internal knowledge storage, search, and interface with RAG for consultancies and small businesses

Data Droid

FastAPI + Streamlit RAG demo for small consultancies. Upload PDFs/Docs, chunk/encode, store in Qdrant, and ask grounded questions with citations.

Features

Document Upload: Upload PDFs, Word docs, PowerPoint, and text files
RAG Chat: Ask questions grounded in your uploaded documents with source citations
Section Generator: Generate proposal sections (Executive Summary, Methodology, etc.) using context from past projects
Staff Resume Generator: Generate tailored experience summaries for project proposals
- Manage a staff directory in Settings
- Select team members for each project
- Customize roles and generate 4-6 sentence experience paragraphs
Multilingual UI: Interface available in English, German, and French
Configurable Profiles: Industry-specific configurations for engineering, consulting, etc.

Architecture

Backend: FastAPI (backend/app.py) for upload, chat, section generation, and resume generation.
LLM Provider: Supports OpenAI (default) or Google Gemini, configurable via LLM_PROVIDER env var. Provider abstraction in backend/llm/.
Embeddings + retrieval: Provider-specific embeddings stored in Qdrant (backend/embeddings/embedder.py). Each backend start recreates the collection.
Vector store: Qdrant with a named volume qdrant_storage (data persists while the volume exists).
Frontend: Streamlit (frontend/streamlit_app.py) with pages for Upload, Library, Search, Chat, Section Generator, Staff Resumes, and Settings.

Quickstart (Docker)

Prereqs: Docker + Docker Compose, and an API key for your chosen provider (OpenAI or Gemini).

Create .env in repo root:

# LLM Provider: "openai" (default) or "gemini"
LLM_PROVIDER=openai

# OpenAI settings (required if LLM_PROVIDER=openai)
OPENAI_API_KEY=sk-...

# Gemini settings (required if LLM_PROVIDER=gemini)
GEMINI_API_KEY=...

# Override if running qdrant elsewhere
# QDRANT_URL=http://localhost:6333

Build and run:

# Using OpenAI (default)
docker compose up --build

# Using Gemini
LLM_PROVIDER=gemini docker compose up --build

Backend: http://localhost:8000
Frontend (Streamlit): http://localhost:8501

Example Configurations

Data Droid includes example configurations for different industries:

Default Configuration

docker compose up --build
# or explicitly:
FRONTEND_CONFIG=example_configs/default.json docker compose up --build

Bridge Engineering (Ironspan Consultants)

# Use bridge engineering config with English/French support
FRONTEND_CONFIG=example_configs/bridge_eng.json docker compose up --build

# Seed example documents
uv run python examples/bridge_eng/seed_docs.py

AlpenBau (Civil Engineering - German/English)

# Use AlpenBau config with German/English support
FRONTEND_CONFIG=example_configs/alpenbau.json docker compose up --build

# Seed example documents
uv run python examples/alpenbau/seed_docs.py

See examples/*/README.md for demo questions and document descriptions.

Stop containers (keep data):

docker compose down

Stop and remove Qdrant data:

docker compose down -v

Qdrant data lives in the Docker volume qdrant_storage (e.g., /var/lib/docker/volumes/qdrant_storage/_data).
Note: backend/embeddings/embedder.py currently calls qdrant.recreate_collection(...) at startup, which clears the collection on each backend container start.

Tech stack & references

FastAPI: https://fastapi.tiangolo.com
Streamlit: https://docs.streamlit.io
Qdrant: https://qdrant.tech/documentation
OpenAI SDK: https://platform.openai.com/docs/api-reference
Google GenAI SDK: https://googleapis.github.io/python-genai/
Unstructured: https://unstructured.io
PyPDF: https://pypdf.readthedocs.io
Docker Compose: https://docs.docker.com/compose

Value for small consultancies (your pitch)

Private knowledge: answers come from their project docs/proposals/wiki; data stays in their environment vs. pasting into public ChatGPT/Gemini.
Grounded, auditable answers: every reply cites source docs (“Project_X_Report_2023.pdf, section 3.2”) so they can click/verify.
Firm-specific workflows: pre-built flows like “draft proposal,” “summarize site visit,” “answer client question based on project history” (1–2 clicks, no prompt engineering).
Consistent, up-to-date internal knowledge: indexes their latest file shares; avoids stale web data.
Multilingual/local context (Zurich/Paris): German/French/English in one workspace; uses their terminology and templates.

Demo concepts (Zurich)

A) Project history co-pilot: “For the Bahnhofstrasse upgrade project, what were the main geotechnical risks and mitigations?” → concise answer + 2–3 citations. You say: “ChatGPT can’t know this; it’s from your internal reports.”
B) Proposal drafting assistant: “Draft a 1-page project description in German… emphasizing sustainability and noise mitigation.” → finds similar past projects, drafts in German, cites sources; highlight tone reuse and multilingual capability.
C) Internal handbook Q&A: “What’s our standard approval process for change orders above CHF 100k?” → 3–5 steps with links to the handbook; stress instant, reliable answers with sources.

Demo concepts (Paris civil engineering)

D) Precedent finder: “Shallow foundations on soft clay near the Seine—show similar past projects and what solutions we used.” → returns 2–3 precedents with links; emphasize surfacing institutional precedent.
E) Norms navigator: “For retaining walls >4 m, what do our guidelines say about safety factors and monitoring?” → synthesizes internal guidelines + Eurocode snippets with citations; generic GPT can’t cite their adaptations.
F) Site report summarizer (FR → EN): paste a rough French site note/email → English client-ready summary with actions/risks; keeps technical meaning.

How to showcase vs. pure ChatGPT/Gemini

Side-by-side: one question (“key open issues on project X”) vs. your app; show citations and grounded answers.
5-minute live script: current workflow (SharePoint hunting) → new workflow (ask, draft, clarify) with speed/trust/privacy emphasis.
Visible “RAG-ness”: show sources sidebar, clickable documents, filters (project/client/date) so it feels like an internal copilot.

RAG vs. “just upload PDFs to ChatGPT/Gemini”

ChatGPT/Gemini uploads: great for 1–20 small PDFs; limited by context window; no persistence or metadata filters; uploads leave when the chat closes; privacy depends on provider.
This RAG: indexes once, searches many; persistent knowledge base; filters/metadata possible; enforced citations/grounding; only top-k snippets per query go to the LLM. Embeddings use OpenAI or Gemini APIs (text leaves infra at embed time); you can swap to local embeddings if needed.
Scale guidance: tens of docs (fast pilots); hundreds (sweet spot for most firms); thousands (needs durable vector store + ingestion pipeline). Qdrant already covers persistence—no extra SQLite/Postgres needed for 1k–10k chunks; consider heavier DB only when you need multi-tenant corpora or >100k chunks.
Limitations: bad scans/images need OCR; chunking trade-offs (too big → overflow, too small → weak answers); bulk reindexing of 10k PDFs takes time; if content isn’t in the docs, the model can’t invent it (good for trust).

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
backend		backend
examples		examples
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
QUICKSTART.txt		QUICKSTART.txt
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Droid

Features

Architecture

Quickstart (Docker)

Example Configurations

Default Configuration

Bridge Engineering (Ironspan Consultants)

AlpenBau (Civil Engineering - German/English)

Tech stack & references

Value for small consultancies (your pitch)

Demo concepts (Zurich)

Demo concepts (Paris civil engineering)

How to showcase vs. pure ChatGPT/Gemini

RAG vs. “just upload PDFs to ChatGPT/Gemini”

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Droid

Features

Architecture

Quickstart (Docker)

Example Configurations

Default Configuration

Bridge Engineering (Ironspan Consultants)

AlpenBau (Civil Engineering - German/English)

Tech stack & references

Value for small consultancies (your pitch)

Demo concepts (Zurich)

Demo concepts (Paris civil engineering)

How to showcase vs. pure ChatGPT/Gemini

RAG vs. “just upload PDFs to ChatGPT/Gemini”

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages