RAGuru is an advanced Retrieval-Augmented Generation (RAG) system designed to help UPSC aspirants stay updated with current affairs and government releases. It automates the end-to-end pipeline: scraping authoritative sources (like PIB), preprocessing and embedding content, storing it in a vector database, and enabling natural language Q&A via a modern Streamlit interface.
You can see RAGuru in action below:
Staying current with official government releases and news is a major challenge for UPSC candidates. RAGuru addresses this by providing a structured, searchable, and AI-powered platform that ingests daily updates from trusted sources, processes them, and enables precise, context-aware question answeringβeliminating manual tracking and information overload.
Flow:
Scraping β Preprocessing β Embedding β Vector Store β Retrieval β RAG/QA (LLM) β Frontend
- Ingestion:
src/ingestion/scrapes and stores raw data. - Preprocessing:
src/preprocessing/cleans and chunks text. - Embedding:
src/preprocessing/embed.pygenerates vector embeddings. - Vector Store:
src/retrieval/vector_store.pymanages FAISS index. - Retrieval & RAG:
src/generation/rag_pipeline.py,src/retrieval/langgraph_agent.pyhandle retrieval and answer generation. - Frontend:
src/ui/streamlit_app.pyprovides an interactive UI.
-
PIB Scraper:
Command:python -m src.ingestion.scraper_pib --year 2025 --month 6 --day 25
- Uses Selenium to interact with the PIB calendar UI.
- Extracts metadata and full articles for the specified date.
- Saves structured JSON to
data/raw/.
-
The Hindu Scraper:
Command:python -m src.ingestion.scraper_hindu --year 2025 --month 6 --day 25
- Scrapes headlines for the given date.
- Stores results in
data/raw/.
- Embedding Pipeline:
Command:python -m src.preprocessing.embed
- Cleans and chunks text using
clean_text.pyandchunk_text.py. - Embeds chunks using either Ollama (
nomic-embed-text) or HuggingFace (all-MiniLM-L6-v2). - Stores vectors in a FAISS index at
data/vector_store/faiss_index_nomic/.
- Cleans and chunks text using
- Vector Store:
- FAISS-based similarity search via
vector_store.py.
- FAISS-based similarity search via
- RAG Pipeline:
rag_pipeline.py(LangChain) andlanggraph_agent.py(LangGraph) orchestrate retrieval and answer generation.- Hybrid agent routes queries: attempts retrieval-augmented answer, falls back to LLM if needed.
- LLMs supported: Google Gemini (default), HuggingFace (local).
- Start the App:
Command:streamlit run src/ui/streamlit_app.py
- Interactive chat UI for Q&A.
- Select LLM provider, ask questions, and get cited, context-grounded answers.
- π§ LangChain (retrieval, chains)
- πΈοΈ LangGraph (agent graph orchestration)
- π€ Google Gemini / HuggingFace (LLMs)
- ποΈ FAISS (vector database)
- π§ͺ Streamlit (frontend)
- π Python 3.13
- π³ Docker (deployment)
- π·οΈ Selenium, BeautifulSoup (scraping)
- ποΈ dotenv (config management)
- Python 3.13+
- Chrome browser (for Selenium)
- Ollama (if using local embeddings)
- Docker (optional, for containerized deployment)
git clone https://github.com/kyash99252/RAGuru
cd RAGuru
python -m venv .venv
.venv\Scripts\activate # On Windows
pip install -r requirements.txtCreate a .env file in the root:
GOOGLE_API_KEY=your_google_api_key
HUGGING_FACE_HUB_TOKEN=your_huggingface_token
Build and run the app in a container:
docker build -t raguru .
docker run -it --env-file .env -p 8501:8501 ragurupython -m src.ingestion.scraper_pib --year 2025 --month 6 --day 25python -m src.preprocessing.embedstreamlit run src/ui/streamlit_app.py- Open http://localhost:8501
- Enter a question (e.g., "Summarize the Digital Personal Data Protection Act, 2023")
- Get a cited, context-grounded answer
.
βββ src/
β βββ config.py # Project-wide configuration
β βββ ingestion/ # Web scrapers for PIB, The Hindu
β βββ preprocessing/ # Text cleaning, chunking, embedding
β βββ retrieval/ # Vector store and retrieval logic
β βββ generation/ # RAG pipeline, LLM client
β βββ ui/ # Streamlit frontend
βββ data/
β βββ raw/ # Raw scraped JSON data
β βββ processed/ # (Reserved for processed data)
β βββ vector_store/ # FAISS vector index
βββ deployment/
β βββ Dockerfile # Docker build instructions
βββ requirements.txt # Python dependencies
βββ setup.py # Python package setup
βββ .env # API keys and secrets
βββ README.md # Project documentation
- Add more news sources and government portals
- Integrate reranking and advanced retrieval (e.g., hybrid search)
- Enhance agent reasoning and fallback logic
- Add user authentication and history
- Deploy as a managed web service (FastAPI backend)
- Improve UI/UX and analytics
This project is licensed under the MIT License.
Contributions, bug reports, and feature requests are welcomeβplease open an issue or pull request!
RAGuru: Making UPSC prep smarter, faster, and more reliable. π
