Skip to content

kyash99252/RAGuru

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAGuru: AI-Powered UPSC Study Companion βœ¨πŸ“š

Python LangChain LangGraph FAISS Streamlit Selenium Docker

RAGuru is an advanced Retrieval-Augmented Generation (RAG) system designed to help UPSC aspirants stay updated with current affairs and government releases. It automates the end-to-end pipeline: scraping authoritative sources (like PIB), preprocessing and embedding content, storing it in a vector database, and enabling natural language Q&A via a modern Streamlit interface.


Demo πŸš€

You can see RAGuru in action below:

RAGuru Demo Screenshot


Motivation 🎯

Staying current with official government releases and news is a major challenge for UPSC candidates. RAGuru addresses this by providing a structured, searchable, and AI-powered platform that ingests daily updates from trusted sources, processes them, and enables precise, context-aware question answeringβ€”eliminating manual tracking and information overload.


Architecture Overview πŸ—οΈ

Flow:
Scraping β†’ Preprocessing β†’ Embedding β†’ Vector Store β†’ Retrieval β†’ RAG/QA (LLM) β†’ Frontend

  • Ingestion: src/ingestion/ scrapes and stores raw data.
  • Preprocessing: src/preprocessing/ cleans and chunks text.
  • Embedding: src/preprocessing/embed.py generates vector embeddings.
  • Vector Store: src/retrieval/vector_store.py manages FAISS index.
  • Retrieval & RAG: src/generation/rag_pipeline.py, src/retrieval/langgraph_agent.py handle retrieval and answer generation.
  • Frontend: src/ui/streamlit_app.py provides an interactive UI.

How It Works (Step-by-Step Flow) πŸ› οΈ

1. Scraping (PIB & The Hindu) πŸ“°

  • PIB Scraper:
    Command:

    python -m src.ingestion.scraper_pib --year 2025 --month 6 --day 25
    • Uses Selenium to interact with the PIB calendar UI.
    • Extracts metadata and full articles for the specified date.
    • Saves structured JSON to data/raw/.
  • The Hindu Scraper:
    Command:

    python -m src.ingestion.scraper_hindu --year 2025 --month 6 --day 25
    • Scrapes headlines for the given date.
    • Stores results in data/raw/.

2. Preprocessing & Embedding πŸ§Ήβž‘οΈπŸ”—

  • Embedding Pipeline:
    Command:
    python -m src.preprocessing.embed
    • Cleans and chunks text using clean_text.py and chunk_text.py.
    • Embeds chunks using either Ollama (nomic-embed-text) or HuggingFace (all-MiniLM-L6-v2).
    • Stores vectors in a FAISS index at data/vector_store/faiss_index_nomic/.

3. Retrieval & RAG Pipeline πŸ€–

  • Vector Store:
    • FAISS-based similarity search via vector_store.py.
  • RAG Pipeline:
    • rag_pipeline.py (LangChain) and langgraph_agent.py (LangGraph) orchestrate retrieval and answer generation.
    • Hybrid agent routes queries: attempts retrieval-augmented answer, falls back to LLM if needed.
    • LLMs supported: Google Gemini (default), HuggingFace (local).

4. Frontend (Streamlit) πŸ’¬

  • Start the App:
    Command:
    streamlit run src/ui/streamlit_app.py
    • Interactive chat UI for Q&A.
    • Select LLM provider, ask questions, and get cited, context-grounded answers.

Tech Stack 🧰

  • 🧠 LangChain (retrieval, chains)
  • πŸ•ΈοΈ LangGraph (agent graph orchestration)
  • πŸ€– Google Gemini / HuggingFace (LLMs)
  • πŸ—‚οΈ FAISS (vector database)
  • πŸ§ͺ Streamlit (frontend)
  • 🐍 Python 3.13
  • 🐳 Docker (deployment)
  • πŸ•·οΈ Selenium, BeautifulSoup (scraping)
  • πŸ—οΈ dotenv (config management)

Setup Instructions βš™οΈ

1. Prerequisites

  • Python 3.13+
  • Chrome browser (for Selenium)
  • Ollama (if using local embeddings)
  • Docker (optional, for containerized deployment)

2. Installation

git clone https://github.com/kyash99252/RAGuru
cd RAGuru
python -m venv .venv
.venv\Scripts\activate  # On Windows
pip install -r requirements.txt

3. Environment Variables

Create a .env file in the root:

GOOGLE_API_KEY=your_google_api_key
HUGGING_FACE_HUB_TOKEN=your_huggingface_token

4. Docker Setup 🐳

Build and run the app in a container:

docker build -t raguru .
docker run -it --env-file .env -p 8501:8501 raguru

Usage Instructions πŸƒβ€β™‚οΈ

Scrape New PIB Articles

python -m src.ingestion.scraper_pib --year 2025 --month 6 --day 25

Run Preprocessing & Embedding

python -m src.preprocessing.embed

Start the RAG App

streamlit run src/ui/streamlit_app.py

Example: Ask a Question

  • Open http://localhost:8501
  • Enter a question (e.g., "Summarize the Digital Personal Data Protection Act, 2023")
  • Get a cited, context-grounded answer

Folder Structure πŸ—‚οΈ

.
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py              # Project-wide configuration
β”‚   β”œβ”€β”€ ingestion/             # Web scrapers for PIB, The Hindu
β”‚   β”œβ”€β”€ preprocessing/         # Text cleaning, chunking, embedding
β”‚   β”œβ”€β”€ retrieval/             # Vector store and retrieval logic
β”‚   β”œβ”€β”€ generation/            # RAG pipeline, LLM client
β”‚   └── ui/                    # Streamlit frontend
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                   # Raw scraped JSON data
β”‚   β”œβ”€β”€ processed/             # (Reserved for processed data)
β”‚   └── vector_store/          # FAISS vector index
β”œβ”€β”€ deployment/
β”‚   └── Dockerfile             # Docker build instructions
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ setup.py                   # Python package setup
β”œβ”€β”€ .env                       # API keys and secrets
└── README.md                  # Project documentation

Future Improvements / Roadmap πŸ›€οΈ

  • Add more news sources and government portals
  • Integrate reranking and advanced retrieval (e.g., hybrid search)
  • Enhance agent reasoning and fallback logic
  • Add user authentication and history
  • Deploy as a managed web service (FastAPI backend)
  • Improve UI/UX and analytics

License & Contributions 🀝

This project is licensed under the MIT License.
Contributions, bug reports, and feature requests are welcomeβ€”please open an issue or pull request!


RAGuru: Making UPSC prep smarter, faster, and more reliable. πŸš€


About

RAGuru automates scraping, embedding, and QA to keep UPSC aspirants updated with current affairs and government releases.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors