RAGuru: AI-Powered UPSC Study Companion ✨📚

RAGuru is an advanced Retrieval-Augmented Generation (RAG) system designed to help UPSC aspirants stay updated with current affairs and government releases. It automates the end-to-end pipeline: scraping authoritative sources (like PIB), preprocessing and embedding content, storing it in a vector database, and enabling natural language Q&A via a modern Streamlit interface.

Demo 🚀

You can see RAGuru in action below:

Motivation 🎯

Staying current with official government releases and news is a major challenge for UPSC candidates. RAGuru addresses this by providing a structured, searchable, and AI-powered platform that ingests daily updates from trusted sources, processes them, and enables precise, context-aware question answering—eliminating manual tracking and information overload.

Architecture Overview 🏗️

Flow:
Scraping → Preprocessing → Embedding → Vector Store → Retrieval → RAG/QA (LLM) → Frontend

Ingestion: src/ingestion/ scrapes and stores raw data.
Preprocessing: src/preprocessing/ cleans and chunks text.
Embedding: src/preprocessing/embed.py generates vector embeddings.
Vector Store: src/retrieval/vector_store.py manages FAISS index.
Retrieval & RAG: src/generation/rag_pipeline.py, src/retrieval/langgraph_agent.py handle retrieval and answer generation.
Frontend: src/ui/streamlit_app.py provides an interactive UI.

How It Works (Step-by-Step Flow) 🛠️

1. Scraping (PIB & The Hindu) 📰

PIB Scraper:
Command:
```
python -m src.ingestion.scraper_pib --year 2025 --month 6 --day 25
```
- Uses Selenium to interact with the PIB calendar UI.
- Extracts metadata and full articles for the specified date.
- Saves structured JSON to data/raw/.
The Hindu Scraper:
Command:
```
python -m src.ingestion.scraper_hindu --year 2025 --month 6 --day 25
```
- Scrapes headlines for the given date.
- Stores results in data/raw/.

2. Preprocessing & Embedding 🧹➡️🔗

Embedding Pipeline:
Command:
```
python -m src.preprocessing.embed
```
- Cleans and chunks text using clean_text.py and chunk_text.py.
- Embeds chunks using either Ollama (nomic-embed-text) or HuggingFace (all-MiniLM-L6-v2).
- Stores vectors in a FAISS index at data/vector_store/faiss_index_nomic/.

3. Retrieval & RAG Pipeline 🤖

Vector Store:
- FAISS-based similarity search via vector_store.py.
RAG Pipeline:
- rag_pipeline.py (LangChain) and langgraph_agent.py (LangGraph) orchestrate retrieval and answer generation.
- Hybrid agent routes queries: attempts retrieval-augmented answer, falls back to LLM if needed.
- LLMs supported: Google Gemini (default), HuggingFace (local).

4. Frontend (Streamlit) 💬

Start the App:
Command:
```
streamlit run src/ui/streamlit_app.py
```
- Interactive chat UI for Q&A.
- Select LLM provider, ask questions, and get cited, context-grounded answers.

Tech Stack 🧰

🧠 LangChain (retrieval, chains)
🕸️ LangGraph (agent graph orchestration)
🤖 Google Gemini / HuggingFace (LLMs)
🗂️ FAISS (vector database)
🧪 Streamlit (frontend)
🐍 Python 3.13
🐳 Docker (deployment)
🕷️ Selenium, BeautifulSoup (scraping)
🗝️ dotenv (config management)

Setup Instructions ⚙️

1. Prerequisites

Python 3.13+
Chrome browser (for Selenium)
Ollama (if using local embeddings)
Docker (optional, for containerized deployment)

2. Installation

git clone https://github.com/kyash99252/RAGuru
cd RAGuru
python -m venv .venv
.venv\Scripts\activate  # On Windows
pip install -r requirements.txt

3. Environment Variables

Create a .env file in the root:

GOOGLE_API_KEY=your_google_api_key
HUGGING_FACE_HUB_TOKEN=your_huggingface_token

4. Docker Setup 🐳

Build and run the app in a container:

docker build -t raguru .
docker run -it --env-file .env -p 8501:8501 raguru

Usage Instructions 🏃‍♂️

Scrape New PIB Articles

python -m src.ingestion.scraper_pib --year 2025 --month 6 --day 25

Run Preprocessing & Embedding

python -m src.preprocessing.embed

Start the RAG App

streamlit run src/ui/streamlit_app.py

Example: Ask a Question

Open http://localhost:8501
Enter a question (e.g., "Summarize the Digital Personal Data Protection Act, 2023")
Get a cited, context-grounded answer

Folder Structure 🗂️

.
├── src/
│   ├── config.py              # Project-wide configuration
│   ├── ingestion/             # Web scrapers for PIB, The Hindu
│   ├── preprocessing/         # Text cleaning, chunking, embedding
│   ├── retrieval/             # Vector store and retrieval logic
│   ├── generation/            # RAG pipeline, LLM client
│   └── ui/                    # Streamlit frontend
├── data/
│   ├── raw/                   # Raw scraped JSON data
│   ├── processed/             # (Reserved for processed data)
│   └── vector_store/          # FAISS vector index
├── deployment/
│   └── Dockerfile             # Docker build instructions
├── requirements.txt           # Python dependencies
├── setup.py                   # Python package setup
├── .env                       # API keys and secrets
└── README.md                  # Project documentation

Future Improvements / Roadmap 🛤️

Add more news sources and government portals
Integrate reranking and advanced retrieval (e.g., hybrid search)
Enhance agent reasoning and fallback logic
Add user authentication and history
Deploy as a managed web service (FastAPI backend)
Improve UI/UX and analytics

License & Contributions 🤝

This project is licensed under the MIT License.
Contributions, bug reports, and feature requests are welcome—please open an issue or pull request!

RAGuru: Making UPSC prep smarter, faster, and more reliable. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGuru: AI-Powered UPSC Study Companion ✨📚

Demo 🚀

Motivation 🎯

Architecture Overview 🏗️

How It Works (Step-by-Step Flow) 🛠️

1. Scraping (PIB & The Hindu) 📰

2. Preprocessing & Embedding 🧹➡️🔗

3. Retrieval & RAG Pipeline 🤖

4. Frontend (Streamlit) 💬

Tech Stack 🧰

Setup Instructions ⚙️

1. Prerequisites

2. Installation

3. Environment Variables

4. Docker Setup 🐳

Usage Instructions 🏃‍♂️

Scrape New PIB Articles

Run Preprocessing & Embedding

Start the RAG App

Example: Ask a Question

Folder Structure 🗂️

Future Improvements / Roadmap 🛤️

License & Contributions 🤝

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
deployment		deployment
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

RAGuru: AI-Powered UPSC Study Companion ✨📚

Demo 🚀

Motivation 🎯

Architecture Overview 🏗️

How It Works (Step-by-Step Flow) 🛠️

1. Scraping (PIB & The Hindu) 📰

2. Preprocessing & Embedding 🧹➡️🔗

3. Retrieval & RAG Pipeline 🤖

4. Frontend (Streamlit) 💬

Tech Stack 🧰

Setup Instructions ⚙️

1. Prerequisites

2. Installation

3. Environment Variables

4. Docker Setup 🐳

Usage Instructions 🏃‍♂️

Scrape New PIB Articles

Run Preprocessing & Embedding

Start the RAG App

Example: Ask a Question

Folder Structure 🗂️

Future Improvements / Roadmap 🛤️

License & Contributions 🤝

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages