Spaces:

IndraneelKumar
/

Search_Engine

Sleeping

File size: 3,178 Bytes

---
title: Articles Search Engine
emoji: 🔎
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "5.45.0"
app_file: frontend/app.py
python_version: "3.12"
pinned: false
---

# Articles Search Engine

A compact, production-style RAG pipeline. It ingests Substack, Medium and top publications RSS articles, stores them in Postgres (Supabase), creates dense/sparse embeddings in Qdrant, and exposes search and answer endpoints via FastAPI with a simple Gradio UI.

## How it works (brief)
- Ingest RSS → Supabase:
  - Prefect flow (`src/pipelines/flows/rss_ingestion_flow.py`) reads feeds from `src/configs/feeds_rss.yaml`, parses articles, and writes them to Postgres using SQLAlchemy models.
- Embed + index in Qdrant:
  - Content is chunked, embedded (e.g., BAAI bge models), and upserted to a Qdrant collection with payload indexes for filtering and hybrid search.
  - Collection and indexes are created via utilities in `src/infrastructure/qdrant/`.
- Search + generate:
  - FastAPI (`src/api/main.py`) exposes search endpoints (keyword, semantic, hybrid) and assembles answers with citations.
  - LLM providers are pluggable with fallback (OpenRouter, OpenAI, Hugging Face).
  - Opik is used for Evaluation
- UI + deploy:
  - Gradio app for quick local search (`frontend/app.py`).
  - Containerization with Docker and optional deploy to Google Cloud Run.

## Tech stack
- Python 3.12, FastAPI, Prefect, SQLAlchemy
- Supabase (Postgres) for articles
- Qdrant for vector search (dense + sparse/hybrid)
- OpenRouter / OpenAI / Hugging Face for LLM completion, Opik for LLM Evaluation
- Gradio UI, Docker, Google Cloud Run
- Config via Pydantic Settings, `uv` or `pip` for deps

## Run locally (minimal)
1) Configure environment (either `.env` or shell). Key variables (Pydantic nested with `__`):
   - Supabase: `SUPABASE_DB__HOST`, `SUPABASE_DB__PORT`, `SUPABASE_DB__NAME`, `SUPABASE_DB__USER`, `SUPABASE_DB__PASSWORD`
   - Qdrant: `QDRANT__URL`, `QDRANT__API_KEY`
   - LLM (choose one): `OPENROUTER__API_KEY` or `OPENAI__API_KEY` or `HUGGING_FACE__API_KEY`
   - Optional CORS: `ALLOWED_ORIGINS`

2) Install dependencies:
```bash
# with uv
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# or with pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
```

3) Initialize storage:
```bash
python src/infrastructure/supabase/create_db.py
python src/infrastructure/qdrant/create_collection.py
python src/infrastructure/qdrant/create_indexes.py
```

4) Ingest and embed:
```bash
python src/pipelines/flows/rss_ingestion_flow.py
python src/pipelines/flows/embeddings_ingestion_flow.py
```

5) Start services:
```bash
# REST API
uvicorn src.api.main:app --reload

# Gradio UI (optional)
python frontend/app.py
```

## Project structure (high-level)
- `src/api/` — FastAPI app, routes, middleware, exceptions
- `src/infrastructure/supabase/` — DB init and sessions
- `src/infrastructure/qdrant/` — Vector store and collection utilities
- `src/pipelines/` — Prefect flows and tasks for ingestion/embeddings
- `src/models/` — SQL and vector models
- `frontend/` — Gradio UI
- `configs/` — RSS feeds config