Spaces:
Sleeping
Sleeping
File size: 3,178 Bytes
56756c6 266d7bc 804054e 266d7bc 804054e 266d7bc 804054e 266d7bc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
title: Articles Search Engine
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "5.45.0"
app_file: frontend/app.py
python_version: "3.12"
pinned: false
---
# Articles Search Engine
A compact, production-style RAG pipeline. It ingests Substack, Medium and top publications RSS articles, stores them in Postgres (Supabase), creates dense/sparse embeddings in Qdrant, and exposes search and answer endpoints via FastAPI with a simple Gradio UI.
## How it works (brief)
- Ingest RSS β Supabase:
- Prefect flow (`src/pipelines/flows/rss_ingestion_flow.py`) reads feeds from `src/configs/feeds_rss.yaml`, parses articles, and writes them to Postgres using SQLAlchemy models.
- Embed + index in Qdrant:
- Content is chunked, embedded (e.g., BAAI bge models), and upserted to a Qdrant collection with payload indexes for filtering and hybrid search.
- Collection and indexes are created via utilities in `src/infrastructure/qdrant/`.
- Search + generate:
- FastAPI (`src/api/main.py`) exposes search endpoints (keyword, semantic, hybrid) and assembles answers with citations.
- LLM providers are pluggable with fallback (OpenRouter, OpenAI, Hugging Face).
- Opik is used for Evaluation
- UI + deploy:
- Gradio app for quick local search (`frontend/app.py`).
- Containerization with Docker and optional deploy to Google Cloud Run.
## Tech stack
- Python 3.12, FastAPI, Prefect, SQLAlchemy
- Supabase (Postgres) for articles
- Qdrant for vector search (dense + sparse/hybrid)
- OpenRouter / OpenAI / Hugging Face for LLM completion, Opik for LLM Evaluation
- Gradio UI, Docker, Google Cloud Run
- Config via Pydantic Settings, `uv` or `pip` for deps
## Run locally (minimal)
1) Configure environment (either `.env` or shell). Key variables (Pydantic nested with `__`):
- Supabase: `SUPABASE_DB__HOST`, `SUPABASE_DB__PORT`, `SUPABASE_DB__NAME`, `SUPABASE_DB__USER`, `SUPABASE_DB__PASSWORD`
- Qdrant: `QDRANT__URL`, `QDRANT__API_KEY`
- LLM (choose one): `OPENROUTER__API_KEY` or `OPENAI__API_KEY` or `HUGGING_FACE__API_KEY`
- Optional CORS: `ALLOWED_ORIGINS`
2) Install dependencies:
```bash
# with uv
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
# or with pip
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
```
3) Initialize storage:
```bash
python src/infrastructure/supabase/create_db.py
python src/infrastructure/qdrant/create_collection.py
python src/infrastructure/qdrant/create_indexes.py
```
4) Ingest and embed:
```bash
python src/pipelines/flows/rss_ingestion_flow.py
python src/pipelines/flows/embeddings_ingestion_flow.py
```
5) Start services:
```bash
# REST API
uvicorn src.api.main:app --reload
# Gradio UI (optional)
python frontend/app.py
```
## Project structure (high-level)
- `src/api/` β FastAPI app, routes, middleware, exceptions
- `src/infrastructure/supabase/` β DB init and sessions
- `src/infrastructure/qdrant/` β Vector store and collection utilities
- `src/pipelines/` β Prefect flows and tasks for ingestion/embeddings
- `src/models/` β SQL and vector models
- `frontend/` β Gradio UI
- `configs/` β RSS feeds config
|