Spaces:
Sleeping
Sleeping
license: mit | |
title: Generate Knowledge Graphs | |
sdk: streamlit | |
emoji: π | |
colorFrom: indigo | |
colorTo: pink | |
short_description: Use LLM to generate a knowledge graph from your input data. | |
# πΈοΈ Knowledge Graph Extraction App | |
A complete knowledge graph extraction application using LLMs via OpenRouter, available in both Gradio and Streamlit versions. | |
## π Features | |
- **Multi-format Document Support**: PDF, TXT, DOCX, JSON files up to 10MB | |
- **LLM-powered Extraction**: Uses OpenRouter API with free models (Gemma-2-9B, Llama-3.1-8B) | |
- **Smart Entity Detection**: Automatically identifies people, organizations, locations, concepts, events, and objects | |
- **Importance Scoring**: LLM evaluates entity importance from 0.0 to 1.0 | |
- **Interactive Visualization**: Multiple graph layout algorithms with filtering options | |
- **Batch Processing**: Optional processing of multiple documents together | |
- **Export Capabilities**: JSON, GraphML, and GEXF formats | |
- **Real-time Statistics**: Graph metrics and centrality analysis | |
## π Project Structure | |
``` | |
knowledge-graphs/ | |
βββ app.py # Main Gradio application (legacy) | |
βββ app_streamlit.py # Main Streamlit application (recommended) | |
βββ run_streamlit.py # Simple launcher script | |
βββ requirements.txt # Python dependencies | |
βββ README.md # Project documentation | |
βββ .env.example # Environment variables template | |
βββ config/ | |
β βββ settings.py # Configuration management | |
βββ src/ | |
βββ document_processor.py # Document loading and chunking | |
βββ llm_extractor.py # LLM-based entity extraction | |
βββ graph_builder.py # NetworkX graph construction | |
βββ visualizer.py # Graph visualization and export | |
``` | |
## π§ Installation & Setup | |
### Option 1: Streamlit Version (Recommended) | |
The Streamlit version is more stable and has better file handling. | |
**Quick Start:** | |
```bash | |
python run_streamlit.py | |
``` | |
**Manual Setup:** | |
1. **Install dependencies**: | |
```bash | |
pip install -r requirements.txt | |
``` | |
2. **Run the Streamlit app**: | |
```bash | |
streamlit run app_streamlit.py --server.address 0.0.0.0 --server.port 8501 | |
``` | |
The app will be available at `http://localhost:8501` | |
### Option 2: Gradio Version (Legacy) | |
The Gradio version may have some file caching issues but is provided for compatibility. | |
1. **Install dependencies**: | |
```bash | |
pip install -r requirements.txt | |
``` | |
2. **Set up environment variables** (optional): | |
```bash | |
cp .env.example .env | |
# Edit .env and add your OpenRouter API key | |
``` | |
3. **Run the application**: | |
```bash | |
python app.py | |
``` | |
The app will be available at `http://localhost:7860` | |
### HuggingFace Spaces Deployment | |
For **Streamlit deployment**: | |
1. Create a new Space on [HuggingFace Spaces](https://huggingface.co/spaces) | |
2. Choose "Streamlit" as the SDK | |
3. Upload `app_streamlit.py` as `app.py` (HF Spaces expects this name) | |
4. Upload all other project files maintaining directory structure | |
For **Gradio deployment**: | |
1. Create a new Space with "Gradio" as the SDK | |
2. Upload `app.py` and all other files | |
3. Note: May experience file handling issues | |
## π API Configuration | |
### Getting OpenRouter API Key | |
1. Visit [OpenRouter.ai](https://openrouter.ai) | |
2. Sign up for a free account | |
3. Navigate to API Keys section | |
4. Generate a new API key | |
5. Copy the key and use it in the application | |
### Free Models Used | |
- **Primary**: `google/gemma-2-9b-it:free` | |
- **Backup**: `meta-llama/llama-3.1-8b-instruct:free` | |
These models are specifically chosen to minimize API costs while maintaining quality. | |
## π Usage Guide | |
### Basic Workflow | |
1. **Upload Documents**: | |
- Select one or more files (PDF, TXT, DOCX, JSON) | |
- Toggle batch mode for multiple document processing | |
2. **Configure API**: | |
- Enter your OpenRouter API key | |
- Key is stored temporarily for the session | |
3. **Customize Settings**: | |
- Choose graph layout algorithm | |
- Toggle label visibility options | |
- Set minimum importance threshold | |
- Select entity types to include | |
4. **Extract Knowledge Graph**: | |
- Click "Extract Knowledge Graph" button | |
- Monitor progress through the status updates | |
- View results in multiple tabs | |
5. **Explore Results**: | |
- **Graph Visualization**: Interactive graph with colored nodes by entity type | |
- **Statistics**: Detailed metrics about the graph structure | |
- **Entities**: Complete list of extracted entities with details | |
- **Central Nodes**: Most important entities based on centrality measures | |
6. **Export Data**: | |
- Choose export format (JSON, GraphML, GEXF) | |
- Download structured graph data | |
### Advanced Features | |
#### Entity Types | |
- **PERSON**: Individuals mentioned in the text | |
- **ORGANIZATION**: Companies, institutions, groups | |
- **LOCATION**: Places, addresses, geographical entities | |
- **CONCEPT**: Abstract ideas, theories, methodologies | |
- **EVENT**: Specific occurrences, meetings, incidents | |
- **OBJECT**: Physical items, products, artifacts | |
#### Relationship Types | |
- **works_at**: Employment relationships | |
- **located_in**: Geographical associations | |
- **part_of**: Hierarchical relationships | |
- **causes**: Causal relationships | |
- **related_to**: General associations | |
#### Filtering Options | |
- **Importance Threshold**: Show only entities above specified importance score | |
- **Entity Types**: Filter by specific entity categories | |
- **Layout Algorithms**: Spring, circular, shell, Kamada-Kawai, random | |
## π οΈ Technical Details | |
### Architecture Components | |
1. **Document Processing**: | |
- Multi-format file parsing | |
- Intelligent text chunking with overlap | |
- File size validation | |
2. **LLM Integration**: | |
- OpenRouter API integration | |
- Structured prompt engineering | |
- Error handling and fallback models | |
3. **Graph Processing**: | |
- NetworkX-based graph construction | |
- Entity deduplication and standardization | |
- Relationship validation | |
4. **Visualization**: | |
- Matplotlib-based static graphs | |
- Interactive HTML visualizations | |
- Multiple export formats | |
### Configuration Options | |
All settings can be modified in `config/settings.py`: | |
- **Chunk Size**: Default 2000 characters | |
- **Chunk Overlap**: Default 200 characters | |
- **Max File Size**: Default 10MB | |
- **Max Entities**: Default 100 per extraction | |
- **Max Relationships**: Default 200 per extraction | |
- **Importance Threshold**: Default 0.3 | |
### Differences Between Versions | |
**Streamlit Version Advantages:** | |
- More reliable file handling | |
- Better progress indicators | |
- Cleaner UI with sidebar configuration | |
- More stable caching system | |
- Built-in download functionality | |
**Gradio Version Advantages:** | |
- Simpler deployment to HF Spaces | |
- More compact interface | |
- Familiar for ML practitioners | |
## π Security & Privacy | |
- API keys are not stored permanently | |
- Files are processed temporarily and discarded | |
- No data is retained between sessions | |
- All processing happens server-side | |
## π Troubleshooting | |
### Common Issues | |
1. **"OpenRouter API key is required"**: | |
- Ensure you've entered a valid API key | |
- Check the key has sufficient credits | |
2. **"No entities extracted"**: | |
- Document may be too short or unstructured | |
- Try lowering the importance threshold | |
- Check if the document contains meaningful text | |
3. **File upload issues (Gradio version)**: | |
- Known issue with Gradio's file caching system | |
- Try the Streamlit version instead | |
- Ensure files are valid and not corrupted | |
4. **Segmentation fault (local development)**: | |
- Usually related to matplotlib backend | |
- Try setting `MPLBACKEND=Agg` environment variable | |
- Install GUI toolkit if running locally with display | |
5. **Module import errors**: | |
- Ensure all requirements are installed: `pip install -r requirements.txt` | |
- Check Python version compatibility (3.8+) | |
### Performance Tips | |
- Use batch mode for related documents | |
- Adjust chunk size for very long documents | |
- Lower importance threshold for sparse documents | |
- Use simpler layout algorithms for large graphs | |
## π€ Contributing | |
1. Fork the repository | |
2. Create a feature branch | |
3. Make your changes | |
4. Test with both Streamlit and Gradio versions if applicable | |
5. Add tests if applicable | |
6. Submit a pull request | |
## π License | |
This project is licensed under the MIT License - see the LICENSE file for details. | |
## π Acknowledgments | |
- [OpenRouter](https://openrouter.ai) for LLM API access | |
- [Streamlit](https://streamlit.io) for the modern web interface framework | |
- [Gradio](https://gradio.app) for the ML-focused web interface | |
- [NetworkX](https://networkx.org) for graph processing | |
- [HuggingFace Spaces](https://huggingface.co/spaces) for hosting |