CultriX (CultriX)

posted an update 26 days ago

Post

1670

New Space: Generate Knowledge Graphs from input data using LLM's (OpenRouter). It's a trial project but seems to be working alright so far!

CultriX/Generate-Knowledge-Graphs

Below is an example after feeding it the wikipedia page about Elon Musk:

posted an update about 1 month ago

Post

356

Hi all!

I was hoping somebody would be willing to check out this thought experiment of mine with the aim to reduce tokens in inter-agent communications.

How It Works:
1. You provide a task in natural language (NL)
2. NL-to-CCL Agent: Converts your request into a structured Compressed Communication Language (CCL) task.
3. Inter-agent communication occurs in CCL
4. CCL is translated back to NL before being presented to the user.

I have a notebook with an example that claims to achieve these results:
--- Token Usage Summary ---
Total NL Tokens (User Input & Final Output): 364
Total CCL Tokens (for NL/CCL Conversions): 159
Total CCL Tokens (Internal Agent Communication): 194
Overall token savings on NL-to-CCL conversion portions: 56.32%
------------------------

When asking Gemini it concludes:
"Yes, the methods used in this notebook are sensible. The multi-agent architecture is logical, and the introduction of a Compressed Communication Language (CCL) is a clever and practical solution to the real-world problems of token cost and ambiguity in LLM-based systems. While it's a proof-of-concept that would need more robust error handling and potentially more complex feedback loops for a production environment, it successfully demonstrates a viable and efficient strategy for automating a software development lifecycle."

However, I have no idea if it's actually working or if I'm just crazy.
Would really like it if someone would be willing to provide thoughts on this!

The notebook:
https://gist.github.com/CultriX-Github/7f9895bc5e4d99d2d4a3eb17d079f08b#file-token-reduction-ipynb

Thank you for taking the time! :)

5 replies

·

replied to their post 2 months ago

@sometimesanotion Maybe this is useful to you! :)

posted an update 2 months ago

Post

938

Script for QA-style dataset generation from custom data:
Transform Your Personal Data into High-Quality Training Datasets with help from a LLM.

Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning.
What it does:
1. Split the input data into chunks (note: this is important, more below!)
2. QA generation: Creates contextually relevant question-answer pairs from each chunk.
3. Quality assurance: Validates outputs using both rule-based filters and LLM judges
4. Exports datasets in both CSV and JSON formats

Key features:
- Separate model configurations for generation and evaluation
- Configurable chunk sizes and question length
- Multi-language support (English and Dutch, but easy to add your own!)
- Local and cloud API compatibility

Quick start:
Place your documents (.txt for now) in an input folder and run:

python generate-rag-qav4.py \
  --input-dir ./rag-input/ \
  --output-dir ./rag-output/ \
  --output-filename finetuning_qa_dataset \
  --gen-model google/gemma-3-4b \
  --gen-api-base http://127.0.0.1:1234/v1 \
  --judge-model google/gemma-3-4b \
  --judge-api-base http://127.0.0.1:1234/v1 \
  --min-chunk-len 200 \
  --question-chars 20 \
  --answer-chars 5 \
  --lang en

Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type!

Use cases:
- Personal knowledge base fine-tuning
- Domain-specific QA dataset creation
- RAG system training data preparation

Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted.

Note2: Original Reddit post that gave me the idea:
https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn

The script can be found here:
https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00

2 replies

·

replied to their post 2 months ago

Oh also definitely look into this! Don't know how I forgot to mention it in my first post it's SUPER useful for RAG:

https://github.com/unclecode/crawl4ai

replied to their post 2 months ago

I know it's something very different from what you described, but have you read about AnythingLLM and their browser extension? I have been using it a lot and it works very well.

I also have been looking into MCP a lot lately (it seems to be very promising and imo is the next big thing happening right now) which could be used for this.

Finally, just because I found it super useful (although a bit unrelated), this python script that can turn pretty much any text data into a LLM-dataset is something I wanted to share with you as well (even though technically not RAG-related. It's been a while since we talked haha): https://www.reddit.com/r/LocalLLaMA/comments/1ai2gby/comment/korunem/?share_id=DFUUUr1ZD2ZCKFGXwccvF

posted an update 2 months ago

Post

359

# Announcing the RAG-Ready Conteaant Scraper! 🚀

Supercharge your Retrieval Augmented Generation (RAG) pipelines with ease! I just finished working on the **RAG-Ready Content Scraper**, a mix between two very useful tools (RAG-Scraper and RepoMix); now available as a Hugging Face Space!

## What can it do?

This intuitive application helps you effortlessly gather and process content from various sources:

* 🌐 **Webpages**: Scrape content from any URL (with RAG-Scraper). You can even control the scraping depth to fetch linked pages!
* 📂 **GitHub Repositories**: Process entire GitHub repos (using the power of Repomix) by simply providing a URL or username/repo ID.

## Various Output Formats

Convert the scraped content into a variety of RAG-friendly formats:

* **Markdown** (.md)
* **JSON** (.json)
* **CSV** (.csv)
* **Plain Text** (.txt)
* **PDF** (.pdf)

Perfect for building datasets, knowledge bases, and feeding your LLMs with high-quality, structured information.

## Hope you enjoY!

Ready to streamline your RAG data preparation?

👉 **Visit the RAG-Ready Content Scraper on Hugging Face Spaces:** [https://huggingface.co/spaces/CultriX/RAG-Scraper]

---

Feedback and feature requests are welcome! Let's build better RAG together.

5 replies

·

reacted to Jaward's post with 🔥👍 5 months ago

Post

4996

made a few improvements on custom grpo trainer:
- added sequence similarity reward (seems to work)
- improved vllm support (5x inference speed)
- adjusted reward scores (this helped with format/accuracy)
- can now push to hf hub (already pushed mine lol: Jaward/smollm2_360m_grpo_gsm8k_reasoner)

Code: https://github.com/Jaykef/ai-algorithms/blob/main/smollm2_360M_135M_grpo_gsm8k.ipynb

reacted to their post with 👍 6 months ago

Post

1458

Reverse-engineering most custom-GPT's is stupidly simple:

https://huggingface.co/blog/CultriX/reverse-engineering-customgpts

reacted to their post with 🔥👍❤️ 6 months ago

Post

2690

Final upgrade to the Multi-Agent Task Completion Space: CultriX/MultiAgent-CodeTask .

It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.

posted an update 6 months ago

Post

2690

Final upgrade to the Multi-Agent Task Completion Space: CultriX/MultiAgent-CodeTask .

It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.

replied to sometimesanotion's post 6 months ago

OK nevermind I clicked that blog link and this is hella damn interesting how come I never heard of this haha. It states some really promising things right there... :o

replied to sometimesanotion's post 6 months ago

the model that calls itself "Qwenconceited-14B-v13-DeepSuffering". <-- That cracked me up, lol!

And yeah very interesting but I'm going to have to read that again at another moment to fully understand all it is saying haha. Sounds like interesting stuff though!

replied to sometimesanotion's post 6 months ago

Oh yeah for sure I'll hit you up sometime! Just to be clear I wasn't asking you to upload all your personal tweaks that you've spent probably weeks on to improve them haha. I was just curious about some of the things you said. For example, when you said " Extract a small LoRA from this" I was a little bit confused actually haha. As in: I have no idea how to do that, let alone apply it to smoothen out other models in the merge.

I know about adapter models and that you can create those with LoRA fine-tuning which you can either load on top during inference or you can merge with the base model, but extracting a LoRA from an existing model is kinda confusing me haha (sorry!). It sounds interesting though! Do I understand correctly that this would enable you to kind of "operate" on the model more precisely and with a lot less compute required (aka: more merges you can make and test in a given time window)?

replied to sometimesanotion's post 6 months ago

Would you mind doing a writeup about your customized mergekit workflow, or do you prefer to keep some of the secret sauce to yourself? ;)

replied to their post 6 months ago

Or I guess the ReadMe as nobody can read that lol: https://huggingface.co/spaces/CultriX/MultiAgent-CodeTask/blob/main/README.md

posted an update 6 months ago

Post

1773

# Multi-Agent Collaboration for Coding Tasks - Updated Space!

This version does not rely on AutoGen.
The user simply enters his OPENAI_API_KEY and a task and the Space goes to work, employing a
- 1. prompt-enhancer agent,
- 2. an orchestrator agent,
- 3. a coder agent,
- 4. a code-reviewing agent and
-5. a code documentation generator agent.

See below image for an example workflow:

CultriX/MultiAgent-CodeTask

1 reply

·

CultriX PRO

AI & ML interests

Recent Activity

Organizations

CultriX PRO

AI & ML interests

Recent Activity

Organizations

CultriX's activity