Performance question.

#14
by Stahan - opened

Have you measured the data exchange between the video cards during model inference? Is there any data available?

I have a theory (though I've already done something similar in practice) about how to run this model entirely on a several GPU with a nominal investment of $40,000. So, in order not to throw away $40,000 of investments into nothing, I really hope that someone will share the data.

The GPUs gives you nothing. Thats TEXTUAL model, it produces a words by letters. It's NOT video or something like airplane design simulation-where the amount of data is huge.
GPU in textual models gives you only speed and even WORSE quality-because of PHYSICAL VRAM limits you can use there only worse versions of model. In huge models smaller size is always more hallucinations.
In simple words, the analogy is, GPU is a Ferrari with 1 liter fuel tank🏎️(worse quality engine) VS old Ford truck with 200 liters fuel tank🚚(better quality engine).
By the Truck i mean any server motherboard with CPU with 8-12 or even 24 RAM clots which can have more than Terabyte RAM, yes its slower but you can run model in biggest best quality size, i've tested many models on my 12 RAM slots motherboard from year 2014 on ANCIENT Xeon architecture and its ok speed, they really improved speed there. Original big Llama from Meta in 400+Billions was unusable, but this models from China are kinda x2-3 faster than that old Llama on CPU, maybe llama_cpp improved very seriously by latest 2 years.
So with GPU you will be very disappointed, esp with the huge hole in your pocket or bank acc. I would recommend use GPU for something like video or 3D simulations.
I will try soon the best Q8 GGUF quality in the most crazy tests if only can obtain size on disk, in such quality you need minimum 700+ Gb RAM or VRAM, which i have on CPU. I wouldn't even waste time on size lower than Q8 by my experience-this models hallucinate hard in smaller sizes, I've tested all huge models, starting the first Falcon, don't remember it was like 200Billions...

The GPUs gives you nothing. Thats TEXTUAL model, it produces a words by letters. It's NOT video or something like airplane design simulation-where the amount of data is huge.
GPU in textual models gives you only speed and even WORSE quality-because of PHYSICAL VRAM limits you can use there only worse versions of model. In huge models smaller size is always more hallucinations.
In simple words, the analogy is, GPU is a Ferrari with 1 liter fuel tank🏎️(worse quality engine) VS old Ford truck with 200 liters fuel tank🚚(better quality engine).
By the Truck i mean any server motherboard with CPU with 8-12 or even 24 RAM clots which can have more than Terabyte RAM, yes its slower but you can run model in biggest best quality size, i've tested many models on my 12 RAM slots motherboard from year 2014 on ANCIENT Xeon architecture and its ok speed, they really improved speed there. Original big Llama from Meta in 400+Billions was unusable, but this models from China are kinda x2-3 faster than that old Llama on CPU, maybe llama_cpp improved very seriously by latest 2 years.
So with GPU you will be very disappointed, esp with the huge hole in your pocket or bank acc. I would recommend use GPU for something like video or 3D simulations.
I will try soon the best Q8 GGUF quality in the most crazy tests if only can obtain size on disk, in such quality you need minimum 700+ Gb RAM or VRAM, which i have on CPU. I wouldn't even waste time on size lower than Q8 by my experience-this models hallucinate hard in smaller sizes, I've tested all huge models, starting the first Falcon, don't remember it was like 200Billions...

Perhaps you misunderstood. I have an idea for how to build 3 TB of video memory. Basically, I'm planning to take a few dozen Ferraris and bolt a truck body onto them.

So I'm still interested to see inter-GPU data transfer measurements in regular inference.

The GPUs gives you nothing. Thats TEXTUAL model, it produces a words by letters. It's NOT video or something like airplane design simulation-where the amount of data is huge.
GPU in textual models gives you only speed and even WORSE quality-because of PHYSICAL VRAM limits you can use there only worse versions of model. In huge models smaller size is always more hallucinations.
In simple words, the analogy is, GPU is a Ferrari with 1 liter fuel tank🏎️(worse quality engine) VS old Ford truck with 200 liters fuel tank🚚(better quality engine).
By the Truck i mean any server motherboard with CPU with 8-12 or even 24 RAM clots which can have more than Terabyte RAM, yes its slower but you can run model in biggest best quality size, i've tested many models on my 12 RAM slots motherboard from year 2014 on ANCIENT Xeon architecture and its ok speed, they really improved speed there. Original big Llama from Meta in 400+Billions was unusable, but this models from China are kinda x2-3 faster than that old Llama on CPU, maybe llama_cpp improved very seriously by latest 2 years.
So with GPU you will be very disappointed, esp with the huge hole in your pocket or bank acc. I would recommend use GPU for something like video or 3D simulations.
I will try soon the best Q8 GGUF quality in the most crazy tests if only can obtain size on disk, in such quality you need minimum 700+ Gb RAM or VRAM, which i have on CPU. I wouldn't even waste time on size lower than Q8 by my experience-this models hallucinate hard in smaller sizes, I've tested all huge models, starting the first Falcon, don't remember it was like 200Billions...

Perhaps you misunderstood. I have an idea for how to build 3 TB of video memory. Basically, I'm planning to take a few dozen Ferraris and bolt a truck body onto them.

So I'm still interested to see inter-GPU data transfer measurements in regular inference.

The question itself showing everything, so why no one is asking that mostly anywhere. PCI-express interface declared speed (minus) many factors and simple physics through controller and CPU which is heating. The bottleneck of data transfer (i had such one time even with bad NVMe drive) resolved only in Nvidia Enterprise solutions by Optical lines, but for $40K no one would even talk there-its where many earn from percent of whole deal of big projects. The speed will be not great, consumer boards have no CLX and no MCIO connectors.
I'm not sure what you see in whole value of generated text, thats illogical of such investments into textual printer, the texts no one reads today (book industry collapsing - so even from "fan fiction" point of view also). Also the $40K is kinda small to get even 700+Gb of VRAM (you can easily forget about original size models where needed 1Tb+), if you go smaller quantization-that's garbage quality, and forget about coding-quantization mostly destroying that ability in most models like Deepseek or Kimi K2 (only GLM4.5 was first which made simple code without errors in 1st try in Q8).
Again i will repeat to anyone - for just making texts its super waste and even Ai will confirm that, for Video in ComfyUI i recommend waiting for release of Super version of RTX5070, 1 is enough.
I'm not sponsored or affiliated, but people don't understand what is server motherboard, it's like that - https://www.gigabyte.com/Enterprise/Server-Motherboard
Enterprise motherboards have such magical tech as CLX and etc and other which wont get into consumer market for many years. What is CLX https://youtu.be/zQGZFBrGmK4
On server building 3Tb of VRAM will be much more expensive, $40K is not even a luxury car, a mediocre one.

The GPUs gives you nothing. Thats TEXTUAL model, it produces a words by letters. It's NOT video or something like airplane design simulation-where the amount of data is huge.
GPU in textual models gives you only speed and even WORSE quality-because of PHYSICAL VRAM limits you can use there only worse versions of model. In huge models smaller size is always more hallucinations.
In simple words, the analogy is, GPU is a Ferrari with 1 liter fuel tank🏎️(worse quality engine) VS old Ford truck with 200 liters fuel tank🚚(better quality engine).
By the Truck i mean any server motherboard with CPU with 8-12 or even 24 RAM clots which can have more than Terabyte RAM, yes its slower but you can run model in biggest best quality size, i've tested many models on my 12 RAM slots motherboard from year 2014 on ANCIENT Xeon architecture and its ok speed, they really improved speed there. Original big Llama from Meta in 400+Billions was unusable, but this models from China are kinda x2-3 faster than that old Llama on CPU, maybe llama_cpp improved very seriously by latest 2 years.
So with GPU you will be very disappointed, esp with the huge hole in your pocket or bank acc. I would recommend use GPU for something like video or 3D simulations.
I will try soon the best Q8 GGUF quality in the most crazy tests if only can obtain size on disk, in such quality you need minimum 700+ Gb RAM or VRAM, which i have on CPU. I wouldn't even waste time on size lower than Q8 by my experience-this models hallucinate hard in smaller sizes, I've tested all huge models, starting the first Falcon, don't remember it was like 200Billions...

Perhaps you misunderstood. I have an idea for how to build 3 TB of video memory. Basically, I'm planning to take a few dozen Ferraris and bolt a truck body onto them.

So I'm still interested to see inter-GPU data transfer measurements in regular inference.

The question itself showing everything, so why no one is asking that mostly anywhere. PCI-express interface declared speed (minus) many factors and simple physics through controller and CPU which is heating. The bottleneck of data transfer (i had such one time even with bad NVMe drive) resolved only in Nvidia Enterprise solutions by Optical lines, but for $40K no one would even talk there-its where many earn from percent of whole deal of big projects. The speed will be not great, consumer boards have no CLX and no MCIO connectors.
I'm not sure what you see in whole value of generated text, thats illogical of such investments into textual printer, the texts no one reads today (book industry collapsing - so even from "fan fiction" point of view also). Also the $40K is kinda small to get even 700+Gb of VRAM (you can easily forget about original size models where needed 1Tb+), if you go smaller quantization-that's garbage quality, and forget about coding-quantization mostly destroying that ability in most models like Deepseek or Kimi K2 (only GLM4.5 was first which made simple code without errors in 1st try in Q8).
Again i will repeat to anyone - for just making texts its super waste and even Ai will confirm that, for Video in ComfyUI i recommend waiting for release of Super version of RTX5070, 1 is enough.
I'm not sponsored or affiliated, but people don't understand what is server motherboard, it's like that - https://www.gigabyte.com/Enterprise/Server-Motherboard
Enterprise motherboards have such magical tech as CLX and etc and other which wont get into consumer market for many years. What is CLX https://youtu.be/zQGZFBrGmK4
On server building 3Tb of VRAM will be much more expensive, $40K is not even a luxury car, a mediocre one.

Believe me, I'm pretty good at what I do. If you don't know the answer to my question, please stop writing essays for me.

LoL 😆 , if you tired of such size text why do you spending last money on text printer which excuse my french-producing it like after spoiled burrito. BTW this written by real human.
Ungrateful people as always, basically i've written advice for free, for which many would require money.
Maybe before coming into mostly research community with public outcry you need to be ready for sober answer. No one asked such nonsense here, No one would answer you not knowing your hardware-No one would have same hardware (all hardware very different), developers not needed to list all their hardware (usually they mention only H100 card ever)-and they don't have a GPU mining farms to know the transfer rate.
All this post is "small investor dilemma", the $40K is very small for anything if i would have it (i have everything and don't need anything). Even for a car, maybe only for very old bad car and same garage for constant repairs. For investing into Gold or crypto it's too late and too small, same for Ai-just to receive toy. Woman would invest this into body surgery maybe, same with Ai-if you buy hardware which will get outdated quite fast, in future zoomer on your place would get just some implant to contest you in job interviews and such "walking Google guys" will be quite in demand than people without implant. That is all, i'm done.
P.S.: 3 Terabytes of VRAM in servers form i think more realistic with $400K not $40K, but its waste of time, because big giants will get anything faster and earlier than anyone here. Patents printed by big companies everyday using Ai, thats proved by research papers. Google & co searching for cure drugs every second using $billions Ai equipment, but they using other people drug patents for this, so selling resulted drugs will be illegal by law, this cure drugs are only for very rich, they finance-they will outlive everyone.
Martin Armstrong, famous stocks trader, built his magical machine by 30 years ahead of you, in 1990s, when GPU not existed, still using it. It can be called Ai because also predicting by "secret formula". So by his latest advice on youtube save your money, 2026 he said everywhere will be catastrophic financially to many by his magical machine. Investing in the bubble at late stage is special "art", but high gold is always high during crisis.

Test results for Q5 GGUF-UD-XLARGE
Simple Music code ChucK creation - Failed (only GLM 4.5 still the only who completed this in Q8)
Repair logic error in ChucK music code - Failed
To get this result you need 718Gb RAM/VRAM.
It's usable in role-playing or book writing, produce very long philosophical texts, but who will read them today... That is all.
Original size model may be better, this is quantized result. From total $1K PC without GPU(server equipment works by VGA)

Sign up or log in to comment