I got an error. Not sure what it is.

#1
by AekDevDev - opened

I downloaded IQ2-M and tried it and got this error "tensor 'blk.47.ffn_gate_exps.weight' data is not within the file bounds"
Does this mean my 3090 with 24 gb VRAM is not enough for the active layers?

main: loading model
srv load_model: loading model 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23306 MiB free
llama_model_load: error loading model: tensor 'blk.47.ffn_gate_exps.weight' data is not within the file bounds, model is corrupted or incomplete
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf', try reducing --n-gpu-layers if you're running out of VRAM
srv load_model: failed to load model, 'D:\llms-models\GLM46\GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf'
srv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error

This error means that you forgot to concatenate the files after downloading the GGUF fragments. Either you concatenate the downloaded files using cat GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part2of2.gguf > GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf (open Git Bash from Git for Windows in the same folder to enter this command) or download the already concatenated GGUF from https://hf.tst.eu/model#GLM-4.6-REAP-218B-A32B-GGUF

This error means that you forgot to concatenate the files after downloading the GGUF fragments. Either you concatenate the downloaded files using cat GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part1of2.gguf GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf.part2of2.gguf > GLM-4.6-REAP-218B-A32B.i1-IQ2_M.gguf (open Git Bash from Git for Windows in the same folder to enter this command) or download the already concatenated GGUF from https://hf.tst.eu/model#GLM-4.6-REAP-218B-A32B-GGUF

Thanks for the explanation. Usually multiple GGUF files work in llamacpp without needed to combine them but you are correct about concatenate the files.

Thanks for these quantizations. I'm guessing you're using split or similar utility? Can I request splitting with the llama-gguf-split utility? This would greatly simplify download (and update) for llama.cpp and any forks or compatible engines. Here's an example:

$ time llama-gguf-split --split --split-max-size 40G GLM-4.6-REAP-218B-A32B.i1-Q5_K_M.gguf GLM-4.6-REAP-218B-A32B.i1-Q5_K_M
n_split: 4
split 00001: n_tensors = 474, total_size = 39916M
split 00002: n_tensors = 439, total_size = 39730M
split 00003: n_tensors = 440, total_size = 39732M
split 00004: n_tensors = 383, total_size = 35437M
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00001-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00002-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00003-of-00004.gguf ... done
Writing file GLM-4.6-REAP-218B-A32B.i1-Q5_K_M-00004-of-00004.gguf ... done
gguf_split: 4 gguf split written with a total of 1736 tensors.

real    3m13.032s
user    0m12.020s
sys     1m28.468s

Thanks!

There are many reasons why we don't use the llama-split format. The most important being that it doesn't support zero copy. This means that using llama-split you need to copy all the data when splitting or merging the files. This is a massive waste of resources booth on our end and on our users side (if they want to merge them). We have many quantization servers that use hard disks and so are usually disk bottlenecked so splitting every quant using llama-split would almost half our quant throughput. In addition to that using llama-split would break our download page where users can already the already concatenated file it simply concatenates the download streams. Once HuggingFace lifts the 50 GB upload limit when they get rid of the legacy LFS download path we are in a much better position than quanters using the llama-split format as we could work with HuggingFace to have them concatenate all our split quants server-side without having to reupload petabytes of files. There also is no technical reason why you couldn't load the non-concatenated files. No idea why anyone would want to do so as you can zero copy concatenate them within a fraction of a second but if you really want to there are things like concatfs that lets you mount them to a virtually concatenated file. It’s also worth mentioning that back when mradermcher started our way of splitting GGUFs was the standard used by TheBloke and anyone active at the time as llama-split not even yet existed. Back then all users were used to our way of concatenating quants and because we continued to split that way all our users are still used to our way of splitting them so switching now would cause a lot of confusion.

Sign up or log in to comment