meta-llama/Llama-3.2-1B-Instruct · Newbie Training questions

10 days ago

I've been inferencing this model and others for a while. I've successful created cli and flask webclients inferencing these models on my local cloud.
The LLM 3.2 1b instruct is on my AMD Ryzen 5 4600G with Radeon Graphics ubuntu server with 4TB of ssd space. I have 64 GB of Ram and 2 asus Geforce rtx ti 16 GB 4060 GPUs.
I've reviewed several methods of training using lora, unsloth, and just transformers. Each time I try I get a memory allocation error. I've updated my environment to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in os.environment on the scripts, and .bashrc and set on the cli,,, just in case. But I keep getting the same errors. I've read it this model only requires 16 GB of GPU minimum but my error messages show that torch is trying to allocate 32 GB of GPU memory, which is impossible given I only have that much total. I've set my device_mapping to auto, cuda:0, cuda:1 and finally just cuda. But I get the same message. I'm willing to share the script if anyone feels like they can solve my problem, then again, maybe it's just not possible?

rkapuaala

10 days ago

•

edited 10 days ago

Oh,,, And I get this error too. I've tried reducing the numbe of workers and the cuda out of memory errors change only in how much they tried to allocate.
RuntimeError:

    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.
    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

rkapuaala

8 days ago

I've finally found a way to tune. I'm using SFTTrainer and I am getting iterations speeds of 5.7s . It's only using one of my GPUs at 96% and 12GB out of 16 on that and 1GB out of 16 on the other. I have done 3 training sessions so far. First with 3 rows data, and it worked so fast I had not time to set up monitors. The second time I used 98 rows, and it was done in about 15 minutes. I'm currently running a session with 496 rows. It is a major relief to finally be able to train the model in my own office.