You can use TRL, a library that integrates PPO and accelerate to fine-tune large language models with reinforcement learning on a single device or in a distributed manner.
You can also use LOMO, a new optimizer that fuses the gradient computation and the parameter update in one step to reduce memory usage. It enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.
You can also try QLoRA, a method that combines quantization and low-rank approximation to reduce the memory footprint and computational cost of large language models. It can achieve 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU.