Punica introduces an innovative approach to efficiently serve multiple finetuned Large Language Models (LLMs) using Low Rank Adaptation (LoRA), a technique that adds minimal storage and memory overhead to pre-existing models. By leveraging small matrices that modify the weights of a pretrained model, Punica can run multiple LoRA finetuned models with the computational cost of running just one.
The key to Punica’s efficiency is the Segmented Gather Matrix-Vector multiplication (SGMV) CUDA kernel, which handles the additional computations required by the LoRA models. This method maintains the strong batching effect, which is the ability to process multiple inputs simultaneously, thereby reducing latency.
In benchmarks, Punica significantly outperforms other systems, achieving up to 12 times the throughput in text generation tasks compared to state-of-the-art alternatives. This makes it a highly scalable solution for serving diverse LLMs simultaneously.
Punica can be installed either from a binary package for quick setup or built from source for customization. It also provides examples for serving multiple LoRA models, finetuning, converting to Punica format, and benchmarking text generation performance. The project’s paper offers a deeper understanding of the multi-tenant LoRA serving capabilities that Punica enables.
Read more at GitHub…