Harnessing the power of Supervised Fine-Tuning (SFT) is crucial for the evolution of Large Language Models (LLMs). A new fine-tuning method, Self-Play fIne-tuNing (SPIN), has been proposed to strengthen LLMs without additional human-annotated data. SPIN utilizes a self-play mechanism, allowing the LLM to generate its own training data and refine its responses by comparing them to human-annotated examples. This process incrementally improves the LLM, maximizing the use of human-annotated data in SFT. Theoretical proofs confirm that SPIN’s training objective is optimized when the LLM’s policy matches the target data distribution. In practical tests, including on the HuggingFace Open LLM Leaderboard and other benchmarks, SPIN not only enhanced LLM performance but also surpassed models trained with direct preference optimization using extra GPT-4 data. These findings highlight the potential of self-play to reach human-level LLM performance without the need for expert human input.
Read more…