A new proof-of-concept tool, Torchtitan, has been developed to demonstrate the capabilities of PyTorch’s distributed training features for large-scale language models (LLMs). Unlike existing frameworks like Megatron and Deepspeed, Torchtitan is designed to be a complementary tool that showcases PyTorch’s latest features in a minimal and modular codebase.
Despite being in pre-release status, Torchtitan has already been tested on 64 A100 GPUs and supports the training of Llama 3 and Llama 2 models. The tool offers a range of features, including selective layer activation checkpointing, pre-configured datasets, and performance metrics visualization using TensorBoard.
Torchtitan is designed to be user-friendly, allowing users to quickly set up and train their models with minimal changes to their code. The tool is available under the BSD 3 license, making it easy for developers to adopt and contribute to the project.
Future updates to Torchtitan are planned, including the introduction of asynchronous checkpointing, FP8 support, and scalable data loading solutions. To get started with Torchtitan, users can clone the repository, install dependencies, and use the PyTorch nightly build. Detailed instructions are provided for training runs and visualizing metrics using TensorBoard.