9th March 2024

BitNet Transformer: Scaling 1-bit Transformers for Large Language Models

BitNet Transformer, a architecture that scales 1-bit Transformers for large language models. BitNet Transformer achieves competitive performance while substantially reducing memory footprint and energy consumption compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines.

Key Features:

  • BitLinear: A drop-in replacement for the nn.Linear layer in PyTorch, enabling the training of 1-bit weights from scratch.
  • Scalable and Stable: BitNet Transformer is designed to be scalable and stable, capable of handling large language models efficiently.
  • Competitive Performance: Achieves competitive results in terms of perplexity and downstream task accuracy compared to baselines.
  • Significant Energy Savings: Provides substantial energy cost reductions, especially as the model size scales up.
  • Scaling Law: Exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models.

Availability:

  • GitHub: The code and implementation details are available on GitHub.
  • Blog Post: For a detailed overview and analysis of BitNet Transformer, please refer to our blog post.