GPT-2 From Scratch PyTorch

Property	Value
Author	rasbt
Model Variants	Small (124M), Medium (355M), Large (774M), XL (1558M)
Framework	PyTorch
Repository	Hugging Face

What is gpt2-from-scratch-pytorch?

This is a PyTorch implementation of OpenAI's GPT-2 model, featuring converted weights from the original TensorFlow implementation. The project provides a clean, from-scratch implementation with support for multiple model sizes, ranging from 124M to 1.5B parameters.

Implementation Details

The implementation includes a complete PyTorch architecture with configurable parameters for embedding dimensions, attention heads, and layer counts. The model uses the tiktoken tokenizer and supports both state dict and safetensors loading formats.

Vocabulary size: 50,257 tokens
Maximum context length: 1,024 tokens
Configurable embedding dimensions (768-1600)
Variable attention heads (12-25)
Adjustable layer counts (12-48)

Core Capabilities

Text generation with customizable maximum token output
Support for multiple model sizes and configurations
Compatible with both .pth and .safetensors formats
Implements the original GPT-2 architecture with query-key-value attention

Frequently Asked Questions

Q: What makes this model unique?

This implementation provides a from-scratch PyTorch version of GPT-2, making it ideal for learning and understanding the architecture. It maintains compatibility with OpenAI's original weights while offering flexible loading options.

Q: What are the recommended use cases?

The model is particularly useful for educational purposes, text generation tasks, and as a foundation for further research. It's also valuable for those who want to understand GPT-2's architecture or need a PyTorch-native implementation.