bertweet-base

vinai

Pre-trained language model for English Tweets, based on RoBERTa architecture. Trained on 850M tweets (16B tokens). MIT licensed with strong performance on NLP tasks.

Property	Value
License	MIT
Author	VINAI
Downloads	81,578
Framework	PyTorch, TensorFlow

What is bertweet-base?

BERTweet-base is a groundbreaking language model specifically pre-trained for English Tweets. As the first public large-scale language model of its kind, it leverages the RoBERTa pre-training procedure and has been trained on an impressive dataset of 850M English Tweets, including 845M general tweets from 2012-2019 and 5M COVID-19 related tweets.

Implementation Details

The model is built on the RoBERTa architecture and has been trained on approximately 16B word tokens, equivalent to about 80GB of text data. It supports multiple deep learning frameworks including PyTorch and TensorFlow, making it versatile for different development environments.

Pre-trained on 850M English Tweets
Implements RoBERTa architecture
Supports multiple frameworks
Includes COVID-19 specific data

Core Capabilities

Part-of-Speech Tagging
Named Entity Recognition
Sentiment Analysis
Irony Detection
Fill-Mask Task Support

Frequently Asked Questions

Q: What makes this model unique?

BERTweet is the first large-scale language model specifically designed for Twitter content, combining both general tweets and pandemic-related data for comprehensive coverage of social media language patterns.

Q: What are the recommended use cases?

The model excels in social media text analysis tasks including sentiment analysis, named entity recognition, part-of-speech tagging, and irony detection, making it ideal for Twitter-focused NLP applications.