SEC-BERT-NUM

Property	Value
Parameters	110M
Architecture	12-layer, 768-hidden, 12-heads BERT
Training Data	260,773 SEC 10-K filings (1993-2019)
Paper	FiNER: Financial Numeric Entity Recognition for XBRL Tagging

What is SEC-BERT-NUM?

SEC-BERT-NUM is a specialized BERT model designed specifically for financial domain natural language processing. Its unique feature is the uniform handling of numerical expressions by replacing all number tokens with a [NUM] pseudo-token, preventing fragmentation of numeric values. The model was trained on a massive dataset of SEC filings, making it particularly effective for financial text analysis tasks.

Implementation Details

The model builds upon the BERT-BASE architecture but incorporates several domain-specific optimizations. It uses a custom 30k subword vocabulary trained from scratch on financial documents and follows the same training setup as BERT-BASE with 1 million training steps.

Trained using Google's official BERT repository
Optimized for both PyTorch and TF2 compatibility
Implements special numeric token handling through pre-processing
Trained on Google Cloud TPU v3-8

Core Capabilities

Superior performance in financial text understanding
Consistent handling of numerical expressions
Enhanced masked token prediction for financial contexts
Specialized vocabulary for financial domain

Frequently Asked Questions

Q: What makes this model unique?

SEC-BERT-NUM's distinctive feature is its uniform handling of numerical expressions through the [NUM] token, which helps maintain consistency in financial text processing and improves performance on financial NLP tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for financial text analysis tasks, including financial numeric entity recognition, sentiment analysis of financial documents, and processing of SEC filings and other financial reports.

sec-bert-num