russian-sensitive-topics

Property	Value
License	CC BY-NC-SA 4.0
Paper	Research Paper
Language	Russian
Framework	PyTorch, Transformers

What is russian-sensitive-topics?

russian-sensitive-topics is a specialized text classification model designed to detect 18 different sensitive topics in Russian text. Developed by researchers at Skoltech NLP, the model addresses the challenge of identifying potentially inappropriate content that could harm a company's reputation. The model was trained on both manually and semi-automatically labeled data, making it robust for real-world applications.

Implementation Details

The model utilizes BERT architecture and is implemented using PyTorch and the Transformers library. It achieves notable performance across various sensitive topics, with particularly strong results in categories like drugs (F1: 0.88), weapons (F1: 0.86), and religion (F1: 0.81). The model demonstrates balanced precision and recall metrics across most categories.

Supports 18 distinct sensitive topics including offline/online crime, discrimination, and social issues
Trained on an extended dataset available on GitHub and Kaggle
Implements multi-label classification for comprehensive content analysis

Core Capabilities

Multi-label classification of sensitive topics in Russian text
Detection of potentially harmful content across various domains
Balanced performance across different sensitive categories
Specialized handling of nuanced content related to social issues

Frequently Asked Questions

Q: What makes this model unique?

The model's unique approach lies in its fine-grained classification of inappropriate content that goes beyond simple toxicity detection. It specifically focuses on content that could harm reputation while considering topic sensitivity.

Q: What are the recommended use cases?

The model is ideal for content moderation systems, corporate communication monitoring, and social media analysis where identifying potentially sensitive or inappropriate content in Russian text is crucial. It's particularly valuable for maintaining brand reputation and ensuring appropriate content guidelines.