Extreme Classification

Introduction

Extreme classification is an emerging subfield of machine learning that focuses on multi-label classification problems involving a large number of classes—often in the thousands or even millions. Unlike traditional classification tasks, which may involve a handful of classes, extreme classification aims to assign each instance to one or more classes from an extremely large set. This poses unique challenges in terms of computational complexity, memory usage, and algorithmic design.

Importance

The significance of extreme classification is underscored by its applicability in various domains. For instance, it is crucial in information retrieval for tagging web pages with multiple relevant categories from a large taxonomy. In natural language processing, it can be used for fine-grained sentiment analysis or topic categorization. E-commerce platforms employ extreme classification for product categorization among a vast array of product types.

Challenges

Computational Complexity: Handling a large number of classes requires efficient algorithms that can scale.
Memory Usage: Storing and processing a large label set can be memory-intensive.
Imbalanced Data: Often, some classes have many more instances than others, leading to class imbalance.
Evaluation Metrics: Traditional metrics like accuracy can be misleading; specialized metrics like Precision@k are often more appropriate.

Our Approach

Our approach addresses a gap in multi-label document classification by incorporating the textual information of label names. We use a triplet transformer network to embed both documents and labels into a joint vector space, effectively turning the classification task into a document similarity problem. This method is not only innovative but also efficient, particularly during inference where classification is done by finding the closest labels in the vector space. We've validated our approach on a real-world dataset, showing competitive performance. The methodology also holds promise for extreme classification scenarios, where handling a large number of classes is a challenge.

Melsbach, J., Stahlmann, S., Hirschmeier, S., & Schoder, D. (2022, September). Triplet transformer network for multi-label document classification. In Proceedings of the 22nd ACM Symposium on Document Engineering (pp. 1-4).

UNIVERSITÄT ZU KÖLN

Wirtschafts- und Sozialwissenschaftliche FakultätSeminar für Wirtschaftsinformatik und Informationsmanagement