< Back to Blog Page

Leveraging Pre-Trained Language Models for Document Classification

Leveraging Pre-Trained Language Models for Document Classification

Holger Keibel of Karakun and Daniele Puccinelli of SUPSI gave the following presentation during the 2021 AI-SDV conference in October of 2021.

The EXTRA classifier is a scalable solution based on recent advances in Natural Language Processing (NLP). The foundational concept of the EXTRA classifier is transfer learning, a machine learning process that enables the relatively low-cost specialization of a pre-trained language model to a specific task in a specific domain with far fewer training examples compared to standard machine learning solutions.

More specifically, the EXTRA classifier leverages BERT, a well-known pre-trained autoencoding language model that has revolutionized the NLP space in the past few years. BERT provides contextual embeddings, i.e., it provides context-aware vector representations of words that capture semantics far more efficiently than their context-free counterparts.

The EXTRA classifier contains a pre-processing module to cope with the inevitable noise in the output of standard Optical Character Recognition systems. The pre-processed plain text from a source document is then fed into a BERT-based classifier, which is built by extending pre-trained BERT with an additional linear layer trained for classification through a process commonly known as fine-tuning.

We will present preliminary results that confirm some clear benefits with respect to rule-based solutions in terms of classification performance and system scalability.

A link to the slides from this presentation is available below.

Sign Up for the Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

©Copyright ML4Patents | Powered By Patinformatics