Fine-tuning transformers: Vocabulary transfer

doi:10.1016/j.artint.2023.103860

Artificial Intelligence

Volume 317, April 2023, 103860

https://doi.org/10.1016/j.artint.2023.103860 Get rights and content

Abstract

Transformers are responsible for the vast majority of recent advances in natural language processing. The majority of practical natural language processing applications of these models are typically enabled through transfer learning. This paper studies if corpus-specific tokenization used for fine-tuning improves the resulting performance of the model. Through a series of experiments, we demonstrate that such tokenization combined with the initialization and fine-tuning strategy for the vocabulary tokens speeds up the transfer and boosts the performance of the fine-tuned model. We call this aspect of transfer facilitation vocabulary transfer.

Introduction

The transformer first introduced in [1] is an architecture that consists of encoder and decoder stacks with stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The transformer gave rise to such models as GPT [2], [3] or BERT [4]. These architectures are shown to beat state of the art for various Natural Language Processing tasks. The performance of such models improves with the size, and training of such architectures from scratch requires a lot of computational power and huge datasets. These obstacles hinder the broader adoption of these architectures and limit most successful applications to transfer learning: a huge pretrained model is fine-tuned on a smaller dataset collected for a specific downstream task. This stimulates a growing interest to transfer learning procedures and gives rise to various approaches and practices aimed at raising the effectiveness of the transfer. For a review of transfer learning methodology for transformers, see [5].

Typical tokenization used for transformer pretraining includes several thousand tokens. These tokens include smaller chunks of words (down to the size of a single letter) and representations with longer tokens that directly correspond to certain words. One can speculate that the model uses shorter tokens to adopt grammatical information and deal with longer, rarely observed words. In contrast, the representations with longer tokens could be useful for semantically intensive problems. These longer, semantically charged tokens may vary significantly on various downstream tasks. Therefore, adopting new downstream-specific tokenization might be beneficial for the performance of the resulting model. Indeed, various researchers have shown that corpus-specific tokenization could be beneficial for an NLP task. For example, [6] show that optimal vocabulary is dependent on the frequencies of the words in the target corpus. [7] show that tokenization on language model pretraining has a direct impact on the resulting performance, yet do not discuss the implications of this result for transfer learning. [8] introduce BPE-dropout that stochastically corrupts the segmentation procedure of Byte Pair Encoding (BPE) [6], [9], which leads to producing multiple segmentations within the same fixed BPE framework. Using BPE-dropout in the pretraining is shown to improve the downstream performance. [10], [11] and [12] discuss the tokenization in the setting of cross-language transfer. [13] demonstrate that replacing the embedding layers of the neural machine translation (NMT) model by projecting general word embeddings induced from monolingual data in a target domain onto a source-domain embedding space is beneficial for task performance. [14] extension of the input and output embedding layer to account for the new vocabulary items improves NMT performance. Outside the NMT setting, transformer-based models are routinely fine-tuned on the same tokenization they inherit from the initial corpus. However, many NLP tasks are not cross-lingual. For example, pre-trained transformers are standardly used for text classification after fine-tuning on a task-specific corpus. Such an approach could be suboptimal since the vocabulary and frequencies of the words in a new corpus could differ significantly. This paper investigates whether new tokenization tailored for the fine-tuning corpus could improve the resulting performance of the model and speed up the transfer, and formalizes such problem as a new natural language processing task.

If one wants a new, corpus-specific vocabulary for fine-tuning the model, one can no longer use the embedding matrix obtained in the pretraining phase. One has either learn it from scratch or come up with some fine-tuning procedures that could partially preserve the information acquired by the model in the pretraining phase. We suggest a new type of transfer learning task that we call vocabulary transfer. We define this task as finding optimal tokenization for a specific downstream task and developing such information preserving fine-tuning strategy. In this paper, we demonstrate that vocabulary transfer facilitates transfer learning in terms of downstream task quality and the speed of the transfer. To our knowledge, this is the first work that addresses the adoption of data-specific tokenization in the context of transfer-learning for transformers.

The contribution of this paper is threefold:

•
we test several ways in which one can effectively leverage a model that was pretrained with different vocabulary tokenization;
•
we conduct a series of experiments that show that adoption of new vocabulary can indeed boost the performance of the model on the downstream tasks;
•
we thus build a case to broaden the scope of transfer learning to include a problem of fine-tuning the model on new vocabulary tokenization; we call the task addressing the effective transfer of information from an old vocabulary to a new one a vocabulary transfer.

Section snippets

Related work

There are different attempts to facilitate transfer learning through some enhancement or preprocessing of the new training data. For example, [15] proposes to inject phrasal paraphrase relations into BERT to generate suitable representations for semantic equivalence assessment instead of increasing the model's size. In this work, instead of enhancing the dataset with additional information, we try to find out if it is possible to organize transfer learning when new vocabulary tokenization is

Vocabulary transfer

This paper, to our knowledge, is the first to introduce the concept of vocabulary transfer. We do it as follows:

•
in this section, we methodologically describe vocabulary transfer as a general problem that is open to future research;
•
we propose an example of a possible solution for the problem of vocabulary transfer;
•
we demonstrate that such solution improves the performance of the resulting model but do not claim that such a solution is optimal;
•
we discuss the obtained results to stimulate further

Aspects of vocabulary transfer

As we have stated in Section 3 vocabulary transfer conceptually is a process of finding dataset-specific tokenization, its initialization, and fine-tuning procedure for it that would result in the superior performance of a given NLP model. We test different popular tokenizations and investigate embeddings initialization and fine-tuning procedures further in this section.

Discussion

Though fine-tuning BERT over new tokenization seems beneficial across various vocabulary sizes and speeds up the transfer for both downstream datasets, one has to address the discrepancy between the benefits of vocabulary transfer for hyperpartisan news detection or Sentiment140, and Quora insincere questions datasets. Indeed, Table 3 shows that vocabulary transfer is much more useful for hyperpartisan news than for Quora. Let us briefly discuss this difference since it provides an illustrative

Conclusion

This paper studies the effect of dataset-specific tokenization on the fine-tuning of a transformer-based architecture. We carry out experiments that demonstrate that a dataset-specific vocabulary paired with procedures for the initialization and fine-tuning of the embeddings facilitates transfer learning. We call this phenomenon vocabulary transfer.

We discuss three aspects of vocabulary transfer: tokenization, initialization, and fine-tuning. We demonstrate that dataset-specific tokenization is

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (21)

A. Vaswani et al.
Attention is all you need
A. Radford et al.
Improving language understanding by generative pre-training
A. Radford et al.
Language models are unsupervised multitask learners
OpenAI Blog
(2019)
J. Devlin et al.
Bert: pre-training of deep bidirectional transformers for language understanding
C. Raffel et al.
Exploring the limits of transfer learning with a unified text-to-text transformer
R. Sennrich et al.
Neural machine translation of rare words with subword units
K. Bostrom et al.
Byte pair encoding is suboptimal for language model pretraining
I. Provilkov et al.
Bpe-dropout: simple and effective subword regularization
P. Gage
A new algorithm for data compression
C Users J.
(1994)
S.M. Lakew et al.
Controlling the output length of neural machine translation

There are more references available in the full text version of this article.

Cited by (9)

Optimizing the impact of data augmentation for low-resource grammatical error correction
2023, Journal of King Saud University - Computer and Information Sciences
Grammatical Error Correction (GEC) refers to the automatic identification and amendment of grammatical, spelling, punctuation, and word-positioning errors in monolingual texts. Neural Machine Translation (NMT) is nowadays one of the most valuable techniques used for GEC but it may suffer from scarcity of training data and domain shift, depending on the addressed language. However, current techniques (e.g., tuning pre-trained language models or developing spell-confusion methods without focusing on language diversity) tackling the data sparsity problem associated with NMT create mismatched data distributions. This paper proposes new aggressive transformation approaches to augment data during training that extend the distribution of authentic data. In particular, it uses augmented data as auxiliary tasks to provide new contexts when the target prefix is not helpful for the next word prediction. This enhances the encoder and steadily increases its contribution by forcing the GEC model to pay more attention to the text representations of the encoder during decoding. The impact of these approaches was investigated using the Transformer-based for low-resource GEC task, and Arabic GEC was used as a case study. GEC models trained with our data tend more to source information, are more domain shift robustness, and have less hallucinations with tiny training datasets and domain shift. Experimental results showed that the proposed approaches outperformed the baseline, the most common data augmentation methods, and classical synthetic data approaches. In addition, a combination of the three best approaches Misspelling, Swap, and Reverse achieved the best $F_{1}$ score in two benchmarks and outperformed previous Arabic GEC approaches.
Towards a RAG-based Summarization Agent for the Electron-Ion Collider
2024, arXiv
Kinematic Calibration for the 3-UPS/S Shipborne Stabilized Platform Based on Transfer Learning
2024, Journal of Marine Science and Engineering
Getting the most out of your tokenizer for pre-training and domain adaptation
2024, arXiv
Textai2.0 (Psych): A Novel, Robust, and Generalized Cross-Domain Depression Detection Using Explainable Attention-Enabled Ensemble-Based Transformers
2023, SSRN
Attention-Enabled Ensemble Deep Learning Models and Their Validation for Depression Detection: A Domain Adoption Paradigm
2023, Diagnostics

View all citing articles on Scopus

¹: The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139. The article was prepared with the support of the Yandex DataSphere service from the Yandex Cloud platform. https://cloud.yandex.com/en/services/datasphere.

View full text

Fine-tuning transformers: Vocabulary transfer

Abstract

Introduction

Section snippets

Related work

Vocabulary transfer

Aspects of vocabulary transfer

Discussion

Conclusion

Declaration of Competing Interest

Attention is all you need

Improving language understanding by generative pre-training

Language models are unsupervised multitask learners

OpenAI Blog

Bert: pre-training of deep bidirectional transformers for language understanding

Exploring the limits of transfer learning with a unified text-to-text transformer

Neural machine translation of rare words with subword units

Byte pair encoding is suboptimal for language model pretraining

Bpe-dropout: simple and effective subword regularization

A new algorithm for data compression

C Users J.

Controlling the output length of neural machine translation