Fine-tuning transformers: Vocabulary transfer
Introduction
The transformer first introduced in [1] is an architecture that consists of encoder and decoder stacks with stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. The transformer gave rise to such models as GPT [2], [3] or BERT [4]. These architectures are shown to beat state of the art for various Natural Language Processing tasks. The performance of such models improves with the size, and training of such architectures from scratch requires a lot of computational power and huge datasets. These obstacles hinder the broader adoption of these architectures and limit most successful applications to transfer learning: a huge pretrained model is fine-tuned on a smaller dataset collected for a specific downstream task. This stimulates a growing interest to transfer learning procedures and gives rise to various approaches and practices aimed at raising the effectiveness of the transfer. For a review of transfer learning methodology for transformers, see [5].
Typical tokenization used for transformer pretraining includes several thousand tokens. These tokens include smaller chunks of words (down to the size of a single letter) and representations with longer tokens that directly correspond to certain words. One can speculate that the model uses shorter tokens to adopt grammatical information and deal with longer, rarely observed words. In contrast, the representations with longer tokens could be useful for semantically intensive problems. These longer, semantically charged tokens may vary significantly on various downstream tasks. Therefore, adopting new downstream-specific tokenization might be beneficial for the performance of the resulting model. Indeed, various researchers have shown that corpus-specific tokenization could be beneficial for an NLP task. For example, [6] show that optimal vocabulary is dependent on the frequencies of the words in the target corpus. [7] show that tokenization on language model pretraining has a direct impact on the resulting performance, yet do not discuss the implications of this result for transfer learning. [8] introduce BPE-dropout that stochastically corrupts the segmentation procedure of Byte Pair Encoding (BPE) [6], [9], which leads to producing multiple segmentations within the same fixed BPE framework. Using BPE-dropout in the pretraining is shown to improve the downstream performance. [10], [11] and [12] discuss the tokenization in the setting of cross-language transfer. [13] demonstrate that replacing the embedding layers of the neural machine translation (NMT) model by projecting general word embeddings induced from monolingual data in a target domain onto a source-domain embedding space is beneficial for task performance. [14] extension of the input and output embedding layer to account for the new vocabulary items improves NMT performance. Outside the NMT setting, transformer-based models are routinely fine-tuned on the same tokenization they inherit from the initial corpus. However, many NLP tasks are not cross-lingual. For example, pre-trained transformers are standardly used for text classification after fine-tuning on a task-specific corpus. Such an approach could be suboptimal since the vocabulary and frequencies of the words in a new corpus could differ significantly. This paper investigates whether new tokenization tailored for the fine-tuning corpus could improve the resulting performance of the model and speed up the transfer, and formalizes such problem as a new natural language processing task.
If one wants a new, corpus-specific vocabulary for fine-tuning the model, one can no longer use the embedding matrix obtained in the pretraining phase. One has either learn it from scratch or come up with some fine-tuning procedures that could partially preserve the information acquired by the model in the pretraining phase. We suggest a new type of transfer learning task that we call vocabulary transfer. We define this task as finding optimal tokenization for a specific downstream task and developing such information preserving fine-tuning strategy. In this paper, we demonstrate that vocabulary transfer facilitates transfer learning in terms of downstream task quality and the speed of the transfer. To our knowledge, this is the first work that addresses the adoption of data-specific tokenization in the context of transfer-learning for transformers.
The contribution of this paper is threefold:
- •
we test several ways in which one can effectively leverage a model that was pretrained with different vocabulary tokenization;
- •
we conduct a series of experiments that show that adoption of new vocabulary can indeed boost the performance of the model on the downstream tasks;
- •
we thus build a case to broaden the scope of transfer learning to include a problem of fine-tuning the model on new vocabulary tokenization; we call the task addressing the effective transfer of information from an old vocabulary to a new one a vocabulary transfer.
Section snippets
Related work
There are different attempts to facilitate transfer learning through some enhancement or preprocessing of the new training data. For example, [15] proposes to inject phrasal paraphrase relations into BERT to generate suitable representations for semantic equivalence assessment instead of increasing the model's size. In this work, instead of enhancing the dataset with additional information, we try to find out if it is possible to organize transfer learning when new vocabulary tokenization is
Vocabulary transfer
This paper, to our knowledge, is the first to introduce the concept of vocabulary transfer. We do it as follows:
- •
in this section, we methodologically describe vocabulary transfer as a general problem that is open to future research;
- •
we propose an example of a possible solution for the problem of vocabulary transfer;
- •
we demonstrate that such solution improves the performance of the resulting model but do not claim that such a solution is optimal;
- •
we discuss the obtained results to stimulate further
Aspects of vocabulary transfer
As we have stated in Section 3 vocabulary transfer conceptually is a process of finding dataset-specific tokenization, its initialization, and fine-tuning procedure for it that would result in the superior performance of a given NLP model. We test different popular tokenizations and investigate embeddings initialization and fine-tuning procedures further in this section.
Discussion
Though fine-tuning BERT over new tokenization seems beneficial across various vocabulary sizes and speeds up the transfer for both downstream datasets, one has to address the discrepancy between the benefits of vocabulary transfer for hyperpartisan news detection or Sentiment140, and Quora insincere questions datasets. Indeed, Table 3 shows that vocabulary transfer is much more useful for hyperpartisan news than for Quora. Let us briefly discuss this difference since it provides an illustrative
Conclusion
This paper studies the effect of dataset-specific tokenization on the fine-tuning of a transformer-based architecture. We carry out experiments that demonstrate that a dataset-specific vocabulary paired with procedures for the initialization and fine-tuning of the embeddings facilitates transfer learning. We call this phenomenon vocabulary transfer.
We discuss three aspects of vocabulary transfer: tokenization, initialization, and fine-tuning. We demonstrate that dataset-specific tokenization is
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (21)
- et al.
Attention is all you need
- et al.
Improving language understanding by generative pre-training
- et al.
Language models are unsupervised multitask learners
OpenAI Blog
(2019) - et al.
Bert: pre-training of deep bidirectional transformers for language understanding
- et al.
Exploring the limits of transfer learning with a unified text-to-text transformer
- et al.
Neural machine translation of rare words with subword units
- et al.
Byte pair encoding is suboptimal for language model pretraining
- et al.
Bpe-dropout: simple and effective subword regularization
A new algorithm for data compression
C Users J.
(1994)- et al.
Controlling the output length of neural machine translation
Cited by (9)
Optimizing the impact of data augmentation for low-resource grammatical error correction
2023, Journal of King Saud University - Computer and Information SciencesKinematic Calibration for the 3-UPS/S Shipborne Stabilized Platform Based on Transfer Learning
2024, Journal of Marine Science and Engineering
- 1
The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139. The article was prepared with the support of the Yandex DataSphere service from the Yandex Cloud platform. https://cloud.yandex.com/en/services/datasphere.