A taxonomy of Transformer based pre-trained language models (TPTLM)
We follow on from our two previous posts
In this post, we understand the taxonomy of TPTLM – Transformer based pre-trained language models
The post is based on a paper which covers this topic extensively:
Transformer based pre-trained language models (TPTLM) are a complex and fast growing area of AI – so I recommend this paper as a good way to understand and navigate the landscape
We can classify TPTLM from four perspectives
- Pretraining Corpus
- Model Architecture
- Type of SSL (self-supervised learning) and
Pretraining Corpus-based models
General pretraining: Models like GPT-1 , BERT etc are pretrained on general corpus. For example, GPT-1
is pretrained on Books corpus while BERT and UniLM are pretrained on English Wikipedia and Books corpus.
This form of training is more general from multiple sources of information
Social Media-based: you could train on models using social media
Language-based: Models could be trained on languages either monolingual or multilingual.
TPTLM could be classified based on their architecture. A T-PTLM can be pretrained using a stack of encoders or decoders or both.
Hence, you could have architectures based on
- Encoder-Decoder based
Self supervised learning – SSL is one of the key ingredients in building T-PTLMs.
A T-PTLM can be developed by pretraining using Generative, Contrastive or Adversarial, or Hybrid SSL. Hence, based on SSLs you could have
- Generative SSL
- Contrastive SSL
- Adversarial SSL
- Hybrid SSL
Based on extensions, you can classify TPTLMs according to the following categories
- Compact T-PTLMs: aim to reduce the size of the T-PTLMs and make them faster using a variety of model compression techniques like pruning, parameter sharing, knowledge distillation, and quantization.
- Character-based T-PTLMs: CharacterBERT uses CharCNN+Highway layer to generate word representations from character embeddings and then apply transformer encoder layers. ex AlphaBERT
- Green T-PTLMs: focus on environmentally friendly methods
- Sentence-based T-PTLMs: extend T-PTLMs like BERT to generate quality sentence embeddings.
- Tokenization-Free T-PLTMs: avoid the use of explicit tokenizers to split input sequences to cater for languages such as Chinese or That that do not use white space or punctuations as word separators.
- Large Scale T-PTLMs: Performance of T-PTLMs is strongly related to the scale rather than the depth or width of the model. These models aim to increase the parameters of the model
- Knowledge Enriched T-PTLMs: T-PTLMs are developed by pretraining over large volumes of text data. During pretraining, the model learns
- Long-Sequence T-PTLMs: self-attention variants like sparse self attention and linearized self-attention are proposed to reduce its complexity and hence extend T-PTLMs to long input sequences
- Efficient T-PTLMs: ex DeBERTa which improves the BERT model using disentangled attention mechanism and enhanced masked decoder.
This is a complex area and I hope the taxonomy above is useful. The paper I referred provides more and makes a great effort at explain such a complex landscape
The post is based on a paper which covers this topic extensively: (also image source from the paper)