A taxonomy of Transformer based pre-trained language models (TPTLM)
We follow on from our two previous posts
Opportunities and Risks of foundation models
Understanding self supervised learning
In this post, we understand the taxonomy of TPTLM – Transformer based pre-trained language models
The post is based on a paper which covers this topic extensively:
AMMUS : A Survey of Transformer-based Pretrained Models in Natural …
Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sa…
Transformer based pre-trained language models (TPTLM) are a complex and fast growing area of AI – so I recommend this paper as a good way to understand and navigate the landscape
We can classify TPTLM from four perspectives
- Pretraining Corpus
- Model Architecture
- Type of SSL (self-supervised learning) and
- Extensions
Pretraining Corpus-based models
General pretraining: Models like GPT-1 , BERT etc are pretrained on general corpus. For example, GPT-1
is pretrained on Books corpus while BERT and UniLM are pretrained on English Wikipedia and Books corpus.
This form of training is more general from multiple sources of information
Social Media-based: you could train on models using social media
Language-based: Models could be trained on languages either monolingual or multilingual.
Architecture
TPTLM could be classified based on their architecture. A T-PTLM can be pretrained using a stack of encoders or decoders or both.
Hence, you could have architectures based on
- Encoder-based
- Decoder-based
- Encoder-Decoder based
Self supervised learning – SSL is one of the key ingredients in building T-PTLMs.
A T-PTLM can be developed by pretraining using Generative, Contrastive or Adversarial, or Hybrid SSL. Hence, based on SSLs you could have
- Generative SSL
- Contrastive SSL
- Adversarial SSL
- Hybrid SSL
Based on extensions, you can classify TPTLMs according to the following categories
- Compact T-PTLMs: aim to reduce the size of the T-PTLMs and make them faster using a variety of model compression techniques like pruning, parameter sharing, knowledge distillation, and quantization.
- Character-based T-PTLMs: CharacterBERT uses CharCNN+Highway layer to generate word representations from character embeddings and then apply transformer encoder layers. ex AlphaBERT
- Green T-PTLMs: focus on environmentally friendly methods
- Sentence-based T-PTLMs: extend T-PTLMs like BERT to generate quality sentence embeddings.
- Tokenization-Free T-PLTMs: avoid the use of explicit tokenizers to split input sequences to cater for languages such as Chinese or That that do not use white space or punctuations as word separators.
- Large Scale T-PTLMs: Performance of T-PTLMs is strongly related to the scale rather than the depth or width of the model. These models aim to increase the parameters of the model
- Knowledge Enriched T-PTLMs: T-PTLMs are developed by pretraining over large volumes of text data. During pretraining, the model learns
- Long-Sequence T-PTLMs: self-attention variants like sparse self attention and linearized self-attention are proposed to reduce its complexity and hence extend T-PTLMs to long input sequences
- Efficient T-PTLMs: ex DeBERTa which improves the BERT model using disentangled attention mechanism and enhanced masked decoder.
This is a complex area and I hope the taxonomy above is useful. The paper I referred provides more and makes a great effort at explain such a complex landscape
The post is based on a paper which covers this topic extensively: (also image source from the paper)
AMMUS : A Survey of Transformer-based Pretrained Models in Natural …
Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sa…