Model Design¶
Parameters¶
The input text will go through an embedding / encoding process, which will map the string into a matrix:
Model Name  | 
\(n_{parameter}\)  | 
\(n_{layer}\)  | 
\(d_{model}\)  | 
\(n_{head}\)  | 
\(d_{head}\)  | 
|---|---|---|---|---|---|
GPT-2 Small  | 
117M  | 
12  | 
768  | 
12  | 
64  | 
GPT-3[1]  | 
175B  | 
96  | 
12288  | 
96  | 
128  | 
\(n_{parameter}\) - total number of trainable parameters
\(n_{layer}\) - number of the decoder-only transformer layers
\(d_{model}\) - the original explain is “number of units in each bottleneck layer”. It first denotes the vector length of input embedding / encoding.
\(d_{ff}\) - number of unit in the hidden states of the feed-forward layer. It is designed to be four times of \(d_{model}\).
\(d_{head}\) - dimension of each attention head
Workflow Chart¶
diagram of the GPT-3 process
                                <---------- 50257 ---------->
                                ============= 1 =============
                                ============= 2 =============
                                ============ ... ============
                                ============ 2048 ===========
                               |||                         |||
                               |||                         |||
                              \|||/                       \|||/
                               \|/                         \|/
                         token embedding           positional encoding
                              (WTE)                       (WPE)
                         <--- 12288 --->             <--- 12288 --->
                         ====== 1 ======             ====== 1 ======
                         ====== 2 ======             ====== 2 ======
                         ===== ... =====             ===== ... =====
                         ===== 2048 ====             ===== 2048 ====
                                |                           |
                                |                           |
                                -----------------------------
                                             |||
                                             |||
                                            \|||/
                                             \|/
                                           plus (+)
                                       <--- 12288 --->
                                       ====== 1 ======
                                       ====== 2 ======
                                       ===== ... =====
                                       ===== 2048 ====
                                             |||
                                             |||
                                            \|||/
                                             \|/
                                   1st Decoder Transformer
                                       <--- 12288 --->
                                       ====== 1 ======
                                       ====== 2 ======
                                       ===== ... =====
                                       ===== 2048 ====
                                             |||
                                             |||
                                            \|||/
                                             \|/
                                             ...
                                             |||
                                             |||
                                            \|||/
                                             \|/
                                   96th Decoder Transformer
                                       <--- 12288 --->
                                       ====== 1 ======
                                       ====== 2 ======
                                       ===== ... =====
                                       ===== 2048 ====
                                             |||
                                             |||
                                            \|||/
                                             \|/
                                   inverse token embedding
                                     (WTE^(-1) + softmax)
                                <---------- 50257 ---------->
                                ============= 1 =============
                                ============= 2 =============
                                ============ ... ============
                                ============ 2048 ===========
Back to GPT.