Model Design¶

Parameters¶

The input text will go through an embedding / encoding process, which will map the string into a matrix:

Model Name	\(n_{parameter}\)	\(n_{layer}\)	\(d_{model}\)	\(n_{head}\)	\(d_{head}\)
GPT-2 Small	117M	12	768	12	64
GPT-3[1]	175B	96	12288	96	128

\(n_{parameter}\) - total number of trainable parameters
\(n_{layer}\) - number of the decoder-only transformer layers
\(d_{model}\) - the original explain is “number of units in each bottleneck layer”. It first denotes the vector length of input embedding / encoding.
\(d_{ff}\) - number of unit in the hidden states of the feed-forward layer. It is designed to be four times of \(d_{model}\).
\(d_{head}\) - dimension of each attention head

Workflow Chart¶

diagram of the GPT-3 process

                                <---------- 50257 ---------->

                                ============= 1 =============
                                ============= 2 =============
                                ============ ... ============
                                ============ 2048 ===========

                               |||                         |||
                               |||                         |||
                              \|||/                       \|||/
                               \|/                         \|/

                         token embedding           positional encoding
                              (WTE)                       (WPE)

                         <--- 12288 --->             <--- 12288 --->

                         ====== 1 ======             ====== 1 ======
                         ====== 2 ======             ====== 2 ======
                         ===== ... =====             ===== ... =====
                         ===== 2048 ====             ===== 2048 ====

                                |                           |
                                |                           |
                                -----------------------------
                                             |||
                                             |||
                                            \|||/
                                             \|/

                                           plus (+)

                                       <--- 12288 --->

                                       ====== 1 ======
                                       ====== 2 ======
                                       ===== ... =====
                                       ===== 2048 ====

                                             |||
                                             |||
                                            \|||/
                                             \|/

                                   1st Decoder Transformer

                                       <--- 12288 --->

                                       ====== 1 ======
                                       ====== 2 ======
                                       ===== ... =====
                                       ===== 2048 ====

                                             |||
                                             |||
                                            \|||/
                                             \|/

                                             ...

                                             |||
                                             |||
                                            \|||/
                                             \|/

                                   96th Decoder Transformer

                                       <--- 12288 --->

                                       ====== 1 ======
                                       ====== 2 ======
                                       ===== ... =====
                                       ===== 2048 ====

                                             |||
                                             |||
                                            \|||/
                                             \|/

                                   inverse token embedding
                                     (WTE^(-1) + softmax)

                                <---------- 50257 ---------->

                                ============= 1 =============
                                ============= 2 =============
                                ============ ... ============
                                ============ 2048 ===========

Back to GPT.