Deep Learning#

Tokenizer
Optimizer
Normalisation
Activation Functions
Bigram Language Model
Logits
Transformers

Tokenizer#

Character Level Tokenizer, Word Level Tokenizer, Subword Level Tokenizer
contains encoder and decoder, which convert each element of array to specific type, e.g char to int

Character Level Tokenizer#

takes a char and converts it to integer equivalent

have small vocabulary to work with, but a lot of characters to encode and decode

Word Level Tokenizer#

have large vocabulary to work with, but small amount of characters to encode and decode

Subword Level Tokenizer#

between Character Level and Word Level tokenizers

Optimizer#

Mean Squared Error, Gradient Descent, Momentum, RMSprop, Adam, AdamW
it is essential to know which optimizer to use based on a problem

Mean Squared Error#

common loss function used in regression problems

the goal is to predict a continuous output

measures the average squared difference between the predicted and the actual values

oftn used to train neural networks for regression tasks

Gradient Descent#

used to minimize the loss function of a model

the loss function measures how well the model is able to predict the target based on input features

iteratively adjust the model parameters in the direction of the steepest descent of the loss function

Momentum#

extension of Gradient Descent that adds a momentum term to the parameter updates

the term helps smooth out the updates and allows the optimizer to continue moving in the right direction, even if the gradient changes direction or varies in magnitude

particularly useful for training deep neural networks

RMSprop#

Root Mean Square Propagation, uses a moving average of the square gradient to adapt learning rates of each parameter

helps to avoid oscillations in the parameter updates and can improve convergence in some cases

Adam#

popular optimization algorithm, combines the ideas of momentum and RMSprop

uses moving of both the gradient and the squared values to adapt the learning rate of each parameter

often used as default optimizer for deep learning models

AdamW#

modification of the Adam optimizer that adds weight decay to the parameter updates

helps to regularise the model and can improve generalisation performance

Normalisation#

Softmax

Softmax#

a type of normalisation, but not used for normalising input data

$Softmax(x_i) = frac{exp(x_i)}{Sigma_jexp(x_j)}$

Activation Functions#

Sigmoid
to introduce non-linearity into the model to learn complex patterns, applied to output of each layer

Sigmoid#

often used in the output layer for binary classification problems

$Sigmoid(x) = frac{1}{1 + exp(-x)}$, output range (0, 1)

smooth gradient and output values are bounded

can cause vanishing gradient problems and is computationally expensive

Bigram Language Model#

only consider the previous character to predict the next

Logits#

unnormalised final scores of a model
apply softmax to logits to get a probability distribution over classes

Transformers#

Self-Attention, Positional Encoding, Encoder-Decoder Structure, Multi-Head Attention
Feed-Forward Neural Networks
neural network architecture that relies on self-attention mechanisms
discard the recurrent layers commonly used in sequence modeling tasks
pre-training: send inputs into a transformer, get output probabilities that are used to generate from
parallelisation makes the transformer significantly faster, especially for longer sequences
can scale well with increasing amounts of data and computational resources
suitable for large-scale tasks
outperforms traditional models like LSTMs and GRUs, particularly in machine translation

Self-Attention#

sets different scores to each token in a sentence, a token can be character, sub-word, or word level

use self-attention to compute representations of input and output sequences

each word in a sequence is connected directly to every other word

allow more efficient parallelisation compared to recurrent models

Positional Encoding#

transformers do not have built-in notion of word order, unlike RNNs

added to the input embeddings to give the model information about the position of each word in the sequence

Encoder-Decoder Structure#

encoder process the input sequence

decoder generate the output sequence

Multi-Head Attention#

used in parallel to capture different relationships between words

Feed-Forward Neural Networks#

after the attention layers, position-wise feed-forward neural networks further process the data

Deep Learning

Contents

Deep Learning#

Tokenizer#

Character Level Tokenizer#

Word Level Tokenizer#

Subword Level Tokenizer#

Optimizer#

Mean Squared Error#

Gradient Descent#

Momentum#

RMSprop#

Adam#

AdamW#

Normalisation#

Softmax#

Activation Functions#

Sigmoid#

Bigram Language Model#

Logits#

Transformers#

Self-Attention#

Positional Encoding#

Encoder-Decoder Structure#

Multi-Head Attention#

Feed-Forward Neural Networks#