Demystifying AI: Transformers

Understanding the Transformer Networks Leading the Way for GenAI


AI has been around for some time, but recent developments have made it one of the hottest topics in tech. Although AI is becoming mainstream, the technology is still new to many, and many of the related concepts and terminology remain unclear. This Futurum Group and Signal65 insight looks to demystify the transformer models that are driving the current generative AI innovation.

Demystifying AI: Transformers

In the previous article, Demystifying AI, ML, and Deep Learning, it was noted that the field of artificial intelligence (AI), including machine learning and deep learning, has been around for quite some time. Despite the fact that AI as a field has been progressing – although at various rates – for decades, the recent excitement over ChatGPT and other GenAI applications has led to a revitalized interest in the technology. This excitement isn’t unfounded and the capabilities of ChatGPT and other Large Language Models (LLMs) are truly impressive. They appear to be a significant step toward the ever-elusive goal of General Artificial Intelligence. For many, interacting with ChatGPT was their first experience with a machine acting truly “intelligent.” It can hold a conversation, it is knowledgeable on seemingly anything it is asked about, and it speaks in a way that appears human. Today’s LLMs, however, are still built on the fundamental aspects of deep learning – so what is it that has led to this breakthrough?

There are several components that can be attributed to the recent innovation in AI – including vast datasets for training and more powerful processing capabilities – but a significant factor is the neural-network architecture employed in these models, known as the transformer.

The Need for a New Architecture

To fully understand the transformer architecture and why it has had such a significant impact in AI innovation, it is helpful to first understand the problems it solves. Transformers are specifically very useful for AI applications that handle a sequence of data – such as a string of text. Handling of sequential data poses additional challenges to neural network architectures because past data in the sequence may have an impact on the processing of data later in the sequence.

To handle the challenges of sequential data, typically a Recurrent Neural Network (RNN) has been used. RNNs differ from standard feedforward neural networks, in that they process data sequentially, utilizing information regarding the previous input. This gives RNNs a type of memory about past inputs, making them useful for handling sequential data such as text.

RNNs, however, are not without their challenges. Specifically, RNNs face two major issues that limit their usefulness for certain applications, such as natural language processing (NLP). First, RNNs are not very parallelizable, due to their requirement of processing data sequentially. This limits their efficiency and does not allow models to fully leverage highly parallelizable hardware, such as GPUs. Second, while RNNs are designed to have a memory, this memory fades quickly over time.

This second challenge can be especially limiting when handling text beyond a few words. Consider an NLP model predicting the next word in the following sentence:

“The grass is _”

In this example, it seems intuitive that the next word should be “green,” due to the context of the previous word grass. In this case, an RNN would likely predict correctly, since the word “grass” is within close proximity to the missing word.

Now consider a slightly longer example:

“Joe recently went on a trip to France. He had a great time visiting museums, trying new foods, and exploring. He enjoyed interacting with the locals, but his biggest challenge was that he doesn’t speak _.”

It should be intuitive that the missing word here is “French,” given the context of the word “France.” The challenge in this example, however, is that there is significant separation between the context and the prediction. Due to their limited memory capabilities, RNNs struggle to make predictions in which the dependent context is far away.

While state of the art LLMs, such as ChatGPT, are capable of writing full paragraphs, stories, and essays, their job is essentially the same as the problem in the previous two examples – predict the next most likely word. Each response, no matter how lengthy, is actually generated one word at a time, based on the previous words. Development of RNNs has led to some workarounds for their memory issues, such as an architecture known as Long Short Term Memory (LSTM); however, their memory is still quite limited. To achieve the results, we see today with cutting edge LLMs, a new architecture was required.

Attention Is All You Need

The transformer architecture was first introduced in a 2017 Google research paper titled “Attention is All You Need.” As the title strongly suggests, a key element to the transformer is the concept of attention.

Attention was first introduced in a 2014 research paper titled “Neural Machine Translation by Jointly Learning to Align and Translate.” The attention mechanism introduced in 2014 is known as additive attention and was used as an enhancement to RNN-based approaches. The core concept behind attention is calculating the relationship and relevance between data. Attention can be calculated both for entire sequences – such as a sentence in English and a sentence in French – and between specific words in a sentence. This later usage, known as self-attention, is one of the key concepts of transformers, and helps calculate the relevance of the words such as “grass” and “green” or “France” and “French” in the previous examples.

While the idea of attention had been previously used to enhance RNNs, the transformer model introduced in 2017 brought a new architecture based solely on attention mechanisms, without the use of an RNN. The original transformer was designed for language translation tasks, but has since been successfully applied to a variety of challenges. The transformer architecture provides two crucial advantages over RNNs: it is highly parallelizable, leading to faster training times, and its usage of self-attention allows it to overcome the memory challenges of RNNs.

The transformer model first unveiled in the “Attention is All You Need” paper looks like this:

Transformer Architecture
Figure 1: Transformer Architecture (Source: “Attention is All You Need”)

Unless one is quite familiar with AI model architectures, a brief look at this architecture diagram alone may not necessarily give a fully intuitive understanding of what is going on. To further understand transformers, we can take a closer look, breaking down the architecture step by step.

Encoders and Decoders

Before jumping into any specific piece of the transformer, it can be noted that the transformer architecture follows an encoder-decoder architecture. Looking at the transformer architecture, it appears to have two connected portions – one taking in “Inputs” and one taking in “Outputs.” This first portion is known as an “Encoder,” while the second is known as a “Decoder.”

Figure 2: Encoder and Decoder
Figure 2: Encoder and Decoder

The encoder-decoder architecture is used for applications in which both the input and the output is a sequence of data of some variable length. This is known as a sequence-to-sequence model. In the case of the original transformer model, which was developed for English to French and English to German translation tasks, the input sequence would be English text, while the output sequence would be the corresponding translation in either French or German.

The encoder portion takes in a variable length sequence as an input and provides a fixed length representation to the decoder that contains the context of the input. The decoder portion then generates an output for the given task. For the task of English to French translation, the encoder’s job is to understand the context of the words in an English sentence, while the Decoder uses this context to generate an appropriately translated French sentence. In general, encoders are well suited for classification, while decoders are used for generation of new content. Although the diagram shows a single encoder block and a single decoder block, in reality, most models would include several encoder layers and several decoder layers.

It should also be noted that while the original transformer utilized both an encoder and a decoder, many of the transformer-based models that have since been developed utilize encoder-only or decoder-only architectures. Popular encoder-only models include BERT and RoBERTA, while popular decoder-only models include the GPT and LLaMA models.

In general, the encoder and decoder share a similar architecture, although the decoder contains a few unique differences. The following discusses each portion in an encoder-decoder architecture, with special mention of areas where the decoder differentiates.

Word Embeddings

The first step in the transformer model is word embeddings. The general concept behind this step is fairly straightforward – the goal is to convert input data, such as text, into a machine-compatible format. While transformers are useful for NLP, computers need to work with numbers, so input data is converted into an embedding matrix.

Word Embeddings
Figure 3: Word Embeddings

Text is first broken into chunks, called tokens. In general, tokens can be thought of as words; however, it should be noted that tokens may actually represent portions of words, such as prefixes or suffixes, as well as other characters such as punctuation. For simplicity, however, we will use “words” and “tokens” interchangeably for the remainder of this paper. Once text is tokenized, the tokens are converted into vector embeddings that numerically represent the semantic meaning of the token. Similar tokens are captured with similar numerical vector representations.

A popular example of word embeddings is the words “King,” “Queen,” “Boy,” and “Girl.” The embedding vectors for “King” and “Queen” would be numerically similar in some dimension which represents “Royalty.” Meanwhile, a different dimension of the vectors, representing gender information, would encode the words such that the word “King” is similar to “Boy” while “Queen” is similar to “Girl.” While this demonstrates a simplified example, in reality, words would be embedded in a high-dimensional representation, capable of encoding various attributes of each word.

The decoder similarly takes in word embeddings; however, it does so with a different vocabulary, specific to the role of the decoder. For the task of English to French translation, the encoder would embed an English vocabulary, while the decoder would embed a corresponding French vocabulary.

Positional Encoding

Vector embeddings allow natural language words to be captured in a numerical format that can be used in a computation. The meaning of words alone, however, is not enough to fully understand the context of words in a sequence. The position of each word within the sentence provides important context that can drastically change its meaning. As an example, consider the following sentence:

“Sally ate a hamburger.”

Without a way to correctly order the words, the exact same input could be used to represent “A hamburger ate Sally,” which represents a vastly different meaning.

Positional Encoding
Figure 4: Positional Encoding

In an RNN, data is processed sequentially that naturally captures its order, but doing so creates a significant bottleneck, since the processing of each word is dependent on processing the previous word. Transformers remove this bottleneck, allowing data to be processed in parallel, yet the order of words must still be captured. To achieve this, transformers add an additional layer of encoding to the vector embeddings, known as positional encoding.

Positional encoding uses calculations based on sine and cosine functions to represent the unique positions of each token, which are stored in position vectors. The previously computed word embeddings and the new positional embeddings can then be added together to create a new embedding matrix that contains both semantic and positional information. This new matrix can then be used as the input into the transformer.

Multi-head Attention

As previously noted, a key component to transformers is attention, which captures the relationship between data. Specifically, what is calculated here is self-attention, which models the relationship of each word in the input, to every other word in the input.

Multi-head Attention
Figure 5: Multi-head Attention

Before diving into how this self-attention works, we can look at a quick example for why it is so important. Consider these two sentences:

“Transformers are innovating the field of AI.”

“Transformers are robots in disguise.”

Both sentences discuss “Transformers,” yet they refer to very different things. Most people can fairly easily determine that “transformers” in the first sentence refers to the neural-network architecture, while in the second sentence refers to the fictional robots made popular by animated cartoons, children’s toys, and a movie series. The way this is determined is by the context of the other words such as “AI” or “robots.” While humans can use their intuition to understand this context, transformer models use self-attention to calculate the relative importance between words and understand the context in which they are used.

To do this, transformer models create three matrices known as Key, Query, and Value matrices (K, Q, and V). These matrices are created from the input matrix and a series of weight values, which are initially randomized and further refined during training. The concept behind Key, Query, and Value stems from search engines, where the Query is what is searched for, the Key is the content being searched over, and the Value is what is determined as the best results. Self-attention follows a similar model in which the Query searches for relationships between words, the Key provides characteristics and features of the relationships, and the Value holds the semantic meaning that can be combined with the discovered relationships to return a result.

To achieve this, the Key and Query matrices undergo a series of transformations and matrix operations that calculate the similarity between the components. The result is a matrix of attention scores that depict the relationship between words. These attention scores are then utilized as weights given to the Value matrix as a way to filter the relationships between words. The specific formula for attention provided in “Attention is All You Need” is as follows:

Attention Formula
Figure 6: Attention Formula (Source: “Attention is All You Need”)

Without diving too far into the math behind attention, the above formula can be broken down as follows. The similarity of values in the Q and K matrices is calculated by taking the dot product of Q with the transpose of K. The result is then scaled to provide more accurate results by dividing by the square root of the number of dimensions of the K matrix, seen as dk in the above formula. The softmax function is utilized to return a matrix of attention scores that consist of values between 0 and 1, where a high attention score signifies a strong relationship between the words. A further description of the softmax function is provided later on in this paper. The calculated attention score matrix is then multiplied with the matrix V.

This process, with a single set of K, Q, and V matrices, represents single-head attention. Transformer models utilize multi-head attention, in which multiple K, Q, and V matrices are utilized, all of which undergo the attention process in parallel. These separate processes, known as “heads,” are ultimately combined to form a single output. This approach allows each distinct head to learn different relationships and aspects of the data, which combine to form a more complex and nuanced understanding of the attention relationships.

Decoder Attention

The main difference in the decoder architecture is within its attention layers. While the encoder includes a single multi-head attention layer, the decoder includes two, both of which vary slightly from the process in the encoder.

Decoder Attention Layers
Figure 7: Decoder Attention Layers

The first attention layer in the decoder performs masked multi-head self-attention. This process is the same as in the encoder, but with one key difference – part of the information is masked or hidden. More specifically, for each word in a sentence, the words following it are hidden. This is because the role of the decoder is to predict an output, yet as previously discussed, transformers take in full sequences of words at once, rather than evaluating one word at a time. The masking process ensures that tokens are only being evaluated by their relationship with previous tokens, forcing the decoder to make predictions without already knowing the future words.

The second attention step in the decoder block differs in that instead of calculating self-attention. Only between the tokens in its own input, it calculates cross-attention, using both the output from the decoder’s masked multi-head attention layer and the output from the encoder. This allows the transformer to learn the relationships between the encoder’s data and the decoder’s data. In the case of English to French translation, this represents learning the relationship between the English sequence and the corresponding French sequence that the decoder is learning to predict. 

Add and Norm

The add and norm block is utilized to enhance the accuracy and efficiency of the model, especially as they grow deeper. This step is performed after every multi-head attention step and every feed-forward step in the model.

Add and Norm
Figure 8: Add and Norm

As can be seen in the transformer diagram, the add and norm block is connected by a separate path which navigates around the previous step. This represents the add step, in which the input data of the previous step bypasses the transformation step and is added to its output. This is known as a residual connection and is used to combat issues in which networks forget information from previous layers as they grow increasingly deep.

The norm component of the step performs layer normalization to rescale the data. This ensures that values do not grow or shrink too large in a way that over or under represents their significance.

Feed Forward

The next layer, and the final layer of the encoder block, is a fully connected feed forward neural network. The feed forward network is a multi-layer perceptron network that contains layers of linear and nonlinear functions. The role of the feed forward network is to take in the computed attention vectors, learn more complex patterns about the data, and return a format that can be utilized by the next encoder or decoder block.

Feed Forward
Figure 9: Feed Forward

Linear and Softmax

The final step in the transformer architecture is for the decoder to create an output prediction. This is achieved with two additional steps seen in the architecture as “Linear” and “Softmax.”

Linear and Softmax
Figure 10: Linear and Softmax

First, the data is sent through an additional linear layer, which categorizes the possible prediction outputs and assigns them a score based on how likely the model has determined each output to be correct. The scores assigned, however, can be any number, positive or negative, and are not interpretable as a probability that can be used to predict the most likely output. Because of this, transformers use one final step, known as a softmax function, to convert the scores into a probability.

Softmax is an incredibly common and incredibly useful function in machine learning algorithms that is used to normalize values. As previously mentioned, the softmax function is also used in the attention calculation. The softmax function converts all inputs into values between 0 and 1, with all values summing to 1. This provides a useful way to represent output scores as a set of probabilities. With softmax, large numbers will be converted to larger probabilities, while very small values will converge to 0. Softmax, as the name suggests, can be thought of as a “softer” version of a traditional max function. Using a traditional max function to find probabilities would simply take the largest value and assign it a probability of 1, while assigning all smaller values to 0. Softmax, on the other hand, still assigns the largest value the highest probability, but accounts for other significant values to receive proportional, non-zero probabilities.

The softmax probabilities are used to predict a final output of the transformer. This method of using probabilities provides flexibility for models to generate different probabilistic outputs, rather than a single deterministic output.

Looking Ahead

The introduction of the transformer architecture has rapidly advanced the fields of machine learning and AI. While AI has been a field of study for decades, the transformer architecture has been seen by many as a major breakthrough, and is the basis for much of the current hype around AI. Despite the impressive, and seemingly “magical” results that transformers can achieve, an examination of the general architecture shows that transformers are built upon a series of relatively straightforward steps.

Although the original transformer was developed for language translation tasks, the capabilities of the transformer architecture have far exceeded this single task. Transformers have become the backbone of additional NLP tasks, such as generative chat applications, and they are increasingly being applied to other AI challenges such as computer vision and multi-modal applications.

Since the introduction of transformers in 2017, the field of AI has accelerated dramatically, and in the last few years it has received a renewed wave of interest in excitement. Looking ahead, the transformer architecture will continue to bring about innovative new AI models and capabilities that can transform both crucial business processes and integrate into the public’s everyday life. The state of transformer-based models is rapidly evolving, as researchers find new ways to optimize or improve the basic concept of the transformer. In addition to new models, the power of transformers will only increase as they are combined with increasingly powerful hardware, additional training datasets, and new techniques, such as RAG, that augment the process to provide more powerful results.