Explore how FPGA-accelerated language models reshape generative AI, with faster inference, lower latency, and improved language understanding.
Introduction: Large Language Models
In recent years, large language models (LLMs) have revolutionized the field of natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. These models, such as OpenAI's GPT, possess an astounding ability to comprehend and produce language. They can be used for a wide range of natural language processing tasks, including text generation, translation, summarization, sentiment analysis, and more.
Large language models are typically built using deep learning techniques, particularly using transformer architectures. Transformers are neural network models that excel at capturing long-range dependencies in sequences, making them well-suited for language understanding and generation tasks. Training a large language model involves exposing the model to massive amounts of text data, often from sources such as books, websites, and other textual resources. The model learns to predict the next word in a sentence or fill in missing words based on the context it has seen. Through this process, it gains knowledge about grammar, syntax, and even some level of world knowledge.
One of the primary challenges associated with large language models is their immense computational and memory requirements. These models consist of billions of parameters, necessitating powerful hardware and significant computational resources to train and deploy them effectively as discussed in Nishant Thakur's March 2023 LinkedIn article, "The Mind-Boggling Processing Power and Cost Behind ChatGPT: What It Takes to Build the Ultimate AI Chatbot?". Organizations and researchers with limited resources often face hurdles in harnessing the full potential of these models due to the vast array of processing needed or money for the cloud. In addition, extreme growth in the context lengths that need to be stored to create the appropriate tokens, words or sub-parts of words, when generating responses puts even more demands on memory and compute resources.
These compute challenges lead to higher latency which makes the adoption of LLMs that much more difficult and not real-time and, therefore, less natural. In this blog, we will delve into the difficulties encountered with large language models and explore potential solutions that can pave the way for their enhanced usability and reliability.
Acceleration of Large Language Models
LLMs are typically built requiring a large-scale system to execute the model which continues to grow to a point where it is no longer cost, power or latency efficient to perform on only CPUs. Accelerators, such as GPUs or FPGAs, can be used to significantly improve the compute-to-power ratio, drastically lower the system latency and reach higher levels of compute at a much smaller scale. While GPUs are definitely becoming the standard for acceleration, mainly due to their accessibility and ease of programming, FPGA architectures actually produces exceptional performance at a much lower latency than the GPU.
Since GPUs are inherently warp-locked architectures, executing over 32, SIMT threads across multiple cores in parallel, they also tend to require batching of larger amounts of data to try and offset warp-locked architectures and keep the pipe full. That equates to more latency and much more demand on system memory. Meanwhile, the FPGA builds custom data paths to execute multiple different instructions on multiple blocks of data concurrently, which means it can operate very efficiently down to a batch size of 1, which is real-time and much lower latency while minimizing the external memory requirements. Therefore, an FPGA is capable of significantly higher utilization of its TOPs than competing architectures — this performance gap only grows as the system is scaled up to a ChatGPT size system.
Achronix FPGAs are capable of outperforming GPUs implementing LLM in both throughput and latency as the system is scaled up to over eight devices (10,000 GPUs were used to train GPT3). If the model can use INT8 precision, then the Achronix FPGA has an even larger advantage shown in the table below using GPT-20B as a reference. The use of FPGAs is beneficial since GPUs have long lead times (over a year for high-end GPUs), minimal user support, and are significantly more expensive than FPGAs (GPUs can cost well over $10,000 each).
CPT-208 Performance Comparison (Low Batch)
No. of Devices | Leading GPU | Achronix Speedster7t AC7t1500 | ||||
---|---|---|---|---|---|---|
Latency (ms) | Throughput (token/s) | Latency (ms) | Throughput (token/s) | Latency (ms) | Throughput (token/s) | |
Batch 1 @ FP16 | INT 8 | |||||
1 | 28 | 35 | 82.8 | 12 | 41.4 | 24 |
2 | 18 | 55 | 41.4 | 24 | 20.7 | 48 |
4 | 13 | 78 | 20.7 | 48 | 10.3 | 96 |
8 | 11 | 92 | 10.3 | 96 | 5.2 | 193 |
16 | 9* | 109* | 5.1 | 192 | 2.6 | 386.5 |
32 | 8* | 128* | 2.5 | 384 | 1.2 | 773 |
Table Note: * Estimated performance. Bold indicates performance advantage. |
CPT-208 Performance Comparison (High Batch)
No. of Devices | Leading GPU | Achronix Speedster7t AC7t1500 | ||||
---|---|---|---|---|---|---|
Latency (ms) | Throughput (token/s) | Latency (ms) | Throughput (token/s) | Latency (ms) | Throughput (token/s) | |
Batch 8 @ FP16 | INT 8 | |||||
1 | 37 | 216 | 84.8 | 94 | 42.4 | 188 |
2 | 24 | 328 | 42.4 | 188 | 21.2 | 377 |
4 | 17 | 480 | 21.2 | 377 | 10.6 | 754 |
8 | 12 | 647 | 11.2 | 713 | 5.3 | 1509 |
16 | 9* | 809* | 5.6 | 1426 | 2.6 | 3019 |
32 | 8* | 1011* | 2.5 | 2800 | 1.4 | 5652 |
Table Note: * Estimated performance. Bold indicates performance advantage. |
Mapping LLMs to Achronix FPGA Accelerators
Achronix Speedster7t FPGA has a unique architecture that lends itself very well to these types of models. First, it has a hardware 2D NoC that resolves the ingress and egress of data in, out and through the device. In addition, it uses machine learning processors (MLPs) with tightly coupled block RAM to allow for efficient result reuse between computations. Finally, similar to GPUs but unlike other FPGAs, the Achronix Speedster7t FPGA has eight banks of highly efficient GDDR6 memory which allows for much higher bandwidth, capable of loading parameters at 4 Tbps.
Since these systems require scaling, FPGAs can implement a variety of standard interfaces to interconnect cards together and move data between them seamlessly. The Achronix Speedster7t AC7t1500 device has 32, 100 Gbps SerDes lanes not requiring proprietary and costly solutions such as NVLink.
The Future of Large Language Models: Scaling Up for Enhanced Language Understanding and Specialized Domains
Since these large language models require huge scale to perform training and inference with minimal latency impact, models will continue to grow in complexity which will enable ever-increasing language understanding, generation, and even prediction capabilities at incredible accuracy. While many of the GPT-style models today are general purpose, it is likely that specialized models trained specifically for certain domains such as medicine, law, engineering, or finance will be next. These systems, for a long time anyway, will be there to assist human experts with more of the mundane tasks being handled by AI systems and provide suggestions for solutions or help with creative tasks.
Contact Achronix to discuss how we can help you accelerate these large language model systems.