Next-generation video applications such as surveillance, object detection, motion analysis now rely on 360° embedded vision. In these systems, multiple real-time camera streams (up to six) are processed together frame by frame, with each frame corrected for distortion and other image artifacts, adjusted for exposure and white balance, and then stitched together dynamically into a single 360° panoramic view, then output at 4K 60 fps and ultimately projected on a spherical coordinate space.
Today’s high-resolution fish-eye cameras lenses used in such applications typically have a wide-angle field of view (FOV). One of the biggest bottlenecks in surround-view camera systems is storing and accessing multiple-camera input data to/from external memory in real time and then processing it as a single frame. The hardware needs to operate within one frame of latency between the incoming raw sensor data from the input cameras and the stitched output video.
High-performance computing platforms have been moving in the direction of using FPGAs in tandem with CPUs to provide specialized hardware acceleration for real-time image processing tasks. This configuration allows the CPUs to focus on the particularly complex algorithms where they switch threads and contexts quickly and relegate repetitive tasks to an FPGA to function as a configurable hardware accelerator/coprocessor/offload engine. Even with using FPGAs and CPUs as discrete devices, systems experience increased overall efficiency as the technologies do not clash, but rather fit together like a hand in a glove.
For example, the images obtained from fish-eye lenses suffer from severe distortion; therefore, the resulting stitching operation involving multiple cameras is a highly compute-intensive task because it is a per-pixel operation. This stitching requires significant real-time image processing and a highly parallelized architecture. But this next-generation application outstrips the ability of FPGAs to perform this roll, primarily due to the delays in moving data on and off chip, impacting the overall latency, throughput and performance of the system.
eFPGAs are the Answer
Enter eFPGA IP which can be embedded along with a CPU in an SoC. An embedded FPGA fabric offers unique advantages when compared to a standalone FPGA plus CPU solution, the primary being higher performance. An eFPGA is directly connected (no I/O buffers) to the ASIC through a wide parallel interface, offering dramatically higher throughput, with latency counted in single-digit clock cycles. Low latency is key in complex real-time processing of images, for example, when correcting fish-eye distortion.
An advantage of eFPGAs is that they can be sized to meet the specific application. With Speedcore eFPGA IP, customers specify their logic, memory and DSP resource needs, then Achronix configures the IP to meet their individual requirements. Look-up-tables (LUTs), RAM blocks and DSP64 blocks can be assembled like building blocks to create the optimal programmable fabric for any given application.
Along with LUTs, embedded memory and DSP blocks, customers can also define their own custom functions to be included in the Speedcore eFPGA fabric. These customized blocks are integrated into the logic fabric alongside the traditional building blocks, greatly increasing the capability of the eFPGA by adding functions optimized to decrease area and/or increase performance of targeted applications, especially for embedded vision and image processing algorithms.
A good example of how custom blocks enable the ability of handling high-performance image processing is when implementing you only look once (YOLO), a state-of-the-art, real-time object detection algorithm using neural networks that offers greatly increased performance over earlier methods. This algorithm relies on a large number of matrix multipliers. When implemented in an FPGA, these matrix multipliers are built using DSP and RAM blocks. The problem arises in the mismatch between the optimal configuration of the DSP and RAM blocks needed by YOLO versus what is found in a typical FPGA fabric. For example, an FPGA fabric may offer DSP blocks with 18 × 27 multiplication/accumulation and 32 × 128 RAMs, where the optimal solution would be a fabric with 16 × 8 DSP blocks with 48 × 1024 RAMs. By creating custom blocks that implement the optimal DSP and RAM block configurations, the resulting Speedcore fabric uses 40% less die area to implement the same functionality as well as achieving a higher level of system performance.
System-Level Benefits of eFPGA IP
Embedding FPGA fabrics in SoCs provides two additional system-level benefits:
- Lower power – Programmable I/O circuitry accounts for half of the total power consumption of standalone FPGAs. An eFPGA has direct wire connections to other blocks within the host SoC, eliminating the need for large programmable I/O buffers altogether.
- Lower system cost – The die size of an eFPGA is much smaller than an equivalent standalone FPGA as the eFPGA can be sized for the specific target function. Additionally, functions in standalone FPGAs like programmable I/O buffers and interface logic are not needed in the eFPGA.
With ultra low-latency and real-time processing driving the need for efficient implementation of 360° view vision-based systems, Speedcore eFPGAs with custom blocks, working in tandem with a CPU in the same host SoC, are ideally suited to implement dedicated functionality such as object detection and image recognition, warping and distortion correction and ultimately the stitching together of the final images.