“Because FPGA has the characteristics of programmable and high-performance computing, AI computing acceleration based on FPGA hardware is being widely used in the field of computer vision processing. One of the most representative deployment methods is to use a combination of FPGA and CPU to form a heterogeneous computing system, and install the Linux operating system on the CPU to run various services such as AI inference engine framework and video image processing. Among them, how to coordinate the calculation relationship between CPU and FPGA has become the key to this heterogeneous system, and this part of the key technology is completed by the drive system.
1. Background introduction
Because FPGA has the characteristics of programmable and high-performance computing, AI computing acceleration based on FPGA hardware is being widely used in the field of computer vision processing. One of the most representative deployment methods is to use a combination of FPGA and CPU to form a heterogeneous computing system, and install the Linux operating system on the CPU to run various services such as AI inference engine framework and video image processing. Among them, how to coordinate the calculation relationship between CPU and FPGA has become the key to this heterogeneous system, and this part of the key technology is completed by the drive system.
There are many modes for coordinating the computing relationship between CPU and FPGA, a pair of heterogeneous brothers, which will vary greatly according to different application scenarios. Generally, on the terminal side, the application of the entire system is relatively single and controllable, and concurrency and virtual situations are less considered. According to this situation, EdgeBoard came into being.
2.Introduction to EdgeBoard
EdgeBoard is an FPGA-based embedded AI solution created by Baidu and a series of hardware based on this solution. As an end-side solution, instead of designing a dedicated memory for the FPGA on the PL side, a structure in which the PS and PL sides share DDR memory is adopted. Therefore, the coordination of the CPU and FPGA of this heterogeneous system falls on the memory management, which is the subsystem about memory management in the drive system.
3. Full text overview
This article will focus on the key techniques for CPU and FPGA memory-driven design in EdgeBoard. The following figure is the software framework diagram of the overall EdgeBoard, in which the memory driver is located in the kernel part. The following will introduce the allocation, release, and recycling of memory step by step.
4. Memory characteristics and FPGA memory requirements
On the PS side of the FPGA chip, the CPU uses multi-level caches to access DDR. In the Linux operating system, memory-mapped pages (managed through page tables) are used, and there is no requirement for the physical continuity of DDR, they are mapped to virtual continuous in the address space. The FPGA on the PL side generally does not use any cache mechanism, and they all directly access the DDR in the calculation. Each read and write operation is to read or write a contiguous memory space, and the starting address of this memory is required to be aligned at a specific address offset (offset 0x10). Multiple reads and writes in one calculation require that the accessed DDR is consistent and contiguous.
In view of the memory requirements of CPU and FPGA, when we design the subsystem of Linux-driven memory, we must fully consider: 1) the impact of cache; 2) the physical continuity of the memory used by the FPGA; 3) the memory block passed to the FPGA for use To meet the offset requirements (segment alignments).
5. Reserve System Memory
In response to the above requirements, we adopt the design scheme of separating physical memory, reserve a part of the memory from the overall system, and let Linux only use the other part of the memory. This part of the reserved memory is allocated by the memory driver, and the memory block allocated by the driver from the reserved part can be accessed in both the Linux system and the FPGA system. During this period, the physical continuity and starting offset characteristics should be guaranteed when assigning, so that it can meet the needs of the FPGA.
In EdgeBoard practice, we use Xilinx’s ZynqMP series FPGA chips, use the PetaLinux tool chain to compile the Linux kernel, and use the reserved-memory node in DeviceTree to achieve memory retention. For example, the overall system has 2GB memory, 1GB is reserved for the FPGA, and 1GB is reserved for the Linux operating system. The settings of the relevant nodes in DeviceTree are as follows:
The following is an introduction to various aspects of the driver design implementation.
6. Initialize the internal mapping of memory
The FPGA overall device driver is written in the form of a character device platform driver: in the probe stage of the Device, do a memory map (memremap) for the memory blocks reserved in the driver, and use a reasonable data structure to save each parameter. for subsequent use. For example:
The definitions of structure members such as mem_start, mem_end, base_addr are as follows:
7. Memory allocation
Regarding the allocation of memory, the mmap call method is adopted: during the initialization of the FPGA device, the file_operations structure variable is passed when the character device is initialized, and the mmap pointer of this structure variable is initialized to our memory allocation function.
In memory allocation, we use the bitmap data structure provided by the kernel to manage the memory area we reserve – each bit in the bitmap bit array represents a memory block of 16k, and an array of the same length is used to manage the memory. Information such as the allocated client owner (ie, the file pointer), the number of memory allocated blocks, and so on.
In addition, in the vma (VMM Memory Area) corresponding to the allocated memory, we also registered our own private data private data to record the necessary information of the corresponding memory: such as the bus physical address range and mapping address of the corresponding memory block, bitmap bit array The index, etc., is used for address translation; another example is some finger information corresponding to the memory block, which is used to identify the reserved memory block.
At the same time, we also registered the close function for the VMA’s vm_ops to prepare for the reclamation of this block of memory.
The amount of code about memory allocation is large, so I will not list them one by one here.
8. Memory recovery
All allocated memory blocks need to be reclaimed. There are two most representative cases. One is the process of releasing a piece of memory by the user, and the other is the cleaning and recycling of unreleased memory blocks when the user shuts down the device.
When the user releases a memory block, the close function of the corresponding vma will be automatically executed. To this end, we register this function as a memory release processing function: in the implementation of the function, first detect private data to avoid error handling, and then restore the bit information in the corresponding bitmap bit array, clean up the owner and block number information, so that the reserved memory can be used again. back to the pending state.
When the user turns off the device, the release function registered by the device is called. Therefore, in the release function of the device, we traverse the owner array to clean up the memory block whose owner is the same as the file pointer of the device, so as to achieve the purpose of batch recycling.
9. Address translation of memory
The address translation of the memory is to complete the bidirectional translation of the bus physical address and the virtual logical address, and provide support for the flush/invalidate of the memory cache and the processing to be performed when it is handed over to the FPGA. The system also exposes these two transformation operations to the user layer through two IOCTLs.
In the address translation operation, first find the corresponding vma, then calculate the offset, and finally check whether the identifier corresponding to the vma is a reserved memory block – if so, use the information in the private_data saved by our vma and the offset to complete the relevant calculation.
10. The flush and invalidate operations of the memory cache
When using DDR memory to transfer data between the CPU and its heterogeneous sibling FPGA, flush or invalidate needs to be used to eliminate the impact of the corresponding cache.
First of all, our driver code runs on the CPU. Before the processing of a piece of memory is transferred from the CPU to the FPGA, we need to perform a flush operation on the cache of this piece of memory, so that all changes in the memory cache are written to the DDR. Then the FPGA starts working on it. Secondly, before the processing of a piece of memory is transferred from the FPGA to the CPU, the invalidate operation needs to be performed on the cache of this piece of memory to make the cache content invalid. The next time the CPU reads directly, it will actively load from the DDR to refresh the cache. In this way, the CPU can get along well with the FPGA.
For flush and invalidate operations, it is related to the CPU system. EdgeBoard uses A53v8 (corresponding to the AArch64 execution set), and the code for cache flush and invalidate is as follows:
11. Other functional design and consideration in memory driver
The size of each reserved memory block in the reserved memory area mentioned above is 16K. In fact, multiple specifications can be enabled here, so that users will be more convenient to use and have more uses. But the memory management part requires more data to manage.
In addition, we can make some fast and concise solutions for memory sharing between different processes by establishing some IOCTLs, which is also what we can consider in the design of memory drivers.