Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping

The coprocessor’s hardware architecture provides embedded designers with a high-performance platform that maintains their design flexibility throughout the development process and after product release. By first verifying the algorithm in C or C++, processes, data and signal paths, and critical functions can all be verified in a relatively short period of time. Then, by converting processor-intensive algorithms into coprocessor FPGAs, designers can enjoy the benefits of hardware acceleration and a more modular design.

By Noah Madinger, Colorado Electronic Product Design (CEPD)

Editor’s Note – Although the coprocessor architecture is known for its digital processing performance and throughput, it also offers embedded system designers the opportunity to implement project management strategies that reduce development costs and speed time-to-market. This article focuses on the combination of discrete microcontrollers (MCUs) and discrete field programmable gate arrays (FPGAs), showing how this architecture lends itself to an efficient and iterative design process. Using research materials, empirical findings, and case studies, the benefits of this architecture are explored and exemplary applications are provided. After reading this article, embedded system designers will have a better understanding of when and how to implement this versatile hardware architecture.


Embedded system designers often find themselves stuck with design constraints, performance expectations, and schedule and budget issues. In fact, even modern project management has some buzzwords or phrases like “fail fast,” “agile,” “future-proof,” and “disruptive!” that further highlights the precarious nature of this role. Even trying to meet these expectations, the maneuvers involved can be distressing, while these expectations have been propagated and continuously reinforced throughout the market. What we need is a design approach that enables an evolving iterative process, like most embedded systems, starting with the hardware architecture.

Coprocessor architecture is a hardware architecture known for combining the advantages of Microcontroller Unit (MCU) and Field Programmable Gate Array (FPGA) technologies to provide embedded designers with a process capable of meeting the most demanding requirements , and it also provides the necessary flexibility to address known and unknown challenges. By providing hardware that can be adjusted iteratively, designers can demonstrate progress, reach key milestones, and take full advantage of the rapid prototyping process.

Along the way, there are a few key project milestones, each with its own unique value that can benefit development efforts. In this article, we will refer to the following terms: Digital Signal Processing with Microcontrollers Milestones, Systems Management with Microcontrollers Milestones, and Product Deployment Milestones.

By the end of this article, we will demonstrate that a flexible hardware architecture can be better suited for modern embedded system design than a more rigid approach. In addition, this approach can lead to improvements in both project cost and time-to-market. This position will be supported by the arguments, examples provided and case studies. By looking at the value of each milestone in the design flexibility provided by this architecture, we can clearly see that adaptive hardware architecture is a powerful driver of embedded system design.

Exploring the Benefits of Coprocessor Architectures: Design Flexibility and High-Performance Processing

A common application for FPGA designs is to interface directly with high-speed analog-to-digital converters (ADCs). After the signal is digitized, it is read into the FPGA, and some digital signal processor (DSP) algorithms are applied to the signal. Finally, the FPGA makes decisions based on these results.

Such an application will be used as an example throughout this article. Additionally, Figure 1 shows a general coprocessor architecture where the MCU and FPGA are connected through the MCU’s external memory interface. The FPGA is treated as an external static random access memory (SRAM). Signals are returned from the FPGA to the MCU and serve as hardware interrupt lines and status indications. This allows the FPGA to indicate critical status to the MCU, such as notifying the ADC that a conversion is ready, or that a fault has occurred, or other notable events have occurred.

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 1: Schematic diagram of a generic coprocessor (MCU + FPGA). (Image credit: CEPD)

The benefits of the coprocessor approach are probably best seen in the deliverables of each of the above milestones. Value not only evaluates the achievements of a task or phase, but also the usefulness of those achievements. Help assess the overall value of a milestone’s deliverables by answering the following questions.

· Can the progress of other team members now continue more rapidly as project dependencies and bottlenecks are removed?
· How do milestone achievements enable further parallel operations?

A milestone in digital signal processing with microcontrollers

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 2: Architecture – Digital Signal Processing with a Microcontroller. (Image credit: CEPD)

The first development phase allowed by this hardware architecture places the MCU in the early and mid-term. Under the same conditions, the development of MCU and executable software saves more resources and time than the development of FPGA and hardware description language (HDL). Therefore, by initiating product development with an MCU as the main processor, algorithms can be implemented, tested and validated more quickly. This allows algorithmic and logic errors to be discovered early in the design process, and also allows substantial parts of the signal chain to be tested and verified.

In this initial milestone, the FPGA acts as a high-speed data collection interface. Its task is to reliably manage data from the high-speed ADC, alert the MCU that data is available, and provide this data on the MCU’s external memory interface. Although this role does not include implementing HDL-based DSP processes or other algorithms, it is still critical.

FPGA development at this stage lays the foundation for the ultimate success of the product, both during product development and when it is released to the market. By focusing only on the low-level interface, you can have enough time to test these basic operations. This milestone can only be confidently accomplished when the FPGA reliably and confidently performs this interface role.

Key deliverables from this initial milestone include the following benefits:

• The entire signal path – all amplification, attenuation and conversion – will be tested and verified.
· By initially implementing the algorithm in software (C/C++), the time and effort for project development will be reduced; this is of considerable value to management and other stakeholders who must be prior to approving future design phases See the feasibility of this project.
• The experience of implementing algorithms in C/C++ will be transferred directly to the HDL implementation – through the use of “software to HDL” tools such as Xilinx HLS.

Milestones in System Management and Microcontrollers

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 3: Architecture – System management with a microcontroller. (Image credit: CEPD)

The second stage of development offered by this coprocessor approach is defined by the transfer of DSP processes and algorithm implementations from the MCU to the FPGA. The FPGA is still responsible for the high-speed ADC interface, but by taking on these other roles, the speed and parallelism provided by the FPGA is fully exploited. Also, unlike MCUs, multiple instances of DSP processes and algorithm channels can be implemented and run synchronously.

Based on the experience gained from the MCU’s implementation, the designer will take this confidence to the next milestone. Tools such as the aforementioned Xilinx Vivado HLS provide functional translation from executable C/C++ code to synthesizable HDL. Now, timing constraints, process parameters, and other user preferences still have to be defined and implemented, but the core functionality is stuck and translated into the FPGA fabric.

For this milestone, the role of the MCU is that of a system manager. Status and control registers within the FPGA are monitored, updated, and reported by the MCU. In addition, the MCU also manages the user interface (UI). This user interface can take the form of accessing a web server via an Ethernet or Wi-Fi connection, or it can be an industrial touchscreen interface that users access at the point of use. The key takeaway from the MCU’s new, more granular role is that by freeing it from computationally intensive processing tasks, MCUs and FPGAs can now be used for the tasks for which they are best suited.

Key deliverables make up this milestone and include the following benefits:

• Fast, parallel execution of DSP processes and algorithm implementations provided by FPGAs. The MCU provides a responsive and streamlined UI and manages the execution of the product.
• Algorithmic risks are mitigated due to first being developed and verified within the MCU, and these mitigations are translated into synthesizable HDL. Tools like Vivado HLS make this transition easier. Additionally, FPGA-specific risks can be mitigated through integrated simulation tools such as the Vivado Design Suite.
• Stakeholders are not exposed to significant risk by moving the process to the FPGA. Instead, they can see and enjoy the benefits of FPGA speed and parallelism. Having observed significant performance improvements, work can now focus on preparing the design for manufacturing.

Product Deployment Milestones

With the computationally intensive processing resolved within the FPGA, the MCU can easily handle its system management and user interface roles, and product deployment is ready. For now, this article does not advocate bypassing Alpha and Beta releases; however, the focus of this milestone is on the capabilities that the coprocessor architecture provides for production deployment.

Both the MCU and the FPGA are field-updatable devices. Several advances have been made to make FPGA updates as easy as software updates. Additionally, because the FPGA is within the MCU’s addressable memory space, the MCU can serve as an access point for the entire system: receiving updates for itself and the FPGA at the same time. Updates can be conditionally scheduled, distributed, and customized on a per-end-user basis. Finally, user and use case log maintenance can be performed and associated with a specific build implementation. With these datasets, performance can continue to be refined and improved even after the product enters the field.

Perhaps this benefit of overall system update capability is most fully manifested in space-based applications. Once a product is launched, maintenance and updates must be done remotely. This can be as simple as changing a logical condition, or as complex as updating a communication modulation scheme. The programmability and coprocessor architecture offered by FPGA technology can meet the full requirements of this range of capabilities, while providing radiation-hardened component options.

The final key takeaway from this milestone is a gradual cost reduction. Cost reductions, bill of materials (BOM) changes, and other optimizations may also occur at this milestone. When deployed in the field, you may find that using a less expensive MCU or a less powerful FPGA works well. Because of the coprocessor, architects are not constrained to use components whose capabilities exceed the requirements of their application. Additionally, the architecture allows new components to be incorporated into the design if a component becomes unavailable. This is not the case with single-chip, system-on-chip (SoC) architectures, or high-performance DSPs or MCs that attempt to handle all of the product processing. The coprocessor architecture is a good combination of power and flexibility, giving designers more choice and freedom during the development phase and when releasing to the market.

Supporting investigations and relevant case studies

Satellite Communication Example

In short, the value of a coprocessor is to help offload the main processing unit and allow tasks to be performed on hardware, where the benefits of acceleration and reduction can be exploited. The benefit of this design choice is a net increase in computing speed and power, but also, as this paper argues, a reduction in development cost and development time. Perhaps the most striking aspect of these benefits is in the area of ​​space communication systems.

In the book FPGA-style hardware as a coprocessor, G. Prasad and N. Vasantha detail how data processing in FPGAs can blend the computational demands of satellite communication systems without the high demands of application-specific integrated circuits (ASICs). Non-recurring engineering (NRE) costs or application-specific limitations of hard-architecture processors. As described in Digital Signal Processing with Microcontrollers Milestones, the design starts with the application processor, which executes the most computationally intensive algorithms. From this starting point, they identified the critical parts of the software that consumed most of the cycles of the central processing unit (CPU) clock and migrated these parts to the HDL implementation. The following graphical representation is highly similar to what has been presented so far, however, they chose to represent the application as its own independent block, as it can be implemented in the host (processor) or FPGA-based hardware.

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 4: Application, host processor, and FPGA-based hardware – for a satellite communications example.

Peripheral performance is greatly improved by utilizing the Peripheral Component Interconnect (PCI) interface and direct memory access (DMA) from the host processor. This is mainly reflected in the improvement of the de-randomization process. When this process is carried out in the software of the host processor, the real-time response of the system is obviously bottlenecked. However, when moving to FPGAs, we can see the following benefits:

・ The de-randomization process is executed in real-time without creating bottlenecks
・The computational overhead of the host processor is greatly reduced, and it can now better perform the required recording role.
• The overall performance of the entire system is improved.

All without the overhead associated with ASICs, while still enjoying the flexibility of programmable logic [5]. Satellite communications pose considerable challenges, and this approach reliably meets these requirements and continues to provide design flexibility.

Examples of car infotainment systems

In-car entertainment systems are a feature valued by discerning consumers. Unlike most automotive electronics, these devices are highly visible and expectations are high for excellent response times and performance. However, designers are often squeezed between current design needs and the flexibility needed for future functionality. In this example, we will use the implementation requirements of signal processing and wireless communication to highlight the advantages of the coprocessor hardware architecture.

One of the major automotive entertainment system architectures used is published by Delphi Delco Electronic Systems. The architecture uses an SH-4 MCU and a companion ASIC, Hitachi’s HD64404 Amanda peripheral. This architecture fulfills more than 75% of the basic entertainment functions of the automotive market; however, it lacks the ability to address video processing applications and wireless communications. By adding an FPGA to this existing architecture, further flexibility and capability can be added to this already existing design.

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 5: Example 1 of an FPGA coprocessor architecture for an infotainment system.

The architecture of Figure 5 is suitable for both video processing and wireless communication management. By pushing the DSP functions to the FPGA, the Amanda processor can play a system management role and be freed up to implement the wireless communication stack. Since both the Amanda and the FPGA have access to external memory, data can be quickly exchanged between the system’s processors and components.

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 6: Example 2 of an FPGA coprocessor architecture for an infotainment system.

The second infotainment system in Figure 6 highlights the capabilities of the FPGA while addressing both the incoming high-speed analog data and the compression and encoding processing required for video applications. In fact, all of these functions can be pushed to the FPGA, and by using parallel processing, these can be processed in real time.

By adding FPGAs to existing hardware architectures, the proven performance of existing hardware allows flexibility and future suitability.Even in existing systems, the coprocessor architecture provides designers with options that would otherwise not be available [6].

Rapid Prototyping Advantages

At the heart of the rapid prototyping process is the need to cover a large number of product development areas, thus enabling parallel execution of tasks, rapid identification of “bugs” and design issues, and validation of data and signal paths, especially those within the project’s critical path. However, for this process to truly yield streamlined, efficient results, there must be sufficient expertise in the required project area.

Traditionally, this meant that there had to be a hardware engineer, an embedded software or DSP engineer, and an HDL engineer. Today, there are many interdisciplinary professionals who may be able to fulfill multiple roles; however, there is still a significant project overhead involved in coordinating these efforts.

In “An FPGA-Based Rapid Prototyping Platform for Wavelet Coprocessors,” the authors advocate that the use of a coprocessor architecture allows a single DSP engineer to perform all these roles efficiently. For this research, the team began designing and simulating the required DSP functions in MATLAB’s Simulink tool. This has two main functions, namely: 1) to verify the desired performance through simulation; 2) as a benchmark to compare and reference future design choices.

After simulation, key functions are identified and divided into different cores – these are soft core components and processors that can be synthesized within the FPGA. The most important step in this work is to define the interfaces between these cores and components and compare the data exchange performance with the expected, simulated performance. This design process is closely integrated with Xilinx’s embedded system design process, which is summarized in Figure 7 below.

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 7: Implementation Design Flow

By dividing the system into integrable cores, DSP engineers can focus on the most critical aspects of the signal processing chain. She/he does not need to be an expert in hardware or HDL to modify, route or implement different soft processors or components within an FPGA. Therefore, as long as designers understand the interface and data format, they have complete control over the signal path and can refine the performance of the system.

Empirical Results – A Case Study of the Discrete Cosine Transform

The empirical results not only confirm the flexibility that the coprocessor architecture offers embedded system designers, but also demonstrate the performance-enhancing options of modern FPGA tools. Enhancements like the ones mentioned below may not be available or have less impact on other hardware architectures. The discrete cosine transform (DCT) was chosen as a computationally intensive algorithm, and its progression from C-based to HDL-based implementations is central to these results.DCT was chosen because this algorithm can be used for pattern recognition and screening in digital signal processing [8]. This empirical result is based on a laboratory work performed by the authors and colleagues that was certified as a Xilinx Alliance Partner for 2020-2021.

In this work, the following tools and equipment were used:

・ Vivado HLS 2019 Edition
・ The device used for evaluation and simulation is xczu7ev-ffvc1156-2-e

Starting with the C-based implementation, the DCT algorithm accepts two 16-bit arrays; array “a” is the input array of the DCT, and array “b” is the output array of the DCT. Therefore, the data width (DW) is defined as 16, and the number of elements (N) in the array is 1024/DW, which is 64. Finally, the size of the DCT matrix (DCT_SIZE) is set to 8, which means that an 8×8 matrix is ​​used.

Under the premise of this paper, the algorithm implementation based on C language allows designers to quickly develop and verify the functions of the algorithm. While this is an important consideration, this validation places a higher weight on functionality than execution time. This weighting is allowed because the final implementation of the algorithm will take place in an FPGA, where hardware acceleration, loop unrolling, and other techniques are readily available.

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Figure 8: Xilinx Vivado HLS design flow.

Once the DCT code is created as a project in the Vivado HLS tool, the next step is to begin design integration for FPGA implementation. Some of the most impactful benefits of moving algorithm execution from the MCU to the FPGA become more apparent in the next steps—for reference, this step corresponds to the microcontroller system management milestone discussed above.

Modern FPGA tools allow for a series of optimizations and enhancements that greatly improve the performance of complex algorithms. Before analyzing the results, there are some important terms to keep in mind.

The number of clock cycles required to delay C to execute all iterations of the loop [10] .
The number of clock cycles before the next iteration of the interval C loop starts processing data [11].
・ BRAM C block random access memory
・ DSP48E C DSP block for UltraScale architecture
・ FF C flip-flop
・ LUT C look-up table
・ URAM C Unified Random Access Memory (can be composed of a single transistor).

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Table 1: FPGA algorithm execution optimization results (latency and interval).

Coprocessor Architecture: An Embedded System Architecture for Rapid Prototyping
Table 2: FPGA algorithm execution optimization results (resource utilization).


The default optimization settings come from the unchanged results of converting a C-based algorithm to a synthesizable HDL. No optimizations are enabled, this can be used as a performance reference to better understand other optimizations.

In-Line Loop

The PIPELINE instruction instructs Vivado HLS to unroll the inner loop so that new data can start being processed while existing data is still in the pipeline. Therefore, new data does not have to wait for existing data to complete before starting processing.

Out-of-pipeline loop

The operation of the outer loop is now pipelined by applying the PIPELINE instruction to the outer loop. But the operation of the inner loop is now done synchronously. By applying it directly to the outer loop, both the delay and the interval are cut in half.

array partition

This instruction maps the contents of the loop into arrays, flattening all memory accesses to individual elements in those arrays. Doing this will consume more RAM, but again, the execution time of this algorithm will be cut in half.

data flow

This instruction allows the designer to specify the target number of clock cycles between each input reading. This directive only supports top-level functions. Only loops and functions exposed at this level benefit from this directive.


The INLINE instruction flattens all loops, both inner and outer. Both row and column processes can now execute concurrently. The number of required clock cycles is kept to a minimum, even though it consumes more FPGA resources.


The coprocessor’s hardware architecture provides embedded designers with a high-performance platform that maintains their design flexibility throughout the development process and after product release. By first verifying the algorithm in C or C++, processes, data and signal paths, and critical functions can all be verified in a relatively short period of time. Then, by converting processor-intensive algorithms into coprocessor FPGAs, designers can enjoy the benefits of hardware acceleration and a more modular design.

If parts are outdated or need optimization, the same architecture can allow for these changes. New MCUs and new FPGAs can be installed into the design, while all interfaces can remain relatively unchanged. In addition, since both the MCU and FPGA are field-updatable, user-specific changes and optimizations can be applied both in the field and remotely.

Ultimately, this architecture combines the development speed and usability of an MCU with the performance and scalability of an FPGA. With optimizations and performance improvements at every development step, coprocessor architectures can meet the most challenging needs—both in today’s designs and those in the future.

The Links:   LC171W03-A4KG NL8060BC31-28D

Related Posts