HSP: The new hardware accelerator that transforms an ultra-low-power STM32U3 into an AI machine

ST is launching today its hardware signal processor (HSP), a new hardware unit that the industry will experience in more and more of our upcoming STM32 microcontrollers, starting today with our new STM32U3B5/C5 devices featuring 2 MB of flash. In a nutshell, the HSP accelerates various computations, such as certain Fourier transform algorithms. However, unlike a traditional DSP, it requires no additional external device, no setup or configuration, and can be used with just a few API calls thanks to ST’s hardware abstraction layer. Consequently, it quickly and easily enables low-power devices to deliver significant computational improvements. In some cases, the new HSP means the STM32U3 will be 12 times faster than a standard Cortex-M33 device.

Learn more about the new STM32U3 with 2 MB of flash

Why many ultra-low-power systems could use an external DSP but don’t

The current tradeoff between ultra-low power and advanced sensing applications

Too often, engineers accept a serious performance compromise for the sake of ultra-low power consumption or a cheaper bill of materials. However, it does mean shutting the door on certain applications. For instance, industrial sensing applications that monitor vibrations or acceleration often benefit from time- or frequency-domain analysis, which requires substantial computation to run a real or complex fast Fourier transform. However, the cost of running such a computation on an ultra-low-power MCU may be prohibitive. Either it takes too long, runs too hot, or simply requires an additional DSP, making the whole proposition too expensive. Engineers must, therefore, renounce certain features or pay for a very different BoM.

The very real challenges of using an external DSP

The other major challenge is that programming an external DSP is a hurdle in its own right. While there is a romantic notion that such a co-processor is open and can, as a result, adapt to a wide range of applications, it is also a complex and costly endeavor that requires advanced expertise. Programming a DSP demands that engineers iterate through many firmware versions before their code is mature, stable, and optimized. Moreover, it’s often not portable. Engineers must either repeatedly choose the same external DSP, severely limiting their BoM, or start development from scratch, which is costly, time-consuming, and potentially demoralizing.

It’s easy to see how an external DSP adds quite a bit of complexity. For example, designers must create a PCB layout that includes more failure points and accounts for more traces and passive components. Working with an external DSP is also challenging because developers often have to work at a low level, with few tools or GUIs to support them. There are very few community-supported development environments that offer common utilities or interfaces for debugging code, configuring hardware, or monitoring memory or CPU usage. For example, it is common for a company to work on an audio DSP for years before releasing any product to market. However, for many small companies, this is simply untenable.

Why ST’s new HSP transforms more than a Fourier algorithm

13x better performance than a traditional Cortex-M33

According to internal benchmarks, the new ST HSP delivers up to 13 times the performance of a traditional Cortex-M33. The tests ran 32-bit fixed-point and floating-point complex fast Fourier transforms with 256 samples, and a 32-bit fixed-point real fast Fourier transform with the same number of samples. We chose these particular algorithms because they are common in sensing applications, and designers will often use an external DSP to accelerate them. For instance, years ago, we published an application note showing how they could be used in the development of high-pass digital filters.

9x better power efficiency than an STM32U5

If we specifically compare the new STM32U3 with HSP to other STM32 devices with a Cortex-M33, such as the STM32U5 or an STM32U3 without HSP, we see 9x and 3x gains, respectively. And while it’s normal that adding dedicated multiply-accumulate units to our MCU offers significant gains, what stands out is that ST’s HSP enables a device like the STM32U3 to deliver these performance gains without compromising its ultra-low-power capabilities. Indeed, using the HSP somewhat increases absolute power consumption, but its ability to perform computations at a significantly faster pace results in overall energy savings. In fact, certain edge AI applications double their power efficiency.

3x better performance than a Cortex-M55 with MVE

When comparing the HSP of the new STM32U3 to a more power-hungry core, such as a typical Cortex-M55 from the competition, our internal tests show we offer about 3x the performance on the same algorithms as in previous benchmarks. These results stand out because the Cortex-M55 comes with an M-Profile Vector Extension (MVE), known as Helium, which specifically accelerates DSP applications. Yet, the ST HSP offers a significant advantage. Concretely, it allows engineers to choose an STM32U3 over an MCU that’s far more costly or power-hungry, while still benefiting from a more powerful and energy-efficient platform.

9x better than a Cortex-M33 using TensorFlow Lite for MCU

ST engineers working on the new HSP prioritized digital signal processing applications. However, upon seeing the performance gains, they realized the new hardware unit could also significantly accelerate some neural network algorithms, especially on a device like the STM32U3. For instance, the same keyword spotting, image classification, or visual wake word algorithm running on the new HSP will see a 6x to 9x gain compared to a Cortex-M33 using TensorFlow Lite for MCU and a 3x boost when using an STM32 MCU with the same core and STM32Cube AI Studio, which has transformed the creation of AI at the edge. That’s why we updated our software tool to leverage the new hardware signal processor.

Ultra-low-power devices that would never have run an AI application at the edge are now viable candidates. We don’t expect to see the same gains in AI on more powerful STM32 MCUs, since they can already run many algorithms very well and are helping democratize machine learning on embedded systems. However, we still expect engineers to enjoy the other benefits of the HSP, which is why we will add it to more MCU series over time. Moreover, when looking at an ultra-low-power device like the STM32U3, the HSP offers such a performance boost that it opens the door to new types of applications that would never have run on a near-sub-threshold device.

Infinitely more flexible

The integration with STM32Cube AI Studio is also a testament to the HSP’s ease-of-use. Since the HSP is a hardware unit within an STM32 MCU, developers can automatically leverage it by simply using our hardware abstraction layer. ST even ensured that the solution was compatible with CMSIS-DSP API calls, allowing developers to use the same code on another one of our future MCUs with an HSP, making the solution highly flexible. Put simply, for the throng of developers who want the power of a DSP without its complexity, the new HSP opens the door to impressive acceleration without the hurdles it traditionally entails.

Battery-less demo at Embedded World 2026

To show exactly what the new HSP enables on a microcontroller like the STM32U3, ST prepared a demo for people attending Embedded World 2026. It builds on the work we did with Dracula Technologies and their printed organic photovoltaic modules that powered an STM32U0. In this demo, we are using dedicated organic modules and two STM32U3 boards. One Nucleo board features the new STM32U3C5, a VD55G4 camera, and runs a person detection algorithm using the new HSP. The other is the STM32U083C-DK and only runs an LCD display that shows if the first board sensed someone’s presence. Put simply, it’s the first time we’ve shown a machine learning algorithm running on only ambient light as an energy source.

Download the HSP User Manual