4

I'm using the STM32 uC for quite a long time, from F1,F2,F3,F4 to F7. In one application I changed from the F4 (100 MHz) to the F7 (200 MHz), but this seems like it was a mistake.

The application run on the F4 with around 15kHz, on the F7 with around 12 kHz, although the F7 runs on double the clock speed. So it seems, that the two processors have different FPU architectures and as I read, the F4 has some parallelism for the FPU while the F7 can only do sequential operation.

So is it true, that for an application with heavy FPU load, an F4 outperforms an F7?

Edit:

So I made some measurements on real hardware to verify my toughts:

Hardware: STM32F722RC vs STM32F412CE

Programm: Just some FPU operations as used in my application

  for(uint16_t i = 0; i < 2000; i++)
  {
      if(x > 6)
      {
          x = 0.1f;
      } else if (x < -6)
      {
          x = -0.1f;
      }
      x = x + 0.05f;
      x = x + sinf(x)*cosf(x);
  }

cyclic_time[ptr] = htim6.Instance->CNT; cyclic_time[ptr] /= 1e6f; ptr++; if(ptr >= 20) { ptr = 0; } htim6.Instance->CNT=0;

Performance F4: enter image description here

At 100Mhz: enter image description here

--> So an average cycle time of around 7.365ms

Performance F7: enter image description here

At 200Mhz: enter image description here

At 100Mhz: enter image description here

--> So an average cycle time of around 9.954ms @ 200Mhz best case (I verified, that in both cases the timer runs on the correct clock speed, 100Mhz and 99 Prescaler, such that the measurement is correct)

So that is exactly what I observed in my real application. Somehow the F4 outperformance the F7 when it comes to floating point operations.

Edit2:

Compiler Options F4: enter image description here

Compiler Options F7: enter image description here

Edit3: To ensure, that the optimizer is not a problem, I tested the cycle time with the optimizer enabled on speed:

F4: enter image description here --> Around 6.44ms

F7: enter image description here --> Around 8.344ms

So this leads to the same problem.

Projects for F4 and F7: http://www.mediafire.com/file/lljhgsez4xk9vat/Test_Projects.rar/file

HansPeterLoft
  • 1,038
  • 1
  • 22
  • 40
  • hm, on a modern compiler, I'd assume the complete for loop would get compiled away and replaced with a constant assignment. Also, don't underestimate how complicated sinf/cosf are: depending on your math lib, this might mostly be control flow or soft floating point operations! – Marcus Müller Jul 04 '20 at 10:44
  • I turned off the optimization for the compiler and I use the standard generated settings from a CubeMX export for both uC – HansPeterLoft Jul 04 '20 at 10:48
  • Does that make sense, though? Instead of buying a more expensive microcontroller, you'd turn on optimizations first. Comparing unoptimized code is kind of unfair, especially on a RISC architecture. – Marcus Müller Jul 04 '20 at 10:53
  • I turn on optimized for speed normally, but this test is just to compare the overal speed for basic operations. When I turn on the optimizer for speed, both get way faster, but still the F4 outperforms the F7 – HansPeterLoft Jul 04 '20 at 11:02
  • 1
    yes, but if your compiler produces code that isn't optimal for a CPU, is it then the CPU's fault it's slow? – Marcus Müller Jul 04 '20 at 11:02
  • No, but I want to see if it is the compiler or some hardware options or maybe even the architecture – HansPeterLoft Jul 04 '20 at 11:05
  • Fair point! So, I'm playing around with the godbolt compiler explorer, here's a link for you to have fun: https://gcc.godbolt.org/z/Cb4saQ – Marcus Müller Jul 04 '20 at 11:06
  • also, I'm not sure which clock source TIM6 uses? – Marcus Müller Jul 04 '20 at 11:16
  • TIM6 uses APB1, I tested it also with APB2 on 100Mhz to verify, still the same. – HansPeterLoft Jul 04 '20 at 11:21
  • darn! You're so thorough, I'm starting to doubt my judgement! (which means you're clearly doing good work, thank you for asking such a great question! And I might simply be wrong.) so, are you as excited as me? Next thing I'd do is use objdump -D -S on the object (.o) files produced by your gcc -c calls and compare them (e.g. with diff) – Marcus Müller Jul 04 '20 at 11:27
  • I uploaded both projects with the compiled files here: http://www.mediafire.com/file/lljhgsez4xk9vat/Test_Projects.rar/file I need to go now and will to the disassembly later. I really don't know what this could cause, but it seems not to be the compiler. – HansPeterLoft Jul 04 '20 at 11:49
  • In case it helps: disassembly: https://gist.github.com/marcusmueller/a4e462aa326a073fae6039a254396a2d – Marcus Müller Jul 04 '20 at 12:05

1 Answers1

2

Hm, until you do a bit of benchmarking to show that it's really the FPU, I'd heavily doubt this is about the Cortex-M7F FPU being slower (it's really not, never seen that).

Generally, try to make sure you're not inadvertedly doing something like soft-Floating Point math (-mfloat-abi=soft), or aren't using math libraries that have been optimized for the STM32F4, but not the F7. Make sure you're compiling for ARMv7-M or ARMv7EM.

The fact that you're putting a processing rate to this: This sounds like a DSP workload. So, make sure you're really using the DSP instructions: both the M4 and the M7 should have single-cycle Multiply-accumulates, so a 200 MHz M7 should in any case be twice as fast as a 100 MHz M4 if these are used. Your compiler should infer these, but sometimes a bit of hand-assembly pays.

So, either you're using a compiler too old or set to not use the FPU, DSP instructions sensibly, or something else is going on here.


From a general DSP engineering perspective: often, there's much to solve in algorithmically or programming inefficiencies before specific properties of FPUs become relevant for application performance. Since already your 100 MHz Cortex-M4F is a pretty strong processor, 15 kS/s of throughput does sound like a pretty hefty DSP workload (700 CPU cycles per sample!) and it might really make sense to ask a question on the DSP StackExchange sister site, describing the algorithm you're doing, and specifically about how to do it.

Marcus Müller
  • 94,373
  • 5
  • 139
  • 252
  • You should delete "-mfloat-abi=soft" part. It is not about the library, it is about the calling convention. – Ayhan Jul 04 '20 at 10:54
  • @Ayhan it's about both. If you're using soft, then gcc will produce software calls for every floating point operation (__eabi_fadd and so on). – Marcus Müller Jul 04 '20 at 11:02
  • If you didn't force the library, it might choose the compatible one. That is about how the precompiled library was compiled before. One additional version of the hard fp library could be compiled with soft-abi option, so it would be more universal in the expense of some performance. My compiler environment has only one version, so when I try to compile with forcing the library, linker complies like this: "libgcc.a uses VFP register arguments, main.elf does not". – Ayhan Jul 04 '20 at 11:39