shit_happen
Senior Member
Trong diễn biến mới nhất. Tác giả clamchowder của diễn đàn chipsandcheese đã phát hiện ra: Amd bí mật tắt tính năng LoopBuffer của Zen4. Đây là tính năng chỉ có ở Zen4. Được bật lên từ bản AGESA 1.0.0.6 và tắt đi từ bản AGESA 1.2.0.2a. Có lẽ vì họ cho rằng LoopBuffer không đem lại hiệu quả. Nên đã không trang bị cho Zen5. Nhưng cũng có giả thuyết cho rằng Amd để dành thứ này cho Zen6 hoặc tương lai. Một nguồn tin vô danh khẳng định Loopbuffer đem lại rủi ro bảo mật vì nó cung cấp cho frontend(dev game này nọ).
Kết quả:
_Tác giả cho rằng LoopBuffer có mục đính chính là tiết kiệm điện. Nên đã test power draw trên 2 bản bios đã tắt và chưa tắt tính năng này. Kết quả đúng như dự đoán. Tiêu thụ điện rên bios mới tốt hơn hẳn bios cũ dù bios cũ có bật hay tắt. Nghĩa là Amd đã chuyển hết phần việc của LoopBuffer sang cho các backend còn lại(bao gồm CPU, OP Cache, L1 Cache, Decoder, L2 Cache, L3 Cache)
Và phương pháp này chứng tỏ hiệu auả hơn về mặt năng lượng.
_Và với công nghệ 3d cache thì LoopBuffer chỉ có 144 entries vô dụng trước vcache to khổng lồ.
_Nhưng:
Thật bất ngờ là trong kết quả test không sử dụng 3dvcache. Việc tắt đi loopbuffer làm giảm tới 25% IPC tương ứng với 5% FPS chênh lệch trên game Cyberpunk 2077. Nghĩa là bá quyền bắt ae mua 3dvcache haha
A loop buffer sits at a CPU’s frontend, where it holds a small number of previously fetched instructions. Small loops can be contained within the loop buffer, after which they can be executed with some frontend stages shut off. That saves power, and can improve performance by bypassing any limitations present in prior frontend stages. It’s an old but popular technique that has seen use by Intel, Arm, and AMD cores.
A loop buffer sits at a CPU’s frontend, where it holds a small number of previously fetched instructions. Small loops can be contained within the loop buffer, after which they can be executed with some frontend stages shut off. That saves power, and can improve performance by bypassing any limitations present in prior frontend stages. It’s an old but popular technique that has seen use by Intel, Arm, and AMD cores.
Why did I mention the op cache was enabled? Surely I wouldn’t disable that big beautiful op cache for some other experiment. Right?
Because I expect negligible performance differences with the loop buffer disabled, I ran the benchmark with an unusual setup to maximize consistency. I disabled Core Performance Boost on the Ryzen 9 7950X3D by setting bit 25 of the Hardware Configuration register (MSR 0xC0010015). That limits all cores to 4.2 GHz. I also capped my RX 6900 XT to 2 GHz. For benchmark settings, I’m using the medium preset at 1080P with no upscaling.
Disabling the loop buffer basically doesn’t affect performance with the game pinned to the VCache die. Strangely, the game sees a 5% performance loss with the loop buffer disabled when pinned to the non-VCache die. I have no explanation for this, and I’ve re-run the benchmark half a dozen times.
Cyberpunk 2077 is unexpectedly friendly to the loop buffer, which covers about 22% of the instruction stream on average. Disabling the loop buffer causes the op cache to deliver 82% of micro-ops, up from 62% before.
There’s a lot of action in Cyberpunk 2077, but most of it doesn’t happen at the CPU’s frontend.
Disabling the loop buffer of course doesn’t change that.
But because the loop buffer covers a significant minority of the instruction stream, turning it off does mean the op cache works harder.
Again, it’s not a big difference. With an average IPC of 0.89 with the loop buffer disabled, or 1.02 with the loop buffer enabled, Cyberpunk 2077 is not a high IPC workload. That means frontend bandwidth isn’t a big consideration. Perhaps the game is more backend bound, or bound by branch predictor delays.
Still, the Cyberpunk 2077 data bothers me. Performance counters also indicate higher average IPC with the loop buffer enabled when the game is running on the VCache die. Specifically, it averages 1.25 IPC with the loop buffer on, and 1.07 IPC with the loop buffer disabled. And, there is a tiny performance dip on the new BIOS. Perhaps I’m pushing closer to a GPU-side bottleneck at 155 FPS. But I’ve already spent enough free time on what I thought would be a quick article. Perhaps some more mainstream tech outlets will figure out AMD disabled the loop buffer at some point, and do testing that I personally lack the time and resources to carry out.
This should let me focus on core power, I think.
I also pinned the test to one core and read the Core Energy Status MSR before and after jumping to my test array, letting me calculate average power draw over the test duration. For consistency, I disabled Core Performance Boost because power readings would vary wildly with boost active.
Testing with 4B NOPs. Core averages 11-12 IPC when fetching from the op cache or loop buffer thanks to instruction fusion. It averages 4 IPC when using the decoders
Results make no sense. On the old BIOS, the Core Energy Status MSR tells me the core averaged 6W of power draw when fetching NOPs from the op cache, and much lower power when doing the same from the loop buffer. Next, I increased the test array size until performance counters showed op cache coverage dropping to under 1%. By that time, the array size had gone well into L2 capacity (128 KB). But even though exercising the decoders and L2 fetch path should increase power draw, the Core Energy Status MSR showed just 1.5W of average core power.
Updating to the new BIOS gave 1.68W of average core power when testing the op cache, and nearly the same power when feeding the decoders mostly from L2. That means the core is achieving better efficiency when running code from the op cache, and makes sense. Of course, I can’t test the loop buffer on the new BIOS because it’s disabled.
To make things even more confusing, AMD’s power monitoring facilities may be modeling power draw instead of measuring it1. There’s a distinct possibility AMD modeled the power draw wrong, or changed the power modeling methodology between the two BIOS versions. I don’t have power measuring hardware to follow up on this. I feel like I don’t understand the power draw situation any more than when I started, and a few hours have gone to waste.
Turning off the loop buffer should have little to no impact on performance because the op cache has more than enough bandwidth to feed the subsequent rename/allocate stage. Impact on power consumption is an unknown factor, but I suspect it’s also minor, and may be very difficult to evaluate even when using expensive hardware to measure CPU power draw at the 12V EPS connector.
The only place AMD ever documented the loop buffer
AMD’s move to disable Zen 4’s loop buffer is interesting, but should go largely unnoticed. AMD never advertised or documented the feature beyond dropping a line in the Processor Programming Reference. It’s a clear contrast to Intel, which often documents its loop buffer and encourages developers to optimize their code to take advantage of it.
Advice from Intel’s software optimization guide, suggesting developers take advantage of Ice Lake’s LSD (loop stream detector, or loop buffer). Note similar limitations to AMD’s loop buffer, like no CALL/RET
Combine that with what looks like minimal impact on performance, and I doubt anyone will ever know that AMD turned the loop buffer off. It was a limited feature in the first place, with low capacity and restrictions like no function calls that prevent it from being as useful as an op cache.
Perhaps the best way of looking at Zen 4’s loop buffer is that it signals the company has engineering bandwidth to go try things. Maybe it didn’t go anywhere this time. But letting engineers experiment with a low risk, low impact feature is a great way to build confidence. I look forward to seeing more of that confidence in the future.
If you like our articles and journalism, and you want to support us in our endeavors, then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way. If you would like to talk with the Chips and Cheese staff and the people behind the scenes, then consider joining our Discord.
Restrictions like no CALL/RET could indicate Zen 4 shuts off certain parts of the branch predictor in addition to the op cache and decoder. That could add to power savings.
Kết quả:
_Tác giả cho rằng LoopBuffer có mục đính chính là tiết kiệm điện. Nên đã test power draw trên 2 bản bios đã tắt và chưa tắt tính năng này. Kết quả đúng như dự đoán. Tiêu thụ điện rên bios mới tốt hơn hẳn bios cũ dù bios cũ có bật hay tắt. Nghĩa là Amd đã chuyển hết phần việc của LoopBuffer sang cho các backend còn lại(bao gồm CPU, OP Cache, L1 Cache, Decoder, L2 Cache, L3 Cache)
Và phương pháp này chứng tỏ hiệu auả hơn về mặt năng lượng.
_Và với công nghệ 3d cache thì LoopBuffer chỉ có 144 entries vô dụng trước vcache to khổng lồ.
_Nhưng:
Thật bất ngờ là trong kết quả test không sử dụng 3dvcache. Việc tắt đi loopbuffer làm giảm tới 25% IPC tương ứng với 5% FPS chênh lệch trên game Cyberpunk 2077. Nghĩa là bá quyền bắt ae mua 3dvcache haha
A loop buffer sits at a CPU’s frontend, where it holds a small number of previously fetched instructions. Small loops can be contained within the loop buffer, after which they can be executed with some frontend stages shut off. That saves power, and can improve performance by bypassing any limitations present in prior frontend stages. It’s an old but popular technique that has seen use by Intel, Arm, and AMD cores.
AMD Disables Zen 4’s Loop Buffer
November 30, 2024 clamchowder 1 CommentA loop buffer sits at a CPU’s frontend, where it holds a small number of previously fetched instructions. Small loops can be contained within the loop buffer, after which they can be executed with some frontend stages shut off. That saves power, and can improve performance by bypassing any limitations present in prior frontend stages. It’s an old but popular technique that has seen use by Intel, Arm, and AMD cores.
Cyberpunk 2077
Cyberpunk 2077 is a game where you can sneak and hold tab while looking at enemies. It features a built-in benchmark, letting me conveniently check on whether disabling the loop buffer might impact gaming performance.
Because I expect negligible performance differences with the loop buffer disabled, I ran the benchmark with an unusual setup to maximize consistency. I disabled Core Performance Boost on the Ryzen 9 7950X3D by setting bit 25 of the Hardware Configuration register (MSR 0xC0010015). That limits all cores to 4.2 GHz. I also capped my RX 6900 XT to 2 GHz. For benchmark settings, I’m using the medium preset at 1080P with no upscaling.

Disabling the loop buffer basically doesn’t affect performance with the game pinned to the VCache die. Strangely, the game sees a 5% performance loss with the loop buffer disabled when pinned to the non-VCache die. I have no explanation for this, and I’ve re-run the benchmark half a dozen times.

Cyberpunk 2077 is unexpectedly friendly to the loop buffer, which covers about 22% of the instruction stream on average. Disabling the loop buffer causes the op cache to deliver 82% of micro-ops, up from 62% before.

There’s a lot of action in Cyberpunk 2077, but most of it doesn’t happen at the CPU’s frontend.

Disabling the loop buffer of course doesn’t change that.

But because the loop buffer covers a significant minority of the instruction stream, turning it off does mean the op cache works harder.

Again, it’s not a big difference. With an average IPC of 0.89 with the loop buffer disabled, or 1.02 with the loop buffer enabled, Cyberpunk 2077 is not a high IPC workload. That means frontend bandwidth isn’t a big consideration. Perhaps the game is more backend bound, or bound by branch predictor delays.
Still, the Cyberpunk 2077 data bothers me. Performance counters also indicate higher average IPC with the loop buffer enabled when the game is running on the VCache die. Specifically, it averages 1.25 IPC with the loop buffer on, and 1.07 IPC with the loop buffer disabled. And, there is a tiny performance dip on the new BIOS. Perhaps I’m pushing closer to a GPU-side bottleneck at 155 FPS. But I’ve already spent enough free time on what I thought would be a quick article. Perhaps some more mainstream tech outlets will figure out AMD disabled the loop buffer at some point, and do testing that I personally lack the time and resources to carry out.
Attempt at Checking Power Draw
I also tried to look at Zen 4’s core power counters to see whether running from the loop buffer improved power efficiency. To do this, I had to modify my instruction bandwidth benchmark to not use calls or returns in the test section. Apparently, calls or returns cause Zen 4 to not use the loop buffer.
I also pinned the test to one core and read the Core Energy Status MSR before and after jumping to my test array, letting me calculate average power draw over the test duration. For consistency, I disabled Core Performance Boost because power readings would vary wildly with boost active.

Results make no sense. On the old BIOS, the Core Energy Status MSR tells me the core averaged 6W of power draw when fetching NOPs from the op cache, and much lower power when doing the same from the loop buffer. Next, I increased the test array size until performance counters showed op cache coverage dropping to under 1%. By that time, the array size had gone well into L2 capacity (128 KB). But even though exercising the decoders and L2 fetch path should increase power draw, the Core Energy Status MSR showed just 1.5W of average core power.
Updating to the new BIOS gave 1.68W of average core power when testing the op cache, and nearly the same power when feeding the decoders mostly from L2. That means the core is achieving better efficiency when running code from the op cache, and makes sense. Of course, I can’t test the loop buffer on the new BIOS because it’s disabled.
To make things even more confusing, AMD’s power monitoring facilities may be modeling power draw instead of measuring it1. There’s a distinct possibility AMD modeled the power draw wrong, or changed the power modeling methodology between the two BIOS versions. I don’t have power measuring hardware to follow up on this. I feel like I don’t understand the power draw situation any more than when I started, and a few hours have gone to waste.
Final Words
I don’t know why AMD disabled Zen 4’s loop buffer. Sometimes CPU features get disabled because there’s a hardware bug. Intel’s Skylake saw its loop buffer (LSD) disabled due to a bug related to partial register access in short loops with both SMT threads active. Zen 4 is AMD’s first attempt at putting a loop buffer into a high performance CPU. Validation is always difficult, especially when implementing a feature for the first time. It’s not crazy to imagine that AMD internally discovered a bug that no one else hit, and decided to turn off the loop buffer out of an abundance of caution. I can’t think of any other reason AMD would mess with Zen 4’s frontend this far into the core’s lifecycle.Turning off the loop buffer should have little to no impact on performance because the op cache has more than enough bandwidth to feed the subsequent rename/allocate stage. Impact on power consumption is an unknown factor, but I suspect it’s also minor, and may be very difficult to evaluate even when using expensive hardware to measure CPU power draw at the 12V EPS connector.

AMD’s move to disable Zen 4’s loop buffer is interesting, but should go largely unnoticed. AMD never advertised or documented the feature beyond dropping a line in the Processor Programming Reference. It’s a clear contrast to Intel, which often documents its loop buffer and encourages developers to optimize their code to take advantage of it.

Combine that with what looks like minimal impact on performance, and I doubt anyone will ever know that AMD turned the loop buffer off. It was a limited feature in the first place, with low capacity and restrictions like no function calls that prevent it from being as useful as an op cache.
Perhaps the best way of looking at Zen 4’s loop buffer is that it signals the company has engineering bandwidth to go try things. Maybe it didn’t go anywhere this time. But letting engineers experiment with a low risk, low impact feature is a great way to build confidence. I look forward to seeing more of that confidence in the future.
If you like our articles and journalism, and you want to support us in our endeavors, then consider heading over to our Patreon or our PayPal if you want to toss a few bucks our way. If you would like to talk with the Chips and Cheese staff and the people behind the scenes, then consider joining our Discord.
References
- Robert Schöne et al, Energy Efficiency Aspects of the AMD Zen 2 Architecture
Appendix
Since AMD never offered optimization advice related to the loop buffer, I’ll do it. On Zen 4 running an old BIOS version, consider sizing loops to have less than 144 micro-ops, or half that if threads share a physical core. Consider inlining a function called within a small loop to avoid CALL/RET instructions. Do this, and your reward will most likely be absolutely nothing. Have fun.Restrictions like no CALL/RET could indicate Zen 4 shuts off certain parts of the branch predictor in addition to the op cache and decoder. That could add to power savings.
Last edited: