Re: [问题] Cache size量测小程式数据解释 hsnuer1171 PTT批踢踢实业坊

Re: [问题] Cache size量测小程式数据解释

楼主: hsnuer1171 (阿赌) 2019-08-24 15:07:20

※ 引述《johnjohnlin ()》之铭言：
: 上面有推文说这篇做 GPU 黑箱测试的方法
: 可以测试出 cache 大小、way 数等等的资讯 (Figure 1)
: https://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf
: 因为这篇有点久之前看的了，有点印象模糊
: 重新读了一遍，想说自己来回一下好了
感谢回复, 这篇paper也是非常有帮助
心得网址:
https://tinyurl.com/yxrczcyv
原本以为在 4K 时最慢只是因为J大提到的 Collision Miss，但我又做了一些实验，当我把 array
size 降为 16K 时，发现速度仍然是在 4K 时最慢，这时我们只会去存取 4个 element，
照理来说 Set 0 是放的下的，那为什么还会最慢呢？
Google了很久才找到答案，这与Intel CPU 针对 data hazard 设计的机制有关。
先复习一下 Data Hazard 是什么，从下图的例子可以看到 add r1,r2,r3 之后紧接着
sub r4,r1,r3 指令，这代表第二条 sub 指令会用前面的 r1 计算完 add 后的值，但因
为 pipeline 设计，在 sub 指令进入execute阶段时，这时 r1还未 writeback，所以会
导致结果错误。
为了解决这个问题，CPU会有 Forwarding 的机制，也就是图中看到的红线，如果将 add
算完后的值提早在 execute阶段之后就能传入给 sub ，这样一来就不用等到 writeback
了。
图片来源:
https://webdocs.cs.ualberta.ca/~amaral/courses/429/webslides/Topic3-Pipelining/sld033.htm
Forwarding to avoid data hazard
课本里面的例子讲到的都是对于 Register的 forwarding ，但实际上应用 x64 架构可以
直接存取 memory，例如 add BYTE PTR [rdi+rcx], 10 ，这时必须得要算出实际上的位
置 (rdi+rcx)，才知道跟前面的指令是不是会有 hazard产生。
而 Intel CPU的设计刚好纪录这个位置的 memory order buffer 只能存 address 的
LSB 12 Bits，刚好就是 4KB，所以，在存取 array[4096] 时，Intel CPU 会以为我们在
存取 array[0]，会试着把他forward给下一次的 add，而要直到 array[4096] 的位置被
完全 decode 之后，CPU才发现原来之前的 forwarding 是错的，得要重新 load 一次
array[4096]，此时会产生 5 cycles 的 delay。
因此，在 4K 时一直不断产生了 5 Cycles 的 delay ，但与 L2 Cache Fetch 的时间比
起来还是较少的 (CPU : L2 Cache = 1:14)，所以导致些微上升。
全文: https://software.intel.com/en-us/forums/intel-vtune-amplifier/topic/606846
When an earlier (in program order) load issued after a later (in program
order) store, a potential WAR (write-after-read) hazard exists. To detect
such hazards, the memory order buffer (MOB) compares the low-order 12 bits of
the load and store in every potential WAR hazard. If they match, the load is
reissued, penalizing performance. However, as only 12 bits are compared, a
WAR hazard may be detected falsely on loads and stores whose addresses are
separated by a multiple of 4096 (2^12). This metric estimates the performance
penalty of handling such falsely aliasing loads and stores.
This occurs when a load is issued after a store and their memory addresses
are offset by (4K). When this is processed in the pipeline, the issue of the
load will match the previous store (the full address is not used at this
point), so pipeline will try to forward the results of the store and avoid
doing the load (this is store forwarding). Later on when the address of the
load is fully resolved, it will not match the store, and so the load will
have to be re-issued from a later point in the pipe. This has a 5-cycle
penalty in the normal case, but could be worse in certain situations, like
with un-aligned loads that span 2 cache lines.

作者: johnjohnlin (嗯?) 2019-08-24 16:34:00

没想到还有这种这么细的陷阱 XD

楼主: hsnuer1171 (阿赌) 2019-08-24 16:37:00

真的xD 太多设计在里面

作者: sarafciel (Cattuz) 2019-08-26 08:47:00

推每次看计结的东西都觉得跟玩海龟汤没两样XD哪天有个能像debugger一样step by step看CPU细部运作的工具就好了......虽然有生之年可能看不到(目死

作者: a58524andy (a58524andy) 2019-08-26 09:41:00

RISCV成功推广之后说不定会有(?其他proprietary的感觉不太可能会出

作者: cc1plus (废柴联盟盟主) 2019-08-27 10:51:00

我看过 Cavium MIPS64 有可以看 step by step CPU cycle的工具, 不过一般人应该拿不到

作者: Ryspon (Ry) 2019-08-29 03:18:00

推实验～不过你的文章里 Step = 64 应该是”每 1 次”存取就会有一次 cache line miss 吧？

楼主: hsnuer1171 (阿赌) 2019-08-29 12:28:00

对!! 感谢我修正一下

继续阅读

[讨论] printf before scanf 请益anoymouse [问题] 多次printf对变量的影响OnlyCourage Re: [问题] 多网卡的raw socket疑问chigi Re: [问题] 多网卡的raw socket疑问Schottky [问题] 多网卡的raw socket疑问chigi [问题] C++ 储存XML多笔资料的问题jayzhuang printf的%s与&问题OnlyCourage [问题] 最大公因子td2100106 [问题] Leetcode compiler option flags 设定dces4212 [问题] txt档案内，等号(=)后面的数值/资料取得jayzhuang