[CUDA] 设置sync模式cudaSetDeviceFlags

news/2024/11/8 17:42:45 标签: cuda, sync 逻辑

文章目录

1. 设置cuda synchronize的等待模式
2 设置函数
3. streamQuery方式实现stream sync等待逻辑
Reference

cuda_synchronize_2">1. 设置cuda synchronize的等待模式

参考资料：https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf

cuda的 synchronize等待模式分为： Yield方式， busy waiting方式（spin）， blocking方式

busy waiting (spin): 是一直占用cpu，属于轮训式等待
yield：是让出时间片，将时间片轮空，可能会导致很多切入切出
blocking：方式会导致线程阻塞，从而让出cpu，等待stream上的gpu操作结束后，会触发block 的cpu线程/进程，然后恢复执行；但是这个是被动唤醒模式；可能会导致block的cpu恢复产生延迟，从而产生空白时间，导致线程整体执行耗时增加。
前两个模式，在gpu操作完成后，cpu主线程会及时响应，从而继续往后执行；但是第三个会产生block空隙，如果主线程是FIFO这种实时线程，优先级高且抢占cpu资源，并且CPU资源充足的情况，则block的cpu线程会恢复较快但不排除存在延迟情况。

采用blocking模式后，nsight观察的现象有几个
- gpu context切换更加频繁了，应该是block阻塞导致的
- block恢复存在延迟，导致一些空白gpu时间，如下图红色框
可以设置cuda Stream synchorinze时是释放cpu资源还是把持cpu资源；根据官方说明默认当gpu 个数大于cpu的时候，因为cpu紧张所以会yield时间片；但是一般cpu core大于gpu个数；所以会spin on the processor； spin属于轮询等待的一种。

2 设置函数

official doc

在执行函数设置cudaDeviceScheduleBlockingSync的时候，cudaDeviceMapHost可能被同步设置

__host__cudaError_t cudaSetDeviceFlags (unsigned int flags);
// flags:
- cudaDeviceScheduleAuto: 根据GPU和CPU 的个数来选择cudaDeviceScheduleSpin|cudaDeviceScheduleYield
- cudaDeviceScheduleSpin: 轮询方式
- cudaDeviceScheduleYield: 出让时间片方式
- cudaDeviceScheduleBlockingSync:阻塞方式
- cudaDeviceBlockingSync:deprecated
- cudaDeviceMapHost: 
- cudaDeviceLmemResizeToMax: deprecated
- cudaDeviceSyncMemops:

cudaDeviceScheduleAuto： "If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while waiting for results and actively spin on the processor. 有可能这个context在程序运行过程中会实时变化，导致C>P 那么就会不确定的执行yield
‣ cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work 另外blocksync和spin还不一样，就是block会进入阻塞态释放cpu，会被主动唤醒，而spin是把持；虽然都会导致程序wait，但是对硬件的使用情况不一样。

3. streamQuery方式实现stream sync等待逻辑

也可以自己写等待逻辑，如用thread::yeild或者busy waiting，通过用streamQuery来实现
In my experience, you can’t make the CPU activity level lower, if the CPU has nothing else to do, and it is spinning at a CUDA sync point. If you really want to do something like that, my suggestion would be that instead of doing a CUDA device or stream sync, put your GPU work into a stream, and then in a loop you do cudaStreamQuery alternating with an OS command to put the thread to sleep. You decide what level of responsiveness you want/need based on how long you put the CPU thread to sleep."

Reference

wiki百科： In computer science and software engineering, busy-waiting, busy-looping or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input or a lock is available.
https://en.wikipedia.org/wiki/Busy_waiting#:~:text=In%20computer%20science%20and%20software,or%20a%20lock%20is%20available."