|

How sched_min_granularity_ns, sched_latency_ns and sched_wakeup_granularity_ns in CFS affect the timeslice of processes

Abstract
Currently, the most famous process scheduling algorithm in Linux Kernel is Completely Fair Scheduling (CFS) algorithm. The core idea of CFS is to let each process share the same proportional CPU resources to run so that it is fair to each process. In this article, I will introduce how sched_min_granularity_ns and sched_latency_ns work internal CFS to affect the timeslice of processes. (This article discusses Linux Kernel 3.16.x; other versions may have some differences.)

Details

616 static u64 __sched_period(unsigned long nr_running)
617 {
618         u64 period = sysctl_sched_latency;                   /* get from "/proc/sys/kernel/sched_latency_ns" */
619         unsigned long nr_latency = sched_nr_latency;         /* get from "sched_latecy_ns divides sched_min_granularity" */
620 
621         if (unlikely(nr_running > nr_latency)) {          /* if the real processes on the run queue is bigger than nr_latency, that means sched_latency_ns cannot satisfy all the tasks's minimum granularity time requirements. */
622                 period = sysctl_sched_min_granularity;       /* get from "/proc/sys/kernel/sched_min_granularity_ns" */
623                 period *= nr_running;
624         }
625 
626         return period;
627 }
635 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
636 {
637         u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
638 
639         for_each_sched_entity(se) {
640                 struct load_weight *load;
641                 struct load_weight lw;
642 
643                 cfs_rq = cfs_rq_of(se);
644                 load = &cfs_rq->load;
645 
646                 if (unlikely(!se->on_rq)) {
647                         lw = cfs_rq->load;
648 
649                         update_load_add(&lw, se->load.weight);
650                         load = &lw;
651                 }
652                 slice = __calc_delta(slice, se->load.weight, load);
653         }
654         return slice;
655 }
2889 /*
2890  * Preempt the current task with a newly woken task if needed:
2891  */
2892 static void
2893 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
2894 {
2895         unsigned long ideal_runtime, delta_exec;
2896         struct sched_entity *se;
2897         s64 delta;
2898 
2899         ideal_runtime = sched_slice(cfs_rq, curr);
2900         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
2901         if (delta_exec > ideal_runtime) {
2902                 resched_task(rq_of(cfs_rq)->curr);
2903                 /*
2904                  * The current task ran long enough, ensure it doesn't get
2905                  * re-elected due to buddy favours.
2906                  */
2907                 clear_buddies(cfs_rq, curr);
2908                 return;
2909         }
2910 
2911         /*
2912          * Ensure that a task that missed wakeup preemption by a
2913          * narrow margin doesn't have to wait for a full slice.
2914          * This also mitigates buddy induced latencies under load.
2915          */
2916         if (delta_exec vruntime - se->vruntime;
2921 
2922         if (delta  ideal_runtime)
2926                 resched_task(rq_of(cfs_rq)->curr);
2927 }

In above source codes, check_preempt_tick function calls sched_slice function and sched_slice function calls __sched_slice function. Let’s analyse them from __sched_slice. In __sched_slice, it gets the time period which can run each task on CPU’s run queue once. In sched_slice, first get the time period which can run each task on the CPU’s run queue once and then, calculate this schedule entity (process)’s ideal time slice to run according to time period, the weight of the schedule entity (process) and the weight of the run queue. For example, assume the time period is 14 ms. If the run queue has two processes with default nice (or priority) value so they have the same weight value (Note that the weight value is calculated by nice value, the lower nice value the bigger weight value). The weight value of the run queue will be 1024 (the weight value of each process for this example) + 1024 (default nice is 0 and its weight value is 1024 correspondingly) so the ideal time slice will be the time period multiply 50% (14 ms * 0.5 = 7 ms).

In check_preempt_tick, it will first get the ideal time slice for current process and then calculate the delta_exec time which is the time that current process has exhausted during this timeslice. Then, current process’s reschedule flag will be set if delta_exec is bigger than ideal_time or sysctl_sched_min_granularity. If not, get the difference, delta, between current process’s vruntime (virtual run time, which is the time the process has been executed on CPU) and the process with minimum vruntime in the runqueue (red black tree); if delta is bigger than ideal_runtime, the current process’s reschedule flag will be set.

Check_preempt_tick is motivated by timer interrupt as follows.

tick_handle_periodic
 -tick_periodic
  --update_process_times
   ---scheduler_tick
    ----task_tick_fair
     -----entity_tick
      ------check_preempt_tick

sched_wakeup_granularity_ns will affect inter-processes preemption. In the following code snippets, you will see only when the difference between the virtual run time of current running process and the virtual run time of preempting process is bigger than the virtual run time of sched_wakeup_granularity_ns (here, transfer sched_wakeup_granularity_ns to a virtual run time with preempting process’s weight).

4604 /*
4605  * Should 'se' preempt 'curr'.
4606  *
4607  *             |s1
4608  *        |s2
4609  *   |s3
4610  *         g
4611  *      ||c
4612  *
4613  *  w(c, s1) = -1
4614  *  w(c, s2) =  0
4615  *  w(c, s3) =  1
4616  *
4617  */
4618 static int
4619 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
4620 {
4621         s64 gran, vdiff = curr->vruntime - se->vruntime;
4622 
4623         if (vdiff  gran)
4628                 return 1;
4629 
4630         return 0;
4631 }
4657 /*
4658  * Preempt the current task with a newly woken task if needed:
4659  */
4660 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
4661 {
...
4713         if (wakeup_preempt_entity(se, pse) == 1) {
4714                 /*
4715                  * Bias pick_next to pick the sched entity that is
4716                  * triggering this preemption.
4717                  */
4718                 if (!next_buddy_marked)
4719                         set_next_buddy(pse);
4720                 goto preempt;
4721         }
4722 
4723         return;
4724 
4725 preempt:
4726         resched_task(curr);
...
4741 }

There is a simple example for sched_wakeup_granularity_ns as follows. When I set sched_wakeup_granularity_ns to be 0, processes will be preempted each other very frequently and it shows that it is better for I/O intensive workload. I just dump the Kernel stack as follows when there is a preemption.

236207 Dec 10 19:15:25 mobius04 kernel: [ 3976.315960] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.
236208 Dec 10 19:15:25 mobius04 kernel: [ 3976.315961] check_preempt_wakeup: current time stamp (microseconds) is 1481415325312148
236209 Dec 10 19:15:25 mobius04 kernel: [ 3976.315962] check_preempt_wakeup: thread 5688 will preempt current thread 5719
236210 Dec 10 19:15:25 mobius04 kernel: [ 3976.315962]
236211 Dec 10 19:15:25 mobius04 kernel: [ 3976.315963] CPU: 5 PID: 5684 Comm: qemu-system-x86 Not tainted 3.16.39 #21
236212 Dec 10 19:15:25 mobius04 kernel: [ 3976.315964] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
236213 Dec 10 19:15:25 mobius04 kernel: [ 3976.315965]  0000000000000000 ffff8813bc74bbe8 ffffffff8176b2f0 ffff8813bd0d9400
236214 Dec 10 19:15:25 mobius04 kernel: [ 3976.315966]  ffff8813bc78aa00 ffff8813bc74bc48 ffffffff810b2054 ffff8813bc74bc18
236215 Dec 10 19:15:25 mobius04 kernel: [ 3976.315968]  00000000000ae45d 00000000584c9a9d 000000000004c354 ffff8814015b8a40
236216 Dec 10 19:15:25 mobius04 kernel: [ 3976.315970] Call Trace:
236217 Dec 10 19:15:25 mobius04 kernel: [ 3976.315971]  [] dump_stack+0x64/0x84
236218 Dec 10 19:15:25 mobius04 kernel: [ 3976.315973]  [] check_preempt_wakeup+0x284/0x2b0
236219 Dec 10 19:15:25 mobius04 kernel: [ 3976.315980]  [] check_preempt_curr+0x84/0xa0
236220 Dec 10 19:15:25 mobius04 kernel: [ 3976.315987]  [] ttwu_do_wakeup+0x1d/0xf0
236221 Dec 10 19:15:25 mobius04 kernel: [ 3976.315994]  [] T.2622+0x4a/0x60
236222 Dec 10 19:15:25 mobius04 kernel: [ 3976.316001]  [] try_to_wake_up+0x299/0x370
236223 Dec 10 19:15:25 mobius04 kernel: [ 3976.316008]  [] ? __pollwait+0xf0/0xf0
236224 Dec 10 19:15:25 mobius04 kernel: [ 3976.316015]  [] wake_up_state+0x10/0x20
236225 Dec 10 19:15:25 mobius04 kernel: [ 3976.316022]  [] wake_futex+0x76/0x90
236226 Dec 10 19:15:25 mobius04 kernel: [ 3976.316029]  [] futex_wake+0x12f/0x160
236227 Dec 10 19:15:25 mobius04 kernel: [ 3976.316035]  [] do_futex+0x119/0xd20
236228 Dec 10 19:15:25 mobius04 kernel: [ 3976.316043]  [] ? __dequeue_entity+0x30/0x50
236229 Dec 10 19:15:25 mobius04 kernel: [ 3976.316051]  [] ? eventfd_ctx_read+0x6e/0x2e0
236230 Dec 10 19:15:25 mobius04 kernel: [ 3976.316057]  [] SyS_futex+0x7d/0x170
236231 Dec 10 19:15:25 mobius04 kernel: [ 3976.316064]  [] ? SyS_read+0x9c/0xd0
236232 Dec 10 19:15:25 mobius04 kernel: [ 3976.316067]  [] system_call_fastpath+0x1a/0x1f
236233 Dec 10 19:15:25 mobius04 kernel: [ 3976.316068] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.

Conclusion
1, If Linux Kernel uses CFS, each process will run at least “sched_min_granularity_ns” timeslice.
2, If “sched_latency_ns / sched_min_granularity_ns” is smaller than the real number of processes on CPU’s run queue, the time period (run each process on the run queue once) will be “the real process number on the run queue multiply sched_min_granularity_ns” rather than sched_latency_ns.
3, If the difference between the virtual run time of current running process and the virtual run time of preempting process is bigger than the virtual run time of sched_wakeup_granularity_ns (here, transfer sched_wakeup_granularity_ns to a virtual run time with preempting process’s weight), the preemption happens.

Editor’s note: This article has been updated to reflect current software versions as of 2026. Commands and package names have been revised accordingly.

Similar Posts

  • Chinese Charactor Configuration on Fedora 11

    最新的更新版本请看: Fedora 中文字体设置. 使用Linux时我个人倾向使用英文环境系统,而Fedora11在英文环境下中文字体有时会不太好看,经常遇到需要字体优化美化的问题。 以下是我的配置方案,经测试效果还算不错,解决了Fedora 11 中文字体难看的问题: 方案1:使用uming和ukai字体,即AR PL UMing CN等。 关键是使用的字体包如下: 首先要安装这两个字体: cjkuni-ukai-fonts cjkuni-uming-fonts 然后配置一下~/.fonts.conf文件. 使sans-serif serif monospace字体中文使用uming/ukai即可. 我的.fonts.conf文件可以从这里下载(两种选择, 我喜欢前者): https://github.com/zma/config_files 使用Liberation和uming/ukai字体: .fonts.cofn.liberation 使用dejavu和uming/ukai字体: .fonts.conf.dejavu 下载后放到自己的$HOME下改名为.fonts.conf就可以了。 使用uming字体效果如下(请放大后看效果): 方案2:安装文泉驿字体,这个非常简单,安装相应包即可了。 如果喜欢其它的字体选择性的安装上就可以了,只要注意只安装自己需要的就行了。有人使用微软雅黑字体,首先这是侵权的,其次开源的字体做得其实已经很不错了。 最后将字体平滑选项打开, KDE和gnome都有相关设置方法。 以上内容只是针对使用xft字体系统的设置。对于使用核心字体系统的X程序来说字体依然会出现很丑的情况。 下面是针对emacs的设置方法: 首先需要安装这个字体包: xorg-x11-fonts-misc 注意到在中文系统下emacs的中文显示非常好,而在英文环境中去非常差,我们可以利用这一点,在运行emacs前首先将系统环境设为中文即可。 在~/bin/下建立一文件ema 内容如下: #!/bin/bash rm -f ~/.emacs ln -s ~/.emacs.x ~/.emacs LANG=zh_CN.UTF-8 emacs –fullheight -r $* 然后加入执行权限即可: chmod +x…

  • |

    Understanding TPM 2.0 and Platform Configuration Registers (PCRs)

    The Trusted Platform Module (TPM) 2.0 is a hardware-based security feature designed to enhance the integrity and security of computing systems. It is an international standard defined by the Trusted Computing Group (TCG) and is implemented as a dedicated chip or firmware module on a computer’s motherboard. The TPM provides cryptographic functions, secure storage, and…

  • git: how to merge upstream with fork repo

    How to merge upstream into local fork repo with git? Merging an upstream repository into your fork: https://help.github.com/articles/merging-an-upstream-repository-into-your-fork/ Merging an upstream repository into your fork: https://help.github.com/articles/merging-an-upstream-repository-into-your-fork/ Read more: Git: setting a local branch’s upstream tracking branch How to merge a commit from another branch to my current branch in git? How to merge git branches…

Leave a Reply

Your email address will not be published. Required fields are marked *