How sched_min_granularity_ns, sched_latency_ns and sched_wakeup_granularity_ns in CFS affect the timeslice of processes
Posted on In Linux, TutorialAbstract
Currently, the most famous process scheduling algorithm in Linux Kernel is Completely Fair Scheduling (CFS) algorithm. The core idea of CFS is to let each process share the same proportional CPU resources to run so that it is fair to each process. In this article, I will introduce how sched_min_granularity_ns and sched_latency_ns work internal CFS to affect the timeslice of processes. (This article discusses Linux Kernel 3.16.x; other versions may have some differences.)
Details
616 static u64 __sched_period(unsigned long nr_running)
617 {
618 u64 period = sysctl_sched_latency; /* get from "/proc/sys/kernel/sched_latency_ns" */
619 unsigned long nr_latency = sched_nr_latency; /* get from "sched_latecy_ns divides sched_min_granularity" */
620
621 if (unlikely(nr_running > nr_latency)) { /* if the real processes on the run queue is bigger than nr_latency, that means sched_latency_ns cannot satisfy all the tasks's minimum granularity time requirements. */
622 period = sysctl_sched_min_granularity; /* get from "/proc/sys/kernel/sched_min_granularity_ns" */
623 period *= nr_running;
624 }
625
626 return period;
627 }
635 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
636 {
637 u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
638
639 for_each_sched_entity(se) {
640 struct load_weight *load;
641 struct load_weight lw;
642
643 cfs_rq = cfs_rq_of(se);
644 load = &cfs_rq->load;
645
646 if (unlikely(!se->on_rq)) {
647 lw = cfs_rq->load;
648
649 update_load_add(&lw, se->load.weight);
650 load = &lw;
651 }
652 slice = __calc_delta(slice, se->load.weight, load);
653 }
654 return slice;
655 }
2889 /*
2890 * Preempt the current task with a newly woken task if needed:
2891 */
2892 static void
2893 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
2894 {
2895 unsigned long ideal_runtime, delta_exec;
2896 struct sched_entity *se;
2897 s64 delta;
2898
2899 ideal_runtime = sched_slice(cfs_rq, curr);
2900 delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
2901 if (delta_exec > ideal_runtime) {
2902 resched_task(rq_of(cfs_rq)->curr);
2903 /*
2904 * The current task ran long enough, ensure it doesn't get
2905 * re-elected due to buddy favours.
2906 */
2907 clear_buddies(cfs_rq, curr);
2908 return;
2909 }
2910
2911 /*
2912 * Ensure that a task that missed wakeup preemption by a
2913 * narrow margin doesn't have to wait for a full slice.
2914 * This also mitigates buddy induced latencies under load.
2915 */
2916 if (delta_exec vruntime - se->vruntime;
2921
2922 if (delta ideal_runtime)
2926 resched_task(rq_of(cfs_rq)->curr);
2927 }
In above source codes, check_preempt_tick function calls sched_slice function and sched_slice function calls __sched_slice
function. Let’s analyse them from __sched_slice
. In __sched_slice
, it gets the time period which can run each task on CPU’s run queue once. In sched_slice, first get the time period which can run each task on the CPU’s run queue once and then, calculate this schedule entity (process)’s ideal time slice to run according to time period, the weight of the schedule entity (process) and the weight of the run queue. For example, assume the time period is 14 ms. If the run queue has two processes with default nice (or priority) value so they have the same weight value (Note that the weight value is calculated by nice value, the lower nice value the bigger weight value). The weight value of the run queue will be 1024 (the weight value of each process for this example) + 1024 (default nice is 0 and its weight value is 1024 correspondingly) so the ideal time slice will be the time period multiply 50% (14 ms * 0.5 = 7 ms).
In check_preempt_tick, it will first get the ideal time slice for current process and then calculate the delta_exec time which is the time that current process has exhausted during this timeslice. Then, current process’s reschedule flag will be set if delta_exec is bigger than ideal_time or sysctl_sched_min_granularity. If not, get the difference, delta, between current process’s vruntime (virtual run time, which is the time the process has been executed on CPU) and the process with minimum vruntime in the runqueue (red black tree); if delta is bigger than ideal_runtime, the current process’s reschedule flag will be set.
Check_preempt_tick is motivated by timer interrupt as follows.
tick_handle_periodic
-tick_periodic
--update_process_times
---scheduler_tick
----task_tick_fair
-----entity_tick
------check_preempt_tick
sched_wakeup_granularity_ns will affect inter-processes preemption. In the following code snippets, you will see only when the difference between the virtual run time of current running process and the virtual run time of preempting process is bigger than the virtual run time of sched_wakeup_granularity_ns (here, transfer sched_wakeup_granularity_ns to a virtual run time with preempting process’s weight).
4604 /*
4605 * Should 'se' preempt 'curr'.
4606 *
4607 * |s1
4608 * |s2
4609 * |s3
4610 * g
4611 * ||c
4612 *
4613 * w(c, s1) = -1
4614 * w(c, s2) = 0
4615 * w(c, s3) = 1
4616 *
4617 */
4618 static int
4619 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
4620 {
4621 s64 gran, vdiff = curr->vruntime - se->vruntime;
4622
4623 if (vdiff gran)
4628 return 1;
4629
4630 return 0;
4631 }
4657 /*
4658 * Preempt the current task with a newly woken task if needed:
4659 */
4660 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
4661 {
...
4713 if (wakeup_preempt_entity(se, pse) == 1) {
4714 /*
4715 * Bias pick_next to pick the sched entity that is
4716 * triggering this preemption.
4717 */
4718 if (!next_buddy_marked)
4719 set_next_buddy(pse);
4720 goto preempt;
4721 }
4722
4723 return;
4724
4725 preempt:
4726 resched_task(curr);
...
4741 }
There is a simple example for sched_wakeup_granularity_ns as follows. When I set sched_wakeup_granularity_ns to be 0, processes will be preempted each other very frequently and it shows that it is better for I/O intensive workload. I just dump the Kernel stack as follows when there is a preemption.
236207 Dec 10 19:15:25 mobius04 kernel: [ 3976.315960] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.
236208 Dec 10 19:15:25 mobius04 kernel: [ 3976.315961] check_preempt_wakeup: current time stamp (microseconds) is 1481415325312148
236209 Dec 10 19:15:25 mobius04 kernel: [ 3976.315962] check_preempt_wakeup: thread 5688 will preempt current thread 5719
236210 Dec 10 19:15:25 mobius04 kernel: [ 3976.315962]
236211 Dec 10 19:15:25 mobius04 kernel: [ 3976.315963] CPU: 5 PID: 5684 Comm: qemu-system-x86 Not tainted 3.16.39 #21
236212 Dec 10 19:15:25 mobius04 kernel: [ 3976.315964] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
236213 Dec 10 19:15:25 mobius04 kernel: [ 3976.315965] 0000000000000000 ffff8813bc74bbe8 ffffffff8176b2f0 ffff8813bd0d9400
236214 Dec 10 19:15:25 mobius04 kernel: [ 3976.315966] ffff8813bc78aa00 ffff8813bc74bc48 ffffffff810b2054 ffff8813bc74bc18
236215 Dec 10 19:15:25 mobius04 kernel: [ 3976.315968] 00000000000ae45d 00000000584c9a9d 000000000004c354 ffff8814015b8a40
236216 Dec 10 19:15:25 mobius04 kernel: [ 3976.315970] Call Trace:
236217 Dec 10 19:15:25 mobius04 kernel: [ 3976.315971] [] dump_stack+0x64/0x84
236218 Dec 10 19:15:25 mobius04 kernel: [ 3976.315973] [] check_preempt_wakeup+0x284/0x2b0
236219 Dec 10 19:15:25 mobius04 kernel: [ 3976.315980] [] check_preempt_curr+0x84/0xa0
236220 Dec 10 19:15:25 mobius04 kernel: [ 3976.315987] [] ttwu_do_wakeup+0x1d/0xf0
236221 Dec 10 19:15:25 mobius04 kernel: [ 3976.315994] [] T.2622+0x4a/0x60
236222 Dec 10 19:15:25 mobius04 kernel: [ 3976.316001] [] try_to_wake_up+0x299/0x370
236223 Dec 10 19:15:25 mobius04 kernel: [ 3976.316008] [] ? __pollwait+0xf0/0xf0
236224 Dec 10 19:15:25 mobius04 kernel: [ 3976.316015] [] wake_up_state+0x10/0x20
236225 Dec 10 19:15:25 mobius04 kernel: [ 3976.316022] [] wake_futex+0x76/0x90
236226 Dec 10 19:15:25 mobius04 kernel: [ 3976.316029] [] futex_wake+0x12f/0x160
236227 Dec 10 19:15:25 mobius04 kernel: [ 3976.316035] [] do_futex+0x119/0xd20
236228 Dec 10 19:15:25 mobius04 kernel: [ 3976.316043] [] ? __dequeue_entity+0x30/0x50
236229 Dec 10 19:15:25 mobius04 kernel: [ 3976.316051] [] ? eventfd_ctx_read+0x6e/0x2e0
236230 Dec 10 19:15:25 mobius04 kernel: [ 3976.316057] [] SyS_futex+0x7d/0x170
236231 Dec 10 19:15:25 mobius04 kernel: [ 3976.316064] [] ? SyS_read+0x9c/0xd0
236232 Dec 10 19:15:25 mobius04 kernel: [ 3976.316067] [] system_call_fastpath+0x1a/0x1f
236233 Dec 10 19:15:25 mobius04 kernel: [ 3976.316068] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.
Conclusion
1, If Linux Kernel uses CFS, each process will run at least “sched_min_granularity_ns” timeslice.
2, If “sched_latency_ns / sched_min_granularity_ns” is smaller than the real number of processes on CPU’s run queue, the time period (run each process on the run queue once) will be “the real process number on the run queue multiply sched_min_granularity_ns” rather than sched_latency_ns.
3, If the difference between the virtual run time of current running process and the virtual run time of preempting process is bigger than the virtual run time of sched_wakeup_granularity_ns (here, transfer sched_wakeup_granularity_ns to a virtual run time with preempting process’s weight), the preemption happens.