[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH] some 2.5 scheduler backporting to ck4
Hi, I've ported some parts of the latest 2.5.6x scheduler to 2.4.20-ck4.
I also included variable-hz again for 1000 Hz (to match 2.5) as well as
sched-tunables.
I'm not sure how correct it is, but it seems to work well. I made these
against ck4-rmap15d with rmap15e incremental patch, ignoring the
elevator.h unpatches in the 15e incremental. Contest benchmarks in
another email.
diff -ruNp a/Documentation/Configure.help b/Documentation/Configure.help
--- a/Documentation/Configure.help 2003-04-03 21:31:34.000000000 -0800
+++ b/Documentation/Configure.help 2003-04-03 23:15:05.000000000 -0800
@@ -2439,6 +2439,18 @@ CONFIG_HEARTBEAT
behaviour is platform-dependent, but normally the flash frequency is
a hyperbolic function of the 5-minute load average.
+Timer frequency
+CONFIG_HZ
+ The frequency the system timer interrupt pops. Higher tick values provide
+ improved granularity of timers, improved select() and poll() performance,
+ and lower scheduling latency. Higher values, however, increase interrupt
+ overhead and will allow jiffie wraparound sooner. For compatibility, the
+ tick count is always exported as if HZ=100.
+
+ The default value, which was the value for all of eternity, is 100. If
+ you are looking to provide better timer granularity or increased desktop
+ performance, try 500 or 1000. In unsure, go with the default of 100.
+
Networking support
CONFIG_NET
Unless you really know what you are doing, you should say Y here.
diff -ruNp a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt 2002-12-09 02:24:08.000000000 -0800
+++ b/Documentation/filesystems/proc.txt 2003-04-03 23:10:53.000000000 -0800
@@ -37,6 +37,7 @@ Table of Contents
2.8 /proc/sys/net/ipv4 - IPV4 settings
2.9 Appletalk
2.10 IPX
+ 2.11 /proc/sys/sched - scheduler tunables
------------------------------------------------------------------------------
Preface
@@ -1779,6 +1780,92 @@ The /proc/net/ipx_route table holds a
gives the destination network, the router node (or Directly) and the network
address of the router (or Connected) for internal networks.
+2.11 /proc/sys/sched - scheduler tunables
+-----------------------------------------
+
+Useful knobs for tuning the scheduler live in /proc/sys/sched.
+
+child_penalty
+-------------
+
+Percentage of the parent's sleep_avg that children inherit. sleep_avg is
+a running average of the time a process spends sleeping. Tasks with high
+sleep_avg values are considered interactive and given a higher dynamic
+priority and a larger timeslice. You typically want this some value just
+under 100.
+
+exit_weight
+-----------
+
+When a CPU hog task exits, its parent's sleep_avg is reduced by a factor of
+exit_weight against the exiting task's sleep_avg.
+
+interactive_delta
+-----------------
+
+If a task is "interactive" it is reinserted into the active array after it
+has expired its timeslice, instead of being inserted into the expired array.
+How "interactive" a task must be in order to be deemed interactive is a
+function of its nice value. This interactive limit is scaled linearly by nice
+value and is offset by the interactive_delta.
+
+max_sleep_avg
+-------------
+
+max_sleep_avg is the largest value (in ms) stored for a task's running sleep
+average. The larger this value, the longer a task needs to sleep to be
+considered interactive (maximum interactive bonus is a function of
+max_sleep_avg).
+
+max_timeslice
+-------------
+
+Maximum timeslice, in milliseconds. This is the value given to tasks of the
+highest dynamic priority.
+
+min_timeslice
+-------------
+
+Minimum timeslice, in milliseconds. This is the value given to tasks of the
+lowest dynamic priority. Every task gets at least this slice of the processor
+per array switch.
+
+parent_penalty
+--------------
+
+Percentage of the parent's sleep_avg that it retains across a fork().
+sleep_avg is a running average of the time a process spends sleeping. Tasks
+with high sleep_avg values are considered interactive and given a higher
+dynamic priority and a larger timeslice. Normally, this value is 100 and thus
+task's retain their sleep_avg on fork. If you want to punish interactive
+tasks for forking, set this below 100.
+
+prio_bonus_ratio
+----------------
+
+Middle percentage of the priority range that tasks can receive as a dynamic
+priority. The default value of 25% ensures that nice values at the
+extremes are still enforced. For example, nice +19 interactive tasks will
+never be able to preempt a nice 0 CPU hog. Setting this higher will increase
+the size of the priority range the tasks can receive as a bonus. Setting
+this lower will decrease this range, making the interactivity bonus less
+apparent and user nice values more applicable.
+
+starvation_limit
+----------------
+
+Sufficiently interactive tasks are reinserted into the active array when they
+run out of timeslice. Normally, tasks are inserted into the expired array.
+Reinserting interactive tasks into the active array allows them to remain
+runnable, which is important to interactive performance. This could starve
+expired tasks, however, since the interactive task could prevent the array
+switch. To prevent starving the tasks on the expired array for too long. the
+starvation_limit is the longest (in ms) we will let the expired array starve
+at the expense of reinserting interactive tasks back into active. Higher
+values here give more preferance to running interactive tasks, at the expense
+of expired tasks. Lower values provide more fair scheduling behavior, at the
+expense of interactivity. The units are in milliseconds.
+
------------------------------------------------------------------------------
Summary
------------------------------------------------------------------------------
diff -ruNp a/arch/i386/config.in b/arch/i386/config.in
--- a/arch/i386/config.in 2003-04-03 21:33:54.000000000 -0800
+++ b/arch/i386/config.in 2003-04-03 23:15:05.000000000 -0800
@@ -240,6 +240,7 @@ endmenu
mainmenu_option next_comment
comment 'General setup'
+int 'Timer frequency (HZ) (100)' CONFIG_HZ 1000
bool 'Networking support' CONFIG_NET
# Visual Workstation support is utterly broken.
diff -ruNp a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c 2003-04-03 21:33:54.000000000 -0800
+++ b/fs/proc/array.c 2003-04-03 23:15:05.000000000 -0800
@@ -360,15 +360,15 @@ int proc_pid_stat(struct task_struct *ta
task->cmin_flt,
task->maj_flt,
task->cmaj_flt,
- task->times.tms_utime,
- task->times.tms_stime,
- task->times.tms_cutime,
- task->times.tms_cstime,
+ jiffies_to_clock_t(task->times.tms_utime),
+ jiffies_to_clock_t(task->times.tms_stime),
+ jiffies_to_clock_t(task->times.tms_cutime),
+ jiffies_to_clock_t(task->times.tms_cstime),
priority,
nice,
0UL /* removed */,
- task->it_real_value,
- task->start_time,
+ jiffies_to_clock_t(task->it_real_value),
+ jiffies_to_clock_t(task->start_time),
vsize,
mm ? mm->rss : 0, /* you might want to shift this left 3 */
task->rlim[RLIMIT_RSS].rlim_cur,
@@ -687,14 +687,14 @@ int proc_pid_cpu(struct task_struct *tas
len = sprintf(buffer,
"cpu %lu %lu\n",
- task->times.tms_utime,
- task->times.tms_stime);
+ jiffies_to_clock_t(task->times.tms_utime),
+ jiffies_to_clock_t(task->times.tms_stime));
for (i = 0 ; i < smp_num_cpus; i++)
len += sprintf(buffer + len, "cpu%d %lu %lu\n",
i,
- task->per_cpu_utime[cpu_logical_map(i)],
- task->per_cpu_stime[cpu_logical_map(i)]);
+ jiffies_to_clock_t(task->per_cpu_utime[cpu_logical_map(i)]),
+ jiffies_to_clock_t(task->per_cpu_stime[cpu_logical_map(i)]));
return len;
}
diff -ruNp a/fs/proc/proc_misc.c b/fs/proc/proc_misc.c
--- a/fs/proc/proc_misc.c 2003-04-03 21:33:54.000000000 -0800
+++ b/fs/proc/proc_misc.c 2003-04-03 23:15:05.000000000 -0800
@@ -316,16 +316,16 @@ static int kstat_read_proc(char *page, c
{
int i, len = 0;
extern unsigned long total_forks;
- unsigned long jif = jiffies;
+ unsigned long jif = jiffies_to_clock_t(jiffies);
unsigned int sum = 0, user = 0, nice = 0, system = 0;
int major, disk;
for (i = 0 ; i < smp_num_cpus; i++) {
int cpu = cpu_logical_map(i), j;
- user += kstat.per_cpu_user[cpu];
- nice += kstat.per_cpu_nice[cpu];
- system += kstat.per_cpu_system[cpu];
+ user += jiffies_to_clock_t(kstat.per_cpu_user[cpu]);
+ nice += jiffies_to_clock_t(kstat.per_cpu_nice[cpu]);
+ system += jiffies_to_clock_t(kstat.per_cpu_system[cpu]);
#if !defined(CONFIG_ARCH_S390)
for (j = 0 ; j < NR_IRQS ; j++)
sum += kstat.irqs[cpu][j];
@@ -339,10 +339,10 @@ static int kstat_read_proc(char *page, c
proc_sprintf(page, &off, &len,
"cpu%d %u %u %u %lu\n",
i,
- kstat.per_cpu_user[cpu_logical_map(i)],
- kstat.per_cpu_nice[cpu_logical_map(i)],
- kstat.per_cpu_system[cpu_logical_map(i)],
- jif - ( kstat.per_cpu_user[cpu_logical_map(i)] \
+ jiffies_to_clock_t(kstat.per_cpu_user[cpu_logical_map(i)]),
+ jiffies_to_clock_t(kstat.per_cpu_nice[cpu_logical_map(i)]),
+ jiffies_to_clock_t(kstat.per_cpu_system[cpu_logical_map(i)]),
+ jif - jiffies_to_clock_t(kstat.per_cpu_user[cpu_logical_map(i)] \
+ kstat.per_cpu_nice[cpu_logical_map(i)] \
+ kstat.per_cpu_system[cpu_logical_map(i)]));
proc_sprintf(page, &off, &len,
diff -ruNp a/include/asm-i386/param.h b/include/asm-i386/param.h
--- a/include/asm-i386/param.h 2000-10-27 11:04:43.000000000 -0700
+++ b/include/asm-i386/param.h 2003-04-03 23:15:05.000000000 -0800
@@ -1,8 +1,17 @@
#ifndef _ASMi386_PARAM_H
#define _ASMi386_PARAM_H
+#include <linux/config.h>
+
+#ifdef __KERNEL__
+# define HZ CONFIG_HZ /* internal kernel timer frequency */
+# define USER_HZ 100 /* some user interfaces are in ticks */
+# define CLOCKS_PER_SEC (USER_HZ) /* like times() */
+# define jiffies_to_clock_t(x) ((x) / ((HZ) / (USER_HZ)))
+#endif
+
#ifndef HZ
-#define HZ 100
+#define HZ 100 /* if userspace cheats, give them 100 */
#endif
#define EXEC_PAGESIZE 4096
@@ -17,8 +26,4 @@
#define MAXHOSTNAMELEN 64 /* max length of hostname */
-#ifdef __KERNEL__
-# define CLOCKS_PER_SEC 100 /* frequency at which times() counts */
-#endif
-
#endif
diff -ruNp a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2003-04-03 21:33:54.000000000 -0800
+++ b/include/linux/sched.h 2003-04-03 23:10:53.000000000 -0800
@@ -356,7 +356,7 @@ struct task_struct {
prio_array_t *array;
unsigned long sleep_avg;
- unsigned long sleep_timestamp;
+ unsigned long last_run;
unsigned long policy;
unsigned long cpus_allowed;
@@ -387,6 +387,7 @@ struct task_struct {
* older sibling, respectively. (p->father can be replaced with
* p->p_pptr->pid)
*/
+ struct task_struct *parent;
task_t *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_osptr;
struct list_head thread_group;
diff -ruNp a/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h 2003-04-03 21:33:54.000000000 -0800
+++ b/include/linux/sysctl.h 2003-04-03 23:10:53.000000000 -0800
@@ -63,7 +63,8 @@ enum
CTL_DEV=7, /* Devices */
CTL_BUS=8, /* Busses */
CTL_ABI=9, /* Binary emulation */
- CTL_CPU=10 /* CPU stuff (speed scaling, etc) */
+ CTL_CPU=10, /* CPU stuff (speed scaling, etc) */
+ CTL_SCHED=11, /* scheduler tunables */
};
/* CTL_BUS names: */
@@ -148,6 +149,19 @@ enum
VM_PAGEBUF=14, /* struct: Control pagebuf parameters */
};
+/* Tunable scheduler parameters in /proc/sys/sched/ */
+enum
+{
+ SCHED_MIN_TIMESLICE=1, /* minimum process timeslice */
+ SCHED_MAX_TIMESLICE=2, /* maximum process timeslice */
+ SCHED_CHILD_PENALTY=3, /* penalty on fork to child */
+ SCHED_PARENT_PENALTY=4, /* penalty on fork to parent */
+ SCHED_EXIT_WEIGHT=5, /* penalty to parent of CPU hog child */
+ SCHED_PRIO_BONUS_RATIO=6, /* percent of max prio given as bonus */
+ SCHED_INTERACTIVE_DELTA=7, /* delta used to scale interactivity */
+ SCHED_MAX_SLEEP_AVG=8, /* maximum sleep avg attainable */
+ SCHED_STARVATION_LIMIT=9, /* no re-active if expired is starved */
+};
/* CTL_NET names: */
enum
diff -ruNp a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c 2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/fork.c 2003-04-03 23:10:53.000000000 -0800
@@ -727,7 +727,7 @@ int do_fork(unsigned long clone_flags, u
current->time_slice = 1;
scheduler_tick(0,0);
}
- p->sleep_timestamp = jiffies;
+ p->last_run = jiffies;
__sti();
/*
diff -ruNp a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/sched.c 2003-04-03 23:10:53.000000000 -0800
@@ -52,15 +52,26 @@
* maximum timeslice is 300 msecs. Timeslices get refilled after
* they expire.
*/
-#define MIN_TIMESLICE ( 10 * HZ / 1000 )
-#define MAX_TIMESLICE ( 1000 * HZ / 1000 )
-#define CHILD_PENALTY 95
-#define PARENT_PENALTY 100
-#define EXIT_WEIGHT 3
-#define PRIO_BONUS_RATIO 15
-#define INTERACTIVE_DELTA 4
-#define MAX_SLEEP_AVG (2*HZ)
-#define STARVATION_LIMIT (3*HZ)
+int min_timeslice = ((5 * HZ) / 1000 ?: 1);
+int max_timeslice = (200 * HZ) / 1000;
+int child_penalty = 50;
+int parent_penalty = 100;
+int exit_weight = 3;
+int prio_bonus_ratio = 25;
+int interactive_delta = 2;
+int max_sleep_avg = 10 * HZ;
+int starvation_limit = 10 * HZ;
+
+#define MIN_TIMESLICE (min_timeslice)
+#define MAX_TIMESLICE (max_timeslice)
+#define CHILD_PENALTY (child_penalty)
+#define PARENT_PENALTY (parent_penalty)
+#define EXIT_WEIGHT (exit_weight)
+#define PRIO_BONUS_RATIO (prio_bonus_ratio)
+#define INTERACTIVE_DELTA (interactive_delta)
+#define MAX_SLEEP_AVG (max_sleep_avg)
+#define STARVATION_LIMIT (starvation_limit)
+#define TIMESLICE_GRANULARITY (HZ/20 ?: 1)
/*
* If a task is 'interactive' then we reinsert it in the active
@@ -115,14 +126,19 @@
* downside in using shorter timeslices.
*/
-static inline unsigned int task_timeslice(task_t *p)
+#define BASE_TIMESLICE(p) \
+ (MAX_TIMESLICE * (MAX_PRIO-(p)->static_prio)/MAX_USER_PRIO)
+
+static unsigned int task_timeslice(task_t *p)
{
- if (p->policy == SCHED_BATCH)
- return MAX_TIMESLICE;
- else
- return MIN_TIMESLICE;
-}
+ unsigned int time_slice = BASE_TIMESLICE(p);
+
+ if (time_slice < MIN_TIMESLICE)
+ time_slice = MIN_TIMESLICE;
+ return time_slice;
+}
+
/*
* These are the runqueue data structures:
*/
@@ -149,6 +165,7 @@ struct runqueue {
unsigned long nr_running, nr_switches, expired_timestamp,
nr_uninterruptible;
task_t *curr, *idle;
+ struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
@@ -191,6 +208,10 @@ static struct runqueue runqueues[NR_CPUS
# define task_running(rq, p) ((rq)->curr == (p))
#endif
+# define nr_running_init(rq) do { } while (0)
+# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0)
+
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -273,6 +294,9 @@ static inline int effective_prio(task_t
*
* Both properties are important to certain workloads.
*/
+ if (rt_task(p))
+ return p->prio;
+
bonus = MAX_USER_PRIO*PRIO_BONUS_RATIO*p->sleep_avg/MAX_SLEEP_AVG/100 -
MAX_USER_PRIO*PRIO_BONUS_RATIO/100/2;
@@ -284,27 +308,58 @@ static inline int effective_prio(task_t
return prio;
}
-static inline void activate_task(task_t *p, runqueue_t *rq)
+static inline void __activate_task(task_t *p, runqueue_t *rq)
{
- unsigned long sleep_time = jiffies - p->sleep_timestamp;
- prio_array_t *array = rq->active;
+ enqueue_task(p, rq->active);
+ nr_running_inc(rq);
+}
- if (!rt_task(p) && sleep_time) {
- /*
- * This code gives a bonus to interactive tasks. We update
- * an 'average sleep time' value here, based on
- * sleep_timestamp. The more time a task spends sleeping,
- * the higher the average gets - and the higher the priority
- * boost gets as well.
- */
- p->sleep_avg += sleep_time;
- if (p->sleep_avg > MAX_SLEEP_AVG)
- p->sleep_avg = MAX_SLEEP_AVG;
- p->prio = effective_prio(p);
+static inline int activate_task(task_t *p, runqueue_t *rq)
+{
+ long sleep_time = jiffies - p->last_run - 1;
+ int requeue_waker = 0;
+
+ if (sleep_time > 0) {
+ int sleep_avg;
+
+ /*
+ * This code gives a bonus to interactive tasks.
+ *
+ * The boost works by updating the 'average sleep time'
+ * value here, based on ->last_run. The more time a task
+ * spends sleeping, the higher the average gets - and the
+ * higher the priority boost gets as well.
+ */
+ sleep_avg = p->sleep_avg + sleep_time;
+
+ /*
+ * 'Overflow' bonus ticks go to the waker as well, so the
+ * ticks are not lost. This has the effect of further
+ * boosting tasks that are related to maximum-interactive
+ * tasks.
+ */
+ if (sleep_avg > MAX_SLEEP_AVG) {
+ if (!in_interrupt()) {
+ sleep_avg += current->sleep_avg - MAX_SLEEP_AVG;
+ if (sleep_avg > MAX_SLEEP_AVG)
+ sleep_avg = MAX_SLEEP_AVG;
+
+ if (current->sleep_avg != sleep_avg) {
+ current->sleep_avg = sleep_avg;
+ requeue_waker = 1;
+ }
+ }
+ sleep_avg = MAX_SLEEP_AVG;
+ }
+ if (p->sleep_avg != sleep_avg) {
+ p->sleep_avg = sleep_avg;
+ p->prio = effective_prio(p);
}
- enqueue_task(p, array);
- rq->nr_running++;
}
+ __activate_task(p, rq);
+
+ return requeue_waker;
+}
static inline void activate_batch_task(task_t *p, runqueue_t *rq)
{
@@ -316,7 +371,7 @@ static inline void activate_batch_task(t
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -378,7 +433,7 @@ static inline void resched_task(task_t *
* ptrace() code.
*/
void wait_task_inactive(task_t * p)
- {
+{
unsigned long flags;
runqueue_t *rq;
@@ -419,23 +474,8 @@ repeat:
*/
void kick_if_running(task_t * p)
{
- if (task_running(task_rq(p), p) && (p->cpu != smp_processor_id()))
+ if (task_running(task_rq(p), p) && (task_cpu(p) != smp_processor_id()))
resched_task(p);
- /*
- * If batch processes get signals but are not running currently
- * then give them a chance to handle the signal. (the kernel
- * side signal handling code will run for sure, the userspace
- * part depends on system load and might be delayed indefinitely.)
- */
- if (p->policy == SCHED_BATCH) {
- unsigned long flags;
- runqueue_t *rq;
-
- rq = task_rq_lock(p, &flags);
- if (p->flags & PF_BATCH)
- activate_batch_task(p, rq);
- task_rq_unlock(rq, &flags);
- }
}
/*
@@ -449,70 +489,99 @@ void kick_if_running(task_t * p)
* returns failure only if the task is already active.
*/
-static int try_to_wake_up(task_t * p, int sync)
+static int try_to_wake_up(task_t * p, unsigned int state, int sync)
{
+ int success = 0, requeue_waker = 0;
unsigned long flags;
- int success = 0;
long old_state;
runqueue_t *rq;
repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- if (!p->array) {
- /*
- * Fast-migrate the task if it's not running or runnable
- * currently. Do not violate hard affinity.
- */
- if (unlikely(sync && !task_running(rq, p) &&
- (task_cpu(p) != smp_processor_id()) &&
- (p->cpus_allowed & (1UL << smp_processor_id())))) {
-
- set_task_cpu(p, smp_processor_id());
+ if (old_state & state) {
+ if (!p->array) {
+ /*
+ * Fast-migrate the task if it's not running or runnable
+ * currently. Do not violate hard affinity.
+ */
+ if (unlikely(sync && !task_running(rq, p) &&
+ (task_cpu(p) != smp_processor_id()) &&
+ (p->cpus_allowed & (1UL << smp_processor_id())))) {
+
+ set_task_cpu(p, smp_processor_id());
+
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ if (old_state == TASK_UNINTERRUPTIBLE)
+ rq->nr_uninterruptible--;
- task_rq_unlock(rq, &flags);
- goto repeat_lock_task;
+ if (sync)
+ __activate_task(p, rq);
+ else {
+ requeue_waker = activate_task(p, rq);
+ if (p->prio < rq->curr->prio)
+ resched_task(rq->curr);
+ }
+ success = 1;
}
- if (old_state == TASK_UNINTERRUPTIBLE)
- rq->nr_uninterruptible--;
- activate_task(p, rq);
-
- if (p->prio < rq->curr->prio || rq->curr->policy == SCHED_BATCH)
- resched_task(rq->curr);
- success = 1;
+ p->state = TASK_RUNNING;
}
- p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+ /*
+ * We have to do this outside the other spinlock, the two
+ * runqueues might be different:
+ */
+ if (requeue_waker) {
+ prio_array_t *array;
+
+ rq = task_rq_lock(current, &flags);
+ array = current->array;
+ dequeue_task(current, array);
+ current->prio = effective_prio(current);
+ enqueue_task(current, array);
+ task_rq_unlock(rq, &flags);
+ }
+
return success;
}
int wake_up_process(task_t * p)
{
- return try_to_wake_up(p, 0);
+ return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
}
void wake_up_forked_process(task_t * p)
{
- runqueue_t *rq ;
+ runqueue_t *rq;
+ unsigned long flags;
preempt_disable();
- rq = this_rq_lock();
+
+ rq = task_rq_lock(current, &flags);
p->state = TASK_RUNNING;
- if (!rt_task(p)) {
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive.
- */
- current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100;
- p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
- p->prio = effective_prio(p);
-}
+ /*
+ * We decrease the sleep average of forking parents
+ * and children as well, to keep max-interactive tasks
+ * from forking tasks that are max-interactive.
+ */
+ current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100;
+ p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
+ p->prio = effective_prio(p);
set_task_cpu(p, smp_processor_id());
- activate_task(p, rq);
- rq_unlock(rq);
+ if (unlikely(!current->array))
+ __activate_task(p, rq);
+ else {
+ p->prio = current->prio;
+ list_add_tail(&p->run_list, ¤t->run_list);
+ p->array = current->array;
+ p->array->nr_active++;
+ nr_running_inc(rq);
+ }
+ task_rq_unlock(rq, &flags);
preempt_enable();
}
@@ -527,13 +596,15 @@ void wake_up_forked_process(task_t * p)
*/
void sched_exit(task_t * p)
{
- __cli();
+ unsigned long flags;
+
+ local_irq_save(flags);
if (p->first_time_slice) {
current->time_slice += p->time_slice;
if (unlikely(current->time_slice > MAX_TIMESLICE))
current->time_slice = MAX_TIMESLICE;
}
- __sti();
+ local_irq_restore(flags);
/*
* If the child was a (relative-) CPU hog then decrease
* the sleep_avg of the parent as well.
@@ -550,7 +621,7 @@ asmlinkage void schedule_tail(task_t *pr
}
#endif
-static inline task_t * context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
@@ -564,7 +635,7 @@ static inline task_t * context_switch(ta
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
- mmdrop(oldmm);
+ rq->prev_mm = oldmm;
}
/* Here we just switch the register state and the stack. */
@@ -824,9 +895,9 @@ static inline runqueue_t *find_busiest_q
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -834,6 +905,11 @@ static inline void pull_task(runqueue_t
*/
if (p->prio < this_rq->curr->prio)
set_need_resched();
+ else {
+ if (p->prio == this_rq->curr->prio &&
+ p->time_slice > this_rq->curr->time_slice)
+ set_need_resched();
+ }
}
/*
@@ -896,7 +972,7 @@ skip_queue:
*/
#define CAN_MIGRATE_TASK(p,rq,this_cpu) \
- ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
+ ((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
!task_running(rq, p) && \
((p)->cpus_allowed & (1UL << (this_cpu))))
@@ -954,9 +1030,9 @@ static inline void idle_tick(runqueue_t
* increasing number of running tasks:
*/
#define EXPIRED_STARVING(rq) \
- ((rq)->expired_timestamp && \
+ (STARVATION_LIMIT && ((rq)->expired_timestamp && \
(jiffies - (rq)->expired_timestamp >= \
- STARVATION_LIMIT * ((rq)->nr_running) + 1))
+ STARVATION_LIMIT * ((rq)->nr_running) + 1 )))
/*
* This function gets called by the timer code, with HZ frequency.
@@ -985,7 +1061,7 @@ void scheduler_tick(int user_ticks, int
}
}
- if (p == rq->idle || p->policy == SCHED_BATCH)
+ if (p == rq->idle)
rq->idle_count++;
#endif
if (p == rq->idle) {
@@ -996,7 +1072,7 @@ void scheduler_tick(int user_ticks, int
#endif
return;
}
- if (TASK_NICE(p) > 0 || p->policy == SCHED_BATCH)
+ if (TASK_NICE(p) > 0)
kstat.per_cpu_nice[cpu] += user_ticks;
else
kstat.per_cpu_user[cpu] += user_ticks;
@@ -1008,6 +1084,17 @@ void scheduler_tick(int user_ticks, int
return;
}
spin_lock(&rq->lock);
+ /*
+ * The task was running during this tick - update the
+ * time slice counter and the sleep average. Note: we
+ * do not update a process's priority until it either
+ * goes to sleep or uses up its timeslice. This makes
+ * it possible for interactive tasks to use up their
+ * timeslices at their highest priority levels.
+ */
+ if (p->sleep_avg)
+ p->sleep_avg--;
+
if (unlikely(rt_task(p))) {
/*
* RR tasks need a special form of timeslice management.
@@ -1024,16 +1111,6 @@ void scheduler_tick(int user_ticks, int
}
goto out;
}
- /*
- * The task was running during this tick - update the
- * time slice counter and the sleep average. Note: we
- * do not update a process's priority until it either
- * goes to sleep or uses up its timeslice. This makes
- * it possible for interactive tasks to use up their
- * timeslices at their highest priority levels.
- */
- if (p->sleep_avg)
- p->sleep_avg--;
if (!--p->time_slice) {
dequeue_task(p, rq->active);
set_tsk_need_resched(p);
@@ -1047,6 +1124,28 @@ void scheduler_tick(int user_ticks, int
enqueue_task(p, rq->expired);
} else
enqueue_task(p, rq->active);
+ } else {
+ /*
+ * Prevent a too long timeslice allowing a task to monopolize
+ * the CPU. We do this by splitting up the timeslice into
+ * smaller pieces.
+ *
+ * Note: this does not mean the task's timeslices expire or
+ * get lost in any way, they just might be preempted by
+ * another task of equal priority. (one with higher
+ * priority would have preempted this task already.) We
+ * requeue this task to the end of the list on this priority
+ * level, which is in essence a round-robin of tasks with
+ * equal priority.
+ */
+ if (!(p->time_slice % TIMESLICE_GRANULARITY) &&
+ (p->array == rq->active)) {
+ dequeue_task(p, rq->active);
+ set_tsk_need_resched(p);
+ p->prio = effective_prio(p);
+ enqueue_task(p, rq->active);
+ }
+
}
out:
#if CONFIG_SMP
@@ -1107,7 +1206,7 @@ need_resched:
rq = this_rq();
release_kernel_lock(prev, smp_processor_id());
- prev->sleep_timestamp = jiffies;
+ prev->last_run = jiffies;
spin_lock_irq(&rq->lock);
/*
@@ -1173,7 +1272,7 @@ switch_tasks:
rq->curr = next;
prepare_arch_switch(rq, next);
- prev = context_switch(prev, next);
+ prev = context_switch(rq, prev, next);
barrier();
rq = this_rq();
finish_arch_switch(rq, prev);
@@ -1230,7 +1337,7 @@ static inline void __wake_up_common(wait
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p, sync) &&
+ if ((state & mode) && try_to_wake_up(p, state, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -1443,7 +1550,7 @@ asmlinkage long sys_nice(int increment)
*/
int task_prio(task_t *p)
{
- return p->prio - MAX_USER_RT_PRIO;
+ return p->prio - MAX_RT_PRIO;
}
int task_nice(task_t *p)
@@ -1536,7 +1643,7 @@ static int setscheduler(pid_t pid, int p
else
p->prio = p->static_prio;
if (array)
- activate_task(p, task_rq(p));
+ __activate_task(p, task_rq(p));
out_unlock:
task_rq_unlock(rq, &flags);
@@ -2221,7 +2328,7 @@ void __init sched_init(void)
rq->curr = current;
rq->idle = current;
set_task_cpu(current, smp_processor_id());
- wake_up_process(current);
+ wake_up_forked_process(current);
init_timervecs();
init_bh(TIMER_BH, timer_bh);
diff -ruNp a/kernel/signal.c b/kernel/signal.c
--- a/kernel/signal.c 2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/signal.c 2003-04-03 23:15:05.000000000 -0800
@@ -13,7 +13,7 @@
#include <linux/smp_lock.h>
#include <linux/init.h>
#include <linux/sched.h>
-
+#include <asm/param.h>
#include <asm/uaccess.h>
/*
@@ -775,8 +775,8 @@ void do_notify_parent(struct task_struct
info.si_uid = tsk->uid;
/* FIXME: find out whether or not this is supposed to be c*time. */
- info.si_utime = tsk->times.tms_utime;
- info.si_stime = tsk->times.tms_stime;
+ info.si_utime = jiffies_to_clock_t(tsk->times.tms_utime);
+ info.si_stime = jiffies_to_clock_t(tsk->times.tms_stime);
status = tsk->exit_code & 0x7f;
why = SI_KERNEL; /* shouldn't happen */
diff -ruNp a/kernel/sys.c b/kernel/sys.c
--- a/kernel/sys.c 2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/sys.c 2003-04-03 23:15:05.000000000 -0800
@@ -14,7 +14,7 @@
#include <linux/prctl.h>
#include <linux/init.h>
#include <linux/highuid.h>
-
+#include <asm/param.h>
#include <asm/uaccess.h>
#include <asm/io.h>
@@ -791,16 +791,23 @@ asmlinkage long sys_setfsgid(gid_t gid)
asmlinkage long sys_times(struct tms * tbuf)
{
+ struct tms temp;
+
/*
* In the SMP world we might just be unlucky and have one of
* the times increment as we use it. Since the value is an
* atomically safe type this is just fine. Conceptually its
* as if the syscall took an instant longer to occur.
*/
- if (tbuf)
- if (copy_to_user(tbuf, ¤t->times, sizeof(struct tms)))
+ if (tbuf) {
+ temp.tms_utime = jiffies_to_clock_t(current->times.tms_utime);
+ temp.tms_stime = jiffies_to_clock_t(current->times.tms_stime);
+ temp.tms_cutime = jiffies_to_clock_t(current->times.tms_cutime);
+ temp.tms_cstime = jiffies_to_clock_t(current->times.tms_cstime);
+ if (copy_to_user(tbuf, &temp, sizeof(struct tms)))
return -EFAULT;
- return jiffies;
+ }
+ return jiffies_to_clock_t(jiffies);
}
/*
diff -ruNp a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c 2003-04-03 21:33:54.000000000 -0800
+++ b/kernel/sysctl.c 2003-04-03 23:10:53.000000000 -0800
@@ -53,7 +53,16 @@ extern int max_queued_signals;
extern int sysrq_enabled;
extern int core_uses_pid;
extern int cad_pid;
-
+extern int min_timeslice;
+extern int max_timeslice;
+extern int child_penalty;
+extern int parent_penalty;
+extern int exit_weight;
+extern int prio_bonus_ratio;
+extern int interactive_delta;
+extern int max_sleep_avg;
+extern int starvation_limit;
+
/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
static int minolduid;
@@ -112,6 +121,7 @@ static struct ctl_table_header root_tabl
static ctl_table kern_table[];
static ctl_table vm_table[];
+static ctl_table sched_table[];
#ifdef CONFIG_NET
extern ctl_table net_table[];
#endif
@@ -156,6 +166,7 @@ static ctl_table root_table[] = {
{CTL_FS, "fs", NULL, 0, 0555, fs_table},
{CTL_DEBUG, "debug", NULL, 0, 0555, debug_table},
{CTL_DEV, "dev", NULL, 0, 0555, dev_table},
+ {CTL_SCHED, "sched", NULL, 0, 0555, sched_table},
{0}
};
@@ -329,8 +340,42 @@ static ctl_table debug_table[] = {
static ctl_table dev_table[] = {
{0}
-};
+};
+
+static int zero = 0;
+static int one = 1;
+static ctl_table sched_table[] = {
+ {SCHED_MAX_TIMESLICE, "max_timeslice", &max_timeslice,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &one, NULL},
+ {SCHED_MIN_TIMESLICE, "min_timeslice", &min_timeslice,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &one, NULL},
+ {SCHED_CHILD_PENALTY, "child_penalty", &child_penalty,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &zero, NULL},
+ {SCHED_PARENT_PENALTY, "parent_penalty", &parent_penalty,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &zero, NULL},
+ {SCHED_EXIT_WEIGHT, "exit_weight", &exit_weight,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &zero, NULL},
+ {SCHED_PRIO_BONUS_RATIO, "prio_bonus_ratio", &prio_bonus_ratio,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &zero, NULL},
+ {SCHED_INTERACTIVE_DELTA, "interactive_delta", &interactive_delta,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &zero, NULL},
+ {SCHED_MAX_SLEEP_AVG, "max_sleep_avg", &max_sleep_avg,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &one, NULL},
+ {SCHED_STARVATION_LIMIT, "starvation_limit", &starvation_limit,
+ sizeof(int), 0644, NULL, &proc_dointvec_minmax,
+ &sysctl_intvec, NULL, &zero, NULL},
+ {0}
+};
+
extern void init_irq_proc (void);
void __init sysctl_init(void)
--
Eric Wong
--
Kernelnewbies: Help each other learn about the Linux kernel.
Archive: http://mail.nl.linux.org/kernelnewbies/
FAQ: http://kernelnewbies.org/faq/