Discussion:
[ltt-dev] [patch 0/9] LTTng core kernel instrumentation
(too old to reply)
Mathieu Desnoyers
2009-03-24 15:56:25 UTC
Permalink
Hi,

This is the core of the Linux kernel LTTng instrumentation. I left out what I
considered that would be eventually covered by a wide syscall instrumentation
and I also picked the most important pieces of instrumentation. The LTTng tree
contains more tracepoints, some of which (e.g. mm instrumentation) will be
proposed later once we agree on the tracepoints proposed in this patchset.

Ingo, the irq and softirq instrumentation patches will probably clash with the
patches from Jason you currently have in your tree. Please see the individual
changelogs for details.

This patchset applies over 2.6.29 in the following order :

irq-handle-prepare-for-instrumentation.patch
lttng-instrumentation-irq.patch
lttng-instrumentation-tasklets.patch
lttng-instrumentation-softirq.patch
lttng-instrumentation-scheduler-fixup-task-migration.patch
lttng-instrumentation-timer.patch
lttng-instrumentation-kernel.patch
lttng-instrumentation-filemap.patch
lttng-instrumentation-swap.patch

Thanks,

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers
2009-03-24 15:56:26 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: irq-handle-prepare-for-instrumentation.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/1196b7a8/attachment.txt>
Mathieu Desnoyers
2009-03-24 15:56:30 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-scheduler-fixup-task-migration.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/624d8e1e/attachment.asc>
Ingo Molnar
2009-03-24 17:53:17 UTC
Permalink
The scheduler migration only needs to pass the task struct pointer
and the destination CPU. The source CPU can be found by using
task_cpu(p) in the probe.
+++ linux-2.6-lttng/include/trace/sched.h 2009-01-30 10:43:19.000000000 -0500
@@ -30,8 +30,8 @@ DECLARE_TRACE(sched_switch,
TPARGS(rq, prev, next));
DECLARE_TRACE(sched_migrate_task,
- TPPROTO(struct task_struct *p, int orig_cpu, int dest_cpu),
- TPARGS(p, orig_cpu, dest_cpu));
+ TPPROTO(struct task_struct *p, int dest_cpu),
+ TPARGS(p, dest_cpu));
Makes sense. Since this tracepoint has been changed in the tracing
tree already, mind doing this change in that context?

Thanks,

Ingo
Ingo Molnar
2009-03-24 17:53:17 UTC
Permalink
The scheduler migration only needs to pass the task struct pointer
and the destination CPU. The source CPU can be found by using
task_cpu(p) in the probe.
+++ linux-2.6-lttng/include/trace/sched.h 2009-01-30 10:43:19.000000000 -0500
@@ -30,8 +30,8 @@ DECLARE_TRACE(sched_switch,
TPARGS(rq, prev, next));
DECLARE_TRACE(sched_migrate_task,
- TPPROTO(struct task_struct *p, int orig_cpu, int dest_cpu),
- TPARGS(p, orig_cpu, dest_cpu));
+ TPPROTO(struct task_struct *p, int dest_cpu),
+ TPARGS(p, dest_cpu));
Makes sense. Since this tracepoint has been changed in the tracing
tree already, mind doing this change in that context?

Thanks,

Ingo
Ingo Molnar
2009-03-24 17:53:17 UTC
Permalink
The scheduler migration only needs to pass the task struct pointer
and the destination CPU. The source CPU can be found by using
task_cpu(p) in the probe.
+++ linux-2.6-lttng/include/trace/sched.h 2009-01-30 10:43:19.000000000 -0500
@@ -30,8 +30,8 @@ DECLARE_TRACE(sched_switch,
TPARGS(rq, prev, next));
DECLARE_TRACE(sched_migrate_task,
- TPPROTO(struct task_struct *p, int orig_cpu, int dest_cpu),
- TPARGS(p, orig_cpu, dest_cpu));
+ TPPROTO(struct task_struct *p, int dest_cpu),
+ TPARGS(p, dest_cpu));
Makes sense. Since this tracepoint has been changed in the tracing
tree already, mind doing this change in that context?

Thanks,

Ingo
Mathieu Desnoyers
2009-03-24 15:56:32 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-kernel.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/dc616bb6/attachment.txt>
Ingo Molnar
2009-03-24 18:33:13 UTC
Permalink
(Rusty Cc:-ed - for the module.c tracepoints below)
Instrument the core kernel : module load/free and printk events. It helps the
tracer to keep track of module related events and to export valuable printk
information into the traces.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at polymtl.ca>
CC: 'Ingo Molnar' <mingo at elte.hu>
CC: Frederic Weisbecker <fweisbec at gmail.com>
CC: Andrew Morton <akpm at linux-foundation.org>
CC: Masami Hiramatsu <mhiramat at redhat.com>
CC: 'Peter Zijlstra' <peterz at infradead.org>
CC: "Frank Ch. Eigler" <fche at redhat.com>
CC: 'Hideo AOKI' <haoki at redhat.com>
CC: Takashi Nishiie <t-nishiie at np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt at goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu at linux360.ro>
---
include/trace/kernel.h | 19 +++++++++++++++++++
kernel/module.c | 8 ++++++++
kernel/printk.c | 7 +++++++
3 files changed, 34 insertions(+)
Index: linux-2.6-lttng/kernel/printk.c
===================================================================
--- linux-2.6-lttng.orig/kernel/printk.c 2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/kernel/printk.c 2009-03-24 09:31:53.000000000 -0400
@@ -32,6 +32,7 @@
#include <linux/security.h>
#include <linux/bootmem.h>
#include <linux/syscalls.h>
+#include <trace/kernel.h>
#include <asm/uaccess.h>
@@ -59,6 +60,7 @@ int console_printk[4] = {
MINIMUM_CONSOLE_LOGLEVEL, /* minimum_console_loglevel */
DEFAULT_CONSOLE_LOGLEVEL, /* default_console_loglevel */
};
+EXPORT_SYMBOL_GPL(console_printk);
/*
* Low level drivers may need that to know if they can schedule in
@@ -128,6 +130,9 @@ EXPORT_SYMBOL(console_set_on_cmdline);
/* Flag: console code may call schedule() */
static int console_may_schedule;
+DEFINE_TRACE(kernel_printk);
+DEFINE_TRACE(kernel_vprintk);
+
#ifdef CONFIG_PRINTK
static char __log_buf[__LOG_BUF_LEN];
@@ -560,6 +565,7 @@ asmlinkage int printk(const char *fmt, .
int r;
va_start(args, fmt);
+ trace_kernel_printk(_RET_IP_);
r = vprintk(fmt, args);
va_end(args);
@@ -667,6 +673,7 @@ asmlinkage int vprintk(const char *fmt,
printed_len += vscnprintf(printk_buf + printed_len,
sizeof(printk_buf) - printed_len, fmt, args);
+ trace_kernel_vprintk(_RET_IP_, printk_buf, printed_len);
So here we pass in the formatted output. What sense does it make to
have the above printk tracepoint? Little i think.

Also, i'm not entirely convinced about the wiseness of instrumenting
an essential debug facility like printk(). Lets keep this one at the
tail portion of the patch-queue, ok?
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c 2009-03-24 09:09:59.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c 2009-03-24 09:31:53.000000000 -0400
@@ -51,6 +51,7 @@
#include <linux/tracepoint.h>
#include <linux/ftrace.h>
#include <linux/async.h>
+#include <trace/kernel.h>
#if 0
#define DEBUGP printk
@@ -78,6 +79,9 @@ static BLOCKING_NOTIFIER_HEAD(module_not
/* Bounds of module allocation, for speeding __module_text_address */
static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+DEFINE_TRACE(kernel_module_load);
+DEFINE_TRACE(kernel_module_free);
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.

These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.

Another thing that is iteresting and which is not covered here are
module request events and usermode helper callouts. These too have
instrumentation and performance analysis uses.

Plus, here too it would be desired to put in default probes as well,
via TRACE_EVENT().

Thanks,

Ingo
Rusty Russell
2009-03-25 01:13:11 UTC
Permalink
Post by Ingo Molnar
(Rusty Cc:-ed - for the module.c tracepoints below)
Thanks, tho they look fine and non-intrusive to me.
Post by Ingo Molnar
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.
These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.
A module_get()/put() should not hurt scalability at all! I went to great and
horrible lengths to ensure that was the case since the rewrite in 2.4.

But a module dependency graph et. al. would be kind of cool.

Thanks,
Rusty.
Ingo Molnar
2009-03-25 08:40:54 UTC
Permalink
Post by Rusty Russell
Post by Ingo Molnar
(Rusty Cc:-ed - for the module.c tracepoints below)
Thanks, tho they look fine and non-intrusive to me.
Thanks - i'll take this as an Acked-by :-)

( Mathieu, mind re-sending a variant that does them via
TRACE_EVENT(), against the tracing tree? That makes it useful not
just to LTTng but the default mainline kernel as well. Thanks! )
Post by Rusty Russell
Post by Ingo Molnar
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.
These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.
A module_get()/put() should not hurt scalability at all! I went
to great and horrible lengths to ensure that was the case since
the rewrite in 2.4.
i know, it uses percpu refcounts :) Still it can be somewhat
non-trivial in a fastpath:

static inline int try_module_get(struct module *module)
{
int ret = 1;

if (module) {
unsigned int cpu = get_cpu();
if (likely(module_is_live(module)))
local_inc(__module_ref_addr(module, cpu));
else
ret = 0;
put_cpu();
}
return ret;
}

So we want to reduce excessive uses of it.

[ And please forget i mentioned scalability - that bit is fine and
you were offended rightfully :) ]

This reminds me. Couldnt we now:

#ifdef CONFIG_SMP
char *refptr;
#else
local_t ref;
#endif

... unify these bits to just standardize on a per-cpu refptr all the
time, with Tejun's ueber-cool percpu changes in place?
Post by Rusty Russell
But a module dependency graph et. al. would be kind of cool.
Yeah. Can be in a separate patch as well - load/unload events are
meaningful in isolation as well.

Ingo
Ingo Molnar
2009-03-25 08:40:54 UTC
Permalink
Post by Rusty Russell
Post by Ingo Molnar
(Rusty Cc:-ed - for the module.c tracepoints below)
Thanks, tho they look fine and non-intrusive to me.
Thanks - i'll take this as an Acked-by :-)

( Mathieu, mind re-sending a variant that does them via
TRACE_EVENT(), against the tracing tree? That makes it useful not
just to LTTng but the default mainline kernel as well. Thanks! )
Post by Rusty Russell
Post by Ingo Molnar
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.
These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.
A module_get()/put() should not hurt scalability at all! I went
to great and horrible lengths to ensure that was the case since
the rewrite in 2.4.
i know, it uses percpu refcounts :) Still it can be somewhat
non-trivial in a fastpath:

static inline int try_module_get(struct module *module)
{
int ret = 1;

if (module) {
unsigned int cpu = get_cpu();
if (likely(module_is_live(module)))
local_inc(__module_ref_addr(module, cpu));
else
ret = 0;
put_cpu();
}
return ret;
}

So we want to reduce excessive uses of it.

[ And please forget i mentioned scalability - that bit is fine and
you were offended rightfully :) ]

This reminds me. Couldnt we now:

#ifdef CONFIG_SMP
char *refptr;
#else
local_t ref;
#endif

... unify these bits to just standardize on a per-cpu refptr all the
time, with Tejun's ueber-cool percpu changes in place?
Post by Rusty Russell
But a module dependency graph et. al. would be kind of cool.
Yeah. Can be in a separate patch as well - load/unload events are
meaningful in isolation as well.

Ingo
Frederic Weisbecker
2009-03-25 13:06:04 UTC
Permalink
Post by Ingo Molnar
(Rusty Cc:-ed - for the module.c tracepoints below)
Instrument the core kernel : module load/free and printk events. It helps the
tracer to keep track of module related events and to export valuable printk
information into the traces.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at polymtl.ca>
CC: 'Ingo Molnar' <mingo at elte.hu>
CC: Frederic Weisbecker <fweisbec at gmail.com>
CC: Andrew Morton <akpm at linux-foundation.org>
CC: Masami Hiramatsu <mhiramat at redhat.com>
CC: 'Peter Zijlstra' <peterz at infradead.org>
CC: "Frank Ch. Eigler" <fche at redhat.com>
CC: 'Hideo AOKI' <haoki at redhat.com>
CC: Takashi Nishiie <t-nishiie at np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt at goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu at linux360.ro>
---
include/trace/kernel.h | 19 +++++++++++++++++++
kernel/module.c | 8 ++++++++
kernel/printk.c | 7 +++++++
3 files changed, 34 insertions(+)
Index: linux-2.6-lttng/kernel/printk.c
===================================================================
--- linux-2.6-lttng.orig/kernel/printk.c 2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/kernel/printk.c 2009-03-24 09:31:53.000000000 -0400
@@ -32,6 +32,7 @@
#include <linux/security.h>
#include <linux/bootmem.h>
#include <linux/syscalls.h>
+#include <trace/kernel.h>
#include <asm/uaccess.h>
@@ -59,6 +60,7 @@ int console_printk[4] = {
MINIMUM_CONSOLE_LOGLEVEL, /* minimum_console_loglevel */
DEFAULT_CONSOLE_LOGLEVEL, /* default_console_loglevel */
};
+EXPORT_SYMBOL_GPL(console_printk);
/*
* Low level drivers may need that to know if they can schedule in
@@ -128,6 +130,9 @@ EXPORT_SYMBOL(console_set_on_cmdline);
/* Flag: console code may call schedule() */
static int console_may_schedule;
+DEFINE_TRACE(kernel_printk);
+DEFINE_TRACE(kernel_vprintk);
+
#ifdef CONFIG_PRINTK
static char __log_buf[__LOG_BUF_LEN];
@@ -560,6 +565,7 @@ asmlinkage int printk(const char *fmt, .
int r;
va_start(args, fmt);
+ trace_kernel_printk(_RET_IP_);
r = vprintk(fmt, args);
va_end(args);
@@ -667,6 +673,7 @@ asmlinkage int vprintk(const char *fmt,
printed_len += vscnprintf(printk_buf + printed_len,
sizeof(printk_buf) - printed_len, fmt, args);
+ trace_kernel_vprintk(_RET_IP_, printk_buf, printed_len);
So here we pass in the formatted output. What sense does it make to
have the above printk tracepoint? Little i think.
Also, i'm not entirely convinced about the wiseness of instrumenting
an essential debug facility like printk(). Lets keep this one at the
tail portion of the patch-queue, ok?
Especially the trace_kernel_printk hook which only probes the printk callers.
I don't think a performance measurement of a printk call in that relevant.

Concerning trace_kernel_vprintk(), if the goal is to capture the printk messages,
I would rather see it through the dynamic printk facility or setting a console which
route printk output to trace_printk(). If that is useful for someone.

Thanks,
Frederic.
Post by Ingo Molnar
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c 2009-03-24 09:09:59.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c 2009-03-24 09:31:53.000000000 -0400
@@ -51,6 +51,7 @@
#include <linux/tracepoint.h>
#include <linux/ftrace.h>
#include <linux/async.h>
+#include <trace/kernel.h>
#if 0
#define DEBUGP printk
@@ -78,6 +79,9 @@ static BLOCKING_NOTIFIER_HEAD(module_not
/* Bounds of module allocation, for speeding __module_text_address */
static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+DEFINE_TRACE(kernel_module_load);
+DEFINE_TRACE(kernel_module_free);
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.
These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.
Another thing that is iteresting and which is not covered here are
module request events and usermode helper callouts. These too have
instrumentation and performance analysis uses.
Plus, here too it would be desired to put in default probes as well,
via TRACE_EVENT().
Thanks,
Ingo
Rusty Russell
2009-03-25 01:13:11 UTC
Permalink
Post by Ingo Molnar
(Rusty Cc:-ed - for the module.c tracepoints below)
Thanks, tho they look fine and non-intrusive to me.
Post by Ingo Molnar
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.
These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.
A module_get()/put() should not hurt scalability at all! I went to great and
horrible lengths to ensure that was the case since the rewrite in 2.4.

But a module dependency graph et. al. would be kind of cool.

Thanks,
Rusty.
Frederic Weisbecker
2009-03-25 13:06:04 UTC
Permalink
Post by Ingo Molnar
(Rusty Cc:-ed - for the module.c tracepoints below)
Instrument the core kernel : module load/free and printk events. It helps the
tracer to keep track of module related events and to export valuable printk
information into the traces.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at polymtl.ca>
CC: 'Ingo Molnar' <mingo at elte.hu>
CC: Frederic Weisbecker <fweisbec at gmail.com>
CC: Andrew Morton <akpm at linux-foundation.org>
CC: Masami Hiramatsu <mhiramat at redhat.com>
CC: 'Peter Zijlstra' <peterz at infradead.org>
CC: "Frank Ch. Eigler" <fche at redhat.com>
CC: 'Hideo AOKI' <haoki at redhat.com>
CC: Takashi Nishiie <t-nishiie at np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt at goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu at linux360.ro>
---
include/trace/kernel.h | 19 +++++++++++++++++++
kernel/module.c | 8 ++++++++
kernel/printk.c | 7 +++++++
3 files changed, 34 insertions(+)
Index: linux-2.6-lttng/kernel/printk.c
===================================================================
--- linux-2.6-lttng.orig/kernel/printk.c 2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/kernel/printk.c 2009-03-24 09:31:53.000000000 -0400
@@ -32,6 +32,7 @@
#include <linux/security.h>
#include <linux/bootmem.h>
#include <linux/syscalls.h>
+#include <trace/kernel.h>
#include <asm/uaccess.h>
@@ -59,6 +60,7 @@ int console_printk[4] = {
MINIMUM_CONSOLE_LOGLEVEL, /* minimum_console_loglevel */
DEFAULT_CONSOLE_LOGLEVEL, /* default_console_loglevel */
};
+EXPORT_SYMBOL_GPL(console_printk);
/*
* Low level drivers may need that to know if they can schedule in
@@ -128,6 +130,9 @@ EXPORT_SYMBOL(console_set_on_cmdline);
/* Flag: console code may call schedule() */
static int console_may_schedule;
+DEFINE_TRACE(kernel_printk);
+DEFINE_TRACE(kernel_vprintk);
+
#ifdef CONFIG_PRINTK
static char __log_buf[__LOG_BUF_LEN];
@@ -560,6 +565,7 @@ asmlinkage int printk(const char *fmt, .
int r;
va_start(args, fmt);
+ trace_kernel_printk(_RET_IP_);
r = vprintk(fmt, args);
va_end(args);
@@ -667,6 +673,7 @@ asmlinkage int vprintk(const char *fmt,
printed_len += vscnprintf(printk_buf + printed_len,
sizeof(printk_buf) - printed_len, fmt, args);
+ trace_kernel_vprintk(_RET_IP_, printk_buf, printed_len);
So here we pass in the formatted output. What sense does it make to
have the above printk tracepoint? Little i think.
Also, i'm not entirely convinced about the wiseness of instrumenting
an essential debug facility like printk(). Lets keep this one at the
tail portion of the patch-queue, ok?
Especially the trace_kernel_printk hook which only probes the printk callers.
I don't think a performance measurement of a printk call in that relevant.

Concerning trace_kernel_vprintk(), if the goal is to capture the printk messages,
I would rather see it through the dynamic printk facility or setting a console which
route printk output to trace_printk(). If that is useful for someone.

Thanks,
Frederic.
Post by Ingo Molnar
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c 2009-03-24 09:09:59.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c 2009-03-24 09:31:53.000000000 -0400
@@ -51,6 +51,7 @@
#include <linux/tracepoint.h>
#include <linux/ftrace.h>
#include <linux/async.h>
+#include <trace/kernel.h>
#if 0
#define DEBUGP printk
@@ -78,6 +79,9 @@ static BLOCKING_NOTIFIER_HEAD(module_not
/* Bounds of module allocation, for speeding __module_text_address */
static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+DEFINE_TRACE(kernel_module_load);
+DEFINE_TRACE(kernel_module_free);
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.
These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.
Another thing that is iteresting and which is not covered here are
module request events and usermode helper callouts. These too have
instrumentation and performance analysis uses.
Plus, here too it would be desired to put in default probes as well,
via TRACE_EVENT().
Thanks,
Ingo
Ingo Molnar
2009-03-24 18:33:13 UTC
Permalink
(Rusty Cc:-ed - for the module.c tracepoints below)
Instrument the core kernel : module load/free and printk events. It helps the
tracer to keep track of module related events and to export valuable printk
information into the traces.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at polymtl.ca>
CC: 'Ingo Molnar' <mingo at elte.hu>
CC: Frederic Weisbecker <fweisbec at gmail.com>
CC: Andrew Morton <akpm at linux-foundation.org>
CC: Masami Hiramatsu <mhiramat at redhat.com>
CC: 'Peter Zijlstra' <peterz at infradead.org>
CC: "Frank Ch. Eigler" <fche at redhat.com>
CC: 'Hideo AOKI' <haoki at redhat.com>
CC: Takashi Nishiie <t-nishiie at np.css.fujitsu.com>
CC: 'Steven Rostedt' <rostedt at goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu at linux360.ro>
---
include/trace/kernel.h | 19 +++++++++++++++++++
kernel/module.c | 8 ++++++++
kernel/printk.c | 7 +++++++
3 files changed, 34 insertions(+)
Index: linux-2.6-lttng/kernel/printk.c
===================================================================
--- linux-2.6-lttng.orig/kernel/printk.c 2009-03-24 09:09:52.000000000 -0400
+++ linux-2.6-lttng/kernel/printk.c 2009-03-24 09:31:53.000000000 -0400
@@ -32,6 +32,7 @@
#include <linux/security.h>
#include <linux/bootmem.h>
#include <linux/syscalls.h>
+#include <trace/kernel.h>
#include <asm/uaccess.h>
@@ -59,6 +60,7 @@ int console_printk[4] = {
MINIMUM_CONSOLE_LOGLEVEL, /* minimum_console_loglevel */
DEFAULT_CONSOLE_LOGLEVEL, /* default_console_loglevel */
};
+EXPORT_SYMBOL_GPL(console_printk);
/*
* Low level drivers may need that to know if they can schedule in
@@ -128,6 +130,9 @@ EXPORT_SYMBOL(console_set_on_cmdline);
/* Flag: console code may call schedule() */
static int console_may_schedule;
+DEFINE_TRACE(kernel_printk);
+DEFINE_TRACE(kernel_vprintk);
+
#ifdef CONFIG_PRINTK
static char __log_buf[__LOG_BUF_LEN];
@@ -560,6 +565,7 @@ asmlinkage int printk(const char *fmt, .
int r;
va_start(args, fmt);
+ trace_kernel_printk(_RET_IP_);
r = vprintk(fmt, args);
va_end(args);
@@ -667,6 +673,7 @@ asmlinkage int vprintk(const char *fmt,
printed_len += vscnprintf(printk_buf + printed_len,
sizeof(printk_buf) - printed_len, fmt, args);
+ trace_kernel_vprintk(_RET_IP_, printk_buf, printed_len);
So here we pass in the formatted output. What sense does it make to
have the above printk tracepoint? Little i think.

Also, i'm not entirely convinced about the wiseness of instrumenting
an essential debug facility like printk(). Lets keep this one at the
tail portion of the patch-queue, ok?
Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c 2009-03-24 09:09:59.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c 2009-03-24 09:31:53.000000000 -0400
@@ -51,6 +51,7 @@
#include <linux/tracepoint.h>
#include <linux/ftrace.h>
#include <linux/async.h>
+#include <trace/kernel.h>
#if 0
#define DEBUGP printk
@@ -78,6 +79,9 @@ static BLOCKING_NOTIFIER_HEAD(module_not
/* Bounds of module allocation, for speeding __module_text_address */
static unsigned long module_addr_min = -1UL, module_addr_max = 0;
+DEFINE_TRACE(kernel_module_load);
+DEFINE_TRACE(kernel_module_free);
I believe that to have a complete picture of module usage, module
refcount get/put events should be included as well, beyond the basic
load/free events.

These both have performance impact (a module get/put in a fastpath
hurts scalability), and are informative in terms of establishing the
module dependency graph.

Another thing that is iteresting and which is not covered here are
module request events and usermode helper callouts. These too have
instrumentation and performance analysis uses.

Plus, here too it would be desired to put in default probes as well,
via TRACE_EVENT().

Thanks,

Ingo
Mathieu Desnoyers
2009-03-24 15:56:33 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-filemap.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/fb535fba/attachment.asc>
Ingo Molnar
2009-03-24 18:39:40 UTC
Permalink
Index: linux-2.6-lttng/mm/filemap.c
+DEFINE_TRACE(wait_on_page_start);
+DEFINE_TRACE(wait_on_page_end);
These are extremely incomplete - to the level of being useless.

To understand the lifetime of the pagecache, the following basic
events have to be observed and instrumented:

- create a new page
- fill in a new page
- dirty a page [when we know this]
- request writeout of a page
- clean a page / complete writeout
- free a page due to MM pressure
- free a page due to truncation/delete

The following additional events are useful as well:

- mmap a page to a user-space address
- copy a page to a user-space address (read)
- write to a page from a user-space address (write)
- unmap a page from a user-space address
- fault in a user-space mapped pagecache page

optional:
- shmem attach/detach events
- shmem map/unmap events
- hugetlb map/unmap events

I'm sure i havent listed them all. Have a look at the function-graph
tracer output to see what kind of basic events can happen to a
pagecache page.

Ingo
Ingo Molnar
2009-03-24 18:39:40 UTC
Permalink
Index: linux-2.6-lttng/mm/filemap.c
+DEFINE_TRACE(wait_on_page_start);
+DEFINE_TRACE(wait_on_page_end);
These are extremely incomplete - to the level of being useless.

To understand the lifetime of the pagecache, the following basic
events have to be observed and instrumented:

- create a new page
- fill in a new page
- dirty a page [when we know this]
- request writeout of a page
- clean a page / complete writeout
- free a page due to MM pressure
- free a page due to truncation/delete

The following additional events are useful as well:

- mmap a page to a user-space address
- copy a page to a user-space address (read)
- write to a page from a user-space address (write)
- unmap a page from a user-space address
- fault in a user-space mapped pagecache page

optional:
- shmem attach/detach events
- shmem map/unmap events
- hugetlb map/unmap events

I'm sure i havent listed them all. Have a look at the function-graph
tracer output to see what kind of basic events can happen to a
pagecache page.

Ingo
Mathieu Desnoyers
2009-03-24 15:56:29 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-softirq.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/cfb79406/attachment.txt>
Ingo Molnar
2009-03-24 18:01:23 UTC
Permalink
Compared to Jason's patch, this patch also instruments softirq raise event.
Mind sending this on top of Jason's patch, which is already applied
to the tracing tree?

Note, your instrumentation misses one source of softirq raise
@@ -300,6 +306,7 @@ void irq_exit(void)
*/
inline void raise_softirq_irqoff(unsigned int nr)
{
+ trace_irq_softirq_raise(nr);
__raise_softirq_irqoff(nr);
__raise_softirq_irqoff() is used in the networking tree. It makes
little sense to instrument raise_softirq_irqoff() while not
instrumenting __raise_softirq_irqoff().

Ingo
Ingo Molnar
2009-03-24 18:01:23 UTC
Permalink
Compared to Jason's patch, this patch also instruments softirq raise event.
Mind sending this on top of Jason's patch, which is already applied
to the tracing tree?

Note, your instrumentation misses one source of softirq raise
@@ -300,6 +306,7 @@ void irq_exit(void)
*/
inline void raise_softirq_irqoff(unsigned int nr)
{
+ trace_irq_softirq_raise(nr);
__raise_softirq_irqoff(nr);
__raise_softirq_irqoff() is used in the networking tree. It makes
little sense to instrument raise_softirq_irqoff() while not
instrumenting __raise_softirq_irqoff().

Ingo
Mathieu Desnoyers
2009-03-24 15:56:28 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-tasklets.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/b750aeb1/attachment.asc>
Ingo Molnar
2009-03-24 17:56:21 UTC
Permalink
tasklet entry and exit events.
+DEFINE_TRACE(irq_tasklet_high_entry);
+DEFINE_TRACE(irq_tasklet_high_exit);
+DEFINE_TRACE(irq_tasklet_low_entry);
+DEFINE_TRACE(irq_tasklet_low_exit);
Dunno - tasklets are a legacy mechanism, not sure we want to
instrument them. That some tasklet activity happens will be obvious
from the softirq tracepoints already.

Anyway ... i guess we could do it if it's done via the more complete
TRACE_EVENT() mechanism in the tracing tree.

Ingo
Peter Zijlstra
2009-03-25 14:17:58 UTC
Permalink
Quick question. I understand this is unrelated to this patch. So I
apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a
plan to phase them out ? Let me draw a small picture as to what's
bothering me.
With the SR-IOV support if there are 'N' virtual functions then there
will be 'N' driver instances(actually N+1, 1 for the PF). If that
driver drains the responses in the interrupt context then all such
VF-instances could virtually block everyone else(because SR-IOV guys
might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's can queue
tasklets and can also get the performance they want.
Any suggestions?
Threaded interrupts?
Chetan.Loke
2009-03-25 17:37:20 UTC
Permalink
-----Original Message-----
From: Peter Zijlstra [mailto:peterz at infradead.org]
Sent: Wednesday, March 25, 2009 10:18 AM
To: Loke,Chetan
Cc: mingo at elte.hu; mathieu.desnoyers at polymtl.ca;
akpm at linux-foundation.org; linux-kernel at vger.kernel.org;
ltt-dev at lists.casi.polymtl.ca; fweisbec at gmail.com;
jbaron at redhat.com; tglx at linutronix.de;
rmk+lkml at arm.linux.org.uk; mhiramat at redhat.com;
fche at redhat.com; haoki at redhat.com;
t-nishiie at np.css.fujitsu.com; rostedt at goodmis.org;
eduard.munteanu at linux360.ro
Subject: RE: [patch 3/9] LTTng instrumentation tasklets
Quick question. I understand this is unrelated to this patch. So I
apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a
plan to phase them out ? Let me draw a small picture as to what's
bothering me.
With the SR-IOV support if there are 'N' virtual functions
then there
will be 'N' driver instances(actually N+1, 1 for the PF). If that
driver drains the responses in the interrupt context then all such
VF-instances could virtually block everyone else(because
SR-IOV guys
might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's
can queue
tasklets and can also get the performance they want.
Any suggestions?
Threaded interrupts?
If we truly need to address performance and scalability for the SCSI-subsystem then we need something lightweight.
Steven Rostedt
2009-03-25 17:52:33 UTC
Permalink
Post by Chetan.Loke
-----Original Message-----
From: Peter Zijlstra [mailto:peterz at infradead.org]
Sent: Wednesday, March 25, 2009 10:18 AM
To: Loke,Chetan
Cc: mingo at elte.hu; mathieu.desnoyers at polymtl.ca;
akpm at linux-foundation.org; linux-kernel at vger.kernel.org;
ltt-dev at lists.casi.polymtl.ca; fweisbec at gmail.com;
jbaron at redhat.com; tglx at linutronix.de;
rmk+lkml at arm.linux.org.uk; mhiramat at redhat.com;
fche at redhat.com; haoki at redhat.com;
t-nishiie at np.css.fujitsu.com; rostedt at goodmis.org;
eduard.munteanu at linux360.ro
Subject: RE: [patch 3/9] LTTng instrumentation tasklets
Quick question. I understand this is unrelated to this patch. So I
apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a
plan to phase them out ? Let me draw a small picture as to what's
bothering me.
With the SR-IOV support if there are 'N' virtual functions
then there
will be 'N' driver instances(actually N+1, 1 for the PF). If that
driver drains the responses in the interrupt context then all such
VF-instances could virtually block everyone else(because
SR-IOV guys
might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's
can queue
tasklets and can also get the performance they want.
Any suggestions?
Threaded interrupts?
If we truly need to address performance and scalability for the SCSI-subsystem then we need something lightweight.
Threaded interrupts are actually quite light.

-- Steve
Steven Rostedt
2009-03-25 17:52:33 UTC
Permalink
Post by Chetan.Loke
-----Original Message-----
From: Peter Zijlstra [mailto:peterz at infradead.org]
Sent: Wednesday, March 25, 2009 10:18 AM
To: Loke,Chetan
Cc: mingo at elte.hu; mathieu.desnoyers at polymtl.ca;
akpm at linux-foundation.org; linux-kernel at vger.kernel.org;
ltt-dev at lists.casi.polymtl.ca; fweisbec at gmail.com;
jbaron at redhat.com; tglx at linutronix.de;
rmk+lkml at arm.linux.org.uk; mhiramat at redhat.com;
fche at redhat.com; haoki at redhat.com;
t-nishiie at np.css.fujitsu.com; rostedt at goodmis.org;
eduard.munteanu at linux360.ro
Subject: RE: [patch 3/9] LTTng instrumentation tasklets
Quick question. I understand this is unrelated to this patch. So I
apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a
plan to phase them out ? Let me draw a small picture as to what's
bothering me.
With the SR-IOV support if there are 'N' virtual functions
then there
will be 'N' driver instances(actually N+1, 1 for the PF). If that
driver drains the responses in the interrupt context then all such
VF-instances could virtually block everyone else(because
SR-IOV guys
might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's
can queue
tasklets and can also get the performance they want.
Any suggestions?
Threaded interrupts?
If we truly need to address performance and scalability for the SCSI-subsystem then we need something lightweight.
Threaded interrupts are actually quite light.

-- Steve
Chetan.Loke
2009-03-25 17:37:20 UTC
Permalink
-----Original Message-----
From: Peter Zijlstra [mailto:peterz at infradead.org]
Sent: Wednesday, March 25, 2009 10:18 AM
To: Loke,Chetan
Cc: mingo at elte.hu; mathieu.desnoyers at polymtl.ca;
akpm at linux-foundation.org; linux-kernel at vger.kernel.org;
ltt-dev at lists.casi.polymtl.ca; fweisbec at gmail.com;
jbaron at redhat.com; tglx at linutronix.de;
rmk+lkml at arm.linux.org.uk; mhiramat at redhat.com;
fche at redhat.com; haoki at redhat.com;
t-nishiie at np.css.fujitsu.com; rostedt at goodmis.org;
eduard.munteanu at linux360.ro
Subject: RE: [patch 3/9] LTTng instrumentation tasklets
Quick question. I understand this is unrelated to this patch. So I
apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a
plan to phase them out ? Let me draw a small picture as to what's
bothering me.
With the SR-IOV support if there are 'N' virtual functions
then there
will be 'N' driver instances(actually N+1, 1 for the PF). If that
driver drains the responses in the interrupt context then all such
VF-instances could virtually block everyone else(because
SR-IOV guys
might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's
can queue
tasklets and can also get the performance they want.
Any suggestions?
Threaded interrupts?
If we truly need to address performance and scalability for the SCSI-subsystem then we need something lightweight.
Chetan.Loke
2009-03-25 13:52:32 UTC
Permalink
-----Original Message-----
From: linux-kernel-owner at vger.kernel.org
[mailto:linux-kernel-owner at vger.kernel.org] On Behalf Of Ingo Molnar
Sent: Tuesday, March 24, 2009 1:56 PM
To: Mathieu Desnoyers
Cc: akpm at linux-foundation.org; linux-kernel at vger.kernel.org;
ltt-dev at lists.casi.polymtl.ca; Frederic Weisbecker; Jason
Baron; Peter Zijlstra; Thomas Gleixner; Russell King; Masami
Hiramatsu; Frank Ch. Eigler; Hideo AOKI; Takashi Nishiie;
Steven Rostedt; Eduard - Gabriel Munteanu
Subject: Re: [patch 3/9] LTTng instrumentation tasklets
tasklet entry and exit events.
+DEFINE_TRACE(irq_tasklet_high_entry);
+DEFINE_TRACE(irq_tasklet_high_exit);
+DEFINE_TRACE(irq_tasklet_low_entry);
+DEFINE_TRACE(irq_tasklet_low_exit);
Dunno - tasklets are a legacy mechanism, not sure we want to
instrument them.
Quick question. I understand this is unrelated to this patch. So I apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a plan to phase them out ? Let me draw a small picture as to what's bothering me.

With the SR-IOV support if there are 'N' virtual functions then there will be 'N' driver instances(actually N+1, 1 for the PF). If that driver drains the responses in the interrupt context then all such VF-instances could virtually block everyone else(because SR-IOV guys might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's can queue tasklets and can also get the performance they want.

Any suggestions?

thanks
Chetan
Peter Zijlstra
2009-03-25 14:17:58 UTC
Permalink
Post by Chetan.Loke
Quick question. I understand this is unrelated to this patch. So I apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a
plan to phase them out ? Let me draw a small picture as to what's
bothering me.
With the SR-IOV support if there are 'N' virtual functions then there
will be 'N' driver instances(actually N+1, 1 for the PF). If that
driver drains the responses in the interrupt context then all such
VF-instances could virtually block everyone else(because SR-IOV guys
might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's can queue
tasklets and can also get the performance they want.
Any suggestions?
Threaded interrupts?
Chetan.Loke
2009-03-25 13:52:32 UTC
Permalink
-----Original Message-----
From: linux-kernel-owner at vger.kernel.org
[mailto:linux-kernel-owner at vger.kernel.org] On Behalf Of Ingo Molnar
Sent: Tuesday, March 24, 2009 1:56 PM
To: Mathieu Desnoyers
Cc: akpm at linux-foundation.org; linux-kernel at vger.kernel.org;
ltt-dev at lists.casi.polymtl.ca; Frederic Weisbecker; Jason
Baron; Peter Zijlstra; Thomas Gleixner; Russell King; Masami
Hiramatsu; Frank Ch. Eigler; Hideo AOKI; Takashi Nishiie;
Steven Rostedt; Eduard - Gabriel Munteanu
Subject: Re: [patch 3/9] LTTng instrumentation tasklets
tasklet entry and exit events.
+DEFINE_TRACE(irq_tasklet_high_entry);
+DEFINE_TRACE(irq_tasklet_high_exit);
+DEFINE_TRACE(irq_tasklet_low_entry);
+DEFINE_TRACE(irq_tasklet_low_exit);
Dunno - tasklets are a legacy mechanism, not sure we want to
instrument them.
Quick question. I understand this is unrelated to this patch. So I apologize in advance.
Ingo - you mentioned "tasklets are a legacy mechanism". Is there a plan to phase them out ? Let me draw a small picture as to what's bothering me.

With the SR-IOV support if there are 'N' virtual functions then there will be 'N' driver instances(actually N+1, 1 for the PF). If that driver drains the responses in the interrupt context then all such VF-instances could virtually block everyone else(because SR-IOV guys might also have MSI-X enabled).
So now all such drivers should alter their Rx path.Driver's can queue tasklets and can also get the performance they want.

Any suggestions?

thanks
Chetan
Ingo Molnar
2009-03-24 17:56:21 UTC
Permalink
tasklet entry and exit events.
+DEFINE_TRACE(irq_tasklet_high_entry);
+DEFINE_TRACE(irq_tasklet_high_exit);
+DEFINE_TRACE(irq_tasklet_low_entry);
+DEFINE_TRACE(irq_tasklet_low_exit);
Dunno - tasklets are a legacy mechanism, not sure we want to
instrument them. That some tasklet activity happens will be obvious
from the softirq tracepoints already.

Anyway ... i guess we could do it if it's done via the more complete
TRACE_EVENT() mechanism in the tracing tree.

Ingo
Ingo Molnar
2009-03-24 17:56:21 UTC
Permalink
tasklet entry and exit events.
+DEFINE_TRACE(irq_tasklet_high_entry);
+DEFINE_TRACE(irq_tasklet_high_exit);
+DEFINE_TRACE(irq_tasklet_low_entry);
+DEFINE_TRACE(irq_tasklet_low_exit);
Dunno - tasklets are a legacy mechanism, not sure we want to
instrument them. That some tasklet activity happens will be obvious
from the softirq tracepoints already.

Anyway ... i guess we could do it if it's done via the more complete
TRACE_EVENT() mechanism in the tracing tree.

Ingo

Mathieu Desnoyers
2009-03-24 15:56:31 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-timer.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/2beb18ca/attachment.txt>
Ingo Molnar
2009-03-24 18:21:22 UTC
Permalink
Instrument timer activity (timer set, expired, current time
updates) to keep information about the "real time" flow within the
kernel. It can be used by a trace analysis tool to synchronize
information coming from various sources, e.g. to merge traces with
system logs.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable
to markers), even without immediate values optimizations, tests
done by Hideo Aoki on ia64 show no regression. His test case was
using hackbench on a kernel where scheduler instrumentation (about
5 events in code scheduler code) was added. See the "Tracepoints"
patch header for performance result detail.
Note : do_setitimer instrumentation could potentially be done with
a more generic system call instrumentation.
CC: 'Steven Rostedt' <rostedt at goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu at linux360.ro>
---
include/trace/timer.h | 24 ++++++++++++++++++++++++
kernel/itimer.c | 8 ++++++++
kernel/timer.c | 12 +++++++++++-
3 files changed, 43 insertions(+), 1 deletion(-)
For complete timer instrumentation, hrtimers should be instrumented
as well.
return HRTIMER_NORESTART;
@@ -148,6 +154,8 @@ int do_setitimer(int which, struct itime
!timeval_valid(&value->it_interval))
return -EINVAL;
+ trace_timer_itimer_set(which, value);
for consistency, i'd name the timer start tracepoints the following
way:

trace_timer_init()
trace_itimer_init()
trace_hrtimer_init()

trace_timer_start()
trace_itimer_start()
trace_hrtimer_start()

trace_timer_expire()
trace_itimer_expire()
trace_hrtimer_expire()

trace_timer_cancel()
trace_itimer_cancel()
trace_hrtimer_cancel()

the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.

Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
+DEFINE_TRACE(timer_set);
+DEFINE_TRACE(timer_timeout);
these two should be 'timer_start' and 'timer_expire'.
+DEFINE_TRACE(timer_update_time);
This is a misnomer. This is in timer.c but is not a 'timer'
tracepoint - this is a time of day tracepoint - and this should be
reflected in its naming - so it should be: trace_gtod_update_xtime()
or so.
@@ -357,6 +363,7 @@ static void internal_add_timer(struct tv
i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
vec = base->tv5.vec + i;
}
+ trace_timer_set(timer);
/*
*/
this should be trace_timer_start().
@@ -1121,6 +1128,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_times(ticks);
+ trace_timer_update_time(&xtime, &wall_to_monotonic);
}
gtod_update_xtime().

The GTOD instrumentation should move into a separate patch, and
should probably be extended with NTP correction events as well.
#ifdef __ARCH_WANT_SYS_ALARM
@@ -1202,7 +1210,9 @@ SYSCALL_DEFINE0(getegid)
static void process_timeout(unsigned long __data)
{
- wake_up_process((struct task_struct *)__data);
+ struct task_struct *task = (struct task_struct *)__data;
+ trace_timer_timeout(task);
+ wake_up_process(task);
[ nit: misssing new line after local variable declaration. ]

this should be trace_timer_expire().
Index: linux-2.6-lttng/include/trace/timer.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/timer.h 2009-03-24 09:31:51.000000000 -0400
@@ -0,0 +1,24 @@
+#ifndef _TRACE_TIMER_H
+#define _TRACE_TIMER_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(timer_itimer_expired,
+ TPPROTO(struct signal_struct *sig),
+ TPARGS(sig));
+DECLARE_TRACE(timer_itimer_set,
+ TPPROTO(int which, struct itimerval *value),
+ TPARGS(which, value));
+DECLARE_TRACE(timer_set,
+ TPPROTO(struct timer_list *timer),
+ TPARGS(timer));
+/*
+ * xtime_lock is taken when kernel_timer_update_time tracepoint is reached.
+ */
+DECLARE_TRACE(timer_update_time,
+ TPPROTO(struct timespec *_xtime, struct timespec *_wall_to_monotonic),
+ TPARGS(_xtime, _wall_to_monotonic));
+DECLARE_TRACE(timer_timeout,
+ TPPROTO(struct task_struct *p),
+ TPARGS(p));
+#endif
For completeness and consistency, trace_itimer_start() should be
split into two cases:

if (value)
trace_itimer_start(which, value);
else
trace_itimer_cancel(which);

That extra branch does not matter much and the resulting
instrumentaton is more consistent across all the timer types.

So this patch needs more work, but it's a good first step.

It would also be fantastic if you did them via the TRACE_EVENT()
mechanism in the tracing tree. It would still be just as useful to
LTTNG - but it would also be useful to ftrace that way.

Thanks,

Ingo
Thomas Gleixner
2009-03-24 19:14:06 UTC
Permalink
Post by Ingo Molnar
the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.
Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
Look at the debugobjects hooks. They cover the complete timer life
time already.

Piggypacking tracepoints on those would simplify maintenance and
reduce the number of extra lines in the code.

Thanks,

tglx
Ingo Molnar
2009-03-24 20:47:19 UTC
Permalink
Post by Thomas Gleixner
Post by Ingo Molnar
the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.
Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
Look at the debugobjects hooks. They cover the complete timer life
time already.
Piggypacking tracepoints on those would simplify maintenance and
reduce the number of extra lines in the code.
makes sense. They dont cover itimers nor hrtimers though.

Ingo
Thomas Gleixner
2009-03-27 22:05:46 UTC
Permalink
Post by Ingo Molnar
Post by Thomas Gleixner
Post by Ingo Molnar
the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.
Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
Look at the debugobjects hooks. They cover the complete timer life
time already.
Piggypacking tracepoints on those would simplify maintenance and
reduce the number of extra lines in the code.
makes sense. They dont cover itimers nor hrtimers though.
They cover all (hr)timers independent of the call site. And we really
do not need any freaking call site instrumented.

The CPU* timer stuff is a different playground and needs to be
instrumented separate.

Thanks,

tglx
Thomas Gleixner
2009-03-27 22:05:46 UTC
Permalink
Post by Ingo Molnar
Post by Thomas Gleixner
Post by Ingo Molnar
the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.
Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
Look at the debugobjects hooks. They cover the complete timer life
time already.
Piggypacking tracepoints on those would simplify maintenance and
reduce the number of extra lines in the code.
makes sense. They dont cover itimers nor hrtimers though.
They cover all (hr)timers independent of the call site. And we really
do not need any freaking call site instrumented.

The CPU* timer stuff is a different playground and needs to be
instrumented separate.

Thanks,

tglx
Ingo Molnar
2009-03-24 20:47:19 UTC
Permalink
Post by Thomas Gleixner
Post by Ingo Molnar
the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.
Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
Look at the debugobjects hooks. They cover the complete timer life
time already.
Piggypacking tracepoints on those would simplify maintenance and
reduce the number of extra lines in the code.
makes sense. They dont cover itimers nor hrtimers though.

Ingo
Thomas Gleixner
2009-03-24 19:14:06 UTC
Permalink
Post by Ingo Molnar
the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.
Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
Look at the debugobjects hooks. They cover the complete timer life
time already.

Piggypacking tracepoints on those would simplify maintenance and
reduce the number of extra lines in the code.

Thanks,

tglx
Ingo Molnar
2009-03-24 18:21:22 UTC
Permalink
Instrument timer activity (timer set, expired, current time
updates) to keep information about the "real time" flow within the
kernel. It can be used by a trace analysis tool to synchronize
information coming from various sources, e.g. to merge traces with
system logs.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable
to markers), even without immediate values optimizations, tests
done by Hideo Aoki on ia64 show no regression. His test case was
using hackbench on a kernel where scheduler instrumentation (about
5 events in code scheduler code) was added. See the "Tracepoints"
patch header for performance result detail.
Note : do_setitimer instrumentation could potentially be done with
a more generic system call instrumentation.
CC: 'Steven Rostedt' <rostedt at goodmis.org>
CC: Eduard - Gabriel Munteanu <eduard.munteanu at linux360.ro>
---
include/trace/timer.h | 24 ++++++++++++++++++++++++
kernel/itimer.c | 8 ++++++++
kernel/timer.c | 12 +++++++++++-
3 files changed, 43 insertions(+), 1 deletion(-)
For complete timer instrumentation, hrtimers should be instrumented
as well.
return HRTIMER_NORESTART;
@@ -148,6 +154,8 @@ int do_setitimer(int which, struct itime
!timeval_valid(&value->it_interval))
return -EINVAL;
+ trace_timer_itimer_set(which, value);
for consistency, i'd name the timer start tracepoints the following
way:

trace_timer_init()
trace_itimer_init()
trace_hrtimer_init()

trace_timer_start()
trace_itimer_start()
trace_hrtimer_start()

trace_timer_expire()
trace_itimer_expire()
trace_hrtimer_expire()

trace_timer_cancel()
trace_itimer_cancel()
trace_hrtimer_cancel()

the init methods are missing from your patch entirely - and the rest
is partially incomplete as well.

Instrumentation of the del_timer() variants is missing. For a
complete lifetime analysis of timers this cannot be left out.
+DEFINE_TRACE(timer_set);
+DEFINE_TRACE(timer_timeout);
these two should be 'timer_start' and 'timer_expire'.
+DEFINE_TRACE(timer_update_time);
This is a misnomer. This is in timer.c but is not a 'timer'
tracepoint - this is a time of day tracepoint - and this should be
reflected in its naming - so it should be: trace_gtod_update_xtime()
or so.
@@ -357,6 +363,7 @@ static void internal_add_timer(struct tv
i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
vec = base->tv5.vec + i;
}
+ trace_timer_set(timer);
/*
*/
this should be trace_timer_start().
@@ -1121,6 +1128,7 @@ void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_times(ticks);
+ trace_timer_update_time(&xtime, &wall_to_monotonic);
}
gtod_update_xtime().

The GTOD instrumentation should move into a separate patch, and
should probably be extended with NTP correction events as well.
#ifdef __ARCH_WANT_SYS_ALARM
@@ -1202,7 +1210,9 @@ SYSCALL_DEFINE0(getegid)
static void process_timeout(unsigned long __data)
{
- wake_up_process((struct task_struct *)__data);
+ struct task_struct *task = (struct task_struct *)__data;
+ trace_timer_timeout(task);
+ wake_up_process(task);
[ nit: misssing new line after local variable declaration. ]

this should be trace_timer_expire().
Index: linux-2.6-lttng/include/trace/timer.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/include/trace/timer.h 2009-03-24 09:31:51.000000000 -0400
@@ -0,0 +1,24 @@
+#ifndef _TRACE_TIMER_H
+#define _TRACE_TIMER_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_TRACE(timer_itimer_expired,
+ TPPROTO(struct signal_struct *sig),
+ TPARGS(sig));
+DECLARE_TRACE(timer_itimer_set,
+ TPPROTO(int which, struct itimerval *value),
+ TPARGS(which, value));
+DECLARE_TRACE(timer_set,
+ TPPROTO(struct timer_list *timer),
+ TPARGS(timer));
+/*
+ * xtime_lock is taken when kernel_timer_update_time tracepoint is reached.
+ */
+DECLARE_TRACE(timer_update_time,
+ TPPROTO(struct timespec *_xtime, struct timespec *_wall_to_monotonic),
+ TPARGS(_xtime, _wall_to_monotonic));
+DECLARE_TRACE(timer_timeout,
+ TPPROTO(struct task_struct *p),
+ TPARGS(p));
+#endif
For completeness and consistency, trace_itimer_start() should be
split into two cases:

if (value)
trace_itimer_start(which, value);
else
trace_itimer_cancel(which);

That extra branch does not matter much and the resulting
instrumentaton is more consistent across all the timer types.

So this patch needs more work, but it's a good first step.

It would also be fantastic if you did them via the TRACE_EVENT()
mechanism in the tracing tree. It would still be just as useful to
LTTNG - but it would also be useful to ftrace that way.

Thanks,

Ingo
Mathieu Desnoyers
2009-03-24 15:56:27 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-irq.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/23f29076/attachment.asc>
Jason Baron
2009-03-24 17:33:54 UTC
Permalink
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be interrupted
by the same irq, I think it should be pretty clear we are in the same
handler. That said, the extra entry/exit tracepoints could make the
sequence of events simpler to decipher, which is important. The code
looks good, and provides at least as much information as the patch that
I proposed. So i'll be happy either way :)

thanks,

-Jason
Ingo Molnar
2009-03-24 17:50:49 UTC
Permalink
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.

Ingo
Jason Baron
2009-03-24 17:57:50 UTC
Permalink
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the tracing
tree are for entry and exit into each handler per irq. Mathieu is
proposing an entry/exit tracepoint per irq, and a 3rd tracepoint to
tell us which handler is being called and its return code. hope this is
clear.

thanks,

-Jason
Ingo Molnar
2009-03-24 19:12:52 UTC
Permalink
Post by Jason Baron
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the
tracing tree are for entry and exit into each handler per irq.
Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd
tracepoint to tell us which handler is being called and its return
code. hope this is clear.
Ok, i misunderstood that.

Mathieu's is slightly more compact, but yours is more logical.

I believe your pre/post IRQ handler callback is the right model - it
decouples device IRQ handling from any notion of 'IRQ'.

For example, we could correctly express "handler got executed by an
IRQ thread" via it - while via Mathieu's scheme it does not really
map to that.

So if then i think there should be a third tracepoint in addition to
your two existing tracepoints: a 'raw vector' type of tracepoint.
It's added both to do_IRQ() entry point, but also to the various
common SMP IPI entry points: reschedule, TLB flush and local timer
IRQ tick.

The best information there to pass to the probe is the raw vector
number, and the ptregs structure.

Hm?

Ingo
Mathieu Desnoyers
2009-03-24 20:11:47 UTC
Permalink
Post by Ingo Molnar
Post by Jason Baron
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the
tracing tree are for entry and exit into each handler per irq.
Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd
tracepoint to tell us which handler is being called and its return
code. hope this is clear.
Ok, i misunderstood that.
Mathieu's is slightly more compact, but yours is more logical.
I believe your pre/post IRQ handler callback is the right model - it
decouples device IRQ handling from any notion of 'IRQ'.
For example, we could correctly express "handler got executed by an
IRQ thread" via it - while via Mathieu's scheme it does not really
map to that.
So if then i think there should be a third tracepoint in addition to
your two existing tracepoints: a 'raw vector' type of tracepoint.
It's added both to do_IRQ() entry point, but also to the various
common SMP IPI entry points: reschedule, TLB flush and local timer
IRQ tick.
The best information there to pass to the probe is the raw vector
number, and the ptregs structure.
Hm?
This third type of tracepoint for IRQs you are talking about is actually
what I had in LTTng. I decided to add the irq_next_handler and to add a
action field to irq_entry to include the irq handler information needed
by Jason.

If we want to do this logically, without thinking about tracer
performance impact, we could/should do :

trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)

And add the irq_entry/irq_exit events to the arch-specific reschedule,
tlb flush, local timer irq, as I have in my lttng tree already.

But given the trace_irq_handler_entry/trace_irq_handler_exit events
could be combined, given we can record action and ret in the
irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
the single-handler fast path by adding this information to the irq
entry/exit events, and decided to combine the irq_handler entry/exit
into a single next_handler event, which records the previous ret value
and the next action to execute.

On an interrupt-driven workload, it will have a significant impact.
(2 events vs 4).

If we add interrupt threads to the kernel, then we can switch to the
following scheme :

* instrumentation of the real interrupt handler :

trace_irq_entry(irqno, pt_regs)

trace_irq_exit(ret)

* instrumentation of the irq threads :

trace_irq_thread_entry()

trace_irq_thread_exit()

I don't see why we should mind trying to make the tracepoints "logical",
especially if it hurts performance considerably. Doing these
implementation-specific versions of irq tracepoints would provide the
best performance we can get when tracing. It's up to the tracers to
specialize their analysis based on the underlying IRQ mechanism
(non-threaded vs threaded).

This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".

Mathieu
Post by Ingo Molnar
Ingo
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Ingo Molnar
2009-03-24 20:51:20 UTC
Permalink
Post by Mathieu Desnoyers
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
Not really.

As i said, the handler invocation should be thought of separately
from vectored IRQs. A device IRQ handler can be invoked in a number
of non-vectored ways: it can be called in an IRQ thread for example.
(there are other invocation modes possible too)

For IRQ vectors, the 'retval' makes little sense - so the exit event
can be left out.

Which leaves us with what i suggested: to add an IRQ vector entry
event on top of Jason's (already integrated) patches.

Hm?

Ingo
Ingo Molnar
2009-03-25 08:47:24 UTC
Permalink
Post by Ingo Molnar
Post by Mathieu Desnoyers
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
Not really.
Put differently: we seem to agree on handler invocation entry/exit
events (and those are handled by Jason's patch already), and that's
good.

I dont think we need two events for physical IRQ events though. One
is enough: entry event. What meaning does 'retval' have in your
pseudo-code above? None typically.

Ingo
Mathieu Desnoyers
2009-03-25 18:30:24 UTC
Permalink
Post by Ingo Molnar
Post by Ingo Molnar
Post by Mathieu Desnoyers
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
Not really.
Put differently: we seem to agree on handler invocation entry/exit
events (and those are handled by Jason's patch already), and that's
good.
Yes, we need to get this information into the trace stream somehow.
Post by Ingo Molnar
I dont think we need two events for physical IRQ events though. One
is enough: entry event. What meaning does 'retval' have in your
pseudo-code above? None typically.
Hrm, an event means more that the "payload" it has. Just saving "here is
a return from interrupt handler" without any return value is valuable
per se because it lets the trace analyzer know when the hardirq handler
ended.

e.g., if we instrument the x86 local apic irq handler only with an "irq
entry" event, we would not be able to detect if the following events are
nested on top of the IRQ handler or are executed in thread context.

Having those two hardirq entry/exit events is also especially useful to
figure out the maximum interrupt latency caused by long interrupt
handlers (or that could be many handlers hooked on the same interrupt
line).

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers
2009-03-25 18:30:24 UTC
Permalink
Post by Ingo Molnar
Post by Ingo Molnar
Post by Mathieu Desnoyers
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
Not really.
Put differently: we seem to agree on handler invocation entry/exit
events (and those are handled by Jason's patch already), and that's
good.
Yes, we need to get this information into the trace stream somehow.
Post by Ingo Molnar
I dont think we need two events for physical IRQ events though. One
is enough: entry event. What meaning does 'retval' have in your
pseudo-code above? None typically.
Hrm, an event means more that the "payload" it has. Just saving "here is
a return from interrupt handler" without any return value is valuable
per se because it lets the trace analyzer know when the hardirq handler
ended.

e.g., if we instrument the x86 local apic irq handler only with an "irq
entry" event, we would not be able to detect if the following events are
nested on top of the IRQ handler or are executed in thread context.

Having those two hardirq entry/exit events is also especially useful to
figure out the maximum interrupt latency caused by long interrupt
handlers (or that could be many handlers hooked on the same interrupt
line).

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Ingo Molnar
2009-03-25 08:47:24 UTC
Permalink
Post by Ingo Molnar
Post by Mathieu Desnoyers
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
Not really.
Put differently: we seem to agree on handler invocation entry/exit
events (and those are handled by Jason's patch already), and that's
good.

I dont think we need two events for physical IRQ events though. One
is enough: entry event. What meaning does 'retval' have in your
pseudo-code above? None typically.

Ingo
Steven Rostedt
2009-03-25 02:00:49 UTC
Permalink
Post by Mathieu Desnoyers
This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".
Let me correct you here. TRACE_EVENT does ***NOT*** export anything to
userspace. There is no code what so ever in TRACE_EVENT that does so.

Now, ftrace does export information using TRACE_EVENT to userspace. But
that is the way ftrace wants to handle it. There's nothing needed to
export to userspace. What is exported, is exported ***BECAUSE*** it can
change. I'll only try to keep the format that is exported the same. But
nothing should rely on what the format represents staying the same.

If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
you like. Ftrace will use it to show how to read the binary data, which
is only needed if you want to do that. It uses the print format to print
to the console in case of failure. Or to the trace file, which by the way
can also change without notice.


-- Steve
Mathieu Desnoyers
2009-03-26 18:27:22 UTC
Permalink
Post by Steven Rostedt
Post by Mathieu Desnoyers
This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".
Let me correct you here. TRACE_EVENT does ***NOT*** export anything to
userspace. There is no code what so ever in TRACE_EVENT that does so.
Except the comment on top of TRACE_EVENT() in tracepoint.h maybe ?

* *
* * Fast binary tracing: define the trace record via
* * TP_STRUCT__entry(). You can think about it like a
* * regular C structure local variable definition.
* *
* * This is how the trace record is structured and will
* * be saved into the ring buffer. These are the fields
* * that will be exposed to user-space in
* * /debug/tracing/events/<*>/format.

You don't need to have code within the infrastructure to actually export
stuff to userspace. Stating in the API that you some information will be
presented to userspace counts.
Post by Steven Rostedt
Now, ftrace does export information using TRACE_EVENT to userspace. But
that is the way ftrace wants to handle it. There's nothing needed to
export to userspace. What is exported, is exported ***BECAUSE*** it can
change. I'll only try to keep the format that is exported the same. But
nothing should rely on what the format represents staying the same.
If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
you like. Ftrace will use it to show how to read the binary data, which
is only needed if you want to do that. It uses the print format to print
to the console in case of failure. Or to the trace file, which by the way
can also change without notice.
Hrm, so you are planning to add, in
include/trace/sched_event_types.h, things like :

TRACE_EVENT(sched_kthread_stop,

TP_PROTO(struct task_struct *t),

TP_ARGS(t),

TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
),

TP_fast_assign(
memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
__entry->pid = t->pid;
),

TP_printk("task %s:%d", __entry->comm, __entry->pid)
);

Actually specifying that the TP_STRUCT__entry, TP_fast_assign and
TP_printk are specific to ftrace and should not be considered as a
stable API, am I correct ?

Then only when ftrace is built in the kernel do we have
kernel/trace/events.c including the holy include/trace/trace_events.h,
which includes all the trace_events headers, and then including
kernel/trace/trace_events_stage_{1,2,3}.h to override the TRACE_EVENT
macro and create the callbacks for each TRACE_EVENT statements.

Then big problem with this is that, whether you like it or not, you
*are* adding an API to the tracepoints, but just validating the types
when ftrace is being built. If you want to add an API to the
tracepoints, then the type verification should be done _even when ftrace
is disabled_.

Therefore, the TRACE_EVENT in tracepoint.h should map to macros that
would verify the argument types.

I think it's ok to specify that the arguments of a given TRACE_EVENT may
change without notice. We have to say it explicitly in the TRACE_EVENT
header though.

And while we talk about this, I also wonder why we won't simply embed
the result of the TP_fast_assign and TP_printk in the tracepoint
unlikely branch ? This would skip a function call in the tracing fast
path, and would save the overhead of a whole function call when tracing
is active. But that implies more bounds between tracepoints and data
output, but given you are already declaring this in the same API, I
don't see the problem anymore.

About the struct ftrace_raw_##name created in stage 1, I think the
padding at the end of the structure is a complete waste of space. You
should probably have a look at the ltt-type-serializer.[ch] in LTTng.

I am still unsure that the approach you take with TRACE_EVENT, which I
would call "automatically generating code using header and macro
tricks", is in the end easier to review than the simple C callback
approach I have taken in LTTng, adding the "glue" in the
DEFINE_MARKER_TP() macros to connect a specific C callback to the actual
tracepoint.

The good thing about your approach is that everyting about a trace event
can be declared within the same macro. In LTTng, I have to create probe
modules and write stuff like :

void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec);

DEFINE_MARKER_TP(kernel, softirq_exit, irq_softirq_exit,
probe_irq_softirq_exit, "softirq_id #1u%lu");

void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec)
{
struct marker *marker;
unsigned char data;

data = ((unsigned long)h - (unsigned long)softirq_vec) / sizeof(*h);

marker = &GET_MARKER(kernel, softirq_exit);
ltt_specialized_trace(marker, marker->single.probe_private,
&data, sizeof(data), sizeof(data));
}

by hand, and then the kernel softirq_exit event is shown in
debugfs/ltt/markers/kernel/softirq_exit/. But I don't see the big win
you get by doing it in TRACE_EVENT, especially if it is not
type-verified when ftrace is disabled, compared to adding those
callbacks in standard kernel modules, which is a coding-style we have
been used to for years.

I haven't seen much automatically generated code around in the kernel
tree, maybe there is a good reason ? I can't wait to see the first
kernel JIT based on ftrace. ;-)

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Steven Rostedt
2009-03-27 22:53:13 UTC
Permalink
Post by Mathieu Desnoyers
Post by Steven Rostedt
Post by Mathieu Desnoyers
This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".
Let me correct you here. TRACE_EVENT does ***NOT*** export anything to
userspace. There is no code what so ever in TRACE_EVENT that does so.
Except the comment on top of TRACE_EVENT() in tracepoint.h maybe ?
* *
* * Fast binary tracing: define the trace record via
* * TP_STRUCT__entry(). You can think about it like a
* * regular C structure local variable definition.
* *
* * This is how the trace record is structured and will
* * be saved into the ring buffer. These are the fields
* * that will be exposed to user-space in
* * /debug/tracing/events/<*>/format.
You don't need to have code within the infrastructure to actually export
stuff to userspace. Stating in the API that you some information will be
presented to userspace counts.
That is done by ftrace not TRACE_EVENT.
Post by Mathieu Desnoyers
Post by Steven Rostedt
Now, ftrace does export information using TRACE_EVENT to userspace. But
that is the way ftrace wants to handle it. There's nothing needed to
export to userspace. What is exported, is exported ***BECAUSE*** it can
change. I'll only try to keep the format that is exported the same. But
nothing should rely on what the format represents staying the same.
If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
you like. Ftrace will use it to show how to read the binary data, which
is only needed if you want to do that. It uses the print format to print
to the console in case of failure. Or to the trace file, which by the way
can also change without notice.
Hrm, so you are planning to add, in
TRACE_EVENT(sched_kthread_stop,
TP_PROTO(struct task_struct *t),
TP_ARGS(t),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
),
TP_fast_assign(
memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
__entry->pid = t->pid;
),
TP_printk("task %s:%d", __entry->comm, __entry->pid)
);
Actually specifying that the TP_STRUCT__entry, TP_fast_assign and
TP_printk are specific to ftrace and should not be considered as a
stable API, am I correct ?
No they are not specific to ftrace. I said they are used by ftrace.
They can be modified at will by the maintainer. The macros makes the
updates automatic for ftrace.
Post by Mathieu Desnoyers
Then only when ftrace is built in the kernel do we have
kernel/trace/events.c including the holy include/trace/trace_events.h,
which includes all the trace_events headers, and then including
kernel/trace/trace_events_stage_{1,2,3}.h to override the TRACE_EVENT
macro and create the callbacks for each TRACE_EVENT statements.
Currently ftrace is the only user of the include/trace/trace_events.h
Others are welcome to do what they wish.
Post by Mathieu Desnoyers
Then big problem with this is that, whether you like it or not, you
*are* adding an API to the tracepoints, but just validating the types
when ftrace is being built. If you want to add an API to the
tracepoints, then the type verification should be done _even when ftrace
is disabled_.
It's an internal API not a user space one. Internal APIs are fine, and can
change whenever we want.
Post by Mathieu Desnoyers
Therefore, the TRACE_EVENT in tracepoint.h should map to macros that
would verify the argument types.
We could do that too.
Post by Mathieu Desnoyers
I think it's ok to specify that the arguments of a given TRACE_EVENT may
change without notice. We have to say it explicitly in the TRACE_EVENT
header though.
Sure, I totally agree.
Post by Mathieu Desnoyers
And while we talk about this, I also wonder why we won't simply embed
the result of the TP_fast_assign and TP_printk in the tracepoint
unlikely branch ? This would skip a function call in the tracing fast
path, and would save the overhead of a whole function call when tracing
is active. But that implies more bounds between tracepoints and data
output, but given you are already declaring this in the same API, I
don't see the problem anymore.
I'm not exactly what you mean here? You mean to add it at the location in
the code? If that is the case, I disagree. Since we don't want to clutter
code with trace point data. It is nice to keep it out in separate headers.

This is what we also think about #ifdefs, it is better to have static
inlines in headers that to clutter the code with them.
Post by Mathieu Desnoyers
About the struct ftrace_raw_##name created in stage 1, I think the
padding at the end of the structure is a complete waste of space. You
should probably have a look at the ltt-type-serializer.[ch] in LTTng.
I don't know what padding you are talking about.
Post by Mathieu Desnoyers
I am still unsure that the approach you take with TRACE_EVENT, which I
would call "automatically generating code using header and macro
tricks", is in the end easier to review than the simple C callback
approach I have taken in LTTng, adding the "glue" in the
DEFINE_MARKER_TP() macros to connect a specific C callback to the actual
tracepoint.
The good thing about your approach is that everyting about a trace event
can be declared within the same macro. In LTTng, I have to create probe
void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec);
DEFINE_MARKER_TP(kernel, softirq_exit, irq_softirq_exit,
probe_irq_softirq_exit, "softirq_id #1u%lu");
void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec)
{
struct marker *marker;
unsigned char data;
data = ((unsigned long)h - (unsigned long)softirq_vec) / sizeof(*h);
marker = &GET_MARKER(kernel, softirq_exit);
ltt_specialized_trace(marker, marker->single.probe_private,
&data, sizeof(data), sizeof(data));
}
by hand, and then the kernel softirq_exit event is shown in
debugfs/ltt/markers/kernel/softirq_exit/. But I don't see the big win
you get by doing it in TRACE_EVENT, especially if it is not
type-verified when ftrace is disabled, compared to adding those
callbacks in standard kernel modules, which is a coding-style we have
been used to for years.
Well, I originally had it like you did above. But I found that if I did
that, the number of events would be greatly limited. Once I added this
infrastructure, a bunch of events were able to be made quickly.

As for the type checking, that should be easy to add. Don't need ftrace to
do it.

-- Steve
Post by Mathieu Desnoyers
I haven't seen much automatically generated code around in the kernel
tree, maybe there is a good reason ? I can't wait to see the first
kernel JIT based on ftrace. ;-)
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Steven Rostedt
2009-03-27 22:53:13 UTC
Permalink
Post by Mathieu Desnoyers
Post by Steven Rostedt
Post by Mathieu Desnoyers
This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".
Let me correct you here. TRACE_EVENT does ***NOT*** export anything to
userspace. There is no code what so ever in TRACE_EVENT that does so.
Except the comment on top of TRACE_EVENT() in tracepoint.h maybe ?
* *
* * Fast binary tracing: define the trace record via
* * TP_STRUCT__entry(). You can think about it like a
* * regular C structure local variable definition.
* *
* * This is how the trace record is structured and will
* * be saved into the ring buffer. These are the fields
* * that will be exposed to user-space in
* * /debug/tracing/events/<*>/format.
You don't need to have code within the infrastructure to actually export
stuff to userspace. Stating in the API that you some information will be
presented to userspace counts.
That is done by ftrace not TRACE_EVENT.
Post by Mathieu Desnoyers
Post by Steven Rostedt
Now, ftrace does export information using TRACE_EVENT to userspace. But
that is the way ftrace wants to handle it. There's nothing needed to
export to userspace. What is exported, is exported ***BECAUSE*** it can
change. I'll only try to keep the format that is exported the same. But
nothing should rely on what the format represents staying the same.
If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
you like. Ftrace will use it to show how to read the binary data, which
is only needed if you want to do that. It uses the print format to print
to the console in case of failure. Or to the trace file, which by the way
can also change without notice.
Hrm, so you are planning to add, in
TRACE_EVENT(sched_kthread_stop,
TP_PROTO(struct task_struct *t),
TP_ARGS(t),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
),
TP_fast_assign(
memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
__entry->pid = t->pid;
),
TP_printk("task %s:%d", __entry->comm, __entry->pid)
);
Actually specifying that the TP_STRUCT__entry, TP_fast_assign and
TP_printk are specific to ftrace and should not be considered as a
stable API, am I correct ?
No they are not specific to ftrace. I said they are used by ftrace.
They can be modified at will by the maintainer. The macros makes the
updates automatic for ftrace.
Post by Mathieu Desnoyers
Then only when ftrace is built in the kernel do we have
kernel/trace/events.c including the holy include/trace/trace_events.h,
which includes all the trace_events headers, and then including
kernel/trace/trace_events_stage_{1,2,3}.h to override the TRACE_EVENT
macro and create the callbacks for each TRACE_EVENT statements.
Currently ftrace is the only user of the include/trace/trace_events.h
Others are welcome to do what they wish.
Post by Mathieu Desnoyers
Then big problem with this is that, whether you like it or not, you
*are* adding an API to the tracepoints, but just validating the types
when ftrace is being built. If you want to add an API to the
tracepoints, then the type verification should be done _even when ftrace
is disabled_.
It's an internal API not a user space one. Internal APIs are fine, and can
change whenever we want.
Post by Mathieu Desnoyers
Therefore, the TRACE_EVENT in tracepoint.h should map to macros that
would verify the argument types.
We could do that too.
Post by Mathieu Desnoyers
I think it's ok to specify that the arguments of a given TRACE_EVENT may
change without notice. We have to say it explicitly in the TRACE_EVENT
header though.
Sure, I totally agree.
Post by Mathieu Desnoyers
And while we talk about this, I also wonder why we won't simply embed
the result of the TP_fast_assign and TP_printk in the tracepoint
unlikely branch ? This would skip a function call in the tracing fast
path, and would save the overhead of a whole function call when tracing
is active. But that implies more bounds between tracepoints and data
output, but given you are already declaring this in the same API, I
don't see the problem anymore.
I'm not exactly what you mean here? You mean to add it at the location in
the code? If that is the case, I disagree. Since we don't want to clutter
code with trace point data. It is nice to keep it out in separate headers.

This is what we also think about #ifdefs, it is better to have static
inlines in headers that to clutter the code with them.
Post by Mathieu Desnoyers
About the struct ftrace_raw_##name created in stage 1, I think the
padding at the end of the structure is a complete waste of space. You
should probably have a look at the ltt-type-serializer.[ch] in LTTng.
I don't know what padding you are talking about.
Post by Mathieu Desnoyers
I am still unsure that the approach you take with TRACE_EVENT, which I
would call "automatically generating code using header and macro
tricks", is in the end easier to review than the simple C callback
approach I have taken in LTTng, adding the "glue" in the
DEFINE_MARKER_TP() macros to connect a specific C callback to the actual
tracepoint.
The good thing about your approach is that everyting about a trace event
can be declared within the same macro. In LTTng, I have to create probe
void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec);
DEFINE_MARKER_TP(kernel, softirq_exit, irq_softirq_exit,
probe_irq_softirq_exit, "softirq_id #1u%lu");
void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec)
{
struct marker *marker;
unsigned char data;
data = ((unsigned long)h - (unsigned long)softirq_vec) / sizeof(*h);
marker = &GET_MARKER(kernel, softirq_exit);
ltt_specialized_trace(marker, marker->single.probe_private,
&data, sizeof(data), sizeof(data));
}
by hand, and then the kernel softirq_exit event is shown in
debugfs/ltt/markers/kernel/softirq_exit/. But I don't see the big win
you get by doing it in TRACE_EVENT, especially if it is not
type-verified when ftrace is disabled, compared to adding those
callbacks in standard kernel modules, which is a coding-style we have
been used to for years.
Well, I originally had it like you did above. But I found that if I did
that, the number of events would be greatly limited. Once I added this
infrastructure, a bunch of events were able to be made quickly.

As for the type checking, that should be easy to add. Don't need ftrace to
do it.

-- Steve
Post by Mathieu Desnoyers
I haven't seen much automatically generated code around in the kernel
tree, maybe there is a good reason ? I can't wait to see the first
kernel JIT based on ftrace. ;-)
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers
2009-03-26 18:27:22 UTC
Permalink
Post by Steven Rostedt
Post by Mathieu Desnoyers
This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".
Let me correct you here. TRACE_EVENT does ***NOT*** export anything to
userspace. There is no code what so ever in TRACE_EVENT that does so.
Except the comment on top of TRACE_EVENT() in tracepoint.h maybe ?

* *
* * Fast binary tracing: define the trace record via
* * TP_STRUCT__entry(). You can think about it like a
* * regular C structure local variable definition.
* *
* * This is how the trace record is structured and will
* * be saved into the ring buffer. These are the fields
* * that will be exposed to user-space in
* * /debug/tracing/events/<*>/format.

You don't need to have code within the infrastructure to actually export
stuff to userspace. Stating in the API that you some information will be
presented to userspace counts.
Post by Steven Rostedt
Now, ftrace does export information using TRACE_EVENT to userspace. But
that is the way ftrace wants to handle it. There's nothing needed to
export to userspace. What is exported, is exported ***BECAUSE*** it can
change. I'll only try to keep the format that is exported the same. But
nothing should rely on what the format represents staying the same.
If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
you like. Ftrace will use it to show how to read the binary data, which
is only needed if you want to do that. It uses the print format to print
to the console in case of failure. Or to the trace file, which by the way
can also change without notice.
Hrm, so you are planning to add, in
include/trace/sched_event_types.h, things like :

TRACE_EVENT(sched_kthread_stop,

TP_PROTO(struct task_struct *t),

TP_ARGS(t),

TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
),

TP_fast_assign(
memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
__entry->pid = t->pid;
),

TP_printk("task %s:%d", __entry->comm, __entry->pid)
);

Actually specifying that the TP_STRUCT__entry, TP_fast_assign and
TP_printk are specific to ftrace and should not be considered as a
stable API, am I correct ?

Then only when ftrace is built in the kernel do we have
kernel/trace/events.c including the holy include/trace/trace_events.h,
which includes all the trace_events headers, and then including
kernel/trace/trace_events_stage_{1,2,3}.h to override the TRACE_EVENT
macro and create the callbacks for each TRACE_EVENT statements.

Then big problem with this is that, whether you like it or not, you
*are* adding an API to the tracepoints, but just validating the types
when ftrace is being built. If you want to add an API to the
tracepoints, then the type verification should be done _even when ftrace
is disabled_.

Therefore, the TRACE_EVENT in tracepoint.h should map to macros that
would verify the argument types.

I think it's ok to specify that the arguments of a given TRACE_EVENT may
change without notice. We have to say it explicitly in the TRACE_EVENT
header though.

And while we talk about this, I also wonder why we won't simply embed
the result of the TP_fast_assign and TP_printk in the tracepoint
unlikely branch ? This would skip a function call in the tracing fast
path, and would save the overhead of a whole function call when tracing
is active. But that implies more bounds between tracepoints and data
output, but given you are already declaring this in the same API, I
don't see the problem anymore.

About the struct ftrace_raw_##name created in stage 1, I think the
padding at the end of the structure is a complete waste of space. You
should probably have a look at the ltt-type-serializer.[ch] in LTTng.

I am still unsure that the approach you take with TRACE_EVENT, which I
would call "automatically generating code using header and macro
tricks", is in the end easier to review than the simple C callback
approach I have taken in LTTng, adding the "glue" in the
DEFINE_MARKER_TP() macros to connect a specific C callback to the actual
tracepoint.

The good thing about your approach is that everyting about a trace event
can be declared within the same macro. In LTTng, I have to create probe
modules and write stuff like :

void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec);

DEFINE_MARKER_TP(kernel, softirq_exit, irq_softirq_exit,
probe_irq_softirq_exit, "softirq_id #1u%lu");

void probe_irq_softirq_exit(struct softirq_action *h,
struct softirq_action *softirq_vec)
{
struct marker *marker;
unsigned char data;

data = ((unsigned long)h - (unsigned long)softirq_vec) / sizeof(*h);

marker = &GET_MARKER(kernel, softirq_exit);
ltt_specialized_trace(marker, marker->single.probe_private,
&data, sizeof(data), sizeof(data));
}

by hand, and then the kernel softirq_exit event is shown in
debugfs/ltt/markers/kernel/softirq_exit/. But I don't see the big win
you get by doing it in TRACE_EVENT, especially if it is not
type-verified when ftrace is disabled, compared to adding those
callbacks in standard kernel modules, which is a coding-style we have
been used to for years.

I haven't seen much automatically generated code around in the kernel
tree, maybe there is a good reason ? I can't wait to see the first
kernel JIT based on ftrace. ;-)

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Steven Rostedt
2009-03-25 02:09:07 UTC
Permalink
Post by Mathieu Desnoyers
This third type of tracepoint for IRQs you are talking about is actually
what I had in LTTng. I decided to add the irq_next_handler and to add a
action field to irq_entry to include the irq handler information needed
by Jason.
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
And add the irq_entry/irq_exit events to the arch-specific reschedule,
tlb flush, local timer irq, as I have in my lttng tree already.
But given the trace_irq_handler_entry/trace_irq_handler_exit events
could be combined, given we can record action and ret in the
irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
the single-handler fast path by adding this information to the irq
entry/exit events, and decided to combine the irq_handler entry/exit
into a single next_handler event, which records the previous ret value
and the next action to execute.
On an interrupt-driven workload, it will have a significant impact.
(2 events vs 4).
I thought tracepoints while not active are very low overhead.

If you only want to use 2 of the 4, would that be just as fast?
Post by Mathieu Desnoyers
If we add interrupt threads to the kernel, then we can switch to the
trace_irq_entry(irqno, pt_regs)
trace_irq_exit(ret)
trace_irq_thread_entry()
trace_irq_thread_exit()
I don't see why we should mind trying to make the tracepoints "logical",
especially if it hurts performance considerably. Doing these
implementation-specific versions of irq tracepoints would provide the
best performance we can get when tracing. It's up to the tracers to
specialize their analysis based on the underlying IRQ mechanism
(non-threaded vs threaded).
Perhaps we want to make them logical so that other things might hook into
these trace points besides a tracer. I do not agree that the code should
be modified just to make the trace points faster. The trace points are
just hooks into code, and should have no effect when disabled. Once the
code starts to change due to better placement of tracepoints for tracers,
that's when those trace points should be NACKed.

-- Steve
Mathieu Desnoyers
2009-03-26 18:28:59 UTC
Permalink
Post by Steven Rostedt
Post by Mathieu Desnoyers
This third type of tracepoint for IRQs you are talking about is actually
what I had in LTTng. I decided to add the irq_next_handler and to add a
action field to irq_entry to include the irq handler information needed
by Jason.
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
And add the irq_entry/irq_exit events to the arch-specific reschedule,
tlb flush, local timer irq, as I have in my lttng tree already.
But given the trace_irq_handler_entry/trace_irq_handler_exit events
could be combined, given we can record action and ret in the
irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
the single-handler fast path by adding this information to the irq
entry/exit events, and decided to combine the irq_handler entry/exit
into a single next_handler event, which records the previous ret value
and the next action to execute.
On an interrupt-driven workload, it will have a significant impact.
(2 events vs 4).
I thought tracepoints while not active are very low overhead.
If you only want to use 2 of the 4, would that be just as fast?
Probably, but I was talking about active tracing overhead here.
Post by Steven Rostedt
Post by Mathieu Desnoyers
If we add interrupt threads to the kernel, then we can switch to the
trace_irq_entry(irqno, pt_regs)
trace_irq_exit(ret)
trace_irq_thread_entry()
trace_irq_thread_exit()
I don't see why we should mind trying to make the tracepoints "logical",
especially if it hurts performance considerably. Doing these
implementation-specific versions of irq tracepoints would provide the
best performance we can get when tracing. It's up to the tracers to
specialize their analysis based on the underlying IRQ mechanism
(non-threaded vs threaded).
Perhaps we want to make them logical so that other things might hook into
these trace points besides a tracer. I do not agree that the code should
be modified just to make the trace points faster. The trace points are
just hooks into code, and should have no effect when disabled. Once the
code starts to change due to better placement of tracepoints for tracers,
that's when those trace points should be NACKed.
-- Steve
If it makes the code messy, then yes, I agree that those tracepoints
should not go in.

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers
2009-03-26 18:28:59 UTC
Permalink
Post by Steven Rostedt
Post by Mathieu Desnoyers
This third type of tracepoint for IRQs you are talking about is actually
what I had in LTTng. I decided to add the irq_next_handler and to add a
action field to irq_entry to include the irq handler information needed
by Jason.
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
And add the irq_entry/irq_exit events to the arch-specific reschedule,
tlb flush, local timer irq, as I have in my lttng tree already.
But given the trace_irq_handler_entry/trace_irq_handler_exit events
could be combined, given we can record action and ret in the
irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
the single-handler fast path by adding this information to the irq
entry/exit events, and decided to combine the irq_handler entry/exit
into a single next_handler event, which records the previous ret value
and the next action to execute.
On an interrupt-driven workload, it will have a significant impact.
(2 events vs 4).
I thought tracepoints while not active are very low overhead.
If you only want to use 2 of the 4, would that be just as fast?
Probably, but I was talking about active tracing overhead here.
Post by Steven Rostedt
Post by Mathieu Desnoyers
If we add interrupt threads to the kernel, then we can switch to the
trace_irq_entry(irqno, pt_regs)
trace_irq_exit(ret)
trace_irq_thread_entry()
trace_irq_thread_exit()
I don't see why we should mind trying to make the tracepoints "logical",
especially if it hurts performance considerably. Doing these
implementation-specific versions of irq tracepoints would provide the
best performance we can get when tracing. It's up to the tracers to
specialize their analysis based on the underlying IRQ mechanism
(non-threaded vs threaded).
Perhaps we want to make them logical so that other things might hook into
these trace points besides a tracer. I do not agree that the code should
be modified just to make the trace points faster. The trace points are
just hooks into code, and should have no effect when disabled. Once the
code starts to change due to better placement of tracepoints for tracers,
that's when those trace points should be NACKed.
-- Steve
If it makes the code messy, then yes, I agree that those tracepoints
should not go in.

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Ingo Molnar
2009-03-24 20:51:20 UTC
Permalink
Post by Mathieu Desnoyers
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
Not really.

As i said, the handler invocation should be thought of separately
from vectored IRQs. A device IRQ handler can be invoked in a number
of non-vectored ways: it can be called in an IRQ thread for example.
(there are other invocation modes possible too)

For IRQ vectors, the 'retval' makes little sense - so the exit event
can be left out.

Which leaves us with what i suggested: to add an IRQ vector entry
event on top of Jason's (already integrated) patches.

Hm?

Ingo
Steven Rostedt
2009-03-25 02:00:49 UTC
Permalink
Post by Mathieu Desnoyers
This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".
Let me correct you here. TRACE_EVENT does ***NOT*** export anything to
userspace. There is no code what so ever in TRACE_EVENT that does so.

Now, ftrace does export information using TRACE_EVENT to userspace. But
that is the way ftrace wants to handle it. There's nothing needed to
export to userspace. What is exported, is exported ***BECAUSE*** it can
change. I'll only try to keep the format that is exported the same. But
nothing should rely on what the format represents staying the same.

If someone adds a TRACE_EVENT, you can uses it to record you data, anyway
you like. Ftrace will use it to show how to read the binary data, which
is only needed if you want to do that. It uses the print format to print
to the console in case of failure. Or to the trace file, which by the way
can also change without notice.


-- Steve
Steven Rostedt
2009-03-25 02:09:07 UTC
Permalink
Post by Mathieu Desnoyers
This third type of tracepoint for IRQs you are talking about is actually
what I had in LTTng. I decided to add the irq_next_handler and to add a
action field to irq_entry to include the irq handler information needed
by Jason.
If we want to do this logically, without thinking about tracer
trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)
And add the irq_entry/irq_exit events to the arch-specific reschedule,
tlb flush, local timer irq, as I have in my lttng tree already.
But given the trace_irq_handler_entry/trace_irq_handler_exit events
could be combined, given we can record action and ret in the
irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
the single-handler fast path by adding this information to the irq
entry/exit events, and decided to combine the irq_handler entry/exit
into a single next_handler event, which records the previous ret value
and the next action to execute.
On an interrupt-driven workload, it will have a significant impact.
(2 events vs 4).
I thought tracepoints while not active are very low overhead.

If you only want to use 2 of the 4, would that be just as fast?
Post by Mathieu Desnoyers
If we add interrupt threads to the kernel, then we can switch to the
trace_irq_entry(irqno, pt_regs)
trace_irq_exit(ret)
trace_irq_thread_entry()
trace_irq_thread_exit()
I don't see why we should mind trying to make the tracepoints "logical",
especially if it hurts performance considerably. Doing these
implementation-specific versions of irq tracepoints would provide the
best performance we can get when tracing. It's up to the tracers to
specialize their analysis based on the underlying IRQ mechanism
(non-threaded vs threaded).
Perhaps we want to make them logical so that other things might hook into
these trace points besides a tracer. I do not agree that the code should
be modified just to make the trace points faster. The trace points are
just hooks into code, and should have no effect when disabled. Once the
code starts to change due to better placement of tracepoints for tracers,
that's when those trace points should be NACKed.

-- Steve
Jason Baron
2009-03-27 19:18:35 UTC
Permalink
Post by Ingo Molnar
Post by Jason Baron
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the
tracing tree are for entry and exit into each handler per irq.
Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd
tracepoint to tell us which handler is being called and its return
code. hope this is clear.
Ok, i misunderstood that.
Mathieu's is slightly more compact, but yours is more logical.
I believe your pre/post IRQ handler callback is the right model - it
decouples device IRQ handling from any notion of 'IRQ'.
For example, we could correctly express "handler got executed by an
IRQ thread" via it - while via Mathieu's scheme it does not really
map to that.
So if then i think there should be a third tracepoint in addition to
your two existing tracepoints: a 'raw vector' type of tracepoint.
It's added both to do_IRQ() entry point, but also to the various
common SMP IPI entry points: reschedule, TLB flush and local timer
IRQ tick.
The best information there to pass to the probe is the raw vector
number, and the ptregs structure.
Hm?
yes, this was my thinking too. The handler tracepoints are useful in and
of themselves...and I was planning to do a separate patch, instrumenting what
you are calling the 'raw vector'...

thanks,

-Jason
Mathieu Desnoyers
2009-03-24 20:11:47 UTC
Permalink
Post by Ingo Molnar
Post by Jason Baron
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the
tracing tree are for entry and exit into each handler per irq.
Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd
tracepoint to tell us which handler is being called and its return
code. hope this is clear.
Ok, i misunderstood that.
Mathieu's is slightly more compact, but yours is more logical.
I believe your pre/post IRQ handler callback is the right model - it
decouples device IRQ handling from any notion of 'IRQ'.
For example, we could correctly express "handler got executed by an
IRQ thread" via it - while via Mathieu's scheme it does not really
map to that.
So if then i think there should be a third tracepoint in addition to
your two existing tracepoints: a 'raw vector' type of tracepoint.
It's added both to do_IRQ() entry point, but also to the various
common SMP IPI entry points: reschedule, TLB flush and local timer
IRQ tick.
The best information there to pass to the probe is the raw vector
number, and the ptregs structure.
Hm?
This third type of tracepoint for IRQs you are talking about is actually
what I had in LTTng. I decided to add the irq_next_handler and to add a
action field to irq_entry to include the irq handler information needed
by Jason.

If we want to do this logically, without thinking about tracer
performance impact, we could/should do :

trace_irq_entry(irqno, pt_regs)
for_each_handler() {
trace_irq_handler_entry(action)
action->handler()
trace_irq_handler_exit(ret)
}
trace_irq_exit(retval)

And add the irq_entry/irq_exit events to the arch-specific reschedule,
tlb flush, local timer irq, as I have in my lttng tree already.

But given the trace_irq_handler_entry/trace_irq_handler_exit events
could be combined, given we can record action and ret in the
irq_entry/exit events, I decided to remove 2 tracepoints (out of 4) from
the single-handler fast path by adding this information to the irq
entry/exit events, and decided to combine the irq_handler entry/exit
into a single next_handler event, which records the previous ret value
and the next action to execute.

On an interrupt-driven workload, it will have a significant impact.
(2 events vs 4).

If we add interrupt threads to the kernel, then we can switch to the
following scheme :

* instrumentation of the real interrupt handler :

trace_irq_entry(irqno, pt_regs)

trace_irq_exit(ret)

* instrumentation of the irq threads :

trace_irq_thread_entry()

trace_irq_thread_exit()

I don't see why we should mind trying to make the tracepoints "logical",
especially if it hurts performance considerably. Doing these
implementation-specific versions of irq tracepoints would provide the
best performance we can get when tracing. It's up to the tracers to
specialize their analysis based on the underlying IRQ mechanism
(non-threaded vs threaded).

This is actually a very good example of what Christoph Hellwig, Peter
Zijlstra and myself have been trying to warn you about the TRACE_EVENT
macro : it exports the tracepoints to userspace, and thus makes them a
userspace-visible API, when those tracepoints should be tied to the
kernel code and nothing else. An adaptation layer should provide the
abstractions that makes the information presented to the user more
"logical".

Mathieu
Post by Ingo Molnar
Ingo
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Jason Baron
2009-03-27 19:18:35 UTC
Permalink
Post by Ingo Molnar
Post by Jason Baron
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the
tracing tree are for entry and exit into each handler per irq.
Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd
tracepoint to tell us which handler is being called and its return
code. hope this is clear.
Ok, i misunderstood that.
Mathieu's is slightly more compact, but yours is more logical.
I believe your pre/post IRQ handler callback is the right model - it
decouples device IRQ handling from any notion of 'IRQ'.
For example, we could correctly express "handler got executed by an
IRQ thread" via it - while via Mathieu's scheme it does not really
map to that.
So if then i think there should be a third tracepoint in addition to
your two existing tracepoints: a 'raw vector' type of tracepoint.
It's added both to do_IRQ() entry point, but also to the various
common SMP IPI entry points: reschedule, TLB flush and local timer
IRQ tick.
The best information there to pass to the probe is the raw vector
number, and the ptregs structure.
Hm?
yes, this was my thinking too. The handler tracepoints are useful in and
of themselves...and I was planning to do a separate patch, instrumenting what
you are calling the 'raw vector'...

thanks,

-Jason
Ingo Molnar
2009-03-24 19:12:52 UTC
Permalink
Post by Jason Baron
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the
tracing tree are for entry and exit into each handler per irq.
Mathieu is proposing an entry/exit tracepoint per irq, and a 3rd
tracepoint to tell us which handler is being called and its return
code. hope this is clear.
Ok, i misunderstood that.

Mathieu's is slightly more compact, but yours is more logical.

I believe your pre/post IRQ handler callback is the right model - it
decouples device IRQ handling from any notion of 'IRQ'.

For example, we could correctly express "handler got executed by an
IRQ thread" via it - while via Mathieu's scheme it does not really
map to that.

So if then i think there should be a third tracepoint in addition to
your two existing tracepoints: a 'raw vector' type of tracepoint.
It's added both to do_IRQ() entry point, but also to the various
common SMP IPI entry points: reschedule, TLB flush and local timer
IRQ tick.

The best information there to pass to the probe is the raw vector
number, and the ptregs structure.

Hm?

Ingo
Jason Baron
2009-03-24 17:57:50 UTC
Permalink
Post by Ingo Molnar
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.
Ingo
maybe i wasn't clear. Entry and exit as I proposed and as in the tracing
tree are for entry and exit into each handler per irq. Mathieu is
proposing an entry/exit tracepoint per irq, and a 3rd tracepoint to
tell us which handler is being called and its return code. hope this is
clear.

thanks,

-Jason
Ingo Molnar
2009-03-24 17:50:49 UTC
Permalink
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.

Ingo
Ingo Molnar
2009-03-24 17:50:49 UTC
Permalink
Post by Jason Baron
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be
interrupted by the same irq, I think it should be pretty clear we
are in the same handler. That said, the extra entry/exit
tracepoints could make the sequence of events simpler to decipher,
which is important. The code looks good, and provides at least as
much information as the patch that I proposed. So i'll be happy
either way :)
We already have your patch merged up in the tracing tree and it
gives entry+exit tracepoints.

Ingo
Ingo Molnar
2009-03-24 19:14:49 UTC
Permalink
+static irqreturn_t __handle_irq_next_handler(unsigned int irq,
+ struct irqaction **action, irqreturn_t *retval, unsigned int *status)
+{
+ irqreturn_t ret;
+
+ ret = (*action)->handler(irq, (*action)->dev_id);
+ if (ret == IRQ_HANDLED)
+ *status |= (*action)->flags;
+ *retval |= ret;
+ *action = (*action)->next;
+ return ret;
+}
+
static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
{
irqreturn_t ret, retval = IRQ_NONE;
@@ -324,13 +345,12 @@ static irqreturn_t _handle_IRQ_event(uns
if (!(action->flags & IRQF_DISABLED))
local_irq_enable_in_hardirq();
- do {
- ret = action->handler(irq, action->dev_id);
- if (ret == IRQ_HANDLED)
- status |= action->flags;
- retval |= ret;
- action = action->next;
- } while (action);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+
+ while (action) {
+ trace_irq_next_handler(irq, action, ret);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+ }
Hm, this is rather unclean. Why open-code the first handler
execution?

This is a sign (and side effect) of the logical model being slightly
incorrect - see my previous mail to Jason.

Ingo
Thomas Gleixner
2009-03-27 22:12:41 UTC
Permalink
*/
struct lock_class_key irq_desc_lock_class;
@@ -316,6 +324,19 @@ irqreturn_t no_action(int cpl, void *dev
return IRQ_NONE;
}
+static irqreturn_t __handle_irq_next_handler(unsigned int irq,
+ struct irqaction **action, irqreturn_t *retval, unsigned int *status)
+{
+ irqreturn_t ret;
+
+ ret = (*action)->handler(irq, (*action)->dev_id);
+ if (ret == IRQ_HANDLED)
+ *status |= (*action)->flags;
+ *retval |= ret;
+ *action = (*action)->next;
+ return ret;
+}
+
static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
{
irqreturn_t ret, retval = IRQ_NONE;
@@ -324,13 +345,12 @@ static irqreturn_t _handle_IRQ_event(uns
if (!(action->flags & IRQF_DISABLED))
local_irq_enable_in_hardirq();
- do {
- ret = action->handler(irq, action->dev_id);
- if (ret == IRQ_HANDLED)
- status |= action->flags;
- retval |= ret;
- action = action->next;
- } while (action);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+
+ while (action) {
+ trace_irq_next_handler(irq, action, ret);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+ }
This one is the next candidate for the "ugly patch of the week contest".

It does not trace the return value of the first handler and just moves
code around for no good reason. Darn why can' t you simply do:

do {
ret = action->handler(irq, action->dev_id);
+ trace_action_handler(....);
}

and keep the rest of the code unchanged ?
#ifndef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
This code is going away. No tracepoint needed here.

Thanks,

tglx
Jason Baron
2009-03-24 17:33:54 UTC
Permalink
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be interrupted
by the same irq, I think it should be pretty clear we are in the same
handler. That said, the extra entry/exit tracepoints could make the
sequence of events simpler to decipher, which is important. The code
looks good, and provides at least as much information as the patch that
I proposed. So i'll be happy either way :)

thanks,

-Jason
Ingo Molnar
2009-03-24 19:14:49 UTC
Permalink
+static irqreturn_t __handle_irq_next_handler(unsigned int irq,
+ struct irqaction **action, irqreturn_t *retval, unsigned int *status)
+{
+ irqreturn_t ret;
+
+ ret = (*action)->handler(irq, (*action)->dev_id);
+ if (ret == IRQ_HANDLED)
+ *status |= (*action)->flags;
+ *retval |= ret;
+ *action = (*action)->next;
+ return ret;
+}
+
static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
{
irqreturn_t ret, retval = IRQ_NONE;
@@ -324,13 +345,12 @@ static irqreturn_t _handle_IRQ_event(uns
if (!(action->flags & IRQF_DISABLED))
local_irq_enable_in_hardirq();
- do {
- ret = action->handler(irq, action->dev_id);
- if (ret == IRQ_HANDLED)
- status |= action->flags;
- retval |= ret;
- action = action->next;
- } while (action);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+
+ while (action) {
+ trace_irq_next_handler(irq, action, ret);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+ }
Hm, this is rather unclean. Why open-code the first handler
execution?

This is a sign (and side effect) of the logical model being slightly
incorrect - see my previous mail to Jason.

Ingo
Thomas Gleixner
2009-03-27 22:12:41 UTC
Permalink
*/
struct lock_class_key irq_desc_lock_class;
@@ -316,6 +324,19 @@ irqreturn_t no_action(int cpl, void *dev
return IRQ_NONE;
}
+static irqreturn_t __handle_irq_next_handler(unsigned int irq,
+ struct irqaction **action, irqreturn_t *retval, unsigned int *status)
+{
+ irqreturn_t ret;
+
+ ret = (*action)->handler(irq, (*action)->dev_id);
+ if (ret == IRQ_HANDLED)
+ *status |= (*action)->flags;
+ *retval |= ret;
+ *action = (*action)->next;
+ return ret;
+}
+
static irqreturn_t _handle_IRQ_event(unsigned int irq, struct irqaction *action)
{
irqreturn_t ret, retval = IRQ_NONE;
@@ -324,13 +345,12 @@ static irqreturn_t _handle_IRQ_event(uns
if (!(action->flags & IRQF_DISABLED))
local_irq_enable_in_hardirq();
- do {
- ret = action->handler(irq, action->dev_id);
- if (ret == IRQ_HANDLED)
- status |= action->flags;
- retval |= ret;
- action = action->next;
- } while (action);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+
+ while (action) {
+ trace_irq_next_handler(irq, action, ret);
+ ret = __handle_irq_next_handler(irq, &action, &retval, &status);
+ }
This one is the next candidate for the "ugly patch of the week contest".

It does not trace the return value of the first handler and just moves
code around for no good reason. Darn why can' t you simply do:

do {
ret = action->handler(irq, action->dev_id);
+ trace_action_handler(....);
}

and keep the rest of the code unchanged ?
#ifndef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
This code is going away. No tracepoint needed here.

Thanks,

tglx
Jason Baron
2009-03-24 17:33:54 UTC
Permalink
Instrumentation of IRQ related events : irq_entry, irq_exit and
irq_next_handler.
It allows tracers to perform latency analysis on those various types of
interrupts and to detect interrupts with max/min/avg duration. It helps
detecting driver or hardware problems which cause an ISR to take ages to
execute. It has been shown to be the case with bogus hardware causing an mmio
read to take a few milliseconds.
Those tracepoints are used by LTTng.
About the performance impact of tracepoints (which is comparable to markers),
even without immediate values optimizations, tests done by Hideo Aoki on ia64
show no regression. His test case was using hackbench on a kernel where
scheduler instrumentation (about 5 events in code scheduler code) was added.
See the "Tracepoints" patch header for performance result detail.
irq_entry and irq_exit not declared static because they appear in x86 arch code.
The idea behind logging irq/softirq/tasklet/(and eventually syscall) entry and
exit events is to be able to recreate the kernel execution state at a given
point in time. Knowing which execution context is responsible for a given trace
event is _very_ valuable in trace data analysis.
The IRQ instrumentation instruments the IRQ handler entry and exit. Jason
instrumented the irq notifier chain calls (irq_handler_entry/exit). His approach
provides information about which handler is being called, but does not map
correctly to the fact that _multiple_ handlers are being called from within the
same interrupt handler. From an interrupt latency analysis POV, this is
incorrect.
Since we are passing back the irq number, and we can not be interrupted
by the same irq, I think it should be pretty clear we are in the same
handler. That said, the extra entry/exit tracepoints could make the
sequence of events simpler to decipher, which is important. The code
looks good, and provides at least as much information as the patch that
I proposed. So i'll be happy either way :)

thanks,

-Jason
Mathieu Desnoyers
2009-03-24 15:56:34 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-swap.patch
URL: <http://lists.casi.polymtl.ca/pipermail/ltt-dev/attachments/20090324/b018dd99/attachment.txt>
Ingo Molnar
2009-03-24 18:51:28 UTC
Permalink
+DECLARE_TRACE(swap_in,
+ TPPROTO(struct page *page, swp_entry_t entry),
+ TPARGS(page, entry));
+DECLARE_TRACE(swap_out,
+ TPPROTO(struct page *page),
+ TPARGS(page));
+DECLARE_TRACE(swap_file_open,
+ TPPROTO(struct file *file, char *filename),
+ TPARGS(file, filename));
+DECLARE_TRACE(swap_file_close,
+ TPPROTO(struct file *file),
+ TPARGS(file));
These are more complete than the pagecache tracepoints, but still
incomplete to make a comprehensive picture about swap activities.

Firstly, the swap_file_open/close events seem quite pointless. Most
systems enable swap during bootup and never close it. These
tracepoints just wont be excercised in practice.

Also, to _really_ help with debugging VM pressure problems, the
whole LRU state-machine should be instrumented, and linked up with
pagecache instrumentation via page frame numbers and (inode,offset)
[file] and (pgd,addr) [anon] pairs.

Not just the fact that something got swapped out is interesting, but
also the whole decision chain that leads up to it. The lifetime of a
page how it jumps between the various stages of eviction and LRU
scores.
+DECLARE_TRACE(swap_file_open,
+ TPPROTO(struct file *file, char *filename),
+ TPARGS(file, filename));
there's no need to pass in the filename - it can be deducted in the
probe from struct file.
+DECLARE_TRACE(swap_in,
+ TPPROTO(struct page *page, swp_entry_t entry),
+ TPARGS(page, entry));
+DECLARE_TRACE(swap_out,
+ TPPROTO(struct page *page),
+ TPARGS(page));
you pass in swp_entry to trace_swap_in(), which encodes the offset -
but that parameter is not needed, the page already represents the
offset at that stage in do_swap_page(). (the actual data is not read
in yet from swap, but the page is already linked up in the
swap-cache and has the offset available - which a probe can
recover.)

So this suffices:

DECLARE_TRACE(swap_in,
TPPROTO(struct page *page),
TPARGS(page));

DECLARE_TRACE(swap_out,
TPPROTO(struct page *page),
TPARGS(page));

And here again i'd like to see actual meaningful probe contents via
a TRACE_EVENT() construct. That shows and proves that it's all part
of a comprehensive framework, and the data that is recovered is
understood and put into a coherent whole - upstream. That makes it
immediately useful to the built-in tracers, and will also cause
fewer surprises downstream.

Ingo
Ingo Molnar
2009-03-24 18:51:28 UTC
Permalink
+DECLARE_TRACE(swap_in,
+ TPPROTO(struct page *page, swp_entry_t entry),
+ TPARGS(page, entry));
+DECLARE_TRACE(swap_out,
+ TPPROTO(struct page *page),
+ TPARGS(page));
+DECLARE_TRACE(swap_file_open,
+ TPPROTO(struct file *file, char *filename),
+ TPARGS(file, filename));
+DECLARE_TRACE(swap_file_close,
+ TPPROTO(struct file *file),
+ TPARGS(file));
These are more complete than the pagecache tracepoints, but still
incomplete to make a comprehensive picture about swap activities.

Firstly, the swap_file_open/close events seem quite pointless. Most
systems enable swap during bootup and never close it. These
tracepoints just wont be excercised in practice.

Also, to _really_ help with debugging VM pressure problems, the
whole LRU state-machine should be instrumented, and linked up with
pagecache instrumentation via page frame numbers and (inode,offset)
[file] and (pgd,addr) [anon] pairs.

Not just the fact that something got swapped out is interesting, but
also the whole decision chain that leads up to it. The lifetime of a
page how it jumps between the various stages of eviction and LRU
scores.
+DECLARE_TRACE(swap_file_open,
+ TPPROTO(struct file *file, char *filename),
+ TPARGS(file, filename));
there's no need to pass in the filename - it can be deducted in the
probe from struct file.
+DECLARE_TRACE(swap_in,
+ TPPROTO(struct page *page, swp_entry_t entry),
+ TPARGS(page, entry));
+DECLARE_TRACE(swap_out,
+ TPPROTO(struct page *page),
+ TPARGS(page));
you pass in swp_entry to trace_swap_in(), which encodes the offset -
but that parameter is not needed, the page already represents the
offset at that stage in do_swap_page(). (the actual data is not read
in yet from swap, but the page is already linked up in the
swap-cache and has the offset available - which a probe can
recover.)

So this suffices:

DECLARE_TRACE(swap_in,
TPPROTO(struct page *page),
TPARGS(page));

DECLARE_TRACE(swap_out,
TPPROTO(struct page *page),
TPARGS(page));

And here again i'd like to see actual meaningful probe contents via
a TRACE_EVENT() construct. That shows and proves that it's all part
of a comprehensive framework, and the data that is recovered is
understood and put into a coherent whole - upstream. That makes it
immediately useful to the built-in tracers, and will also cause
fewer surprises downstream.

Ingo
Mathieu Desnoyers
2009-03-24 15:56:25 UTC
Permalink
Hi,

This is the core of the Linux kernel LTTng instrumentation. I left out what I
considered that would be eventually covered by a wide syscall instrumentation
and I also picked the most important pieces of instrumentation. The LTTng tree
contains more tracepoints, some of which (e.g. mm instrumentation) will be
proposed later once we agree on the tracepoints proposed in this patchset.

Ingo, the irq and softirq instrumentation patches will probably clash with the
patches from Jason you currently have in your tree. Please see the individual
changelogs for details.

This patchset applies over 2.6.29 in the following order :

irq-handle-prepare-for-instrumentation.patch
lttng-instrumentation-irq.patch
lttng-instrumentation-tasklets.patch
lttng-instrumentation-softirq.patch
lttng-instrumentation-scheduler-fixup-task-migration.patch
lttng-instrumentation-timer.patch
lttng-instrumentation-kernel.patch
lttng-instrumentation-filemap.patch
lttng-instrumentation-swap.patch

Thanks,

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers
2009-03-24 15:56:26 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: irq-handle-prepare-for-instrumentation.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/1196b7a8/attachment.asc>
Mathieu Desnoyers
2009-03-24 15:56:30 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-scheduler-fixup-task-migration.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/624d8e1e/attachment.txt>
Mathieu Desnoyers
2009-03-24 15:56:32 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-kernel.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/dc616bb6/attachment.asc>
Mathieu Desnoyers
2009-03-24 15:56:33 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-filemap.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/fb535fba/attachment.txt>
Mathieu Desnoyers
2009-03-24 15:56:29 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-softirq.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/cfb79406/attachment.asc>
Mathieu Desnoyers
2009-03-24 15:56:28 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-tasklets.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/b750aeb1/attachment.txt>
Mathieu Desnoyers
2009-03-24 15:56:31 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-timer.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/2beb18ca/attachment.asc>
Mathieu Desnoyers
2009-03-24 15:56:27 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-irq.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/23f29076/attachment.txt>
Mathieu Desnoyers
2009-03-24 15:56:34 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-swap.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/b018dd99/attachment.asc>
Mathieu Desnoyers
2009-03-24 15:56:25 UTC
Permalink
Hi,

This is the core of the Linux kernel LTTng instrumentation. I left out what I
considered that would be eventually covered by a wide syscall instrumentation
and I also picked the most important pieces of instrumentation. The LTTng tree
contains more tracepoints, some of which (e.g. mm instrumentation) will be
proposed later once we agree on the tracepoints proposed in this patchset.

Ingo, the irq and softirq instrumentation patches will probably clash with the
patches from Jason you currently have in your tree. Please see the individual
changelogs for details.

This patchset applies over 2.6.29 in the following order :

irq-handle-prepare-for-instrumentation.patch
lttng-instrumentation-irq.patch
lttng-instrumentation-tasklets.patch
lttng-instrumentation-softirq.patch
lttng-instrumentation-scheduler-fixup-task-migration.patch
lttng-instrumentation-timer.patch
lttng-instrumentation-kernel.patch
lttng-instrumentation-filemap.patch
lttng-instrumentation-swap.patch

Thanks,

Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers
2009-03-24 15:56:26 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: irq-handle-prepare-for-instrumentation.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/1196b7a8/attachment-0001.asc>
Mathieu Desnoyers
2009-03-24 15:56:30 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-scheduler-fixup-task-migration.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/624d8e1e/attachment-0001.txt>
Mathieu Desnoyers
2009-03-24 15:56:32 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-kernel.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/dc616bb6/attachment-0001.asc>
Mathieu Desnoyers
2009-03-24 15:56:33 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-filemap.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/fb535fba/attachment-0001.txt>
Mathieu Desnoyers
2009-03-24 15:56:29 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-softirq.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/cfb79406/attachment-0001.asc>
Mathieu Desnoyers
2009-03-24 15:56:28 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-tasklets.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/b750aeb1/attachment-0001.txt>
Mathieu Desnoyers
2009-03-24 15:56:31 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-timer.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/2beb18ca/attachment-0001.asc>
Mathieu Desnoyers
2009-03-24 15:56:27 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-irq.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/23f29076/attachment-0001.txt>
Mathieu Desnoyers
2009-03-24 15:56:34 UTC
Permalink
An embedded and charset-unspecified text was scrubbed...
Name: lttng-instrumentation-swap.patch
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090324/b018dd99/attachment-0001.asc>
Continue reading on narkive:
Loading...