Discussion:
[lttng-dev] lost trace events
Alok Priyadarshi
2018-10-31 21:58:37 UTC
Permalink
We are tracing a multi-process application that generates high-frequency
user-space trace events.

Environment:
- Ubuntu 16.04 LTS
- Lttng 2.10.4 - KeKriek

Session:
- Snapshot mode
- One channel with default config:
overwrite=1, subbuf_size=524288, num_subbuf=4

We have noticed that a few events are missing. I understand that in
overwrite mode we can lose entire sub-buffers but IIUC we would only lose
events at the beginning, not in-between events in a sub-buffer.

Is it not true that only the oldest sub-buffer is overwritten and when
overwrite happens, none of the latest events are discarded?

Few more things to note:
- Before lttng-snapshot-record, we stop tracing as recommended by the
documentation
- Packet context field events_discarded is always set to 0
- Event loss seems to happen around packet transition. I have noticed this
in at least two traces

The snapshot mode seems perfect for our use case. It is OK to lose events
from either ends of the snapshot, but it is essential to not lose any
in-between events. I was under the impression that overwrite mode
guarantees that. If not, is there anything else we can try?
Alok Priyadarshi
2018-11-06 22:48:24 UTC
Permalink
Any advice on how to debug lost events? Thanks in advance!
Post by Alok Priyadarshi
We are tracing a multi-process application that generates high-frequency
user-space trace events.
- Ubuntu 16.04 LTS
- Lttng 2.10.4 - KeKriek
- Snapshot mode
overwrite=1, subbuf_size=524288, num_subbuf=4
We have noticed that a few events are missing. I understand that in
overwrite mode we can lose entire sub-buffers but IIUC we would only lose
events at the beginning, not in-between events in a sub-buffer.
Is it not true that only the oldest sub-buffer is overwritten and when
overwrite happens, none of the latest events are discarded?
- Before lttng-snapshot-record, we stop tracing as recommended by the
documentation
- Packet context field events_discarded is always set to 0
- Event loss seems to happen around packet transition. I have noticed this
in at least two traces
The snapshot mode seems perfect for our use case. It is OK to lose events
from either ends of the snapshot, but it is essential to not lose any
in-between events. I was under the impression that overwrite mode
guarantees that. If not, is there anything else we can try?
Jonathan Rajotte-Julien
2018-11-07 15:31:20 UTC
Permalink
Hi,
Post by Alok Priyadarshi
We are tracing a multi-process application that generates high-frequency
user-space trace events.
- Ubuntu 16.04 LTS
- Lttng 2.10.4 - KeKriek
- Snapshot mode
overwrite=1, subbuf_size=524288, num_subbuf=4
We have noticed that a few events are missing. I understand that in
How did you notice those "missing" events?

Would you be able to share a small reproducer with us so that we can assess the
situation?
Post by Alok Priyadarshi
overwrite mode we can lose entire sub-buffers but IIUC we would only lose
events at the beginning, not in-between events in a sub-buffer.
Is it not true that only the oldest sub-buffer is overwritten and when
overwrite happens, none of the latest events are discarded?
This is how it should work yes.
Post by Alok Priyadarshi
- Before lttng-snapshot-record, we stop tracing as recommended by the
documentation
Good.
Post by Alok Priyadarshi
- Packet context field events_discarded is always set to 0
- Event loss seems to happen around packet transition. I have noticed this
in at least two traces
Please provide us with how you arrived to this conclusion and a small
reproducer if possible.
Post by Alok Priyadarshi
The snapshot mode seems perfect for our use case. It is OK to lose events
from either ends of the snapshot, but it is essential to not lose any
in-between events. I was under the impression that overwrite mode
guarantees that.
It should.

Cheers
--
Jonathan Rajotte-Julien
EfficiOS
Alok Priyadarshi
2018-11-07 19:53:25 UTC
Permalink
Hi Jonathan,

Thanks for your response.

We are tracing function scopes. Each scope emits two events - begin and
end. We noticed that some begin events did not have corresponding end
events in the trace.

I have the trace with this problem available. Would that provide any clue?
If not, I will try to extract a small reproducer.

-Alok
Jonathan Rajotte-Julien
2018-11-07 20:09:29 UTC
Permalink
Hi Alok,
Post by Alok Priyadarshi
Hi Jonathan,
Thanks for your response.
We are tracing function scopes. Each scope emits two events - begin and
end. We noticed that some begin events did not have corresponding end
events in the trace.
Seems like a solid ground as long as the corresponding ending event is not
inside any conditional paths. I'm sure you already validated that but we never
know.
Post by Alok Priyadarshi
I have the trace with this problem available. Would that provide any clue?
It could help but I presume it might contains confidential or sensible
information. Let's define the scenario first and we will go back to actual
trace data if needed.
Post by Alok Priyadarshi
If not, I will try to extract a small reproducer.
That would be best. In the meantime, could you provide the exact lttng commands
used to setup everything and describe subsequent lttng commands used for
snapshot etc.

You can omit event name or replace them with dummy name if necessary.

Cheers
--
Jonathan Rajotte-Julien
EfficiOS
Alok Priyadarshi
2018-11-08 00:12:30 UTC
Permalink
A custom trace event class emits "begin" and "end" events in its
constructor and destructor respectively. So I do not think this is due to
conditional path.

Sequence of commands to capture a snapshot are:
lttng create --snapshot
lttng enable-event --userspace --all
lttng add-context --userspace -t vpid -t vtid
lttng start
lttng stop
lttng snapshot record

We create a new session for each snapshot to make sure that the buffer does
not have any left-over events from the previous session. But I am curious
if lttng shares buffers between sessions or re-uses buffers from an old
session without flushing?

Note that I am using the default LTTNG_BUFFER_PER_UID mode. All processes
for our application run under one UID, and only one tracing session is
active at a time.

On Wed, Nov 7, 2018 at 12:09 PM Jonathan Rajotte-Julien <
Post by Jonathan Rajotte-Julien
Hi Alok,
Post by Alok Priyadarshi
Hi Jonathan,
Thanks for your response.
We are tracing function scopes. Each scope emits two events - begin and
end. We noticed that some begin events did not have corresponding end
events in the trace.
Seems like a solid ground as long as the corresponding ending event is not
inside any conditional paths. I'm sure you already validated that but we never
know.
Post by Alok Priyadarshi
I have the trace with this problem available. Would that provide any
clue?
It could help but I presume it might contains confidential or sensible
information. Let's define the scenario first and we will go back to actual
trace data if needed.
Post by Alok Priyadarshi
If not, I will try to extract a small reproducer.
That would be best. In the meantime, could you provide the exact lttng commands
used to setup everything and describe subsequent lttng commands used for
snapshot etc.
You can omit event name or replace them with dummy name if necessary.
Cheers
--
Jonathan Rajotte-Julien
EfficiOS
Jonathan Rajotte-Julien
2018-11-08 16:05:42 UTC
Permalink
Post by Alok Priyadarshi
A custom trace event class emits "begin" and "end" events in its
constructor and destructor respectively. So I do not think this is due to
conditional path.
If you can guarantee that the destructor is called, seems good to me.

Do you see scenario where the "begin" event is missing ?
Post by Alok Priyadarshi
lttng create --snapshot
lttng enable-event --userspace --all
lttng add-context --userspace -t vpid -t vtid
lttng start
lttng stop
lttng snapshot record
Nothing out of the ordinary here.
Post by Alok Priyadarshi
We create a new session for each snapshot to make sure that the buffer does
not have any left-over events from the previous session. But I am curious
if lttng shares buffers between sessions or re-uses buffers from an old
session without flushing?
No sharing is done between sessions AFAIK.
Post by Alok Priyadarshi
Note that I am using the default LTTNG_BUFFER_PER_UID mode. All processes
for our application run under one UID, and only one tracing session is
active at a time.
Again, nothing out of the ordinary.

Next step would be a small reproducer.

Is the system under load?
Is there any cpu affinity set?
What kind of system are we talking about (x86_64? arm? 24+ cores? etc.)

Cheers
--
Jonathan Rajotte-Julien
EfficiOS
Mathieu Desnoyers
2018-11-08 17:52:46 UTC
Permalink
Hi Alok,

With a snapshot trace, you can end up with some per-cpu buffers that contain information
going further back in time compared to other per-cpu buffers. This depends on the level of
system activity and tracing throughput for each CPU.

The situation you experience can very well be triggered by having the begin event on one CPU,
being migrated to another CPU, and having the end event on another CPU. If either the
source or destination CPU has been more active than the other, you may find that you
have only the begin or the end event in your trace.

This is why we implemented the "stream intersection" mode in babeltrace. It ensures that
babeltrace cuts away trace data belonging to time spans that only appear in some of
the streams. It is not however activated by default.

To try it out, you can do this:

babeltrace --stream-intersection path/to/trace

Does it help ?

Thanks,

Mathieu
A custom trace event class emits "begin" and "end" events in its constructor and
destructor respectively. So I do not think this is due to conditional path.
lttng create --snapshot
lttng enable-event --userspace --all
lttng add-context --userspace -t vpid -t vtid
lttng start
lttng stop
lttng snapshot record
We create a new session for each snapshot to make sure that the buffer does not
have any left-over events from the previous session. But I am curious if lttng
shares buffers between sessions or re-uses buffers from an old session without
flushing?
Note that I am using the default LTTNG_BUFFER_PER_UID mode. All processes for
our application run under one UID, and only one tracing session is active at a
time.
On Wed, Nov 7, 2018 at 12:09 PM Jonathan Rajotte-Julien < [
Post by Jonathan Rajotte-Julien
Hi Alok,
Post by Alok Priyadarshi
Hi Jonathan,
Thanks for your response.
We are tracing function scopes. Each scope emits two events - begin and
end. We noticed that some begin events did not have corresponding end
events in the trace.
Seems like a solid ground as long as the corresponding ending event is not
inside any conditional paths. I'm sure you already validated that but we never
know.
Post by Alok Priyadarshi
I have the trace with this problem available. Would that provide any clue?
It could help but I presume it might contains confidential or sensible
information. Let's define the scenario first and we will go back to actual
trace data if needed.
Post by Alok Priyadarshi
If not, I will try to extract a small reproducer.
That would be best. In the meantime, could you provide the exact lttng commands
used to setup everything and describe subsequent lttng commands used for
snapshot etc.
You can omit event name or replace them with dummy name if necessary.
Cheers
--
Jonathan Rajotte-Julien
EfficiOS
_______________________________________________
lttng-dev mailing list
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Alok Priyadarshi
2018-11-08 23:31:12 UTC
Permalink
Jonathan: The system is x86_64, 16 cores, no cpu affinity set. Not too sure
about load yet.

Mathieu: The explanation makes sense. Running babeltrace with
--stream-intersection flag did filter out a lot of events. However it is
hard to verify the filtered result because we do not yet know how to read
the text format. Does the babeltrace python API support stream intersection
by any chance?


On Thu, Nov 8, 2018 at 9:52 AM Mathieu Desnoyers <
Post by Jonathan Rajotte-Julien
Hi Alok,
With a snapshot trace, you can end up with some per-cpu buffers that contain information
going further back in time compared to other per-cpu buffers. This depends on the level of
system activity and tracing throughput for each CPU.
The situation you experience can very well be triggered by having the
begin event on one CPU,
being migrated to another CPU, and having the end event on another CPU. If either the
source or destination CPU has been more active than the other, you may find that you
have only the begin or the end event in your trace.
This is why we implemented the "stream intersection" mode in babeltrace. It ensures that
babeltrace cuts away trace data belonging to time spans that only appear in some of
the streams. It is not however activated by default.
babeltrace --stream-intersection path/to/trace
Does it help ?
Thanks,
Mathieu
A custom trace event class emits "begin" and "end" events in its
constructor and destructor respectively. So I do not think this is due to
conditional path.
lttng create --snapshot
lttng enable-event --userspace --all
lttng add-context --userspace -t vpid -t vtid
lttng start
lttng stop
lttng snapshot record
We create a new session for each snapshot to make sure that the buffer
does not have any left-over events from the previous session. But I am
curious if lttng shares buffers between sessions or re-uses buffers from an
old session without flushing?
Note that I am using the default LTTNG_BUFFER_PER_UID mode. All processes
for our application run under one UID, and only one tracing session is
active at a time.
On Wed, Nov 7, 2018 at 12:09 PM Jonathan Rajotte-Julien <
Post by Jonathan Rajotte-Julien
Hi Alok,
Post by Alok Priyadarshi
Hi Jonathan,
Thanks for your response.
We are tracing function scopes. Each scope emits two events - begin and
end. We noticed that some begin events did not have corresponding end
events in the trace.
Seems like a solid ground as long as the corresponding ending event is not
inside any conditional paths. I'm sure you already validated that but we never
know.
Post by Alok Priyadarshi
I have the trace with this problem available. Would that provide any
clue?
It could help but I presume it might contains confidential or sensible
information. Let's define the scenario first and we will go back to actual
trace data if needed.
Post by Alok Priyadarshi
If not, I will try to extract a small reproducer.
That would be best. In the meantime, could you provide the exact lttng commands
used to setup everything and describe subsequent lttng commands used for
snapshot etc.
You can omit event name or replace them with dummy name if necessary.
Cheers
--
Jonathan Rajotte-Julien
EfficiOS
_______________________________________________
lttng-dev mailing list
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Jérémie Galarneau
2018-11-08 23:44:13 UTC
Permalink
Jonathan: The system is x86_64, 16 cores, no cpu affinity set. Not too sure about load yet.
Mathieu: The explanation makes sense. Running babeltrace with --stream-intersection flag did filter out a lot of events. However it is hard to verify the filtered result because we do not yet know how to read the text format. Does the babeltrace python API support stream intersection by any chance?
Hi Alok,

The TraceCollection constructor accepts an "intersect_mode" parameter
which will provide the same semantics as --stream-intersection.

Regards,
Jérémie
Post by Jonathan Rajotte-Julien
Hi Alok,
With a snapshot trace, you can end up with some per-cpu buffers that contain information
going further back in time compared to other per-cpu buffers. This depends on the level of
system activity and tracing throughput for each CPU.
The situation you experience can very well be triggered by having the begin event on one CPU,
being migrated to another CPU, and having the end event on another CPU. If either the
source or destination CPU has been more active than the other, you may find that you
have only the begin or the end event in your trace.
This is why we implemented the "stream intersection" mode in babeltrace. It ensures that
babeltrace cuts away trace data belonging to time spans that only appear in some of
the streams. It is not however activated by default.
babeltrace --stream-intersection path/to/trace
Does it help ?
Thanks,
Mathieu
A custom trace event class emits "begin" and "end" events in its constructor and destructor respectively. So I do not think this is due to conditional path.
lttng create --snapshot
lttng enable-event --userspace --all
lttng add-context --userspace -t vpid -t vtid
lttng start
lttng stop
lttng snapshot record
We create a new session for each snapshot to make sure that the buffer does not have any left-over events from the previous session. But I am curious if lttng shares buffers between sessions or re-uses buffers from an old session without flushing?
Note that I am using the default LTTNG_BUFFER_PER_UID mode. All processes for our application run under one UID, and only one tracing session is active at a time.
Post by Jonathan Rajotte-Julien
Hi Alok,
Post by Alok Priyadarshi
Hi Jonathan,
Thanks for your response.
We are tracing function scopes. Each scope emits two events - begin and
end. We noticed that some begin events did not have corresponding end
events in the trace.
Seems like a solid ground as long as the corresponding ending event is not
inside any conditional paths. I'm sure you already validated that but we never
know.
Post by Alok Priyadarshi
I have the trace with this problem available. Would that provide any clue?
It could help but I presume it might contains confidential or sensible
information. Let's define the scenario first and we will go back to actual
trace data if needed.
Post by Alok Priyadarshi
If not, I will try to extract a small reproducer.
That would be best. In the meantime, could you provide the exact lttng commands
used to setup everything and describe subsequent lttng commands used for
snapshot etc.
You can omit event name or replace them with dummy name if necessary.
Cheers
--
Jonathan Rajotte-Julien
EfficiOS
_______________________________________________
lttng-dev mailing list
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com
Alok Priyadarshi
2018-11-09 00:53:49 UTC
Permalink
Awesome, I tried the intersect_mode parameter and it worked. No more lost
events. Thanks for your support.

On Thu, Nov 8, 2018 at 3:44 PM Jérémie Galarneau <
Post by Alok Priyadarshi
Jonathan: The system is x86_64, 16 cores, no cpu affinity set. Not too
sure about load yet.
Post by Alok Priyadarshi
Mathieu: The explanation makes sense. Running babeltrace with
--stream-intersection flag did filter out a lot of events. However it is
hard to verify the filtered result because we do not yet know how to read
the text format. Does the babeltrace python API support stream intersection
by any chance?
Hi Alok,
The TraceCollection constructor accepts an "intersect_mode" parameter
which will provide the same semantics as --stream-intersection.
Regards,
Jérémie
Post by Alok Priyadarshi
On Thu, Nov 8, 2018 at 9:52 AM Mathieu Desnoyers <
Post by Jonathan Rajotte-Julien
Hi Alok,
With a snapshot trace, you can end up with some per-cpu buffers that
contain information
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
going further back in time compared to other per-cpu buffers. This
depends on the level of
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
system activity and tracing throughput for each CPU.
The situation you experience can very well be triggered by having the
begin event on one CPU,
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
being migrated to another CPU, and having the end event on another CPU.
If either the
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
source or destination CPU has been more active than the other, you may
find that you
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
have only the begin or the end event in your trace.
This is why we implemented the "stream intersection" mode in
babeltrace. It ensures that
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
babeltrace cuts away trace data belonging to time spans that only
appear in some of
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
the streams. It is not however activated by default.
babeltrace --stream-intersection path/to/trace
Does it help ?
Thanks,
Mathieu
A custom trace event class emits "begin" and "end" events in its
constructor and destructor respectively. So I do not think this is due to
conditional path.
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
lttng create --snapshot
lttng enable-event --userspace --all
lttng add-context --userspace -t vpid -t vtid
lttng start
lttng stop
lttng snapshot record
We create a new session for each snapshot to make sure that the buffer
does not have any left-over events from the previous session. But I am
curious if lttng shares buffers between sessions or re-uses buffers from an
old session without flushing?
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Note that I am using the default LTTNG_BUFFER_PER_UID mode. All
processes for our application run under one UID, and only one tracing
session is active at a time.
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
On Wed, Nov 7, 2018 at 12:09 PM Jonathan Rajotte-Julien <
Post by Jonathan Rajotte-Julien
Hi Alok,
Post by Alok Priyadarshi
Hi Jonathan,
Thanks for your response.
We are tracing function scopes. Each scope emits two events - begin
and
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Post by Jonathan Rajotte-Julien
Post by Alok Priyadarshi
end. We noticed that some begin events did not have corresponding end
events in the trace.
Seems like a solid ground as long as the corresponding ending event is
not
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Post by Jonathan Rajotte-Julien
inside any conditional paths. I'm sure you already validated that but
we never
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Post by Jonathan Rajotte-Julien
know.
Post by Alok Priyadarshi
I have the trace with this problem available. Would that provide any
clue?
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Post by Jonathan Rajotte-Julien
It could help but I presume it might contains confidential or sensible
information. Let's define the scenario first and we will go back to
actual
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Post by Jonathan Rajotte-Julien
trace data if needed.
Post by Alok Priyadarshi
If not, I will try to extract a small reproducer.
That would be best. In the meantime, could you provide the exact lttng
commands
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Post by Jonathan Rajotte-Julien
used to setup everything and describe subsequent lttng commands used
for
Post by Alok Priyadarshi
Post by Jonathan Rajotte-Julien
Post by Jonathan Rajotte-Julien
snapshot etc.
You can omit event name or replace them with dummy name if necessary.
Cheers
--
Jonathan Rajotte-Julien
EfficiOS
_______________________________________________
lttng-dev mailing list
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com
Loading...