summaryrefslogtreecommitdiffstats
path: root/documentation/profile-manual
diff options
context:
space:
mode:
authorScott Rifenbark <scott.m.rifenbark@intel.com>2013-01-16 12:49:45 -0800
committerRichard Purdie <richard.purdie@linuxfoundation.org>2013-01-27 13:56:02 +0000
commit86478ff9486516ed428e470afbe76f3cd857888c (patch)
treeb883e2138166f3a9daf986931779a8f4058169bf /documentation/profile-manual
parent487a76f7c7bfdb47ddec76659794834e84ed04eb (diff)
downloadast2050-yocto-poky-86478ff9486516ed428e470afbe76f3cd857888c.zip
ast2050-yocto-poky-86478ff9486516ed428e470afbe76f3cd857888c.tar.gz
profile-manual: Updated indentation and sectioning
Had the sectioning off and it was preventing the manual from making. I also added more indentation as needed throughout most of the manual up to the error. (From yocto-docs rev: 5de3544593a63e96b349babc177970d8605e0098) Signed-off-by: Scott Rifenbark <scott.m.rifenbark@intel.com> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Diffstat (limited to 'documentation/profile-manual')
-rw-r--r--documentation/profile-manual/profile-manual-usage.xml1257
1 files changed, 629 insertions, 628 deletions
diff --git a/documentation/profile-manual/profile-manual-usage.xml b/documentation/profile-manual/profile-manual-usage.xml
index cfe916a..51c483b 100644
--- a/documentation/profile-manual/profile-manual-usage.xml
+++ b/documentation/profile-manual/profile-manual-usage.xml
@@ -108,28 +108,28 @@
top System profiling tool.
See 'perf help COMMAND' for more information on a specific command.
- </literallayout>
- </para>
+ </literallayout>
+ </para>
- <section id='using-perf-to-do-basic-profiling'>
- <title>Using perf to do Basic Profiling</title>
+ <section id='using-perf-to-do-basic-profiling'>
+ <title>Using perf to do Basic Profiling</title>
- <para>
- As a simple test case, we'll profile the 'wget' of a fairly large
- file, which is a minimally interesting case because it has both
- file and network I/O aspects, and at least in the case of standard
- Yocto images, it's implemented as part of busybox, so the methods
- we use to analyze it can be used in a very similar way to the whole
- host of supported busybox applets in Yocto.
- <literallayout class='monospaced'>
+ <para>
+ As a simple test case, we'll profile the 'wget' of a fairly large
+ file, which is a minimally interesting case because it has both
+ file and network I/O aspects, and at least in the case of standard
+ Yocto images, it's implemented as part of busybox, so the methods
+ we use to analyze it can be used in a very similar way to the whole
+ host of supported busybox applets in Yocto.
+ <literallayout class='monospaced'>
root@crownbay:~# rm linux-2.6.19.2.tar.bz2; \
wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
- </literallayout>
- The quickest and easiest way to get some basic overall data about
- what's going on for a particular workload it to profile it using
- 'perf stat'. 'perf stat' basically profiles using a few default
- counters and displays the summed counts at the end of the run:
- <literallayout class='monospaced'>
+ </literallayout>
+ The quickest and easiest way to get some basic overall data about
+ what's going on for a particular workload it to profile it using
+ 'perf stat'. 'perf stat' basically profiles using a few default
+ counters and displays the summed counts at the end of the run:
+ <literallayout class='monospaced'>
root@crownbay:~# perf stat wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA
@@ -148,19 +148,19 @@
19550329 branch-misses # 11.82% of all branches
59.836627620 seconds time elapsed
- </literallayout>
- Many times such a simple-minded test doesn't yield much of
- interest, but sometimes it does (see Real-world Yocto bug
- (slow loop-mounted write speed)).
- </para>
+ </literallayout>
+ Many times such a simple-minded test doesn't yield much of
+ interest, but sometimes it does (see Real-world Yocto bug
+ (slow loop-mounted write speed)).
+ </para>
- <para>
- Also, note that 'perf stat' isn't restricted to a fixed set of
- counters - basically any event listed in the output of 'perf list'
- can be tallied by 'perf stat'. For example, suppose we wanted to
- see a summary of all the events related to kernel memory
- allocation/freeing along with cache hits and misses:
- <literallayout class='monospaced'>
+ <para>
+ Also, note that 'perf stat' isn't restricted to a fixed set of
+ counters - basically any event listed in the output of 'perf list'
+ can be tallied by 'perf stat'. For example, suppose we wanted to
+ see a summary of all the events related to kernel memory
+ allocation/freeing along with cache hits and misses:
+ <literallayout class='monospaced'>
root@crownbay:~# perf stat -e kmem:* -e cache-references -e cache-misses wget http:// downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
linux-2.6.19.2.tar.b 100% |***************************************************| 41727k 0:00:00 ETA
@@ -183,28 +183,28 @@
2917740 cache-misses # 4.365 % of all cache refs
44.831023415 seconds time elapsed
- </literallayout>
- So 'perf stat' gives us a nice easy way to get a quick overview of
- what might be happening for a set of events, but normally we'd
- need a little more detail in order to understand what's going on
- in a way that we can act on in a useful way.
- </para>
+ </literallayout>
+ So 'perf stat' gives us a nice easy way to get a quick overview of
+ what might be happening for a set of events, but normally we'd
+ need a little more detail in order to understand what's going on
+ in a way that we can act on in a useful way.
+ </para>
- <para>
- To dive down into a next level of detail, we can use 'perf
- record'/'perf report' which will collect profiling data and
- present it to use using an interactive text-based UI (or
- simply as text if we specify --stdio to 'perf report').
- </para>
+ <para>
+ To dive down into a next level of detail, we can use 'perf
+ record'/'perf report' which will collect profiling data and
+ present it to use using an interactive text-based UI (or
+ simply as text if we specify --stdio to 'perf report').
+ </para>
- <para>
- As our first attempt at profiling this workload, we'll simply
- run 'perf record', handing it the workload we want to profile
- (everything after 'perf record' and any perf options we hand
- it - here none - will be executedin a new shell). perf collects
- samples until the process exits and records them in a file named
- 'perf.data' in the current working directory.
- <literallayout class='monospaced'>
+ <para>
+ As our first attempt at profiling this workload, we'll simply
+ run 'perf record', handing it the workload we want to profile
+ (everything after 'perf record' and any perf options we hand
+ it - here none - will be executedin a new shell). perf collects
+ samples until the process exits and records them in a file named
+ 'perf.data' in the current working directory.
+ <literallayout class='monospaced'>
root@crownbay:~# perf record wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
@@ -215,44 +215,44 @@
To see the results in a 'text-based UI' (tui), simply run
'perf report', which will read the perf.data file in the current
working directory and display the results in an interactive UI:
- <literallayout class='monospaced'>
+ <literallayout class='monospaced'>
root@crownbay:~# perf report
- </literallayout>
- </para>
+ </literallayout>
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-flat-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-flat-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- The above screenshot displays a 'flat' profile, one entry for
- each 'bucket' corresponding to the functions that were profiled
- during the profiling run, ordered from the most popular to the
- least (perf has options to sort in various orders and keys as
- well as display entries only above a certain threshold and so
- on - see the perf documentation for details). Note that this
- includes both userspace functions (entries containing a [.]) and
- kernel functions accounted to the process (entries containing
- a [k]). (perf has command-line modifiers that can be used to
- restrict the profiling to kernel or userspace, among others).
- </para>
+ <para>
+ The above screenshot displays a 'flat' profile, one entry for
+ each 'bucket' corresponding to the functions that were profiled
+ during the profiling run, ordered from the most popular to the
+ least (perf has options to sort in various orders and keys as
+ well as display entries only above a certain threshold and so
+ on - see the perf documentation for details). Note that this
+ includes both userspace functions (entries containing a [.]) and
+ kernel functions accounted to the process (entries containing
+ a [k]). (perf has command-line modifiers that can be used to
+ restrict the profiling to kernel or userspace, among others).
+ </para>
- <para>
- Notice also that the above report shows an entry for 'busybox',
- which is the executable that implements 'wget' in Yocto, but that
- instead of a useful function name in that entry, it displays
- an not-so-friendly hex value instead. The steps below will show
- how to fix that problem.
- </para>
+ <para>
+ Notice also that the above report shows an entry for 'busybox',
+ which is the executable that implements 'wget' in Yocto, but that
+ instead of a useful function name in that entry, it displays
+ an not-so-friendly hex value instead. The steps below will show
+ how to fix that problem.
+ </para>
- <para>
- Before we do that, however, let's try running a different profile,
- one which shows something a little more interesting. The only
- difference between the new profile and the previous one is that
- we'll add the -g option, which will record not just the address
- of a sampled function, but the entire callchain to the sampled
- function as well:
- <literallayout class='monospaced'>
+ <para>
+ Before we do that, however, let's try running a different profile,
+ one which shows something a little more interesting. The only
+ difference between the new profile and the previous one is that
+ we'll add the -g option, which will record not just the address
+ of a sampled function, but the entire callchain to the sampled
+ function as well:
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -g wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
linux-2.6.19.2.tar.b 100% |************************************************| 41727k 0:00:00 ETA
@@ -261,217 +261,217 @@
root@crownbay:~# perf report
- </literallayout>
- </para>
+ </literallayout>
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- Using the callgraph view, we can actually see not only which
- functions took the most time, but we can also see a summary of
- how those functions were called and learn something about how the
- program interacts with the kernel in the process.
- </para>
+ <para>
+ Using the callgraph view, we can actually see not only which
+ functions took the most time, but we can also see a summary of
+ how those functions were called and learn something about how the
+ program interacts with the kernel in the process.
+ </para>
- <para>
- Notice that each entry in the above screenshot now contains a '+'
- on the left-hand side. This means that we can expand the entry and
- drill down into the callchains that feed into that entry.
- Pressing 'enter' on any one of them will expand the callchain
- (you can also press 'E' to expand them all at the same time or 'C'
- to collapse them all).
- </para>
+ <para>
+ Notice that each entry in the above screenshot now contains a '+'
+ on the left-hand side. This means that we can expand the entry and
+ drill down into the callchains that feed into that entry.
+ Pressing 'enter' on any one of them will expand the callchain
+ (you can also press 'E' to expand them all at the same time or 'C'
+ to collapse them all).
+ </para>
- <para>
- In the screenshot above, we've toggled the __copy_to_user_ll()
- entry and several subnodes all the way down. This lets us see
- which callchains contributed to the profiled __copy_to_user_ll()
- function which contributed 1.77% to the total profile.
- </para>
+ <para>
+ In the screenshot above, we've toggled the __copy_to_user_ll()
+ entry and several subnodes all the way down. This lets us see
+ which callchains contributed to the profiled __copy_to_user_ll()
+ function which contributed 1.77% to the total profile.
+ </para>
- <para>
- As a bit of background explanation for these callchains, think
- about what happens at a high level when you run wget to get a file
- out on the network. Basically what happens is that the data comes
- into the kernel via the network connection (socket) and is passed
- to the userspace program 'wget' (which is actually a part of
- busybox, but that's not important for now), which takes the buffers
- the kernel passes to it and writes it to a disk file to save it.
- </para>
+ <para>
+ As a bit of background explanation for these callchains, think
+ about what happens at a high level when you run wget to get a file
+ out on the network. Basically what happens is that the data comes
+ into the kernel via the network connection (socket) and is passed
+ to the userspace program 'wget' (which is actually a part of
+ busybox, but that's not important for now), which takes the buffers
+ the kernel passes to it and writes it to a disk file to save it.
+ </para>
- <para>
- The part of this process that we're looking at in the above call
- stacks is the part where the kernel passes the data it's read from
- the socket down to wget i.e. a copy-to-user.
- </para>
+ <para>
+ The part of this process that we're looking at in the above call
+ stacks is the part where the kernel passes the data it's read from
+ the socket down to wget i.e. a copy-to-user.
+ </para>
- <para>
- Notice also that here there's also a case where the a hex value
- is displayed in the callstack, here in the expanded
- sys_clock_gettime() function. Later we'll see it resolve to a
- userspace function call in busybox.
- </para>
+ <para>
+ Notice also that here there's also a case where the a hex value
+ is displayed in the callstack, here in the expanded
+ sys_clock_gettime() function. Later we'll see it resolve to a
+ userspace function call in busybox.
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-g-copy-from-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-g-copy-from-user-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- The above screenshot shows the other half of the journey for the
- data - from the wget program's userspace buffers to disk. To get
- the buffers to disk, the wget program issues a write(2), which
- does a copy-from-user to the kernel, which then takes care via
- some circuitous path (probably also present somewhere in the
- profile data), to get it safely to disk.
- </para>
+ <para>
+ The above screenshot shows the other half of the journey for the
+ data - from the wget program's userspace buffers to disk. To get
+ the buffers to disk, the wget program issues a write(2), which
+ does a copy-from-user to the kernel, which then takes care via
+ some circuitous path (probably also present somewhere in the
+ profile data), to get it safely to disk.
+ </para>
- <para>
- Now that we've seen the basic layout of the profile data and the
- basics of how to extract useful information out of it, let's get
- back to the task at hand and see if we can get some basic idea
- about where the time is spent in the program we're profiling,
- wget. Remember that wget is actually implemented as an applet
- in busybox, so while the process name is 'wget', the executable
- we're actually interested in is busybox. So let's expand the
- first entry containing busybox:
- </para>
+ <para>
+ Now that we've seen the basic layout of the profile data and the
+ basics of how to extract useful information out of it, let's get
+ back to the task at hand and see if we can get some basic idea
+ about where the time is spent in the program we're profiling,
+ wget. Remember that wget is actually implemented as an applet
+ in busybox, so while the process name is 'wget', the executable
+ we're actually interested in is busybox. So let's expand the
+ first entry containing busybox:
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-busybox-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-busybox-expanded-stripped.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- Again, before we expanded we saw that the function was labeled
- with a hex value instead of a symbol as with most of the kernel
- entries. Expanding the busybox entry doesn't make it any better.
- </para>
+ <para>
+ Again, before we expanded we saw that the function was labeled
+ with a hex value instead of a symbol as with most of the kernel
+ entries. Expanding the busybox entry doesn't make it any better.
+ </para>
- <para>
- The problem is that perf can't find the symbol information for the
- busybox binary, which is actually stripped out by the Yocto build
- system.
- </para>
+ <para>
+ The problem is that perf can't find the symbol information for the
+ busybox binary, which is actually stripped out by the Yocto build
+ system.
+ </para>
- <para>
- One way around that is to put the following in your local.conf
- when you build the image:
- <literallayout class='monospaced'>
+ <para>
+ One way around that is to put the following in your local.conf
+ when you build the image:
+ <literallayout class='monospaced'>
INHIBIT_PACKAGE_STRIP = "1"
- </literallayout>
- However, we already have an image with the binaries stripped,
- so what can we do to get perf to resolve the symbols? Basically
- we need to install the debuginfo for the busybox package.
- </para>
+ </literallayout>
+ However, we already have an image with the binaries stripped,
+ so what can we do to get perf to resolve the symbols? Basically
+ we need to install the debuginfo for the busybox package.
+ </para>
- <para>
- To generate the debug info for the packages in the image, we can
- to add dbg-pkgs to EXTRA_IMAGE_FEATURES in local.conf. For example:
- <literallayout class='monospaced'>
+ <para>
+ To generate the debug info for the packages in the image, we can
+ to add dbg-pkgs to EXTRA_IMAGE_FEATURES in local.conf. For example:
+ <literallayout class='monospaced'>
EXTRA_IMAGE_FEATURES = "debug-tweaks tools-profile dbg-pkgs"
- </literallayout>
- Additionally, in order to generate the type of debuginfo that
- perf understands, we also need to add the following to local.conf:
- <literallayout class='monospaced'>
+ </literallayout>
+ Additionally, in order to generate the type of debuginfo that
+ perf understands, we also need to add the following to local.conf:
+ <literallayout class='monospaced'>
PACKAGE_DEBUG_SPLIT_STYLE = 'debug-file-directory'
- </literallayout>
- Once we've done that, we can install the debuginfo for busybox.
- The debug packages once built can be found in
- build/tmp/deploy/rpm/* on the host system. Find the
- busybox-dbg-...rpm file and copy it to the target. For example:
- <literallayout class='monospaced'>
+ </literallayout>
+ Once we've done that, we can install the debuginfo for busybox.
+ The debug packages once built can be found in
+ build/tmp/deploy/rpm/* on the host system. Find the
+ busybox-dbg-...rpm file and copy it to the target. For example:
+ <literallayout class='monospaced'>
[trz@empanada core2]$ scp /home/trz/yocto/crownbay-tracing-dbg/build/tmp/deploy/rpm/core2/busybox-dbg-1.20.2-r2.core2.rpm root@192.168.1.31:
root@192.168.1.31's password:
busybox-dbg-1.20.2-r2.core2.rpm 100% 1826KB 1.8MB/s 00:01
- </literallayout>
- Now install the debug rpm on the target:
- <literallayout class='monospaced'>
+ </literallayout>
+ Now install the debug rpm on the target:
+ <literallayout class='monospaced'>
root@crownbay:~# rpm -i busybox-dbg-1.20.2-r2.core2.rpm
- </literallayout>
- Now that the debuginfo is installed, we see that the busybox
- entries now display their functions symbolically:
- </para>
+ </literallayout>
+ Now that the debuginfo is installed, we see that the busybox
+ entries now display their functions symbolically:
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-busybox-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-busybox-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- If we expand one of the entries and press 'enter' on a leaf node,
- we're presented with a menu of actions we can take to get more
- information related to that entry:
- </para>
+ <para>
+ If we expand one of the entries and press 'enter' on a leaf node,
+ we're presented with a menu of actions we can take to get more
+ information related to that entry:
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-busybox-dso-zoom-menu.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-busybox-dso-zoom-menu.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- One of these actions allows us to show a view that displays a
- busybox-centric view of the profiled functions (in this case we've
- also expanded all the nodes using the 'E' key):
- </para>
+ <para>
+ One of these actions allows us to show a view that displays a
+ busybox-centric view of the profiled functions (in this case we've
+ also expanded all the nodes using the 'E' key):
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-busybox-dso-zoom.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-busybox-dso-zoom.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- Finally, we can see that now that the busybox debuginfo is
- installed, the previously unresolved symbol in the
- sys_clock_gettime() entry mentioned previously is now resolved,
- and shows that the sys_clock_gettime system call that was the
- source of 6.75% of the copy-to-user overhead was initiated by
- the handle_input() busybox function:
- </para>
+ <para>
+ Finally, we can see that now that the busybox debuginfo is
+ installed, the previously unresolved symbol in the
+ sys_clock_gettime() entry mentioned previously is now resolved,
+ and shows that the sys_clock_gettime system call that was the
+ source of 6.75% of the copy-to-user overhead was initiated by
+ the handle_input() busybox function:
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-g-copy-to-user-expanded-debuginfo.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- At the lowest level of detail, we can dive down to the assembly
- level and see which instructions caused the most overhead in a
- function. Pressing 'enter' on the 'udhcpc_main' function, we're
- again presented with a menu:
- </para>
+ <para>
+ At the lowest level of detail, we can dive down to the assembly
+ level and see which instructions caused the most overhead in a
+ function. Pressing 'enter' on the 'udhcpc_main' function, we're
+ again presented with a menu:
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-busybox-annotate-menu.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-busybox-annotate-menu.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- Selecting 'Annotate udhcpc_main', we get a detailed listing of
- percentages by instruction for the udhcpc_main function. From the
- display, we can see that over 50% of the time spent in this
- function is taken up by a couple tests and the move of a
- constant (1) to a register:
- </para>
+ <para>
+ Selecting 'Annotate udhcpc_main', we get a detailed listing of
+ percentages by instruction for the udhcpc_main function. From the
+ display, we can see that over 50% of the time spent in this
+ function is taken up by a couple tests and the move of a
+ constant (1) to a register:
+ </para>
- <para>
- <imagedata fileref="figures/perf-wget-busybox-annotate-udhcpc.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-wget-busybox-annotate-udhcpc.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- As a segue into tracing, let's try another profile using a
- different counter, something other than the default 'cycles'.
- </para>
+ <para>
+ As a segue into tracing, let's try another profile using a
+ different counter, something other than the default 'cycles'.
+ </para>
- <para>
- The tracing and profiling infrastructure in Linux has become
- unified in a way that allows us to use the same tool with a
- completely different set of counters, not just the standard
- hardware counters that traditionally tools have had to restrict
- themselves to (of course the traditional tools can also make use
- of the expanded possibilities now available to them, and in some
- cases have, as mentioned previously).
- </para>
+ <para>
+ The tracing and profiling infrastructure in Linux has become
+ unified in a way that allows us to use the same tool with a
+ completely different set of counters, not just the standard
+ hardware counters that traditionally tools have had to restrict
+ themselves to (of course the traditional tools can also make use
+ of the expanded possibilities now available to them, and in some
+ cases have, as mentioned previously).
+ </para>
- <para>
- We can get a list of the available events that can be used to
- profile a workload via 'perf list':
- <literallayout class='monospaced'>
+ <para>
+ We can get a list of the available events that can be used to
+ profile a workload via 'perf list':
+ <literallayout class='monospaced'>
root@crownbay:~# perf list
List of pre-defined events (to be used in -e):
@@ -602,23 +602,23 @@
syscalls:sys_exit_unshare [Tracepoint event]
raw_syscalls:sys_enter [Tracepoint event]
raw_syscalls:sys_exit [Tracepoint event]
- </literallayout>
- </para>
+ </literallayout>
+ </para>
- <note>
- Tying It Together: These are exactly the same set of events defined
- by the trace event subsystem and exposed by
- ftrace/tracecmd/kernelshark as files in
- /sys/kernel/debug/tracing/events, by SystemTap as
- kernel.trace("tracepoint_name") and (partially) accessed by LTTng.
- </note>
+ <note>
+ Tying It Together: These are exactly the same set of events defined
+ by the trace event subsystem and exposed by
+ ftrace/tracecmd/kernelshark as files in
+ /sys/kernel/debug/tracing/events, by SystemTap as
+ kernel.trace("tracepoint_name") and (partially) accessed by LTTng.
+ </note>
- <para>
- Only a subset of these would be of interest to us when looking at
- this workload, so let's choose the most likely subsystems
- (identified by the string before the colon in the Tracepoint events)
- and do a 'perf stat' run using only those wildcarded subsystems:
- <literallayout class='monospaced'>
+ <para>
+ Only a subset of these would be of interest to us when looking at
+ this workload, so let's choose the most likely subsystems
+ (identified by the string before the colon in the Tracepoint events)
+ and do a 'perf stat' run using only those wildcarded subsystems:
+ <literallayout class='monospaced'>
root@crownbay:~# perf stat -e skb:* -e net:* -e napi:* -e sched:* -e workqueue:* -e irq:* -e syscalls:* wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
Performance counter stats for 'wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2':
@@ -675,26 +675,26 @@
.
58.029710972 seconds time elapsed
- </literallayout>
- Let's pick one of these tracepoints and tell perf to do a profile
- using it as the sampling event:
- <literallayout class='monospaced'>
+ </literallayout>
+ Let's pick one of these tracepoints and tell perf to do a profile
+ using it as the sampling event:
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -g -e sched:sched_wakeup wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
- </literallayout>
- </para>
+ </literallayout>
+ </para>
- <para>
- <imagedata fileref="figures/sched-wakeup-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/sched-wakeup-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- The screenshot above shows the results of running a profile using
- sched:sched_switch tracepoint, which shows the relative costs of
- various paths to sched_wakeup (note that sched_wakeup is the
- name of the tracepoint - it's actually defined just inside
- ttwu_do_wakeup(), which accounts for the function name actually
- displayed in the profile:
- <literallayout class='monospaced'>
+ <para>
+ The screenshot above shows the results of running a profile using
+ sched:sched_switch tracepoint, which shows the relative costs of
+ various paths to sched_wakeup (note that sched_wakeup is the
+ name of the tracepoint - it's actually defined just inside
+ ttwu_do_wakeup(), which accounts for the function name actually
+ displayed in the profile:
+ <literallayout class='monospaced'>
/*
* Mark the task runnable and perform wakeup-preemption.
*/
@@ -706,55 +706,55 @@
.
.
}
- </literallayout>
- A couple of the more interesting callchains are expanded and
- displayed above, basically some network receive paths that
- presumably end up waking up wget (busybox) when network data is
- ready.
- </para>
+ </literallayout>
+ A couple of the more interesting callchains are expanded and
+ displayed above, basically some network receive paths that
+ presumably end up waking up wget (busybox) when network data is
+ ready.
+ </para>
- <para>
- Note that because tracepoints are normally used for tracing,
- the default sampling period for tracepoints is 1 i.e. for
- tracepoints perf will sample on every event occurrence (this
- can be changed using the -c option). This is in contrast to
- hardware counters such as for example the default 'cycles'
- hardware counter used for normal profiling, where sampling
- periods are much higher (in the thousands) because profiling should
- have as low an overhead as possible and sampling on every cycle w
- ould be prohibitively expensive.
- </para>
- </section>
+ <para>
+ Note that because tracepoints are normally used for tracing,
+ the default sampling period for tracepoints is 1 i.e. for
+ tracepoints perf will sample on every event occurrence (this
+ can be changed using the -c option). This is in contrast to
+ hardware counters such as for example the default 'cycles'
+ hardware counter used for normal profiling, where sampling
+ periods are much higher (in the thousands) because profiling should
+ have as low an overhead as possible and sampling on every cycle
+ would be prohibitively expensive.
+ </para>
+ </section>
- <section id='using-perf-to-do-basic-tracing'>
- <title>Using perf to do Basic Tracing</title>
+ <section id='using-perf-to-do-basic-tracing'>
+ <title>Using perf to do Basic Tracing</title>
- <para>
- Profiling is a great tool for solving many problems or for
- getting a high-level view of what's going on with a workload or
- across the system. It is however by definition an approximation,
- as suggested by the most prominent word associated with it,
- 'sampling'. On the one hand, it allows a representative picture of
- what's going on in the system to be cheaply taken, but on the other
- hand, that cheapness limits its utility when that data suggests a
- need to 'dive down' more deeply to discover what's really going
- on. In such cases, the only way to see what's really going on is
- to be able to look at (or summarize more intelligently) the
- individual steps that go into the higher-level behavior exposed
- by the coarse-grained profiling data.
- </para>
+ <para>
+ Profiling is a great tool for solving many problems or for
+ getting a high-level view of what's going on with a workload or
+ across the system. It is however by definition an approximation,
+ as suggested by the most prominent word associated with it,
+ 'sampling'. On the one hand, it allows a representative picture of
+ what's going on in the system to be cheaply taken, but on the other
+ hand, that cheapness limits its utility when that data suggests a
+ need to 'dive down' more deeply to discover what's really going
+ on. In such cases, the only way to see what's really going on is
+ to be able to look at (or summarize more intelligently) the
+ individual steps that go into the higher-level behavior exposed
+ by the coarse-grained profiling data.
+ </para>
- <para>
- As a concrete example, we can trace all the events we think might
- be applicable to our workload:
- <literallayout class='monospaced'>
+ <para>
+ As a concrete example, we can trace all the events we think might
+ be applicable to our workload:
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -g -e skb:* -e net:* -e napi:* -e sched:sched_switch -e sched:sched_wakeup -e irq:*
-e syscalls:sys_enter_read -e syscalls:sys_exit_read -e syscalls:sys_enter_write -e syscalls:sys_exit_write
wget http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2
- </literallayout>
- We can look at the raw trace output using 'perf script' with no
- arguments:
- <literallayout class='monospaced'>
+ </literallayout>
+ We can look at the raw trace output using 'perf script' with no
+ arguments:
+ <literallayout class='monospaced'>
root@crownbay:~# perf script
perf 1262 [000] 11624.857082: sys_exit_read: 0x0
@@ -779,90 +779,90 @@
wget 1262 [001] 11624.859888: sys_exit_read: 0x400
wget 1262 [001] 11624.859935: sys_enter_read: fd: 0x0003, buf: 0xb7720000, count: 0x0400
wget 1262 [001] 11624.859944: sys_exit_read: 0x400
- </literallayout>
- This gives us a detailed timestamped sequence of events that
- occurred within the workload with respect to those events.
- </para>
+ </literallayout>
+ This gives us a detailed timestamped sequence of events that
+ occurred within the workload with respect to those events.
+ </para>
- <para>
- In many ways, profiling can be viewed as a subset of tracing -
- theoretically, if you have a set of trace events that's sufficient
- to capture all the important aspects of a workload, you can derive
- any of the results or views that a profiling run can.
- </para>
+ <para>
+ In many ways, profiling can be viewed as a subset of tracing -
+ theoretically, if you have a set of trace events that's sufficient
+ to capture all the important aspects of a workload, you can derive
+ any of the results or views that a profiling run can.
+ </para>
- <para>
- Another aspect of traditional profiling is that while powerful in
- many ways, it's limited by the granularity of the underlying data.
- Profiling tools offer various ways of sorting and presenting the
- sample data, which make it much more useful and amenable to user
- experimentation, but in the end it can't be used in an open-ended
- way to extract data that just isn't present as a consequence of
- the fact that conceptually, most of it has been thrown away.
- </para>
+ <para>
+ Another aspect of traditional profiling is that while powerful in
+ many ways, it's limited by the granularity of the underlying data.
+ Profiling tools offer various ways of sorting and presenting the
+ sample data, which make it much more useful and amenable to user
+ experimentation, but in the end it can't be used in an open-ended
+ way to extract data that just isn't present as a consequence of
+ the fact that conceptually, most of it has been thrown away.
+ </para>
- <para>
- Full-blown detailed tracing data does however offer the opportunity
- to manipulate and present the information collected during a
- tracing run in an infinite variety of ways.
- </para>
+ <para>
+ Full-blown detailed tracing data does however offer the opportunity
+ to manipulate and present the information collected during a
+ tracing run in an infinite variety of ways.
+ </para>
- <para>
- Another way to look at it is that there are only so many ways that
- the 'primitive' counters can be used on their own to generate
- interesting output; to get anything more complicated than simple
- counts requires some amount of additional logic, which is typically
- very specific to the problem at hand. For example, if we wanted to
- make use of a 'counter' that maps to the value of the time
- difference between when a process was scheduled to run on a
- processor and the time it actually ran, we wouldn't expect such
- a counter to exist on its own, but we could derive one called say
- 'wakeup_latency' and use it to extract a useful view of that metric
- from trace data. Likewise, we really can't figure out from standard
- profiling tools how much data every process on the system reads and
- writes, along with how many of those reads and writes fail
- completely. If we have sufficient trace data, however, we could
- with the right tools easily extract and present that information,
- but we'd need something other than pre-canned profiling tools to
- do that.
- </para>
+ <para>
+ Another way to look at it is that there are only so many ways that
+ the 'primitive' counters can be used on their own to generate
+ interesting output; to get anything more complicated than simple
+ counts requires some amount of additional logic, which is typically
+ very specific to the problem at hand. For example, if we wanted to
+ make use of a 'counter' that maps to the value of the time
+ difference between when a process was scheduled to run on a
+ processor and the time it actually ran, we wouldn't expect such
+ a counter to exist on its own, but we could derive one called say
+ 'wakeup_latency' and use it to extract a useful view of that metric
+ from trace data. Likewise, we really can't figure out from standard
+ profiling tools how much data every process on the system reads and
+ writes, along with how many of those reads and writes fail
+ completely. If we have sufficient trace data, however, we could
+ with the right tools easily extract and present that information,
+ but we'd need something other than pre-canned profiling tools to
+ do that.
+ </para>
- <para>
- Luckily, there is general-purpose way to handle such needs,
- called 'programming languages'. Making programming languages
- easily available to apply to such problems given the specific
- format of data is called a 'programming language binding' for
- that data and language. Perf supports two programming language
- bindings, one for Python and one for Perl.
- </para>
+ <para>
+ Luckily, there is general-purpose way to handle such needs,
+ called 'programming languages'. Making programming languages
+ easily available to apply to such problems given the specific
+ format of data is called a 'programming language binding' for
+ that data and language. Perf supports two programming language
+ bindings, one for Python and one for Perl.
+ </para>
- <note>
- Tying It Together: Language bindings for manipulating and
- aggregating trace data are of course not a new
- idea. One of the first projects to do this was IBM's DProbes
- dpcc compiler, an ANSI C compiler which targeted a low-level
- assembly language running on an in-kernel interpreter on the
- target system. This is exactly analagous to what Sun's DTrace
- did, except that DTrace invented its own language for the purpose.
- Systemtap, heavily inspired by DTrace, also created its own
- one-off language, but rather than running the product on an
- in-kernel interpreter, created an elaborate compiler-based
- machinery to translate its language into kernel modules written
- in C.
- </note>
+ <note>
+ Tying It Together: Language bindings for manipulating and
+ aggregating trace data are of course not a new
+ idea. One of the first projects to do this was IBM's DProbes
+ dpcc compiler, an ANSI C compiler which targeted a low-level
+ assembly language running on an in-kernel interpreter on the
+ target system. This is exactly analagous to what Sun's DTrace
+ did, except that DTrace invented its own language for the purpose.
+ Systemtap, heavily inspired by DTrace, also created its own
+ one-off language, but rather than running the product on an
+ in-kernel interpreter, created an elaborate compiler-based
+ machinery to translate its language into kernel modules written
+ in C.
+ </note>
- <para>
- Now that we have the trace data in perf.data, we can use
- 'perf script -g' to generate a skeleton script with handlers
- for the read/write entry/exit events we recorded:
- <literallayout class='monospaced'>
+ <para>
+ Now that we have the trace data in perf.data, we can use
+ 'perf script -g' to generate a skeleton script with handlers
+ for the read/write entry/exit events we recorded:
+ <literallayout class='monospaced'>
root@crownbay:~# perf script -g python
generated Python script: perf-script.py
- </literallayout>
- The skeleton script simply creates a python function for each
- event type in the perf.data file. The body of each function simply
- prints the event name along with its parameters. For example:
- <literallayout class='monospaced'>
+ </literallayout>
+ The skeleton script simply creates a python function for each
+ event type in the perf.data file. The body of each function simply
+ prints the event name along with its parameters. For example:
+ <literallayout class='monospaced'>
def net__netif_rx(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
skbaddr, len, name):
@@ -870,10 +870,10 @@
common_pid, common_comm)
print "skbaddr=%u, len=%u, name=%s\n" % (skbaddr, len, name),
- </literallayout>
- We can run that script directly to print all of the events
- contained in the perf.data file:
- <literallayout class='monospaced'>
+ </literallayout>
+ We can run that script directly to print all of the events
+ contained in the perf.data file:
+ <literallayout class='monospaced'>
root@crownbay:~# perf script -s perf-script.py
in trace_begin
@@ -899,39 +899,39 @@
syscalls__sys_exit_read 1 11624.859888770 1262 wget nr=3, ret=1024
syscalls__sys_enter_read 1 11624.859935140 1262 wget nr=3, fd=3, buf=3077701632, count=1024
syscalls__sys_exit_read 1 11624.859944032 1262 wget nr=3, ret=1024
- </literallayout>
- That in itself isn't very useful; after all, we can accomplish
- pretty much the same thing by simply running 'perf script'
- without arguments in the same directory as the perf.data file.
- </para>
+ </literallayout>
+ That in itself isn't very useful; after all, we can accomplish
+ pretty much the same thing by simply running 'perf script'
+ without arguments in the same directory as the perf.data file.
+ </para>
- <para>
- We can however replace the print statements in the generated
- function bodies with whatever we want, and thereby make it
- infinitely more useful.
- </para>
+ <para>
+ We can however replace the print statements in the generated
+ function bodies with whatever we want, and thereby make it
+ infinitely more useful.
+ </para>
- <para>
- As a simple example, let's just replace the print statements in
- the function bodies with a simple function that does nothing but
- increment a per-event count. When the program is run against a
- perf.data file, each time a particular event is encountered,
- a tally is incremented for that event. For example:
- <literallayout class='monospaced'>
+ <para>
+ As a simple example, let's just replace the print statements in
+ the function bodies with a simple function that does nothing but
+ increment a per-event count. When the program is run against a
+ perf.data file, each time a particular event is encountered,
+ a tally is incremented for that event. For example:
+ <literallayout class='monospaced'>
def net__netif_rx(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
skbaddr, len, name):
inc_counts(event_name)
- </literallayout>
- Each event handler function in the generated code is modified
- to do this. For convenience, we define a common function called
- inc_counts() that each handler calls; inc_counts simply tallies
- a count for each event using the 'counts' hash, which is a
- specialized has function that does Perl-like autovivification, a
- capability that's extremely useful for kinds of multi-level
- aggregation commonly used in processing traces (see perf's
- documentation on the Python language binding for details):
- <literallayout class='monospaced'>
+ </literallayout>
+ Each event handler function in the generated code is modified
+ to do this. For convenience, we define a common function called
+ inc_counts() that each handler calls; inc_counts simply tallies
+ a count for each event using the 'counts' hash, which is a
+ specialized has function that does Perl-like autovivification, a
+ capability that's extremely useful for kinds of multi-level
+ aggregation commonly used in processing traces (see perf's
+ documentation on the Python language binding for details):
+ <literallayout class='monospaced'>
counts = autodict()
def inc_counts(event_name):
@@ -939,18 +939,18 @@
counts[event_name] += 1
except TypeError:
counts[event_name] = 1
- </literallayout>
- Finally, at the end of the trace processing run, we want to
- print the result of all the per-event tallies. For that, we
- use the special 'trace_end()' function:
- <literallayout class='monospaced'>
+ </literallayout>
+ Finally, at the end of the trace processing run, we want to
+ print the result of all the per-event tallies. For that, we
+ use the special 'trace_end()' function:
+ <literallayout class='monospaced'>
def trace_end():
for event_name, count in counts.iteritems():
print "%-40s %10s\n" % (event_name, count)
- </literallayout>
- The end result is a summary of all the events recorded in the
- trace:
- <literallayout>
+ </literallayout>
+ The end result is a summary of all the events recorded in the
+ trace:
+ <literallayout>
skb__skb_copy_datagram_iovec 13148
irq__softirq_entry 4796
irq__irq_handler_exit 3805
@@ -968,112 +968,112 @@
net__netif_rx 2
napi__napi_poll 1877
syscalls__sys_exit_write 8990
- </literallayout>
- Note that this is pretty much exactly the same information we get
- from 'perf stat', which goes a little way to support the idea
- mentioned previously that given the right kind of trace data,
- higher-level profiling-type summaries can be derived from it.
- </para>
+ </literallayout>
+ Note that this is pretty much exactly the same information we get
+ from 'perf stat', which goes a little way to support the idea
+ mentioned previously that given the right kind of trace data,
+ higher-level profiling-type summaries can be derived from it.
+ </para>
- <para>
- Documentation on using the
- <ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.
- </para>
- </section>
+ <para>
+ Documentation on using the
+ <ulink url='http://linux.die.net/man/1/perf-script-python'>'perf script' python binding</ulink>.
+ </para>
+ </section>
- <section id='system-wide-tracing-and-profiling'>
- <title>System-Wide Tracing and Profiling</title>
+ <section id='system-wide-tracing-and-profiling'>
+ <title>System-Wide Tracing and Profiling</title>
- <para>
- The examples so far have focused on tracing a particular program or
- workload - in other words, every profiling run has specified the
- program to profile in the command-line e.g. 'perf record wget ...'.
- </para>
+ <para>
+ The examples so far have focused on tracing a particular program or
+ workload - in other words, every profiling run has specified the
+ program to profile in the command-line e.g. 'perf record wget ...'.
+ </para>
- <para>
- It's also possible, and more interesting in many cases, to run a
- system-wide profile or trace while running the workload in a
- separate shell.
- </para>
+ <para>
+ It's also possible, and more interesting in many cases, to run a
+ system-wide profile or trace while running the workload in a
+ separate shell.
+ </para>
- <para>
- To do system-wide profiling or tracing, you typically use
- the -a flag to 'perf record'.
- </para>
+ <para>
+ To do system-wide profiling or tracing, you typically use
+ the -a flag to 'perf record'.
+ </para>
- <para>
- To demonstrate this, open up one window and start the profile
- using the -a flag (press Ctrl-C to stop tracing):
- <literallayout class='monospaced'>
+ <para>
+ To demonstrate this, open up one window and start the profile
+ using the -a flag (press Ctrl-C to stop tracing):
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -g -a
^C[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.400 MB perf.data (~61172 samples) ]
- </literallayout>
- In another window, run the wget test:
- <literallayout class='monospaced'>
+ </literallayout>
+ In another window, run the wget test:
+ <literallayout class='monospaced'>
root@crownbay:~# wget <ulink url='http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2'>http://downloads.yoctoproject.org/mirror/sources/linux-2.6.19.2.tar.bz2</ulink>
Connecting to downloads.yoctoproject.org (140.211.169.59:80)
linux-2.6.19.2.tar.b 100% |*******************************| 41727k 0:00:00 ETA
- </literallayout>
- Here we see entries not only for our wget load, but for other
- processes running on the system as well:
- </para>
+ </literallayout>
+ Here we see entries not only for our wget load, but for other
+ processes running on the system as well:
+ </para>
- <para>
- <imagedata fileref="figures/perf-systemwide.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-systemwide.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- In the snapshot above, we can see callchains that originate in
- libc, and a callchain from Xorg that demonstrates that we're
- using a proprietary X driver in userspace (notice the presence
- of 'PVR' and some other unresolvable symbols in the expanded
- Xorg callchain).
- </para>
+ <para>
+ In the snapshot above, we can see callchains that originate in
+ libc, and a callchain from Xorg that demonstrates that we're
+ using a proprietary X driver in userspace (notice the presence
+ of 'PVR' and some other unresolvable symbols in the expanded
+ Xorg callchain).
+ </para>
- <para>
- Note also that we have both kernel and userspace entries in the
- above snapshot. We can also tell perf to focus on userspace but
- providing a modifier, in this case 'u', to the 'cycles' hardware
- counter when we record a profile:
- <literallayout class='monospaced'>
+ <para>
+ Note also that we have both kernel and userspace entries in the
+ above snapshot. We can also tell perf to focus on userspace but
+ providing a modifier, in this case 'u', to the 'cycles' hardware
+ counter when we record a profile:
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -g -a -e cycles:u
^C[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.376 MB perf.data (~16443 samples) ]
- </literallayout>
- </para>
+ </literallayout>
+ </para>
- <para>
- <imagedata fileref="figures/perf-report-cycles-u.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-report-cycles-u.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- Notice in the screenshot above, we see only userspace entries ([.])
- </para>
+ <para>
+ Notice in the screenshot above, we see only userspace entries ([.])
+ </para>
- <para>
- Finally, we can press 'enter' on a leaf node and select the 'Zoom
- into DSO' menu item to show only entries associated with a
- specific DSO. In the screenshot below, we've zoomed into the
- 'libc' DSO which shows all the entries associated with the
- libc-xxx.so DSO.
- </para>
+ <para>
+ Finally, we can press 'enter' on a leaf node and select the 'Zoom
+ into DSO' menu item to show only entries associated with a
+ specific DSO. In the screenshot below, we've zoomed into the
+ 'libc' DSO which shows all the entries associated with the
+ libc-xxx.so DSO.
+ </para>
- <para>
- <imagedata fileref="figures/perf-systemwide-libc.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-systemwide-libc.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <para>
- We can also use the system-wide -a switch to do system-wide
- tracing. Here we'll trace a couple of scheduler events:
- <literallayout class='monospaced'>
+ <para>
+ We can also use the system-wide -a switch to do system-wide
+ tracing. Here we'll trace a couple of scheduler events:
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -a -e sched:sched_switch -e sched:sched_wakeup
^C[ perf record: Woken up 38 times to write data ]
[ perf record: Captured and wrote 9.780 MB perf.data (~427299 samples) ]
- </literallayout>
- We can look at the raw output using 'perf script' with no
- arguments:
- <literallayout class='monospaced'>
+ </literallayout>
+ We can look at the raw output using 'perf script' with no
+ arguments:
+ <literallayout class='monospaced'>
root@crownbay:~# perf script
perf 1383 [001] 6171.460045: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
@@ -1086,21 +1086,21 @@
perf 1383 [001] 6171.470058: sched_switch: prev_comm=perf prev_pid=1383 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:1 next_pid=21 next_prio=120
kworker/1:1 21 [001] 6171.470082: sched_switch: prev_comm=kworker/1:1 prev_pid=21 prev_prio=120 prev_state=S ==> next_comm=perf next_pid=1383 next_prio=120
perf 1383 [001] 6171.480035: sched_wakeup: comm=kworker/1:1 pid=21 prio=120 success=1 target_cpu=001
- </literallayout>
- </para>
-
- <section id='perf-filtering'>
- <title>Filtering</title>
+ </literallayout>
+ </para>
- <para>
- Notice that there are a lot of events that don't really have
- anything to do with what we're interested in, namely events
- that schedule 'perf' itself in and out or that wake perf up.
- We can get rid of those by using the '--filter' option -
- for each event we specify using -e, we can add a --filter
- after that to filter out trace events that contain fields
- with specific values:
- <literallayout class='monospaced'>
+ <section id='perf-filtering'>
+ <title>Filtering</title>
+
+ <para>
+ Notice that there are a lot of events that don't really have
+ anything to do with what we're interested in, namely events
+ that schedule 'perf' itself in and out or that wake perf up.
+ We can get rid of those by using the '--filter' option -
+ for each event we specify using -e, we can add a --filter
+ after that to filter out trace events that contain fields
+ with specific values:
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -a -e sched:sched_switch --filter 'next_comm != perf &amp;&amp; prev_comm != perf' -e sched:sched_wakeup --filter 'comm != perf'
^C[ perf record: Woken up 38 times to write data ]
[ perf record: Captured and wrote 9.688 MB perf.data (~423279 samples) ]
@@ -1122,59 +1122,59 @@
swapper 0 [000] 7932.326109: sched_wakeup: comm=kworker/0:3 pid=1209 prio=120 success=1 target_cpu=000
swapper 0 [000] 7932.326171: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/0:3 next_pid=1209 next_prio=120
kworker/0:3 1209 [000] 7932.326214: sched_switch: prev_comm=kworker/0:3 prev_pid=1209 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
- </literallayout>
- In this case, we've filtered out all events that have 'perf'
- in their 'comm' or 'comm_prev' or 'comm_next' fields. Notice
- that there are still events recorded for perf, but notice
- that those events don't have values of 'perf' for the filtered
- fields. To completely filter out anything from perf will
- require a bit more work, but for the purpose of demonstrating
- how to use filters, it's close enough.
- </para>
-
- <note>
- Tying It Together: These are exactly the same set of event
- filters defined by the trace event subsystem. See the
- ftrace/tracecmd/kernelshark section for more discussion about
- these event filters.
- </note>
-
- <note>
- Tying It Together: These event filters are implemented by a
- special-purpose pseudo-interpreter in the kernel and are an
- integral and indispensable part of the perf design as it
- relates to tracing. kernel-based event filters provide a
- mechanism to precisely throttle the event stream that appears
- in user space, where it makes sense to provide bindings to real
- programming languages for postprocessing the event stream.
- This architecture allows for the intelligent and flexible
- partitioning of processing between the kernel and user space.
- Contrast this with other tools such as SystemTap, which does
- all of its processing in the kernel and as such requires a
- special project-defined language in order to accommodate that
- design, or LTTng, where everything is sent to userspace and
- as such requires a super-efficient kernel-to-userspace
- transport mechanism in order to function properly. While
- perf certainly can benefit from for instance advances in
- the design of the transport, it doesn't fundamentally depend
- on them. Basically, if you find that your perf tracing
- application is causing buffer I/O overruns, it probably
- means that you aren't taking enough advantage of the
- kernel filtering engine.
- </note>
+ </literallayout>
+ In this case, we've filtered out all events that have 'perf'
+ in their 'comm' or 'comm_prev' or 'comm_next' fields. Notice
+ that there are still events recorded for perf, but notice
+ that those events don't have values of 'perf' for the filtered
+ fields. To completely filter out anything from perf will
+ require a bit more work, but for the purpose of demonstrating
+ how to use filters, it's close enough.
+ </para>
+
+ <note>
+ Tying It Together: These are exactly the same set of event
+ filters defined by the trace event subsystem. See the
+ ftrace/tracecmd/kernelshark section for more discussion about
+ these event filters.
+ </note>
+
+ <note>
+ Tying It Together: These event filters are implemented by a
+ special-purpose pseudo-interpreter in the kernel and are an
+ integral and indispensable part of the perf design as it
+ relates to tracing. kernel-based event filters provide a
+ mechanism to precisely throttle the event stream that appears
+ in user space, where it makes sense to provide bindings to real
+ programming languages for postprocessing the event stream.
+ This architecture allows for the intelligent and flexible
+ partitioning of processing between the kernel and user space.
+ Contrast this with other tools such as SystemTap, which does
+ all of its processing in the kernel and as such requires a
+ special project-defined language in order to accommodate that
+ design, or LTTng, where everything is sent to userspace and
+ as such requires a super-efficient kernel-to-userspace
+ transport mechanism in order to function properly. While
+ perf certainly can benefit from for instance advances in
+ the design of the transport, it doesn't fundamentally depend
+ on them. Basically, if you find that your perf tracing
+ application is causing buffer I/O overruns, it probably
+ means that you aren't taking enough advantage of the
+ kernel filtering engine.
+ </note>
+ </section>
</section>
- </section>
- <section id='using-dynamic-tracepoints'>
- <title>Using Dynamic Tracepoints</title>
+ <section id='using-dynamic-tracepoints'>
+ <title>Using Dynamic Tracepoints</title>
- <para>
- perf isn't restricted to the fixed set of static tracepoints
- listed by 'perf list'. Users can also add their own 'dynamic'
- tracepoints anywhere in the kernel. For instance, suppose we
- want to define our own tracepoint on do_fork(). We can do that
- using the 'perf probe' perf subcommand:
- <literallayout class='monospaced'>
+ <para>
+ perf isn't restricted to the fixed set of static tracepoints
+ listed by 'perf list'. Users can also add their own 'dynamic'
+ tracepoints anywhere in the kernel. For instance, suppose we
+ want to define our own tracepoint on do_fork(). We can do that
+ using the 'perf probe' perf subcommand:
+ <literallayout class='monospaced'>
root@crownbay:~# perf probe do_fork
Added new event:
probe:do_fork (on do_fork)
@@ -1182,13 +1182,13 @@
You can now use it in all perf tools, such as:
perf record -e probe:do_fork -aR sleep 1
- </literallayout>
- Adding a new tracepoint via 'perf probe' results in an event
- with all the expected files and format in
- /sys/kernel/debug/tracing/events, just the same as for static
- tracepoints (as discussed in more detail in the trace events
- subsystem section:
- <literallayout class='monospaced'>
+ </literallayout>
+ Adding a new tracepoint via 'perf probe' results in an event
+ with all the expected files and format in
+ /sys/kernel/debug/tracing/events, just the same as for static
+ tracepoints (as discussed in more detail in the trace events
+ subsystem section:
+ <literallayout class='monospaced'>
root@crownbay:/sys/kernel/debug/tracing/events/probe/do_fork# ls -al
drwxr-xr-x 2 root root 0 Oct 28 11:42 .
drwxr-xr-x 3 root root 0 Oct 28 11:42 ..
@@ -1210,23 +1210,23 @@
field:unsigned long __probe_ip; offset:12; size:4; signed:0;
print fmt: "(%lx)", REC->__probe_ip
- </literallayout>
- We can list all dynamic tracepoints currently in existence:
- <literallayout class='monospaced'>
+ </literallayout>
+ We can list all dynamic tracepoints currently in existence:
+ <literallayout class='monospaced'>
root@crownbay:~# perf probe -l
probe:do_fork (on do_fork)
probe:schedule (on schedule)
- </literallayout>
- Let's record system-wide ('sleep 30' is a trick for recording
- system-wide but basically do nothing and then wake up after
- 30 seconds):
- <literallayout class='monospaced'>
+ </literallayout>
+ Let's record system-wide ('sleep 30' is a trick for recording
+ system-wide but basically do nothing and then wake up after
+ 30 seconds):
+ <literallayout class='monospaced'>
root@crownbay:~# perf record -g -a -e probe:do_fork sleep 30
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.087 MB perf.data (~3812 samples) ]
- </literallayout>
- Using 'perf script' we can see each do_fork event that fired:
- <literallayout class='monospaced'>
+ </literallayout>
+ Using 'perf script' we can see each do_fork event that fired:
+ <literallayout class='monospaced'>
root@crownbay:~# perf script
# ========
@@ -1265,27 +1265,28 @@
matchbox-deskto 1197 [001] 34237.112138: do_fork: (c1028460)
matchbox-deskto 1311 [001] 34237.114106: do_fork: (c1028460)
gaku 1312 [000] 34237.202388: do_fork: (c1028460)
- </literallayout>
- And using 'perf report' on the same file, we can see the
- callgraphs from starting a few programs during those 30 seconds:
- </para>
+ </literallayout>
+ And using 'perf report' on the same file, we can see the
+ callgraphs from starting a few programs during those 30 seconds:
+ </para>
- <para>
- <imagedata fileref="figures/perf-probe-do_fork-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
- </para>
+ <para>
+ <imagedata fileref="figures/perf-probe-do_fork-profile.png" width="6in" depth="7in" align="center" scalefit="1" />
+ </para>
- <note>
- Tying It Together: The trace events subsystem accomodate static
- and dynamic tracepoints in exactly the same way - there's no
- difference as far as the infrastructure is concerned. See the
- ftrace section for more details on the trace event subsystem.
- </note>
-
- <note>
- Tying It Together: Dynamic tracepoints are implemented under the
- covers by kprobes and uprobes. kprobes and uprobes are also used
- by and in fact are the main focus of SystemTap.
- </note>
+ <note>
+ Tying It Together: The trace events subsystem accomodate static
+ and dynamic tracepoints in exactly the same way - there's no
+ difference as far as the infrastructure is concerned. See the
+ ftrace section for more details on the trace event subsystem.
+ </note>
+
+ <note>
+ Tying It Together: Dynamic tracepoints are implemented under the
+ covers by kprobes and uprobes. kprobes and uprobes are also used
+ by and in fact are the main focus of SystemTap.
+ </note>
+ </section>
</section>
<section id='perf-documentation'>
OpenPOWER on IntegriCloud