\documentclass[a4paper,twocolumn]{article} \usepackage{abstract} \usepackage{xspace} \usepackage{amssymb} \usepackage{latexsym} \usepackage{tabularx} \usepackage[T1]{fontenc} \usepackage{calc} \usepackage{listings} \usepackage{color} \usepackage{url} \title{Device trees everywhere} \author{David Gibson \texttt{<{dwg}{@}{au1.ibm.com}>}\\ Benjamin Herrenschmidt \texttt{<{benh}{@}{kernel.crashing.org}>}\\ \emph{OzLabs, IBM Linux Technology Center}} \newcommand{\R}{\textsuperscript{\textregistered}\xspace} \newcommand{\tm}{\textsuperscript{\texttrademark}\xspace} \newcommand{\tge}{$\geqslant$} %\newcommand{\ditto}{\textquotedbl\xspace} \newcommand{\fixme}[1]{$\bigstar$\emph{\textbf{\large #1}}$\bigstar$\xspace} \newcommand{\ppc}{\mbox{PowerPC}\xspace} \newcommand{\of}{Open Firmware\xspace} \newcommand{\benh}{Ben Herrenschmidt\xspace} \newcommand{\kexec}{\texttt{kexec()}\xspace} \newcommand{\dtbeginnode}{\texttt{OF\_DT\_BEGIN\_NODE\xspace}} \newcommand{\dtendnode}{\texttt{OF\_DT\_END\_NODE\xspace}} \newcommand{\dtprop}{\texttt{OF\_DT\_PROP\xspace}} \newcommand{\dtend}{\texttt{OF\_DT\_END\xspace}} \newcommand{\dtc}{\texttt{dtc}\xspace} \newcommand{\phandle}{\texttt{linux,phandle}\xspace} \begin{document} \maketitle \begin{abstract} We present a method for booting a \ppc{}\R Linux\R kernel on an embedded machine. To do this, we supply the kernel with a compact flattened-tree representation of the system's hardware based on the device tree supplied by Open Firmware on IBM\R servers and Apple\R Power Macintosh\R machines. The ``blob'' representing the device tree can be created using \dtc --- the Device Tree Compiler --- that turns a simple text representation of the tree into the compact representation used by the kernel. The compiler can produce either a binary ``blob'' or an assembler file ready to be built into a firmware or bootwrapper image. This flattened-tree approach is now the only supported method of booting a \texttt{ppc64} kernel without Open Firmware, and we plan to make it the only supported method for all \texttt{powerpc} kernels in the future. \end{abstract} \section{Introduction} \subsection{OF and the device tree} Historically, ``everyday'' \ppc machines have booted with the help of \of (OF), a firmware environment defined by IEEE1275 \cite{IEEE1275}. Among other boot-time services, OF maintains a device tree that describes all of the system's hardware devices and how they're connected. During boot, before taking control of memory management, the Linux kernel uses OF calls to scan the device tree and transfer it to an internal representation that is used at run time to look up various device information. The device tree consists of nodes representing devices or buses\footnote{Well, mostly. There are a few special exceptions.}. Each node contains \emph{properties}, name--value pairs that give information about the device. The values are arbitrary byte strings, and for some properties, they contain tables or other structured information. \subsection{The bad old days} Embedded systems, by contrast, usually have a minimal firmware that might supply a few vital system parameters (size of RAM and the like), but nothing as detailed or complete as the OF device tree. This has meant that the various 32-bit \ppc embedded ports have required a variety of hacks spread across the kernel to deal with the lack of device tree. These vary from specialised boot wrappers to parse parameters (which are at least reasonably localised) to CONFIG-dependent hacks in drivers to override normal probe logic with hardcoded addresses for a particular board. As well as being ugly of itself, such CONFIG-dependent hacks make it hard to build a single kernel image that supports multiple embedded machines. Until relatively recently, the only 64-bit \ppc machines without OF were legacy (pre-POWER5\R) iSeries\R machines. iSeries machines often only have virtual IO devices, which makes it quite simple to work around the lack of a device tree. Even so, the lack means the iSeries boot sequence must be quite different from the pSeries or Macintosh, which is not ideal. The device tree also presents a problem for implementing \kexec. When the kernel boots, it takes over full control of the system from OF, even re-using OF's memory. So, when \kexec comes to boot another kernel, OF is no longer around for the second kernel to query. \section{The Flattened Tree} In May 2005 \benh implemented a new approach to handling the device tree that addresses all these problems. When booting on OF systems, the first thing the kernel runs is a small piece of code in \texttt{prom\_init.c}, which executes in the context of OF. This code walks the device tree using OF calls, and transcribes it into a compact, flattened format. The resulting device tree ``blob'' is then passed to the kernel proper, which eventually unflattens the tree into its runtime form. This blob is the only data communicated between the \texttt{prom\_init.c} bootstrap and the rest of the kernel. When OF isn't available, either because the machine doesn't have it at all or because \kexec has been used, the kernel instead starts directly from the entry point taking a flattened device tree. The device tree blob must be passed in from outside, rather than generated by part of the kernel from OF. For \kexec, the userland \texttt{kexec} tools build the blob from the runtime device tree before invoking the new kernel. For embedded systems the blob can come either from the embedded bootloader, or from a specialised version of the \texttt{zImage} wrapper for the system in question. \subsection{Properties of the flattened tree} The flattened tree format should be easy to handle, both for the kernel that parses it and the bootloader that generates it. In particular, the following properties are desirable: \begin{itemize} \item \emph{relocatable}: the bootloader or kernel should be able to move the blob around as a whole, without needing to parse or adjust its internals. In practice that means we must not use pointers within the blob. \item \emph{insert and delete}: sometimes the bootloader might want to make tweaks to the flattened tree, such as deleting or inserting a node (or whole subtree). It should be possible to do this without having to effectively regenerate the whole flattened tree. In practice this means limiting the use of internal offsets in the blob that need recalculation if a section is inserted or removed with \texttt{memmove()}. \item \emph{compact}: embedded systems are frequently short of resources, particularly RAM and flash memory space. Thus, the tree representation should be kept as small as conveniently possible. \end{itemize} \subsection{Format of the device tree blob} \label{sec:format} \begin{figure}[htb!] \centering \footnotesize \begin{tabular}{r|c|l} \multicolumn{1}{r}{\textbf{Offset}}& \multicolumn{1}{c}{\textbf{Contents}} \\\cline{2-2} \texttt{0x00} & \texttt{0xd00dfeed} & magic number \\\cline{2-2} \texttt{0x04} & \emph{totalsize} \\\cline{2-2} \texttt{0x08} & \emph{off\_struct} & \\\cline{2-2} \texttt{0x0C} & \emph{off\_strs} & \\\cline{2-2} \texttt{0x10} & \emph{off\_rsvmap} & \\\cline{2-2} \texttt{0x14} & \emph{version} \\\cline{2-2} \texttt{0x18} & \emph{last\_comp\_ver} & \\\cline{2-2} \texttt{0x1C} & \emph{boot\_cpu\_id} & \tge v2 only\\\cline{2-2} \texttt{0x20} & \emph{size\_strs} & \tge v3 only\\\cline{2-2} \multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2} \emph{off\_rsvmap} & \emph{address0} & memory reserve \\ + \texttt{0x04} & ...& table \\\cline{2-2} + \texttt{0x08} & \emph{len0} & \\ + \texttt{0x0C} & ...& \\\cline{2-2} \vdots & \multicolumn{1}{c|}{\vdots} & \\\cline{2-2} & \texttt{0x00000000}- & end marker\\ & \texttt{00000000} & \\\cline{2-2} & \texttt{0x00000000}- & \\ & \texttt{00000000} & \\\cline{2-2} \multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2} \emph{off\_strs} & \texttt{'n' 'a' 'm' 'e'} & strings block \\ + \texttt{0x04} & \texttt{~0~ 'm' 'o' 'd'} & \\ + \texttt{0x08} & \texttt{'e' 'l' ~0~ \makebox[\widthof{~~~}]{\textrm{...}}} & \\ \vdots & \multicolumn{1}{c|}{\vdots} & \\\cline{2-2} \multicolumn{1}{r}{+ \emph{size\_strs}} \\ \multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2} \emph{off\_struct} & \dtbeginnode & structure block \\\cline{2-2} + \texttt{0x04} & \texttt{'/' ~0~ ~0~ ~0~} & root node\\\cline{2-2} + \texttt{0x08} & \dtprop & \\\cline{2-2} + \texttt{0x0C} & \texttt{0x00000005} & ``\texttt{model}''\\\cline{2-2} + \texttt{0x10} & \texttt{0x00000008} & \\\cline{2-2} + \texttt{0x14} & \texttt{'M' 'y' 'B' 'o'} & \\ + \texttt{0x18} & \texttt{'a' 'r' 'd' ~0~} & \\\cline{2-2} \vdots & \multicolumn{1}{c|}{\vdots} & \\\cline{2-2} & \texttt{\dtendnode} \\\cline{2-2} & \texttt{\dtend} \\\cline{2-2} \multicolumn{1}{r}{\vdots} & \multicolumn{1}{c}{\vdots} & \\\cline{2-2} \multicolumn{1}{r}{\emph{totalsize}} \\ \end{tabular} \caption{Device tree blob layout} \label{fig:blob-layout} \end{figure} The format for the blob we devised, was first described on the \texttt{linuxppc64-dev} mailing list in \cite{noof1}. The format has since evolved through various revisions, and the current version is included as part of the \dtc (see \S\ref{sec:dtc}) git tree, \cite{dtcgit}. Figure \ref{fig:blob-layout} shows the layout of the blob of data containing the device tree. It has three sections of variable size: the \emph{memory reserve table}, the \emph{structure block} and the \emph{strings block}. A small header gives the blob's size and version and the locations of the three sections, plus a handful of vital parameters used during early boot. The memory reserve map section gives a list of regions of memory that the kernel must not use\footnote{Usually such ranges contain some data structure initialised by the firmware that must be preserved by the kernel.}. The list is represented as a simple array of (address, size) pairs of 64 bit values, terminated by a zero size entry. The strings block is similarly simple, consisting of a number of null-terminated strings appended together, which are referenced from the structure block as described below. The structure block contains the device tree proper. Each node is introduced with a 32-bit \dtbeginnode tag, followed by the node's name as a null-terminated string, padded to a 32-bit boundary. Then follows all of the properties of the node, each introduced with a \dtprop tag, then all of the node's subnodes, each introduced with their own \dtbeginnode tag. The node ends with an \dtendnode tag, and after the \dtendnode for the root node is an \dtend tag, indicating the end of the whole tree\footnote{This is redundant, but included for ease of parsing.}. The structure block starts with the \dtbeginnode introducing the description of the root node (named \texttt{/}). Each property, after the \dtprop, has a 32-bit value giving an offset from the beginning of the strings block at which the property name is stored. Because it's common for many nodes to have properties with the same name, this approach can substantially reduce the total size of the blob. The name offset is followed by the length of the property value (as a 32-bit value) and then the data itself padded to a 32-bit boundary. \subsection{Contents of the tree} \label{sec:treecontents} Having seen how to represent the device tree structure as a flattened blob, what actually goes into the tree? The short answer is ``the same as an OF tree''. On OF systems, the flattened tree is transcribed directly from the OF device tree, so for simplicity we also use OF conventions for the tree on other systems. In many cases a flat tree can be simpler than a typical OF provided device tree. The flattened tree need only provide those nodes and properties that the kernel actually requires; the flattened tree generally need not include devices that the kernel can probe itself. For example, an OF device tree would normally include nodes for each PCI device on the system. A flattened tree need only include nodes for the PCI host bridges; the kernel will scan the buses thus described to find the subsidiary devices. The device tree can include nodes for devices where the kernel needs extra information, though: for example, for ISA devices on a subsidiary PCI/ISA bridge, or for devices with unusual interrupt routing. Where they exist, we follow the IEEE1275 bindings that specify how to describe various buses in the device tree (for example, \cite{IEEE1275-pci} describe how to represent PCI devices). The standard has not been updated for a long time, however, and lacks bindings for many modern buses and devices. In particular, embedded specific devices such as the various System-on-Chip buses are not covered. We intend to create new bindings for such buses, in keeping with the general conventions of IEEE1275 (a simple such binding for a System-on-Chip bus was included in \cite{noof5} a revision of \cite{noof1}). One complication arises for representing ``phandles'' in the flattened tree. In OF, each node in the tree has an associated phandle, a 32-bit integer that uniquely identifies the node\footnote{In practice usually implemented as a pointer or offset within OF memory.}. This handle is used by the various OF calls to query and traverse the tree. Sometimes phandles are also used within the tree to refer to other nodes in the tree. For example, devices that produce interrupts generally have an \texttt{interrupt-parent} property giving the phandle of the interrupt controller that handles interrupts from this device. Parsing these and other interrupt related properties allows the kernel to build a complete representation of the system's interrupt tree, which can be quite different from the tree of bus connections. In the flattened tree, a node's phandle is represented by a special \phandle property. When the kernel generates a flattened tree from OF, it adds a \phandle property to each node, containing the phandle retrieved from OF. When the tree is generated without OF, however, only nodes that are actually referred to by phandle need to have this property. Another complication arises because nodes in an OF tree have two names. First they have the ``unit name'', which is how the node is referred to in an OF path. The unit name generally consists of a device type followed by an \texttt{@} followed by a \emph{unit address}. For example \texttt{/memory@0} is the full path of a memory node at address 0, \texttt{/ht@0,f2000000/pci@1} is the path of a PCI bus node, which is under a HyperTransport\tm bus node. The form of the unit address is bus dependent, but is generally derived from the node's \texttt{reg} property. In addition, nodes have a property, \texttt{name}, whose value is usually equal to the first path of the unit name. For example, the nodes in the previous example would have \texttt{name} properties equal to \texttt{memory} and \texttt{pci}, respectively. To save space in the blob, the current version of the flattened tree format only requires the unit names to be present. When the kernel unflattens the tree, it automatically generates a \texttt{name} property from the node's path name. \section{The Device Tree Compiler} \label{sec:dtc} \begin{figure}[htb!] \centering \begin{lstlisting}[frame=single,basicstyle=\footnotesize\ttfamily, tabsize=3,numbers=left,xleftmargin=2em] /memreserve/ 0x20000000-0x21FFFFFF; / { model = "MyBoard"; compatible = "MyBoardFamily"; #address-cells = <2>; #size-cells = <2>; cpus { #address-cells = <1>; #size-cells = <0>; PowerPC,970@0 { device_type = "cpu"; reg = <0>; clock-frequency = <5f5e1000>; timebase-frequency = <1FCA055>; linux,boot-cpu; i-cache-size = <10000>; d-cache-size = <8000>; }; }; memory@0 { device_type = "memory"; memreg: reg = <00000000 00000000 00000000 20000000>; }; mpic@0x3fffdd08400 { /* Interrupt controller */ /* ... */ }; pci@40000000000000 { /* PCI host bridge */ /* ... */ }; chosen { bootargs = "root=/dev/sda2"; linux,platform = <00000600>; interrupt-controller = < &/mpic@0x3fffdd08400 >; }; }; \end{lstlisting} \caption{Example \dtc source} \label{fig:dts} \end{figure} As we've seen, the flattened device tree format provides a convenient way of communicating device tree information to the kernel. It's simple for the kernel to parse, and simple for bootloaders to manipulate. On OF systems, it's easy to generate the flattened tree by walking the OF maintained tree. However, for embedded systems, the flattened tree must be generated from scratch. Embedded bootloaders are generally built for a particular board. So, it's usually possible to build the device tree blob at compile time and include it in the bootloader image. For minor revisions of the board, the bootloader can contain code to make the necessary tweaks to the tree before passing it to the booted kernel. The device trees for embedded boards are usually quite simple, and it's possible to hand construct the necessary blob by hand, but doing so is tedious. The ``device tree compiler'', \dtc{}\footnote{\dtc can be obtained from \cite{dtcgit}.}, is designed to make creating device tree blobs easier by converting a text representation of the tree into the necessary blob. \subsection{Input and output formats} As well as the normal mode of compiling a device tree blob from text source, \dtc can convert a device tree between a number of representations. It can take its input in one of three different formats: \begin{itemize} \item source, the normal case. The device tree is described in a text form, described in \S\ref{sec:dts}. \item blob (\texttt{dtb}), the flattened tree format described in \S\ref{sec:format}. This mode is useful for checking a pre-existing device tree blob. \item filesystem (\texttt{fs}), input is a directory tree in the layout of \texttt{/proc/device-tree} (roughly, a directory for each node in the device tree, a file for each property). This is useful for building a blob for the device tree in use by the currently running kernel. \end{itemize} In addition, \dtc can output the tree in one of three different formats: \begin{itemize} \item blob (\texttt{dtb}), as in \S\ref{sec:format}. The most straightforward use of \dtc is to compile from ``source'' to ``blob'' format. \item source (\texttt{dts}), as in \S\ref{sec:dts}. If used with blob input, this allows \dtc to act as a ``decompiler''. \item assembler source (\texttt{asm}). \dtc can produce an assembler file, which will assemble into a \texttt{.o} file containing the device tree blob, with symbols giving the beginning of the blob and its various subsections. This can then be linked directly into a bootloader or firmware image. \end{itemize} For maximum applicability, \dtc can both read and write any of the existing revisions of the blob format. When reading, \dtc takes the version from the blob header, and when writing it takes a command line option specifying the desired version. It automatically makes any necessary adjustments to the tree that are necessary for the specified version. For example, formats before 0x10 require each node to have an explicit \texttt{name} property. When \dtc creates such a blob, it will automatically generate \texttt{name} properties from the unit names. \subsection{Source format} \label{sec:dts} The ``source'' format for \dtc is a text description of the device tree in a vaguely C-like form. Figure \ref{fig:dts} shows an example. The file starts with \texttt{/memreserve/} directives, which gives address ranges to add to the output blob's memory reserve table, then the device tree proper is described. Nodes of the tree are introduced with the node name, followed by a \texttt{\{} ... \texttt{\};} block containing the node's properties and subnodes. Properties are given as just {\emph{name} \texttt{=} \emph{value}\texttt{;}}. The property values can be given in any of three forms: \begin{itemize} \item \emph{string} (for example, \texttt{"MyBoard"}). The property value is the given string, including terminating NULL. C-style escapes (\verb+\t+, \verb+\n+, \verb+\0+ and so forth) are allowed. \item \emph{cells} (for example, \texttt{<0 8000 f0000000>}). The property value is made up of a list of 32-bit ``cells'', each given as a hex value. \item \emph{bytestring} (for example, \texttt{[1234abcdef]}). The property value is given as a hex bytestring. \end{itemize} Cell properties can also contain \emph{references}. Instead of a hex number, the source can give an ampersand (\texttt{\&}) followed by the full path to some node in the tree. For example, in Figure \ref{fig:dts}, the \texttt{/chosen} node has an \texttt{interrupt-controller} property referring to the interrupt controller described by the node \texttt{/mpic@0x3fffdd08400}. In the output tree, the value of the referenced node's phandle is included in the property. If that node doesn't have an explicit phandle property, \dtc will automatically create a unique phandle for it. This approach makes it easy to create interrupt trees without having to explicitly assign and remember phandles for the various interrupt controller nodes. The \dtc source can also include ``labels'', which are placed on a particular node or property. For example, Figure \ref{fig:dts} has a label ``\texttt{memreg}'' on the \texttt{reg} property of the node \texttt{/memory@0}. When using assembler output, corresponding labels in the output are generated, which will assemble into symbols addressing the part of the blob with the node or property in question. This is useful for the common case where an embedded board has an essentially fixed device tree with a few variable properties, such as the size of memory. The bootloader for such a board can have a device tree linked in, including a symbol referring to the right place in the blob to update the parameter with the correct value determined at runtime. \subsection{Tree checking} Between reading in the device tree and writing it out in the new format, \dtc performs a number of checks on the tree: \begin{itemize} \item \emph{syntactic structure}: \dtc checks that node and property names contain only allowed characters and meet length restrictions. It checks that a node does not have multiple properties or subnodes with the same name. \item \emph{semantic structure}: In some cases, \dtc checks that properties whose contents are defined by convention have appropriate values. For example, it checks that \texttt{reg} properties have a length that makes sense given the address forms specified by the \texttt{\#address-cells} and \texttt{\#size-cells} properties. It checks that properties such as \texttt{interrupt-parent} contain a valid phandle. \item \emph{Linux requirements}: \dtc checks that the device tree contains those nodes and properties that are required by the Linux kernel to boot correctly. \end{itemize} These checks are useful to catch simple problems with the device tree, rather than having to debug the results on an embedded kernel. With the blob input mode, it can also be used for diagnosing problems with an existing blob. \section{Future Work} \subsection{Board ports} The flattened device tree has always been the only supported way to boot a \texttt{ppc64} kernel on an embedded system. With the merge of \texttt{ppc32} and \texttt{ppc64} code it has also become the only supported way to boot any merged \texttt{powerpc} kernel, 32-bit or 64-bit. In fact, the old \texttt{ppc} architecture exists mainly just to support the old ppc32 embedded ports that have not been migrated to the flattened device tree approach. We plan to remove the \texttt{ppc} architecture eventually, which will mean porting all the various embedded boards to use the flattened device tree. \subsection{\dtc features} While it is already quite usable, there are a number of extra features that \dtc could include to make creating device trees more convenient: \begin{itemize} \item \emph{better tree checking}: Although \dtc already performs a number of checks on the device tree, they are rather haphazard. In many cases \dtc will give up after detecting a minor error early and won't pick up more interesting errors later on. There is a \texttt{-f} parameter that forces \dtc to generate an output tree even if there are errors. At present, this needs to be used more often than one might hope, because \dtc is bad at deciding which errors should really be fatal, and which rate mere warnings. \item \emph{binary include}: Occasionally, it is useful for the device tree to incorporate as a property a block of binary data for some board-specific purpose. For example, many of Apple's device trees incorporate bytecode drivers for certain platform devices. \dtc's source format ought to allow this by letting a property's value be read directly from a binary file. \item \emph{macros}: it might be useful for \dtc to implement some sort of macros so that a tree containing a number of similar devices (for example, multiple identical ethernet controllers or PCI buses) can be written more quickly. At present, this can be accomplished in part by running the source file through CPP before compiling with \dtc. It's not clear whether ``native'' support for macros would be more useful. \end{itemize} \bibliographystyle{amsplain} \bibliography{dtc-paper} \section*{About the authors} David Gibson has been a member of the IBM Linux Technology Center, working from Canberra, Australia, since 2001. Recently he has worked on Linux hugepage support and performance counter support for ppc64, as well as the device tree compiler. In the past, he has worked on bringup for various ppc and ppc64 embedded systems, the orinoco wireless driver, ramfs, and a userspace checkpointing system (\texttt{esky}). Benjamin Herrenschmidt was a MacOS developer for about 10 years, but ultimately saw the light and installed Linux on his Apple PowerPC machine. After writing a bootloader, BootX, for it in 1998, he started contributing to the PowerPC Linux port in various areas, mostly around the support for Apple machines. He became official PowerMac maintainer in 2001. In 2003, he joined the IBM Linux Technology Center in Canberra, Australia, where he ported the 64 bit PowerPC kernel to Apple G5 machines and the Maple embedded board, among others things. He's a member of the ppc64 development ``team'' and one of his current goals is to make the integration of embedded platforms smoother and more maintainable than in the 32-bit PowerPC kernel. \section*{Legal Statement} This work represents the view of the author and does not necessarily represent the view of IBM. IBM, \ppc, \ppc Architecture, POWER5, pSeries and iSeries are trademarks or registered trademarks of International Business Machines Corporation in the United States and/or other countries. Apple and Power Macintosh are a registered trademarks of Apple Computer Inc. in the United States, other countries, or both. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. \end{document}