diff options
author | jraynard <jraynard@FreeBSD.org> | 1997-10-14 18:17:11 +0000 |
---|---|---|
committer | jraynard <jraynard@FreeBSD.org> | 1997-10-14 18:17:11 +0000 |
commit | a46c41193ff2573a4c910e19b570e9c253e714a1 (patch) | |
tree | d84200da2f7f2d8f1321c265bc6ddd7ce15633f8 /contrib/awk/doc | |
download | FreeBSD-src-a46c41193ff2573a4c910e19b570e9c253e714a1.zip FreeBSD-src-a46c41193ff2573a4c910e19b570e9c253e714a1.tar.gz |
Virgin import of GNU awk 3.0.3
Diffstat (limited to 'contrib/awk/doc')
-rw-r--r-- | contrib/awk/doc/ChangeLog | 91 | ||||
-rw-r--r-- | contrib/awk/doc/awk.1 | 2621 | ||||
-rw-r--r-- | contrib/awk/doc/gawk.texi | 20820 |
3 files changed, 23532 insertions, 0 deletions
diff --git a/contrib/awk/doc/ChangeLog b/contrib/awk/doc/ChangeLog new file mode 100644 index 0000000..660436a --- /dev/null +++ b/contrib/awk/doc/ChangeLog @@ -0,0 +1,91 @@ +Thu May 15 12:49:08 1997 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Release 3.0.3: Release tar file made. + +Fri Apr 18 07:55:47 1997 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * BETA Release 3.0.34: Release tar file made. + +Sun Apr 13 15:39:20 1997 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in ($(infodir)/gawk.info): exit 0 in case install-info + fails. + +Thu Jan 2 23:17:53 1997 Fred Fish <fnf@ninemoons.com> + + * Makefile.in (awkcard.tr): Use ':' chars to separate parts of + sed command, since $(srcdir) may expand to something with '/' + characters in it, which confuses sed terribly. + * gawk.texi (Amiga Installation): Note change of configuration + from "m68k-cbm-amigados" to "m68k-amigaos". Point ftp users + towards current ADE distribution and not obsolete Aminet + "gcc" distribution. Change "FreshFish" to "Geek Gadgets". + +Wed Dec 25 11:25:22 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Release 3.0.2: Release tar file made. + +Wed Dec 25 11:17:32 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in ($(mandir)/igawk$(manext),$(mandir)/gawk$(manext)): + remove chmod command; let $(INSTALL_DATA) use -m. + +Tue Dec 17 22:38:28 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in (gawk.info,gawk.dvi,postscript): run makeinfo, TeX, + and/or troff against files in $(srcdir). Thanks to Ulrich Drepper. + ($(infodir)/gawk.info): use --info-dir to install-info, not + --infodir. + +Tue Dec 10 23:09:26 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Release 3.0.1: Release tar file made. + +Mon Dec 9 12:48:54 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * no.colors: new file from Michal for old troffs. + * Makefile.in [AWKCARD]: changes to parameterize old/new troff. + +Sun Dec 1 15:04:56 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * texinfo.tex: Updated to version 2.193, from Karl Berry. + +Tue Nov 26 22:57:15 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in ($(infodir)/gawk.info): Change option in call + to `install-info' to `--info-dir' from `--infodir'. + +Mon Nov 4 13:30:39 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in: updates for reference card. + (ad.block, awkcard.in, cardfonts, colors, macros, setter.outline): + new files for reference card. + +Wed Oct 16 12:43:02 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * texinfo.tex: Updated to version 2.185, from texinfo-3.9 dist. + +Sun Aug 11 23:12:08 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in ($(infodir)/gawk.info): correct use of + $(INSTALL_DATA) and remove chmod command. + +Thu Jul 11 22:06:50 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in ($(mandir)/gawk.$(ext), $(mandir)/igawk.$(ext)): + made dependant on files in $(srcdir). + +Fri Mar 15 06:45:35 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in (clean): add `*~' to list of files to be removed. + +Thu Jan 25 23:40:15 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in (dvi): run texindex and tex an extra time. + This gets the cross references right. Sigh. + +Wed Jan 24 11:51:54 1996 Arnold D. Robbins <arnold@skeeve.atl.ga.us> + + * Makefile.in (maintainer-clean): + Depend on distclean, not the other way around. + Output warning message as per GNU standards. diff --git a/contrib/awk/doc/awk.1 b/contrib/awk/doc/awk.1 new file mode 100644 index 0000000..0568c16 --- /dev/null +++ b/contrib/awk/doc/awk.1 @@ -0,0 +1,2621 @@ +.ds PX \s-1POSIX\s+1 +.ds UX \s-1UNIX\s+1 +.ds AN \s-1ANSI\s+1 +.TH GAWK 1 "Dec 19 1996" "Free Software Foundation" "Utility Commands" +.SH NAME +gawk \- pattern scanning and processing language +.SH SYNOPSIS +.B gawk +[ POSIX or GNU style options ] +.B \-f +.I program-file +[ +.B \-\^\- +] file .\^.\^. +.br +.B gawk +[ POSIX or GNU style options ] +[ +.B \-\^\- +] +.I program-text +file .\^.\^. +.SH DESCRIPTION +.I Gawk +is the GNU Project's implementation of the AWK programming language. +It conforms to the definition of the language in +the \*(PX 1003.2 Command Language And Utilities Standard. +This version in turn is based on the description in +.IR "The AWK Programming Language" , +by Aho, Kernighan, and Weinberger, +with the additional features found in the System V Release 4 version +of \*(UX +.IR awk . +.I Gawk +also provides more recent Bell Labs +.I awk +extensions, and some GNU-specific extensions. +.PP +The command line consists of options to +.I gawk +itself, the AWK program text (if not supplied via the +.B \-f +or +.B \-\^\-file +options), and values to be made +available in the +.B ARGC +and +.B ARGV +pre-defined AWK variables. +.SH OPTION FORMAT +.PP +.I Gawk +options may be either the traditional \*(PX one letter options, +or the GNU style long options. \*(PX options start with a single ``\-'', +while long options start with ``\-\^\-''. +Long options are provided for both GNU-specific features and +for \*(PX mandated features. +.PP +Following the \*(PX standard, +.IR gawk -specific +options are supplied via arguments to the +.B \-W +option. Multiple +.B \-W +options may be supplied +Each +.B \-W +option has a corresponding long option, as detailed below. +Arguments to long options are either joined with the option +by an +.B = +sign, with no intervening spaces, or they may be provided in the +next command line argument. +Long options may be abbreviated, as long as the abbreviation +remains unique. +.SH OPTIONS +.PP +.I Gawk +accepts the following options. +.TP +.PD 0 +.BI \-F " fs" +.TP +.PD +.BI \-\^\-field-separator " fs" +Use +.I fs +for the input field separator (the value of the +.B FS +predefined +variable). +.TP +.PD 0 +\fB\-v\fI var\fB\^=\^\fIval\fR +.TP +.PD +\fB\-\^\-assign \fIvar\fB\^=\^\fIval\fR +Assign the value +.IR val , +to the variable +.IR var , +before execution of the program begins. +Such variable values are available to the +.B BEGIN +block of an AWK program. +.TP +.PD 0 +.BI \-f " program-file" +.TP +.PD +.BI \-\^\-file " program-file" +Read the AWK program source from the file +.IR program-file , +instead of from the first command line argument. +Multiple +.B \-f +(or +.BR \-\^\-file ) +options may be used. +.TP +.PD 0 +.BI \-mf " NNN" +.TP +.PD +.BI \-mr " NNN" +Set various memory limits to the value +.IR NNN . +The +.B f +flag sets the maximum number of fields, and the +.B r +flag sets the maximum record size. These two flags and the +.B \-m +option are from the Bell Labs research version of \*(UX +.IR awk . +They are ignored by +.IR gawk , +since +.I gawk +has no pre-defined limits. +.TP +.PD 0 +.B "\-W traditional" +.TP +.PD 0 +.B "\-W compat" +.TP +.PD 0 +.B \-\^\-traditional +.TP +.PD +.B \-\^\-compat +Run in +.I compatibility +mode. In compatibility mode, +.I gawk +behaves identically to \*(UX +.IR awk ; +none of the GNU-specific extensions are recognized. +The use of +.B \-\^\-traditional +is preferred over the other forms of this option. +See +.BR "GNU EXTENSIONS" , +below, for more information. +.TP +.PD 0 +.B "\-W copyleft" +.TP +.PD 0 +.B "\-W copyright" +.TP +.PD 0 +.B \-\^\-copyleft +.TP +.PD +.B \-\^\-copyright +Print the short version of the GNU copyright information message on +the standard output, and exits successfully. +.TP +.PD 0 +.B "\-W help" +.TP +.PD 0 +.B "\-W usage" +.TP +.PD 0 +.B \-\^\-help +.TP +.PD +.B \-\^\-usage +Print a relatively short summary of the available options on +the standard output. +(Per the +.IR "GNU Coding Standards" , +these options cause an immediate, successful exit.) +.TP +.PD 0 +.B "\-W lint" +.TP +.PD +.B \-\^\-lint +Provide warnings about constructs that are +dubious or non-portable to other AWK implementations. +.TP +.PD 0 +.B "\-W lint\-old" +.TP +.PD +.B \-\^\-lint\-old +Provide warnings about constructs that are +not portable to the original version of Unix +.IR awk . +.ig +.\" This option is left undocumented, on purpose. +.TP +.PD 0 +.B "\-W nostalgia" +.TP +.PD +.B \-\^\-nostalgia +Provide a moment of nostalgia for long time +.I awk +users. +.. +.TP +.PD 0 +.B "\-W posix" +.TP +.PD +.B \-\^\-posix +This turns on +.I compatibility +mode, with the following additional restrictions: +.RS +.TP \w'\(bu'u+1n +\(bu +.B \ex +escape sequences are not recognized. +.TP +\(bu +Only space and tab act as field separators when +.B FS +is set to a single space, newline does not. +.TP +\(bu +The synonym +.B func +for the keyword +.B function +is not recognized. +.TP +\(bu +The operators +.B ** +and +.B **= +cannot be used in place of +.B ^ +and +.BR ^= . +.TP +\(bu +The +.B fflush() +function is not available. +.RE +.TP +.PD 0 +.B "\-W re\-interval" +.TP +.PD +.B \-\^\-re\-interval +Enable the use of +.I "interval expressions" +in regular expression matching +(see +.BR "Regular Expressions" , +below). +Interval expressions were not traditionally available in the +AWK language. The POSIX standard added them, to make +.I awk +and +.I egrep +consistent with each other. +However, their use is likely +to break old AWK programs, so +.I gawk +only provides them if they are requested with this option, or when +.B \-\^\-posix +is specified. +.TP +.PD 0 +.BI "\-W source " program-text +.TP +.PD +.BI \-\^\-source " program-text" +Use +.I program-text +as AWK program source code. +This option allows the easy intermixing of library functions (used via the +.B \-f +and +.B \-\^\-file +options) with source code entered on the command line. +It is intended primarily for medium to large AWK programs used +in shell scripts. +.TP +.PD 0 +.B "\-W version" +.TP +.PD +.B \-\^\-version +Print version information for this particular copy of +.I gawk +on the standard output. +This is useful mainly for knowing if the current copy of +.I gawk +on your system +is up to date with respect to whatever the Free Software Foundation +is distributing. +This is also useful when reporting bugs. +(Per the +.IR "GNU Coding Standards" , +these options cause an immediate, successful exit.) +.TP +.B \-\^\- +Signal the end of options. This is useful to allow further arguments to the +AWK program itself to start with a ``\-''. +This is mainly for consistency with the argument parsing convention used +by most other \*(PX programs. +.PP +In compatibility mode, +any other options are flagged as illegal, but are otherwise ignored. +In normal operation, as long as program text has been supplied, unknown +options are passed on to the AWK program in the +.B ARGV +array for processing. This is particularly useful for running AWK +programs via the ``#!'' executable interpreter mechanism. +.SH AWK PROGRAM EXECUTION +.PP +An AWK program consists of a sequence of pattern-action statements +and optional function definitions. +.RS +.PP +\fIpattern\fB { \fIaction statements\fB }\fR +.br +\fBfunction \fIname\fB(\fIparameter list\fB) { \fIstatements\fB }\fR +.RE +.PP +.I Gawk +first reads the program source from the +.IR program-file (s) +if specified, +from arguments to +.BR \-\^\-source , +or from the first non-option argument on the command line. +The +.B \-f +and +.B \-\^\-source +options may be used multiple times on the command line. +.I Gawk +will read the program text as if all the +.IR program-file s +and command line source texts +had been concatenated together. This is useful for building libraries +of AWK functions, without having to include them in each new AWK +program that uses them. It also provides the ability to mix library +functions with command line programs. +.PP +The environment variable +.B AWKPATH +specifies a search path to use when finding source files named with +the +.B \-f +option. If this variable does not exist, the default path is +\fB".:/usr/local/share/awk"\fR. +(The actual directory may vary, depending upon how +.I gawk +was built and installed.) +If a file name given to the +.B \-f +option contains a ``/'' character, no path search is performed. +.PP +.I Gawk +executes AWK programs in the following order. +First, +all variable assignments specified via the +.B \-v +option are performed. +Next, +.I gawk +compiles the program into an internal form. +Then, +.I gawk +executes the code in the +.B BEGIN +block(s) (if any), +and then proceeds to read +each file named in the +.B ARGV +array. +If there are no files named on the command line, +.I gawk +reads the standard input. +.PP +If a filename on the command line has the form +.IB var = val +it is treated as a variable assignment. The variable +.I var +will be assigned the value +.IR val . +(This happens after any +.B BEGIN +block(s) have been run.) +Command line variable assignment +is most useful for dynamically assigning values to the variables +AWK uses to control how input is broken into fields and records. It +is also useful for controlling state if multiple passes are needed over +a single data file. +.PP +If the value of a particular element of +.B ARGV +is empty (\fB""\fR), +.I gawk +skips over it. +.PP +For each record in the input, +.I gawk +tests to see if it matches any +.I pattern +in the AWK program. +For each pattern that the record matches, the associated +.I action +is executed. +The patterns are tested in the order they occur in the program. +.PP +Finally, after all the input is exhausted, +.I gawk +executes the code in the +.B END +block(s) (if any). +.SH VARIABLES, RECORDS AND FIELDS +AWK variables are dynamic; they come into existence when they are +first used. Their values are either floating-point numbers or strings, +or both, +depending upon how they are used. AWK also has one dimensional +arrays; arrays with multiple dimensions may be simulated. +Several pre-defined variables are set as a program +runs; these will be described as needed and summarized below. +.SS Records +Normally, records are separated by newline characters. You can control how +records are separated by assigning values to the built-in variable +.BR RS . +If +.B RS +is any single character, that character separates records. +Otherwise, +.B RS +is a regular expression. Text in the input that matches this +regular expression will separate the record. +However, in compatibility mode, +only the first character of its string +value is used for separating records. +If +.B RS +is set to the null string, then records are separated by +blank lines. +When +.B RS +is set to the null string, the newline character always acts as +a field separator, in addition to whatever value +.B FS +may have. +.SS Fields +.PP +As each input record is read, +.I gawk +splits the record into +.IR fields , +using the value of the +.B FS +variable as the field separator. +If +.B FS +is a single character, fields are separated by that character. +If +.B FS +is the null string, then each individual character becomes a +separate field. +Otherwise, +.B FS +is expected to be a full regular expression. +In the special case that +.B FS +is a single space, fields are separated +by runs of spaces and/or tabs and/or newlines. +(But see the discussion of +.BR \-\-posix , +below). +Note that the value of +.B IGNORECASE +(see below) will also affect how fields are split when +.B FS +is a regular expression, and how records are separated when +.B RS +is a regular expression. +.PP +If the +.B FIELDWIDTHS +variable is set to a space separated list of numbers, each field is +expected to have fixed width, and +.I gawk +will split up the record using the specified widths. The value of +.B FS +is ignored. +Assigning a new value to +.B FS +overrides the use of +.BR FIELDWIDTHS , +and restores the default behavior. +.PP +Each field in the input record may be referenced by its position, +.BR $1 , +.BR $2 , +and so on. +.B $0 +is the whole record. The value of a field may be assigned to as well. +Fields need not be referenced by constants: +.RS +.PP +.ft B +n = 5 +.br +print $n +.ft R +.RE +.PP +prints the fifth field in the input record. +The variable +.B NF +is set to the total number of fields in the input record. +.PP +References to non-existent fields (i.e. fields after +.BR $NF ) +produce the null-string. However, assigning to a non-existent field +(e.g., +.BR "$(NF+2) = 5" ) +will increase the value of +.BR NF , +create any intervening fields with the null string as their value, and +cause the value of +.B $0 +to be recomputed, with the fields being separated by the value of +.BR OFS . +References to negative numbered fields cause a fatal error. +Decrementing +.B NF +causes the values of fields past the new value to be lost, and the value of +.B $0 +to be recomputed, with the fields being separated by the value of +.BR OFS . +.SS Built-in Variables +.PP +.IR Gawk 's +built-in variables are: +.PP +.TP \w'\fBFIELDWIDTHS\fR'u+1n +.B ARGC +The number of command line arguments (does not include options to +.IR gawk , +or the program source). +.TP +.B ARGIND +The index in +.B ARGV +of the current file being processed. +.TP +.B ARGV +Array of command line arguments. The array is indexed from +0 to +.B ARGC +\- 1. +Dynamically changing the contents of +.B ARGV +can control the files used for data. +.TP +.B CONVFMT +The conversion format for numbers, \fB"%.6g"\fR, by default. +.TP +.B ENVIRON +An array containing the values of the current environment. +The array is indexed by the environment variables, each element being +the value of that variable (e.g., \fBENVIRON["HOME"]\fP might be +.BR /home/arnold ). +Changing this array does not affect the environment seen by programs which +.I gawk +spawns via redirection or the +.B system() +function. +(This may change in a future version of +.IR gawk .) +.\" but don't hold your breath... +.TP +.B ERRNO +If a system error occurs either doing a redirection for +.BR getline , +during a read for +.BR getline , +or during a +.BR close() , +then +.B ERRNO +will contain +a string describing the error. +.TP +.B FIELDWIDTHS +A white-space separated list of fieldwidths. When set, +.I gawk +parses the input into fields of fixed width, instead of using the +value of the +.B FS +variable as the field separator. +The fixed field width facility is still experimental; the +semantics may change as +.I gawk +evolves over time. +.TP +.B FILENAME +The name of the current input file. +If no files are specified on the command line, the value of +.B FILENAME +is ``\-''. +However, +.B FILENAME +is undefined inside the +.B BEGIN +block. +.TP +.B FNR +The input record number in the current input file. +.TP +.B FS +The input field separator, a space by default. See +.BR Fields , +above. +.TP +.B IGNORECASE +Controls the case-sensitivity of all regular expression +and string operations. If +.B IGNORECASE +has a non-zero value, then string comparisons and +pattern matching in rules, +field splitting with +.BR FS , +record separating with +.BR RS , +regular expression +matching with +.B ~ +and +.BR !~ , +and the +.BR gensub() , +.BR gsub() , +.BR index() , +.BR match() , +.BR split() , +and +.B sub() +pre-defined functions will all ignore case when doing regular expression +operations. Thus, if +.B IGNORECASE +is not equal to zero, +.B /aB/ +matches all of the strings \fB"ab"\fP, \fB"aB"\fP, \fB"Ab"\fP, +and \fB"AB"\fP. +As with all AWK variables, the initial value of +.B IGNORECASE +is zero, so all regular expression and string +operations are normally case-sensitive. +Under Unix, the full ISO 8859-1 Latin-1 character set is used +when ignoring case. +.B NOTE: +In versions of +.I gawk +prior to 3.0, +.B IGNORECASE +only affected regular expression operations. It now affects string +comparisons as well. +.TP +.B NF +The number of fields in the current input record. +.TP +.B NR +The total number of input records seen so far. +.TP +.B OFMT +The output format for numbers, \fB"%.6g"\fR, by default. +.TP +.B OFS +The output field separator, a space by default. +.TP +.B ORS +The output record separator, by default a newline. +.TP +.B RS +The input record separator, by default a newline. +.TP +.B RT +The record terminator. +.I Gawk +sets +.B RT +to the input text that matched the character or regular expression +specified by +.BR RS . +.TP +.B RSTART +The index of the first character matched by +.BR match() ; +0 if no match. +.TP +.B RLENGTH +The length of the string matched by +.BR match() ; +\-1 if no match. +.TP +.B SUBSEP +The character used to separate multiple subscripts in array +elements, by default \fB"\e034"\fR. +.SS Arrays +.PP +Arrays are subscripted with an expression between square brackets +.RB ( [ " and " ] ). +If the expression is an expression list +.RI ( expr ", " expr " ...)" +then the array subscript is a string consisting of the +concatenation of the (string) value of each expression, +separated by the value of the +.B SUBSEP +variable. +This facility is used to simulate multiply dimensioned +arrays. For example: +.PP +.RS +.ft B +i = "A";\^ j = "B";\^ k = "C" +.br +x[i, j, k] = "hello, world\en" +.ft R +.RE +.PP +assigns the string \fB"hello, world\en"\fR to the element of the array +.B x +which is indexed by the string \fB"A\e034B\e034C"\fR. All arrays in AWK +are associative, i.e. indexed by string values. +.PP +The special operator +.B in +may be used in an +.B if +or +.B while +statement to see if an array has an index consisting of a particular +value. +.PP +.RS +.ft B +.nf +if (val in array) + print array[val] +.fi +.ft +.RE +.PP +If the array has multiple subscripts, use +.BR "(i, j) in array" . +.PP +The +.B in +construct may also be used in a +.B for +loop to iterate over all the elements of an array. +.PP +An element may be deleted from an array using the +.B delete +statement. +The +.B delete +statement may also be used to delete the entire contents of an array, +just by specifying the array name without a subscript. +.SS Variable Typing And Conversion +.PP +Variables and fields +may be (floating point) numbers, or strings, or both. How the +value of a variable is interpreted depends upon its context. If used in +a numeric expression, it will be treated as a number, if used as a string +it will be treated as a string. +.PP +To force a variable to be treated as a number, add 0 to it; to force it +to be treated as a string, concatenate it with the null string. +.PP +When a string must be converted to a number, the conversion is accomplished +using +.IR atof (3). +A number is converted to a string by using the value of +.B CONVFMT +as a format string for +.IR sprintf (3), +with the numeric value of the variable as the argument. +However, even though all numbers in AWK are floating-point, +integral values are +.I always +converted as integers. Thus, given +.PP +.RS +.ft B +.nf +CONVFMT = "%2.2f" +a = 12 +b = a "" +.fi +.ft R +.RE +.PP +the variable +.B b +has a string value of \fB"12"\fR and not \fB"12.00"\fR. +.PP +.I Gawk +performs comparisons as follows: +If two variables are numeric, they are compared numerically. +If one value is numeric and the other has a string value that is a +``numeric string,'' then comparisons are also done numerically. +Otherwise, the numeric value is converted to a string and a string +comparison is performed. +Two strings are compared, of course, as strings. +According to the \*(PX standard, even if two strings are +numeric strings, a numeric comparison is performed. However, this is +clearly incorrect, and +.I gawk +does not do this. +.PP +Note that string constants, such as \fB"57"\fP, are +.I not +numeric strings, they are string constants. The idea of ``numeric string'' +only applies to fields, +.B getline +input, +.BR FILENAME , +.B ARGV +elements, +.B ENVIRON +elements and the elements of an array created by +.B split() +that are numeric strings. +The basic idea is that +.IR "user input" , +and only user input, that looks numeric, +should be treated that way. +.PP +Uninitialized variables have the numeric value 0 and the string value "" +(the null, or empty, string). +.SH PATTERNS AND ACTIONS +AWK is a line oriented language. The pattern comes first, and then the +action. Action statements are enclosed in +.B { +and +.BR } . +Either the pattern may be missing, or the action may be missing, but, +of course, not both. If the pattern is missing, the action will be +executed for every single record of input. +A missing action is equivalent to +.RS +.PP +.B "{ print }" +.RE +.PP +which prints the entire record. +.PP +Comments begin with the ``#'' character, and continue until the +end of the line. +Blank lines may be used to separate statements. +Normally, a statement ends with a newline, however, this is not the +case for lines ending in +a ``,'', +.BR { , +.BR ? , +.BR : , +.BR && , +or +.BR || . +Lines ending in +.B do +or +.B else +also have their statements automatically continued on the following line. +In other cases, a line can be continued by ending it with a ``\e'', +in which case the newline will be ignored. +.PP +Multiple statements may +be put on one line by separating them with a ``;''. +This applies to both the statements within the action part of a +pattern-action pair (the usual case), +and to the pattern-action statements themselves. +.SS Patterns +AWK patterns may be one of the following: +.PP +.RS +.nf +.B BEGIN +.B END +.BI / "regular expression" / +.I "relational expression" +.IB pattern " && " pattern +.IB pattern " || " pattern +.IB pattern " ? " pattern " : " pattern +.BI ( pattern ) +.BI ! " pattern" +.IB pattern1 ", " pattern2 +.fi +.RE +.PP +.B BEGIN +and +.B END +are two special kinds of patterns which are not tested against +the input. +The action parts of all +.B BEGIN +patterns are merged as if all the statements had +been written in a single +.B BEGIN +block. They are executed before any +of the input is read. Similarly, all the +.B END +blocks are merged, +and executed when all the input is exhausted (or when an +.B exit +statement is executed). +.B BEGIN +and +.B END +patterns cannot be combined with other patterns in pattern expressions. +.B BEGIN +and +.B END +patterns cannot have missing action parts. +.PP +For +.BI / "regular expression" / +patterns, the associated statement is executed for each input record that matches +the regular expression. +Regular expressions are the same as those in +.IR egrep (1), +and are summarized below. +.PP +A +.I "relational expression" +may use any of the operators defined below in the section on actions. +These generally test whether certain fields match certain regular expressions. +.PP +The +.BR && , +.BR || , +and +.B ! +operators are logical AND, logical OR, and logical NOT, respectively, as in C. +They do short-circuit evaluation, also as in C, and are used for combining +more primitive pattern expressions. As in most languages, parentheses +may be used to change the order of evaluation. +.PP +The +.B ?\^: +operator is like the same operator in C. If the first pattern is true +then the pattern used for testing is the second pattern, otherwise it is +the third. Only one of the second and third patterns is evaluated. +.PP +The +.IB pattern1 ", " pattern2 +form of an expression is called a +.IR "range pattern" . +It matches all input records starting with a record that matches +.IR pattern1 , +and continuing until a record that matches +.IR pattern2 , +inclusive. It does not combine with any other sort of pattern expression. +.SS Regular Expressions +Regular expressions are the extended kind found in +.IR egrep . +They are composed of characters as follows: +.TP \w'\fB[^\fIabc...\fB]\fR'u+2n +.I c +matches the non-metacharacter +.IR c . +.TP +.I \ec +matches the literal character +.IR c . +.TP +.B . +matches any character +.I including +newline. +.TP +.B ^ +matches the beginning of a string. +.TP +.B $ +matches the end of a string. +.TP +.BI [ abc... ] +character list, matches any of the characters +.IR abc... . +.TP +.BI [^ abc... ] +negated character list, matches any character except +.IR abc... . +.TP +.IB r1 | r2 +alternation: matches either +.I r1 +or +.IR r2 . +.TP +.I r1r2 +concatenation: matches +.IR r1 , +and then +.IR r2 . +.TP +.IB r + +matches one or more +.IR r 's. +.TP +.IB r * +matches zero or more +.IR r 's. +.TP +.IB r ? +matches zero or one +.IR r 's. +.TP +.BI ( r ) +grouping: matches +.IR r . +.TP +.PD 0 +.IB r { n } +.TP +.PD 0 +.IB r { n ,} +.TP +.PD +.IB r { n , m } +One or two numbers inside braces denote an +.IR "interval expression" . +If there is one number in the braces, the preceding regexp +.I r +is repeated +.I n +times. If there are two numbers separated by a comma, +.I r +is repeated +.I n +to +.I m +times. +If there is one number followed by a comma, then +.I r +is repeated at least +.I n +times. +.sp .5 +Interval expressions are only available if either +.B \-\^\-posix +or +.B \-\^\-re\-interval +is specified on the command line. +.TP +.B \ey +matches the empty string at either the beginning or the +end of a word. +.TP +.B \eB +matches the empty string within a word. +.TP +.B \e< +matches the empty string at the beginning of a word. +.TP +.B \e> +matches the empty string at the end of a word. +.TP +.B \ew +matches any word-constituent character (letter, digit, or underscore). +.TP +.B \eW +matches any character that is not word-constituent. +.TP +.B \e` +matches the empty string at the beginning of a buffer (string). +.TP +.B \e' +matches the empty string at the end of a buffer. +.PP +The escape sequences that are valid in string constants (see below) +are also legal in regular expressions. +.PP +.I "Character classes" +are a new feature introduced in the POSIX standard. +A character class is a special notation for describing +lists of characters that have a specific attribute, but where the +actual characters themselves can vary from country to country and/or +from character set to character set. For example, the notion of what +is an alphabetic character differs in the USA and in France. +.PP +A character class is only valid in a regexp +.I inside +the brackets of a character list. Character classes consist of +.BR [: , +a keyword denoting the class, and +.BR :] . +Here are the character +classes defined by the POSIX standard. +.TP +.B [:alnum:] +Alphanumeric characters. +.TP +.B [:alpha:] +Alphabetic characters. +.TP +.B [:blank:] +Space or tab characters. +.TP +.B [:cntrl:] +Control characters. +.TP +.B [:digit:] +Numeric characters. +.TP +.B [:graph:] +Characters that are both printable and visible. +(A space is printable, but not visible, while an +.B a +is both.) +.TP +.B [:lower:] +Lower-case alphabetic characters. +.TP +.B [:print:] +Printable characters (characters that are not control characters.) +.TP +.B [:punct:] +Punctuation characters (characters that are not letter, digits, +control characters, or space characters). +.TP +.B [:space:] +Space characters (such as space, tab, and formfeed, to name a few). +.TP +.B [:upper:] +Upper-case alphabetic characters. +.TP +.B [:xdigit:] +Characters that are hexadecimal digits. +.PP +For example, before the POSIX standard, to match alphanumeric +characters, you would have had to write +.BR /[A\-Za\-z0\-9]/ . +If your character set had other alphabetic characters in it, this would not +match them. With the POSIX character classes, you can write +.BR /[[:alnum:]]/ , +and this will match +.I all +the alphabetic and numeric characters in your character set. +.PP +Two additional special sequences can appear in character lists. +These apply to non-ASCII character sets, which can have single symbols +(called +.IR "collating elements" ) +that are represented with more than one +character, as well as several characters that are equivalent for +.IR collating , +or sorting, purposes. (E.g., in French, a plain ``e'' +and a grave-accented e\` are equivalent.) +.TP +Collating Symbols +A collating symbols is a multi-character collating element enclosed in +.B [. +and +.BR .] . +For example, if +.B ch +is a collating element, then +.B [[.ch.]] +is a regexp that matches this collating element, while +.B [ch] +is a regexp that matches either +.B c +or +.BR h . +.TP +Equivalence Classes +An equivalence class is a locale-specific name for a list of +characters that are equivalent. The name is enclosed in +.B [= +and +.BR =] . +For example, the name +.B e +might be used to represent all of +``e,'' ``e\`,'' and ``e\`.'' +In this case, +.B [[=e]] +is a regexp +that matches any of + .BR e , + .BR e\' , +or + .BR e\` . +.PP +These features are very valuable in non-English speaking locales. +The library functions that +.I gawk +uses for regular expression matching +currently only recognize POSIX character classes; they do not recognize +collating symbols or equivalence classes. +.PP +The +.BR \ey , +.BR \eB , +.BR \e< , +.BR \e> , +.BR \ew , +.BR \eW , +.BR \e` , +and +.B \e' +operators are specific to +.IR gawk ; +they are extensions based on facilities in the GNU regexp libraries. +.PP +The various command line options +control how +.I gawk +interprets characters in regexps. +.TP +No options +In the default case, +.I gawk +provide all the facilities of +POSIX regexps and the GNU regexp operators described above. +However, interval expressions are not supported. +.TP +.B \-\^\-posix +Only POSIX regexps are supported, the GNU operators are not special. +(E.g., +.B \ew +matches a literal +.BR w ). +Interval expressions are allowed. +.TP +.B \-\^\-traditional +Traditional Unix +.I awk +regexps are matched. The GNU operators +are not special, interval expressions are not available, and neither +are the POSIX character classes +.RB ( [[:alnum:]] +and so on). +Characters described by octal and hexadecimal escape sequences are +treated literally, even if they represent regexp metacharacters. +.TP +.B \-\^\-re\-interval +Allow interval expressions in regexps, even if +.B \-\^\-traditional +has been provided. +.SS Actions +Action statements are enclosed in braces, +.B { +and +.BR } . +Action statements consist of the usual assignment, conditional, and looping +statements found in most languages. The operators, control statements, +and input/output statements +available are patterned after those in C. +.SS Operators +.PP +The operators in AWK, in order of decreasing precedence, are +.PP +.TP "\w'\fB*= /= %= ^=\fR'u+1n" +.BR ( \&... ) +Grouping +.TP +.B $ +Field reference. +.TP +.B "++ \-\^\-" +Increment and decrement, both prefix and postfix. +.TP +.B ^ +Exponentiation (\fB**\fR may also be used, and \fB**=\fR for +the assignment operator). +.TP +.B "+ \- !" +Unary plus, unary minus, and logical negation. +.TP +.B "* / %" +Multiplication, division, and modulus. +.TP +.B "+ \-" +Addition and subtraction. +.TP +.I space +String concatenation. +.TP +.PD 0 +.B "< >" +.TP +.PD 0 +.B "<= >=" +.TP +.PD +.B "!= ==" +The regular relational operators. +.TP +.B "~ !~" +Regular expression match, negated match. +.B NOTE: +Do not use a constant regular expression +.RB ( /foo/ ) +on the left-hand side of a +.B ~ +or +.BR !~ . +Only use one on the right-hand side. The expression +.BI "/foo/ ~ " exp +has the same meaning as \fB(($0 ~ /foo/) ~ \fIexp\fB)\fR. +This is usually +.I not +what was intended. +.TP +.B in +Array membership. +.TP +.B && +Logical AND. +.TP +.B || +Logical OR. +.TP +.B ?: +The C conditional expression. This has the form +.IB expr1 " ? " expr2 " : " expr3\c +\&. If +.I expr1 +is true, the value of the expression is +.IR expr2 , +otherwise it is +.IR expr3 . +Only one of +.I expr2 +and +.I expr3 +is evaluated. +.TP +.PD 0 +.B "= += \-=" +.TP +.PD +.B "*= /= %= ^=" +Assignment. Both absolute assignment +.BI ( var " = " value ) +and operator-assignment (the other forms) are supported. +.SS Control Statements +.PP +The control statements are +as follows: +.PP +.RS +.nf +\fBif (\fIcondition\fB) \fIstatement\fR [ \fBelse\fI statement \fR] +\fBwhile (\fIcondition\fB) \fIstatement \fR +\fBdo \fIstatement \fBwhile (\fIcondition\fB)\fR +\fBfor (\fIexpr1\fB; \fIexpr2\fB; \fIexpr3\fB) \fIstatement\fR +\fBfor (\fIvar \fBin\fI array\fB) \fIstatement\fR +\fBbreak\fR +\fBcontinue\fR +\fBdelete \fIarray\^\fB[\^\fIindex\^\fB]\fR +\fBdelete \fIarray\^\fR +\fBexit\fR [ \fIexpression\fR ] +\fB{ \fIstatements \fB} +.fi +.RE +.SS "I/O Statements" +.PP +The input/output statements are as follows: +.PP +.TP "\w'\fBprintf \fIfmt, expr-list\fR'u+1n" +.BI close( file ) +Close file (or pipe, see below). +.TP +.B getline +Set +.B $0 +from next input record; set +.BR NF , +.BR NR , +.BR FNR . +.TP +.BI "getline <" file +Set +.B $0 +from next record of +.IR file ; +set +.BR NF . +.TP +.BI getline " var" +Set +.I var +from next input record; set +.BR NR , +.BR FNR . +.TP +.BI getline " var" " <" file +Set +.I var +from next record of +.IR file . +.TP +.B next +Stop processing the current input record. The next input record +is read and processing starts over with the first pattern in the +AWK program. If the end of the input data is reached, the +.B END +block(s), if any, are executed. +.TP +.B "nextfile" +Stop processing the current input file. The next input record read +comes from the next input file. +.B FILENAME +and +.B ARGIND +are updated, +.B FNR +is reset to 1, and processing starts over with the first pattern in the +AWK program. If the end of the input data is reached, the +.B END +block(s), if any, are executed. +.B NOTE: +Earlier versions of gawk used +.BR "next file" , +as two words. While this usage is still recognized, it generates a +warning message and will eventually be removed. +.TP +.B print +Prints the current record. +The output record is terminated with the value of the +.B ORS +variable. +.TP +.BI print " expr-list" +Prints expressions. +Each expression is separated by the value of the +.B OFS +variable. +The output record is terminated with the value of the +.B ORS +variable. +.TP +.BI print " expr-list" " >" file +Prints expressions on +.IR file . +Each expression is separated by the value of the +.B OFS +variable. The output record is terminated with the value of the +.B ORS +variable. +.TP +.BI printf " fmt, expr-list" +Format and print. +.TP +.BI printf " fmt, expr-list" " >" file +Format and print on +.IR file . +.TP +.BI system( cmd-line ) +Execute the command +.IR cmd-line , +and return the exit status. +(This may not be available on non-\*(PX systems.) +.TP +\&\fBfflush(\fR[\fIfile\^\fR]\fB)\fR +Flush any buffers associated with the open output file or pipe +.IR file . +If +.I file +is missing, then standard output is flushed. +If +.I file +is the null string, +then all open output files and pipes +have their buffers flushed. +.PP +Other input/output redirections are also allowed. For +.B print +and +.BR printf , +.BI >> file +appends output to the +.IR file , +while +.BI | " command" +writes on a pipe. +In a similar fashion, +.IB command " | getline" +pipes into +.BR getline . +The +.BR getline +command will return 0 on end of file, and \-1 on an error. +.SS The \fIprintf\fP\^ Statement +.PP +The AWK versions of the +.B printf +statement and +.B sprintf() +function +(see below) +accept the following conversion specification formats: +.TP +.B %c +An \s-1ASCII\s+1 character. +If the argument used for +.B %c +is numeric, it is treated as a character and printed. +Otherwise, the argument is assumed to be a string, and the only first +character of that string is printed. +.TP +.PD 0 +.B %d +.TP +.PD +.B %i +A decimal number (the integer part). +.TP +.PD 0 +.B %e +.TP +.PD +.B %E +A floating point number of the form +.BR [\-]d.dddddde[+\^\-]dd . +The +.B %E +format uses +.B E +instead of +.BR e . +.TP +.B %f +A floating point number of the form +.BR [\-]ddd.dddddd . +.TP +.PD 0 +.B %g +.TP +.PD +.B %G +Use +.B %e +or +.B %f +conversion, whichever is shorter, with nonsignificant zeros suppressed. +The +.B %G +format uses +.B %E +instead of +.BR %e . +.TP +.B %o +An unsigned octal number (again, an integer). +.TP +.B %s +A character string. +.TP +.PD 0 +.B %x +.TP +.PD +.B %X +An unsigned hexadecimal number (an integer). +.The +.B %X +format uses +.B ABCDEF +instead of +.BR abcdef . +.TP +.B %% +A single +.B % +character; no argument is converted. +.PP +There are optional, additional parameters that may lie between the +.B % +and the control letter: +.TP +.B \- +The expression should be left-justified within its field. +.TP +.I space +For numeric conversions, prefix positive values with a space, and +negative values with a minus sign. +.TP +.B + +The plus sign, used before the width modifier (see below), +says to always supply a sign for numeric conversions, even if the data +to be formatted is positive. The +.B + +overrides the space modifier. +.TP +.B # +Use an ``alternate form'' for certain control letters. +For +.BR %o , +supply a leading zero. +For +.BR %x , +and +.BR %X , +supply a leading +.BR 0x +or +.BR 0X +for +a nonzero result. +For +.BR %e , +.BR %E , +and +.BR %f , +the result will always contain a +decimal point. +For +.BR %g , +and +.BR %G , +trailing zeros are not removed from the result. +.TP +.B 0 +A leading +.B 0 +(zero) acts as a flag, that indicates output should be +padded with zeroes instead of spaces. +This applies even to non-numeric output formats. +This flag only has an effect when the field width is wider than the +value to be printed. +.TP +.I width +The field should be padded to this width. The field is normally padded +with spaces. If the +.B 0 +flag has been used, it is padded with zeroes. +.TP +.BI \&. prec +A number that specifies the precision to use when printing. +For the +.BR %e , +.BR %E , +and +.BR %f +formats, this specifies the +number of digits you want printed to the right of the decimal point. +For the +.BR %g , +and +.B %G +formats, it specifies the maximum number +of significant digits. For the +.BR %d , +.BR %o , +.BR %i , +.BR %u , +.BR %x , +and +.B %X +formats, it specifies the minimum number of +digits to print. For a string, it specifies the maximum number of +characters from the string that should be printed. +.PP +The dynamic +.I width +and +.I prec +capabilities of the \*(AN C +.B printf() +routines are supported. +A +.B * +in place of either the +.B width +or +.B prec +specifications will cause their values to be taken from +the argument list to +.B printf +or +.BR sprintf() . +.SS Special File Names +.PP +When doing I/O redirection from either +.B print +or +.B printf +into a file, +or via +.B getline +from a file, +.I gawk +recognizes certain special filenames internally. These filenames +allow access to open file descriptors inherited from +.IR gawk 's +parent process (usually the shell). +Other special filenames provide access to information about the running +.B gawk +process. +The filenames are: +.TP \w'\fB/dev/stdout\fR'u+1n +.B /dev/pid +Reading this file returns the process ID of the current process, +in decimal, terminated with a newline. +.TP +.B /dev/ppid +Reading this file returns the parent process ID of the current process, +in decimal, terminated with a newline. +.TP +.B /dev/pgrpid +Reading this file returns the process group ID of the current process, +in decimal, terminated with a newline. +.TP +.B /dev/user +Reading this file returns a single record terminated with a newline. +The fields are separated with spaces. +.B $1 +is the value of the +.IR getuid (2) +system call, +.B $2 +is the value of the +.IR geteuid (2) +system call, +.B $3 +is the value of the +.IR getgid (2) +system call, and +.B $4 +is the value of the +.IR getegid (2) +system call. +If there are any additional fields, they are the group IDs returned by +.IR getgroups (2). +Multiple groups may not be supported on all systems. +.TP +.B /dev/stdin +The standard input. +.TP +.B /dev/stdout +The standard output. +.TP +.B /dev/stderr +The standard error output. +.TP +.BI /dev/fd/\^ n +The file associated with the open file descriptor +.IR n . +.PP +These are particularly useful for error messages. For example: +.PP +.RS +.ft B +print "You blew it!" > "/dev/stderr" +.ft R +.RE +.PP +whereas you would otherwise have to use +.PP +.RS +.ft B +print "You blew it!" | "cat 1>&2" +.ft R +.RE +.PP +These file names may also be used on the command line to name data files. +.SS Numeric Functions +.PP +AWK has the following pre-defined arithmetic functions: +.PP +.TP \w'\fBsrand(\fR[\fIexpr\^\fR]\fB)\fR'u+1n +.BI atan2( y , " x" ) +returns the arctangent of +.I y/x +in radians. +.TP +.BI cos( expr ) +returns the cosine of +.IR expr , +which is in radians. +.TP +.BI exp( expr ) +the exponential function. +.TP +.BI int( expr ) +truncates to integer. +.TP +.BI log( expr ) +the natural logarithm function. +.TP +.B rand() +returns a random number between 0 and 1. +.TP +.BI sin( expr ) +returns the sine of +.IR expr , +which is in radians. +.TP +.BI sqrt( expr ) +the square root function. +.TP +\&\fBsrand(\fR[\fIexpr\^\fR]\fB)\fR +uses +.I expr +as a new seed for the random number generator. If no +.I expr +is provided, the time of day will be used. +The return value is the previous seed for the random +number generator. +.SS String Functions +.PP +.I Gawk +has the following pre-defined string functions: +.PP +.TP "\w'\fBsprintf(\^\fIfmt\fB\^, \fIexpr-list\^\fB)\fR'u+1n" +\fBgensub(\fIr\fB, \fIs\fB, \fIh \fR[\fB, \fIt\fR]\fB)\fR +search the target string +.I t +for matches of the regular expression +.IR r . +If +.I h +is a string beginning with +.B g +or +.BR G , +then replace all matches of +.I r +with +.IR s . +Otherwise, +.I h +is a number indicating which match of +.I r +to replace. +If no +.I t +is supplied, +.B $0 +is used instead. +Within the replacement text +.IR s , +the sequence +.BI \e n\fR, +where +.I n +is a digit from 1 to 9, may be used to indicate just the text that +matched the +.IR n 'th +parenthesized subexpression. The sequence +.B \e0 +represents the entire matched text, as does the character +.BR & . +Unlike +.B sub() +and +.BR gsub() , +the modified string is returned as the result of the function, +and the original target string is +.I not +changed. +.TP "\w'\fBsprintf(\^\fIfmt\fB\^, \fIexpr-list\^\fB)\fR'u+1n" +\fBgsub(\fIr\fB, \fIs \fR[\fB, \fIt\fR]\fB)\fR +for each substring matching the regular expression +.I r +in the string +.IR t , +substitute the string +.IR s , +and return the number of substitutions. +If +.I t +is not supplied, use +.BR $0 . +An +.B & +in the replacement text is replaced with the text that was actually matched. +Use +.B \e& +to get a literal +.BR & . +See +.I "AWK Language Programming" +for a fuller discussion of the rules for +.BR &'s +and backslashes in the replacement text of +.BR sub() , +.BR gsub() , +and +.BR gensub() . +.TP +.BI index( s , " t" ) +returns the index of the string +.I t +in the string +.IR s , +or 0 if +.I t +is not present. +.TP +\fBlength(\fR[\fIs\fR]\fB) +returns the length of the string +.IR s , +or the length of +.B $0 +if +.I s +is not supplied. +.TP +.BI match( s , " r" ) +returns the position in +.I s +where the regular expression +.I r +occurs, or 0 if +.I r +is not present, and sets the values of +.B RSTART +and +.BR RLENGTH . +.TP +\fBsplit(\fIs\fB, \fIa \fR[\fB, \fIr\fR]\fB)\fR +splits the string +.I s +into the array +.I a +on the regular expression +.IR r , +and returns the number of fields. If +.I r +is omitted, +.B FS +is used instead. +The array +.I a +is cleared first. +Splitting behaves identically to field splitting, described above. +.TP +.BI sprintf( fmt , " expr-list" ) +prints +.I expr-list +according to +.IR fmt , +and returns the resulting string. +.TP +\fBsub(\fIr\fB, \fIs \fR[\fB, \fIt\fR]\fB)\fR +just like +.BR gsub() , +but only the first matching substring is replaced. +.TP +\fBsubstr(\fIs\fB, \fIi \fR[\fB, \fIn\fR]\fB)\fR +returns the at most +.IR n -character +substring of +.I s +starting at +.IR i . +If +.I n +is omitted, the rest of +.I s +is used. +.TP +.BI tolower( str ) +returns a copy of the string +.IR str , +with all the upper-case characters in +.I str +translated to their corresponding lower-case counterparts. +Non-alphabetic characters are left unchanged. +.TP +.BI toupper( str ) +returns a copy of the string +.IR str , +with all the lower-case characters in +.I str +translated to their corresponding upper-case counterparts. +Non-alphabetic characters are left unchanged. +.SS Time Functions +.PP +Since one of the primary uses of AWK programs is processing log files +that contain time stamp information, +.I gawk +provides the following two functions for obtaining time stamps and +formatting them. +.PP +.TP "\w'\fBsystime()\fR'u+1n" +.B systime() +returns the current time of day as the number of seconds since the Epoch +(Midnight UTC, January 1, 1970 on \*(PX systems). +.TP +\fBstrftime(\fR[\fIformat \fR[\fB, \fItimestamp\fR]]\fB)\fR +formats +.I timestamp +according to the specification in +.IR format. +The +.I timestamp +should be of the same form as returned by +.BR systime() . +If +.I timestamp +is missing, the current time of day is used. +If +.I format +is missing, a default format equivalent to the output of +.IR date (1) +will be used. +See the specification for the +.B strftime() +function in \*(AN C for the format conversions that are +guaranteed to be available. +A public-domain version of +.IR strftime (3) +and a man page for it come with +.IR gawk ; +if that version was used to build +.IR gawk , +then all of the conversions described in that man page are available to +.IR gawk. +.SS String Constants +.PP +String constants in AWK are sequences of characters enclosed +between double quotes (\fB"\fR). Within strings, certain +.I "escape sequences" +are recognized, as in C. These are: +.PP +.TP \w'\fB\e\^\fIddd\fR'u+1n +.B \e\e +A literal backslash. +.TP +.B \ea +The ``alert'' character; usually the \s-1ASCII\s+1 \s-1BEL\s+1 character. +.TP +.B \eb +backspace. +.TP +.B \ef +form-feed. +.TP +.B \en +newline. +.TP +.B \er +carriage return. +.TP +.B \et +horizontal tab. +.TP +.B \ev +vertical tab. +.TP +.BI \ex "\^hex digits" +The character represented by the string of hexadecimal digits following +the +.BR \ex . +As in \*(AN C, all following hexadecimal digits are considered part of +the escape sequence. +(This feature should tell us something about language design by committee.) +E.g., \fB"\ex1B"\fR is the \s-1ASCII\s+1 \s-1ESC\s+1 (escape) character. +.TP +.BI \e ddd +The character represented by the 1-, 2-, or 3-digit sequence of octal +digits. E.g. \fB"\e033"\fR is the \s-1ASCII\s+1 \s-1ESC\s+1 (escape) character. +.TP +.BI \e c +The literal character +.IR c\^ . +.PP +The escape sequences may also be used inside constant regular expressions +(e.g., +.B "/[\ \et\ef\en\er\ev]/" +matches whitespace characters). +.PP +In compatibility mode, the characters represented by octal and +hexadecimal escape sequences are treated literally when used in +regexp constants. Thus, +.B /a\e52b/ +is equivalent to +.BR /a\e*b/ . +.SH FUNCTIONS +Functions in AWK are defined as follows: +.PP +.RS +\fBfunction \fIname\fB(\fIparameter list\fB) { \fIstatements \fB}\fR +.RE +.PP +Functions are executed when they are called from within expressions +in either patterns or actions. Actual parameters supplied in the function +call are used to instantiate the formal parameters declared in the function. +Arrays are passed by reference, other variables are passed by value. +.PP +Since functions were not originally part of the AWK language, the provision +for local variables is rather clumsy: They are declared as extra parameters +in the parameter list. The convention is to separate local variables from +real parameters by extra spaces in the parameter list. For example: +.PP +.RS +.ft B +.nf +function f(p, q, a, b) # a & b are local +{ + \&..... +} + +/abc/ { ... ; f(1, 2) ; ... } +.fi +.ft R +.RE +.PP +The left parenthesis in a function call is required +to immediately follow the function name, +without any intervening white space. +This is to avoid a syntactic ambiguity with the concatenation operator. +This restriction does not apply to the built-in functions listed above. +.PP +Functions may call each other and may be recursive. +Function parameters used as local variables are initialized +to the null string and the number zero upon function invocation. +.PP +If +.B \-\^\-lint +has been provided, +.I gawk +will warn about calls to undefined functions at parse time, +instead of at run time. +Calling an undefined function at run time is a fatal error. +.PP +The word +.B func +may be used in place of +.BR function . +.SH EXAMPLES +.nf +Print and sort the login names of all users: + +.ft B + BEGIN { FS = ":" } + { print $1 | "sort" } + +.ft R +Count lines in a file: + +.ft B + { nlines++ } + END { print nlines } + +.ft R +Precede each line by its number in the file: + +.ft B + { print FNR, $0 } + +.ft R +Concatenate and line number (a variation on a theme): + +.ft B + { print NR, $0 } +.ft R +.fi +.SH SEE ALSO +.IR egrep (1), +.IR getpid (2), +.IR getppid (2), +.IR getpgrp (2), +.IR getuid (2), +.IR geteuid (2), +.IR getgid (2), +.IR getegid (2), +.IR getgroups (2) +.PP +.IR "The AWK Programming Language" , +Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger, +Addison-Wesley, 1988. ISBN 0-201-07981-X. +.PP +.IR "AWK Language Programming" , +Edition 1.0, published by the Free Software Foundation, 1995. +.SH POSIX COMPATIBILITY +A primary goal for +.I gawk +is compatibility with the \*(PX standard, as well as with the +latest version of \*(UX +.IR awk . +To this end, +.I gawk +incorporates the following user visible +features which are not described in the AWK book, +but are part of the Bell Labs version of +.IR awk , +and are in the \*(PX standard. +.PP +The +.B \-v +option for assigning variables before program execution starts is new. +The book indicates that command line variable assignment happens when +.I awk +would otherwise open the argument as a file, which is after the +.B BEGIN +block is executed. However, in earlier implementations, when such an +assignment appeared before any file names, the assignment would happen +.I before +the +.B BEGIN +block was run. Applications came to depend on this ``feature.'' +When +.I awk +was changed to match its documentation, this option was added to +accommodate applications that depended upon the old behavior. +(This feature was agreed upon by both the AT&T and GNU developers.) +.PP +The +.B \-W +option for implementation specific features is from the \*(PX standard. +.PP +When processing arguments, +.I gawk +uses the special option ``\fB\-\^\-\fP'' to signal the end of +arguments. +In compatibility mode, it will warn about, but otherwise ignore, +undefined options. +In normal operation, such arguments are passed on to the AWK program for +it to process. +.PP +The AWK book does not define the return value of +.BR srand() . +The \*(PX standard +has it return the seed it was using, to allow keeping track +of random number sequences. Therefore +.B srand() +in +.I gawk +also returns its current seed. +.PP +Other new features are: +The use of multiple +.B \-f +options (from MKS +.IR awk ); +the +.B ENVIRON +array; the +.BR \ea , +and +.BR \ev +escape sequences (done originally in +.I gawk +and fed back into AT&T's); the +.B tolower() +and +.B toupper() +built-in functions (from AT&T); and the \*(AN C conversion specifications in +.B printf +(done first in AT&T's version). +.SH GNU EXTENSIONS +.I Gawk +has a number of extensions to \*(PX +.IR awk . +They are described in this section. All the extensions described here +can be disabled by +invoking +.I gawk +with the +.B \-\^\-traditional +option. +.PP +The following features of +.I gawk +are not available in +\*(PX +.IR awk . +.RS +.TP \w'\(bu'u+1n +\(bu +The +.B \ex +escape sequence. +(Disabled with +.BR \-\^\-posix .) +.TP \w'\(bu'u+1n +\(bu +The +.B fflush() +function. +(Disabled with +.BR \-\^\-posix .) +.TP +\(bu +The +.BR systime(), +.BR strftime(), +and +.B gensub() +functions. +.TP +\(bu +The special file names available for I/O redirection are not recognized. +.TP +\(bu +The +.BR ARGIND , +.BR ERRNO , +and +.B RT +variables are not special. +.TP +\(bu +The +.B IGNORECASE +variable and its side-effects are not available. +.TP +\(bu +The +.B FIELDWIDTHS +variable and fixed-width field splitting. +.TP +\(bu +The use of +.B RS +as a regular expression. +.TP +\(bu +The ability to split out individual characters using the null string +as the value of +.BR FS , +and as the third argument to +.BR split() . +.TP +\(bu +No path search is performed for files named via the +.B \-f +option. Therefore the +.B AWKPATH +environment variable is not special. +.TP +\(bu +The use of +.B "nextfile" +to abandon processing of the current input file. +.TP +\(bu +The use of +.BI delete " array" +to delete the entire contents of an array. +.RE +.PP +The AWK book does not define the return value of the +.B close() +function. +.IR Gawk\^ 's +.B close() +returns the value from +.IR fclose (3), +or +.IR pclose (3), +when closing a file or pipe, respectively. +.PP +When +.I gawk +is invoked with the +.B \-\^\-traditional +option, +if the +.I fs +argument to the +.B \-F +option is ``t'', then +.B FS +will be set to the tab character. +Note that typing +.B "gawk \-F\et \&..." +simply causes the shell to quote the ``t,'', and does not pass +``\et'' to the +.B \-F +option. +Since this is a rather ugly special case, it is not the default behavior. +This behavior also does not occur if +.B \-\^\-posix +has been specified. +To really get a tab character as the field separator, it is best to use +quotes: +.BR "gawk \-F'\et' \&..." . +.ig +.PP +If +.I gawk +was compiled for debugging, it will +accept the following additional options: +.TP +.PD 0 +.B \-Wparsedebug +.TP +.PD +.B \-\^\-parsedebug +Turn on +.IR yacc (1) +or +.IR bison (1) +debugging output during program parsing. +This option should only be of interest to the +.I gawk +maintainers, and may not even be compiled into +.IR gawk . +.. +.SH HISTORICAL FEATURES +There are two features of historical AWK implementations that +.I gawk +supports. +First, it is possible to call the +.B length() +built-in function not only with no argument, but even without parentheses! +Thus, +.RS +.PP +.ft B +a = length # Holy Algol 60, Batman! +.ft R +.RE +.PP +is the same as either of +.RS +.PP +.ft B +a = length() +.br +a = length($0) +.ft R +.RE +.PP +This feature is marked as ``deprecated'' in the \*(PX standard, and +.I gawk +will issue a warning about its use if +.B \-\^\-lint +is specified on the command line. +.PP +The other feature is the use of either the +.B continue +or the +.B break +statements outside the body of a +.BR while , +.BR for , +or +.B do +loop. Traditional AWK implementations have treated such usage as +equivalent to the +.B next +statement. +.I Gawk +will support this usage if +.B \-\^\-traditional +has been specified. +.SH ENVIRONMENT VARIABLES +If +.B POSIXLY_CORRECT +exists in the environment, then +.I gawk +behaves exactly as if +.B \-\^\-posix +had been specified on the command line. +If +.B \-\^\-lint +has been specified, +.I gawk +will issue a warning message to this effect. +.PP +The +.B AWKPATH +environment variable can be used to provide a list of directories that +.I gawk +will search when looking for files named via the +.B \-f +and +.B \-\^\-file +options. +.SH BUGS +The +.B \-F +option is not necessary given the command line variable assignment feature; +it remains only for backwards compatibility. +.PP +If your system actually has support for +.B /dev/fd +and the associated +.BR /dev/stdin , +.BR /dev/stdout , +and +.B /dev/stderr +files, you may get different output from +.I gawk +than you would get on a system without those files. When +.I gawk +interprets these files internally, it synchronizes output to the standard +output with output to +.BR /dev/stdout , +while on a system with those files, the output is actually to different +open files. +Caveat Emptor. +.PP +Syntactically invalid single character programs tend to overflow +the parse stack, generating a rather unhelpful message. Such programs +are surprisingly difficult to diagnose in the completely general case, +and the effort to do so really is not worth it. +.SH VERSION INFORMATION +This man page documents +.IR gawk , +version 3.0.2. +.SH AUTHORS +The original version of \*(UX +.I awk +was designed and implemented by Alfred Aho, +Peter Weinberger, and Brian Kernighan of AT&T Bell Labs. Brian Kernighan +continues to maintain and enhance it. +.PP +Paul Rubin and Jay Fenlason, +of the Free Software Foundation, wrote +.IR gawk , +to be compatible with the original version of +.I awk +distributed in Seventh Edition \*(UX. +John Woods contributed a number of bug fixes. +David Trueman, with contributions +from Arnold Robbins, made +.I gawk +compatible with the new version of \*(UX +.IR awk . +Arnold Robbins is the current maintainer. +.PP +The initial DOS port was done by Conrad Kwok and Scott Garfinkle. +Scott Deifik is the current DOS maintainer. Pat Rankin did the +port to VMS, and Michal Jaegermann did the port to the Atari ST. +The port to OS/2 was done by Kai Uwe Rommel, with contributions and +help from Darrel Hankerson. Fred Fish supplied support for the Amiga. +.SH BUG REPORTS +If you find a bug in +.IR gawk , +please send electronic mail to +.BR bug-gnu-utils@prep.ai.mit.edu , +.I with +a carbon copy to +.BR arnold@gnu.ai.mit.edu . +Please include your operating system and its revision, the version of +.IR gawk , +what C compiler you used to compile it, and a test program +and data that are as small as possible for reproducing the problem. +.PP +Before sending a bug report, please do two things. First, verify that +you have the latest version of +.IR gawk . +Many bugs (usually subtle ones) are fixed at each release, and if +yours is out of date, the problem may already have been solved. +Second, please read this man page and the reference manual carefully to +be sure that what you think is a bug really is, instead of just a quirk +in the language. +.PP +Whatever you do, do +.B NOT +post a bug report in +.BR comp.lang.awk . +While the +.I gawk +developers occasionally read this newsgroup, posting bug reports there +is an unreliable way to report bugs. Instead, please use the electronic mail +addresses given above. +.SH ACKNOWLEDGEMENTS +Brian Kernighan of Bell Labs +provided valuable assistance during testing and debugging. +We thank him. +.SH COPYING PERMISSIONS +Copyright \(co) 1996 Free Software Foundation, Inc. +.PP +Permission is granted to make and distribute verbatim copies of +this manual page provided the copyright notice and this permission +notice are preserved on all copies. +.ig +Permission is granted to process this file through troff and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual page). +.. +.PP +Permission is granted to copy and distribute modified versions of this +manual page under the conditions for verbatim copying, provided that +the entire resulting derived work is distributed under the terms of a +permission notice identical to this one. +.PP +Permission is granted to copy and distribute translations of this +manual page into another language, under the above conditions for +modified versions, except that this permission notice may be stated in +a translation approved by the Foundation. diff --git a/contrib/awk/doc/gawk.texi b/contrib/awk/doc/gawk.texi new file mode 100644 index 0000000..8c2aad2 --- /dev/null +++ b/contrib/awk/doc/gawk.texi @@ -0,0 +1,20820 @@ +\input texinfo @c -*-texinfo-*- +@c %**start of header (This is for running Texinfo on a region.) +@setfilename gawk.info +@settitle The GNU Awk User's Guide +@c %**end of header (This is for running Texinfo on a region.) + +@c inside ifinfo for older versions of texinfo.tex +@ifinfo +@c I hope this is the right category +@dircategory Programming Languages +@direntry +* Gawk: (gawk.info). A Text Scanning and Processing Language. +@end direntry +@end ifinfo + +@c @set xref-automatic-section-title +@c @set DRAFT + +@c The following information should be updated here only! +@c This sets the edition of the document, the version of gawk it +@c applies to, and when the document was updated. +@set TITLE Effective AWK Programming +@set SUBTITLE A User's Guide for GNU Awk +@set PATCHLEVEL 3 +@set EDITION 1.0.@value{PATCHLEVEL} +@set VERSION 3.0 +@set UPDATE-MONTH February 1997 +@iftex +@set DOCUMENT book +@end iftex +@ifinfo +@set DOCUMENT Info file +@end ifinfo + +@ignore +Some comments on the layout for TeX. +1. Use at least texinfo.tex 2.159. It contains fixes that + are needed to get the footings for draft mode to not appear. +2. I have done A LOT of work to make this look good. There are `@page' commands + and use of `@group ... @end group' in a number of places. If you muck + with anything, it's your responsibility not to break the layout. +@end ignore + +@c merge the function and variable indexes into the concept index +@ifinfo +@synindex fn cp +@synindex vr cp +@end ifinfo +@iftex +@syncodeindex fn cp +@syncodeindex vr cp +@end iftex + +@c If "finalout" is commented out, the printed output will show +@c black boxes that mark lines that are too long. Thus, it is +@c unwise to comment it out when running a master in case there are +@c overfulls which are deemed okay. + +@ifclear DRAFT +@iftex +@finalout +@end iftex +@end ifclear + +@smallbook +@iftex +@c @cropmarks +@end iftex + +@ifinfo +This file documents @code{awk}, a program that you can use to select +particular records in a file and perform operations upon them. + +This is Edition @value{EDITION} of @cite{@value{TITLE}}, +for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation of AWK. + +Copyright (C) 1989, 1991, 92, 93, 96, 97 Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +@ignore +Permission is granted to process this file through TeX and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual). + +@end ignore +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided that the entire +resulting derived work is distributed under the terms of a permission +notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions, +except that this permission notice may be stated in a translation approved +by the Foundation. +@end ifinfo + +@setchapternewpage odd + +@titlepage +@title @value{TITLE} +@subtitle @value{SUBTITLE} +@subtitle Edition @value{EDITION} +@subtitle @value{UPDATE-MONTH} +@author Arnold D. Robbins +@ignore +@sp 1 +@author Based on @cite{The GAWK Manual}, +@author by Robbins, Close, Rubin, and Stallman +@end ignore + +@c Include the Distribution inside the titlepage environment so +@c that headings are turned off. Headings on and off do not work. + +@page +@vskip 0pt plus 1filll +@ifset LEGALJUNK +The programs and applications presented in this book have been +included for their instructional value. They have been tested with care, +but are not guaranteed for any particular purpose. The publisher does not +offer any warranties or representations, nor does it accept any +liabilities with respect to the programs or applications. +So there. +@sp 2 +UNIX is a registered trademark of X/Open, Ltd. @* +Microsoft, MS, and MS-DOS are registered trademarks, and Windows is a +trademark of Microsoft Corporation in the United States and other +countries. @* +Atari, 520ST, 1040ST, TT, STE, Mega, and Falcon are registered trademarks +or trademarks of Atari Corporation. @* +DEC, Digital, OpenVMS, ULTRIX, and VMS, are trademarks of Digital Equipment +Corporation. @* +@end ifset +``To boldly go where no man has gone before'' is a +Registered Trademark of Paramount Pictures Corporation. @* +@c sorry, i couldn't resist +@sp 3 +Copyright @copyright{} 1989, 1991, 92, 93, 96, 97 Free Software Foundation, Inc. +@sp 2 + +This is Edition @value{EDITION} of @cite{@value{TITLE}}, @* +for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU implementation of AWK. + +@sp 2 +@center Published jointly by: + +@multitable {Specialized Systems Consultants, Inc. (SSC)} {Boston, MA 02111-1307 USA} +@item Specialized Systems Consultants, Inc. (SSC) @tab Free Software Foundation +@item PO Box 55549 @tab 59 Temple Place --- Suite 330 +@item Seattle, WA 98155 USA @tab Boston, MA 02111-1307 USA +@item Phone: +1-206-782-7733 @tab Phone: +1-617-542-5942 +@item Fax: +1-206-782-7191 @tab Fax: +1-617-542-2652 +@item E-mail: @code{sales@@ssc.com} @tab E-mail: @code{gnu@@prep.ai.mit.edu} +@item URL: @code{http://www.ssc.com/} @tab URL: @code{http://www.fsf.org/} +@end multitable + +@sp 1 +@c this ISBN can change! Check with SSC +@c This one is correct for gawk 3.0 and edition 1.0 from the FSF +@c ISBN 1-882114-26-4 @* +@c This one is correct for gawk 3.0.3 and edition 1.0.3 from SSC +ISBN 1-57831-000-8 @* + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided that the entire +resulting derived work is distributed under the terms of a permission +notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions, +except that this permission notice may be stated in a translation approved +by the Foundation. +@sp 2 +@c Cover art by Etienne Suvasa. +Cover art by Amy Wells Wood. +@end titlepage + +@c Thanks to Bob Chassell for directions on doing dedications. +@iftex +@headings off +@page +@w{ } +@sp 9 +@center @i{To Miriam, for making me complete.} +@sp 1 +@center @i{To Chana, for the joy you bring us.} +@sp 1 +@center @i{To Rivka, for the exponential increase.} +@sp 1 +@center @i{To Nachum, for the added dimension.} +@page +@w{ } +@page +@headings on +@end iftex + +@iftex +@headings off +@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| +@oddheading @| @| @strong{@thischapter}@ @ @ @thispage +@ifset DRAFT +@evenfooting @today{} @| @emph{DRAFT!} @| Please Do Not Redistribute +@oddfooting Please Do Not Redistribute @| @emph{DRAFT!} @| @today{} +@end ifset +@end iftex + +@ifinfo +@node Top, Preface, (dir), (dir) +@top General Introduction +@c Preface or Licensing nodes should come right after the Top +@c node, in `unnumbered' sections, then the chapter, `What is gawk'. + +This file documents @code{awk}, a program that you can use to select +particular records in a file and perform operations upon them. + +This is Edition @value{EDITION} of @cite{@value{TITLE}}, @* +for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation @* +of AWK. + +@end ifinfo + +@menu +* Preface:: What this @value{DOCUMENT} is about; brief + history and acknowledgements. +* What Is Awk:: What is the @code{awk} language; using this + @value{DOCUMENT}. +* Getting Started:: A basic introduction to using @code{awk}. How + to run an @code{awk} program. Command line + syntax. +* One-liners:: Short, sample @code{awk} programs. +* Regexp:: All about matching things using regular + expressions. +* Reading Files:: How to read files and manipulate fields. +* Printing:: How to print using @code{awk}. Describes the + @code{print} and @code{printf} statements. + Also describes redirection of output. +* Expressions:: Expressions are the basic building blocks of + statements. +* Patterns and Actions:: Overviews of patterns and actions. +* Statements:: The various control statements are described + in detail. +* Built-in Variables:: Built-in Variables +* Arrays:: The description and use of arrays. Also + includes array-oriented control statements. +* Built-in:: The built-in functions are summarized here. +* User-defined:: User-defined functions are described in + detail. +* Invoking Gawk:: How to run @code{gawk}. +* Library Functions:: A Library of @code{awk} Functions. +* Sample Programs:: Many @code{awk} programs with complete + explanations. +* Language History:: The evolution of the @code{awk} language. +* Gawk Summary:: @code{gawk} Options and Language Summary. +* Installation:: Installing @code{gawk} under various operating + systems. +* Notes:: Something about the implementation of + @code{gawk}. +* Glossary:: An explanation of some unfamiliar terms. +* Copying:: Your right to copy and distribute @code{gawk}. +* Index:: Concept and Variable Index. + +* History:: The history of @code{gawk} and @code{awk}. +* Manual History:: Brief history of the GNU project and this + @value{DOCUMENT}. +* Acknowledgements:: Acknowledgements. +* This Manual:: Using this @value{DOCUMENT}. Includes sample + input files that you can use. +* Conventions:: Typographical Conventions. +* Sample Data Files:: Sample data files for use in the @code{awk} + programs illustrated in this @value{DOCUMENT}. +* Names:: What name to use to find @code{awk}. +* Running gawk:: How to run @code{gawk} programs; includes + command line syntax. +* One-shot:: Running a short throw-away @code{awk} program. +* Read Terminal:: Using no input files (input from terminal + instead). +* Long:: Putting permanent @code{awk} programs in + files. +* Executable Scripts:: Making self-contained @code{awk} programs. +* Comments:: Adding documentation to @code{gawk} programs. +* Very Simple:: A very simple example. +* Two Rules:: A less simple one-line example with two rules. +* More Complex:: A more complex example. +* Statements/Lines:: Subdividing or combining statements into + lines. +* Other Features:: Other Features of @code{awk}. +* When:: When to use @code{gawk} and when to use other + things. +* Regexp Usage:: How to Use Regular Expressions. +* Escape Sequences:: How to write non-printing characters. +* Regexp Operators:: Regular Expression Operators. +* GNU Regexp Operators:: Operators specific to GNU software. +* Case-sensitivity:: How to do case-insensitive matching. +* Leftmost Longest:: How much text matches. +* Computed Regexps:: Using Dynamic Regexps. +* Records:: Controlling how data is split into records. +* Fields:: An introduction to fields. +* Non-Constant Fields:: Non-constant Field Numbers. +* Changing Fields:: Changing the Contents of a Field. +* Field Separators:: The field separator and how to change it. +* Basic Field Splitting:: How fields are split with single characters or + simple strings. +* Regexp Field Splitting:: Using regexps as the field separator. +* Single Character Fields:: Making each character a separate field. +* Command Line Field Separator:: Setting @code{FS} from the command line. +* Field Splitting Summary:: Some final points and a summary table. +* Constant Size:: Reading constant width data. +* Multiple Line:: Reading multi-line records. +* Getline:: Reading files under explicit program control + using the @code{getline} function. +* Getline Intro:: Introduction to the @code{getline} function. +* Plain Getline:: Using @code{getline} with no arguments. +* Getline/Variable:: Using @code{getline} into a variable. +* Getline/File:: Using @code{getline} from a file. +* Getline/Variable/File:: Using @code{getline} into a variable from a + file. +* Getline/Pipe:: Using @code{getline} from a pipe. +* Getline/Variable/Pipe:: Using @code{getline} into a variable from a + pipe. +* Getline Summary:: Summary Of @code{getline} Variants. +* Print:: The @code{print} statement. +* Print Examples:: Simple examples of @code{print} statements. +* Output Separators:: The output separators and how to change them. +* OFMT:: Controlling Numeric Output With @code{print}. +* Printf:: The @code{printf} statement. +* Basic Printf:: Syntax of the @code{printf} statement. +* Control Letters:: Format-control letters. +* Format Modifiers:: Format-specification modifiers. +* Printf Examples:: Several examples. +* Redirection:: How to redirect output to multiple files and + pipes. +* Special Files:: File name interpretation in @code{gawk}. + @code{gawk} allows access to inherited file + descriptors. +* Close Files And Pipes:: Closing Input and Output Files and Pipes. +* Constants:: String, numeric, and regexp constants. +* Scalar Constants:: Numeric and string constants. +* Regexp Constants:: Regular Expression constants. +* Using Constant Regexps:: When and how to use a regexp constant. +* Variables:: Variables give names to values for later use. +* Using Variables:: Using variables in your programs. +* Assignment Options:: Setting variables on the command line and a + summary of command line syntax. This is an + advanced method of input. +* Conversion:: The conversion of strings to numbers and vice + versa. +* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, + etc.) +* Concatenation:: Concatenating strings. +* Assignment Ops:: Changing the value of a variable or a field. +* Increment Ops:: Incrementing the numeric value of a variable. +* Truth Values:: What is ``true'' and what is ``false''. +* Typing and Comparison:: How variables acquire types, and how this + affects comparison of numbers and strings with + @samp{<}, etc. +* Boolean Ops:: Combining comparison expressions using boolean + operators @samp{||} (``or''), @samp{&&} + (``and'') and @samp{!} (``not''). +* Conditional Exp:: Conditional expressions select between two + subexpressions under control of a third + subexpression. +* Function Calls:: A function call is an expression. +* Precedence:: How various operators nest. +* Pattern Overview:: What goes into a pattern. +* Kinds of Patterns:: A list of all kinds of patterns. +* Regexp Patterns:: Using regexps as patterns. +* Expression Patterns:: Any expression can be used as a pattern. +* Ranges:: Pairs of patterns specify record ranges. +* BEGIN/END:: Specifying initialization and cleanup rules. +* Using BEGIN/END:: How and why to use BEGIN/END rules. +* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. +* Empty:: The empty pattern, which matches every record. +* Action Overview:: What goes into an action. +* If Statement:: Conditionally execute some @code{awk} + statements. +* While Statement:: Loop until some condition is satisfied. +* Do Statement:: Do specified action while looping until some + condition is satisfied. +* For Statement:: Another looping statement, that provides + initialization and increment clauses. +* Break Statement:: Immediately exit the innermost enclosing loop. +* Continue Statement:: Skip to the end of the innermost enclosing + loop. +* Next Statement:: Stop processing the current input record. +* Nextfile Statement:: Stop processing the current file. +* Exit Statement:: Stop execution of @code{awk}. +* User-modified:: Built-in variables that you change to control + @code{awk}. +* Auto-set:: Built-in variables where @code{awk} gives you + information. +* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. +* Array Intro:: Introduction to Arrays +* Reference to Elements:: How to examine one element of an array. +* Assigning Elements:: How to change an element of an array. +* Array Example:: Basic Example of an Array +* Scanning an Array:: A variation of the @code{for} statement. It + loops through the indices of an array's + existing elements. +* Delete:: The @code{delete} statement removes an element + from an array. +* Numeric Array Subscripts:: How to use numbers as subscripts in + @code{awk}. +* Uninitialized Subscripts:: Using Uninitialized variables as subscripts. +* Multi-dimensional:: Emulating multi-dimensional arrays in + @code{awk}. +* Multi-scanning:: Scanning multi-dimensional arrays. +* Calling Built-in:: How to call built-in functions. +* Numeric Functions:: Functions that work with numbers, including + @code{int}, @code{sin} and @code{rand}. +* String Functions:: Functions for string manipulation, such as + @code{split}, @code{match}, and + @code{sprintf}. +* I/O Functions:: Functions for files and shell commands. +* Time Functions:: Functions for dealing with time stamps. +* Definition Syntax:: How to write definitions and what they mean. +* Function Example:: An example function definition and what it + does. +* Function Caveats:: Things to watch out for. +* Return Statement:: Specifying the value a function returns. +* Options:: Command line options and their meanings. +* Other Arguments:: Input file names and variable assignments. +* AWKPATH Variable:: Searching directories for @code{awk} programs. +* Obsolete:: Obsolete Options and/or features. +* Undocumented:: Undocumented Options and Features. +* Known Bugs:: Known Bugs in @code{gawk}. +* Portability Notes:: What to do if you don't have @code{gawk}. +* Nextfile Function:: Two implementations of a @code{nextfile} + function. +* Assert Function:: A function for assertions in @code{awk} + programs. +* Round Function:: A function for rounding if @code{sprintf} does + not do it correctly. +* Ordinal Functions:: Functions for using characters as numbers and + vice versa. +* Join Function:: A function to join an array into a string. +* Mktime Function:: A function to turn a date into a timestamp. +* Gettimeofday Function:: A function to get formatted times. +* Filetrans Function:: A function for handling data file transitions. +* Getopt Function:: A function for processing command line + arguments. +* Passwd Functions:: Functions for getting user information. +* Group Functions:: Functions for getting group information. +* Library Names:: How to best name private global variables in + library functions. +* Clones:: Clones of common utilities. +* Cut Program:: The @code{cut} utility. +* Egrep Program:: The @code{egrep} utility. +* Id Program:: The @code{id} utility. +* Split Program:: The @code{split} utility. +* Tee Program:: The @code{tee} utility. +* Uniq Program:: The @code{uniq} utility. +* Wc Program:: The @code{wc} utility. +* Miscellaneous Programs:: Some interesting @code{awk} programs. +* Dupword Program:: Finding duplicated words in a document. +* Alarm Program:: An alarm clock. +* Translate Program:: A program similar to the @code{tr} utility. +* Labels Program:: Printing mailing labels. +* Word Sorting:: A program to produce a word usage count. +* History Sorting:: Eliminating duplicate entries from a history + file. +* Extract Program:: Pulling out programs from Texinfo source + files. +* Simple Sed:: A Simple Stream Editor. +* Igawk Program:: A wrapper for @code{awk} that includes files. +* V7/SVR3.1:: The major changes between V7 and System V + Release 3.1. +* SVR4:: Minor changes between System V Releases 3.1 + and 4. +* POSIX:: New features from the POSIX standard. +* BTL:: New features from the Bell Laboratories + version of @code{awk}. +* POSIX/GNU:: The extensions in @code{gawk} not in POSIX + @code{awk}. +* Command Line Summary:: Recapitulation of the command line. +* Language Summary:: A terse review of the language. +* Variables/Fields:: Variables, fields, and arrays. +* Fields Summary:: Input field splitting. +* Built-in Summary:: @code{awk}'s built-in variables. +* Arrays Summary:: Using arrays. +* Data Type Summary:: Values in @code{awk} are numbers or strings. +* Rules Summary:: Patterns and Actions, and their component + parts. +* Pattern Summary:: Quick overview of patterns. +* Regexp Summary:: Quick overview of regular expressions. +* Actions Summary:: Quick overview of actions. +* Operator Summary:: @code{awk} operators. +* Control Flow Summary:: The control statements. +* I/O Summary:: The I/O statements. +* Printf Summary:: A summary of @code{printf}. +* Special File Summary:: Special file names interpreted internally. +* Built-in Functions Summary:: Built-in numeric and string functions. +* Time Functions Summary:: Built-in time functions. +* String Constants Summary:: Escape sequences in strings. +* Functions Summary:: Defining and calling functions. +* Historical Features:: Some undocumented but supported ``features''. +* Gawk Distribution:: What is in the @code{gawk} distribution. +* Getting:: How to get the distribution. +* Extracting:: How to extract the distribution. +* Distribution contents:: What is in the distribution. +* Unix Installation:: Installing @code{gawk} under various versions + of Unix. +* Quick Installation:: Compiling @code{gawk} under Unix. +* Configuration Philosophy:: How it's all supposed to work. +* VMS Installation:: Installing @code{gawk} on VMS. +* VMS Compilation:: How to compile @code{gawk} under VMS. +* VMS Installation Details:: How to install @code{gawk} under VMS. +* VMS Running:: How to run @code{gawk} under VMS. +* VMS POSIX:: Alternate instructions for VMS POSIX. +* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS + and OS/2 +* Atari Installation:: Installing @code{gawk} on the Atari ST. +* Atari Compiling:: Compiling @code{gawk} on Atari +* Atari Using:: Running @code{gawk} on Atari +* Amiga Installation:: Installing @code{gawk} on an Amiga. +* Bugs:: Reporting Problems and Bugs. +* Other Versions:: Other freely available @code{awk} + implementations. +* Compatibility Mode:: How to disable certain @code{gawk} extensions. +* Additions:: Making Additions To @code{gawk}. +* Adding Code:: Adding code to the main body of @code{gawk}. +* New Ports:: Porting @code{gawk} to a new operating system. +* Future Extensions:: New features that may be implemented one day. +* Improvements:: Suggestions for improvements by volunteers. + +@end menu + +@c dedication for Info file +@ifinfo +@center To Miriam, for making me complete. +@sp 1 +@center To Chana, for the joy you bring us. +@sp 1 +@center To Rivka, for the exponential increase. +@sp 1 +@center To Nachum, for the added dimension. +@end ifinfo + +@node Preface, What Is Awk, Top, Top +@unnumbered Preface + +@c I saw a comment somewhere that the preface should describe the book itself, +@c and the introduction should describe what the book covers. + +This @value{DOCUMENT} teaches you about the @code{awk} language and +how you can use it effectively. You should already be familiar with basic +system commands, such as @code{cat} and @code{ls},@footnote{These commands +are available on POSIX compliant systems, as well as on traditional Unix +based systems. If you are using some other operating system, you still need to +be familiar with the ideas of I/O redirection and pipes.} and basic shell +facilities, such as Input/Output (I/O) redirection and pipes. + +Implementations of the @code{awk} language are available for many different +computing environments. This @value{DOCUMENT}, while describing the @code{awk} language +in general, also describes a particular implementation of @code{awk} called +@code{gawk} (which stands for ``GNU Awk''). @code{gawk} runs on a broad range +of Unix systems, ranging from 80386 PC-based computers, up through large scale +systems, such as Crays. @code{gawk} has also been ported to MS-DOS and +OS/2 PC's, Atari and Amiga micro-computers, and VMS. + +@menu +* History:: The history of @code{gawk} and @code{awk}. +* Manual History:: Brief history of the GNU project and this + @value{DOCUMENT}. +* Acknowledgements:: Acknowledgements. +@end menu + +@node History, Manual History, Preface, Preface +@unnumberedsec History of @code{awk} and @code{gawk} + +@cindex acronym +@cindex history of @code{awk} +@cindex Aho, Alfred +@cindex Weinberger, Peter +@cindex Kernighan, Brian +@cindex old @code{awk} +@cindex new @code{awk} +The name @code{awk} comes from the initials of its designers: Alfred V.@: +Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan. The original version of +@code{awk} was written in 1977 at AT&T Bell Laboratories. +In 1985 a new version made the programming +language more powerful, introducing user-defined functions, multiple input +streams, and computed regular expressions. +This new version became generally available with Unix System V Release 3.1. +The version in System V Release 4 added some new features and also cleaned +up the behavior in some of the ``dark corners'' of the language. +The specification for @code{awk} in the POSIX Command Language +and Utilities standard further clarified the language based on feedback +from both the @code{gawk} designers, and the original Bell Labs @code{awk} +designers. + +The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin +and Jay Fenlason, with advice from Richard Stallman. John Woods +contributed parts of the code as well. In 1988 and 1989, David Trueman, with +help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility +with the newer @code{awk}. Current development focuses on bug fixes, +performance improvements, standards compliance, and occasionally, new features. + +@node Manual History, Acknowledgements, History, Preface +@unnumberedsec The GNU Project and This Book + +@cindex Free Software Foundation +@cindex Stallman, Richard +The Free Software Foundation (FSF) is a non-profit organization dedicated +to the production and distribution of freely distributable software. +It was founded by Richard M.@: Stallman, the author of the original +Emacs editor. GNU Emacs is the most widely used version of Emacs today. + +@cindex GNU Project +The GNU project is an on-going effort on the part of the Free Software +Foundation to create a complete, freely distributable, POSIX compliant +computing environment. (GNU stands for ``GNU's not Unix''.) +The FSF uses the ``GNU General Public License'' (or GPL) to ensure that +source code for their software is always available to the end user. A +copy of the GPL is included for your reference +(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}). +The GPL applies to the C language source code for @code{gawk}. + +A shell, an editor (Emacs), highly portable optimizing C, C++, and +Objective-C compilers, a symbolic debugger, and dozens of large and +small utilities (such as @code{gawk}), have all been completed and are +freely available. As of this writing (early 1997), the GNU operating +system kernel (the HURD), has been released, but is still in an early +stage of development. + +@cindex Linux +@cindex NetBSD +@cindex FreeBSD +Until the GNU operating system is more fully developed, you should +consider using Linux, a freely distributable, Unix-like operating +system for 80386, DEC Alpha, Sun SPARC and other systems. There are +many books on Linux. One freely available one is @cite{Linux +Installation and Getting Started}, by Matt Welsh. +Many Linux distributions are available, often in computer stores or +bundled on CD-ROM with books about Linux. +(There are three other freely available, Unix-like operating systems for +80386 and other systems, NetBSD, FreeBSD,and OpenBSD. All are based on the +4.4-Lite Berkeley Software Distribution, and they use recent versions +of @code{gawk} for their versions of @code{awk}.) + +@iftex +This @value{DOCUMENT} you are reading now is actually free. The +information in it is freely available to anyone, the machine readable +source code for the @value{DOCUMENT} comes with @code{gawk}, and anyone +may take this @value{DOCUMENT} to a copying machine and make as many +copies of it as they like. (Take a moment to check the copying +permissions on the Copyright page.) + +If you paid money for this @value{DOCUMENT}, what you actually paid for +was the @value{DOCUMENT}'s nice printing and binding, and the +publisher's associated costs to produce it. We have made an effort to +keep these costs reasonable; most people would prefer a bound book to +over 330 pages of photo-copied text that would then have to be held in +a loose-leaf binder (not to mention the time and labor involved in +doing the copying). The same is true of producing this +@value{DOCUMENT} from the machine readable source; the retail price is +only slightly more than the cost per page of printing it +on a laser printer. +@end iftex + +This @value{DOCUMENT} itself has gone through several previous, +preliminary editions. I started working on a preliminary draft of +@cite{The GAWK Manual}, by Diane Close, Paul Rubin, and Richard +Stallman in the fall of 1988. +It was around 90 pages long, and barely described the original, ``old'' +version of @code{awk}. After substantial revision, the first version of +the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in +October of 1989. The manual then underwent more substantial revision +for Edition 0.13 of December 1991. +David Trueman, Pat Rankin, and Michal Jaegermann contributed sections +of the manual for Edition 0.13. +That edition was published by the +FSF as a bound book early in 1992. Since then there have been several +minor revisions, notably Edition 0.14 of November 1992 that was published +by the FSF in January of 1993, and Edition 0.16 of August 1993. + +Edition 1.0 of @cite{@value{TITLE}} represents a significant re-working +of @cite{The GAWK Manual}, with much additional material. +The FSF and I agree that I am now the primary author. +I also felt that it needed a more descriptive title. + +@cite{@value{TITLE}} will undoubtedly continue to evolve. +An electronic version +comes with the @code{gawk} distribution from the FSF. +If you find an error in this @value{DOCUMENT}, please report it! +@xref{Bugs, ,Reporting Problems and Bugs}, for information on submitting +problem reports electronically, or write to me in care of the FSF. + +@node Acknowledgements, , Manual History, Preface +@unnumberedsec Acknowledgements + +@cindex Stallman, Richard +I would like to acknowledge Richard M.@: Stallman, for his vision of a +better world, and for his courage in founding the FSF and starting the +GNU project. + +The initial draft of @cite{The GAWK Manual} had the following acknowledgements: + +@quotation +Many people need to be thanked for their assistance in producing this +manual. Jay Fenlason contributed many ideas and sample programs. Richard +Mlynarik and Robert Chassell gave helpful comments on drafts of this +manual. The paper @cite{A Supplemental Document for @code{awk}} by John W.@: +Pierce of the Chemistry Department at UC San Diego, pinpointed several +issues relevant both to @code{awk} implementation and to this manual, that +would otherwise have escaped us. +@end quotation + +The following people provided many helpful comments on Edition 0.13 of +@cite{The GAWK Manual}: Rick Adams, Michael Brennan, Rich Burridge, Diane Close, +Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins, +and Michal Jaegermann. + +The following people provided many helpful comments for Edition 1.0 of +@cite{@value{TITLE}}: Karl Berry, Michael Brennan, Darrel +Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins. +Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik +updated their respective sections for Edition 1.0. + +Robert J.@: Chassell provided much valuable advice on +the use of Texinfo. He also deserves special thanks for +convincing me @emph{not} to title this @value{DOCUMENT} +@cite{How To Gawk Politely}. +Karl Berry helped significantly with the @TeX{} part of Texinfo. + +@cindex Trueman, David +David Trueman deserves special credit; he has done a yeoman job +of evolving @code{gawk} so that it performs well, and without bugs. +Although he is no longer involved with @code{gawk}, +working with him on this project was a significant pleasure. + +@cindex Deifik, Scott +@cindex Hankerson, Darrel +@cindex Rommel, Kai Uwe +@cindex Rankin, Pat +@cindex Jaegermann, Michal +Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal +Jaegermann (in no particular order) are long time members of the +@code{gawk} ``crack portability team.'' Without their hard work and +help, @code{gawk} would not be nearly the fine program it is today. It +has been and continues to be a pleasure working with this team of fine +people. + +@cindex Friedl, Jeffrey +Jeffrey Friedl provided invaluable help in tracking down a number +of last minute problems with regular expressions in @code{gawk} 3.0. + +@cindex Kernighan, Brian +David and I would like to thank Brian Kernighan of Bell Labs for +invaluable assistance during the testing and debugging of @code{gawk}, and for +help in clarifying numerous points about the language. We could not have +done nearly as good a job on either @code{gawk} or its documentation without +his help. + +@cindex Hughes, Phil +I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr.@: +Bert and Rita Schreiber of Detroit for large amounts of quiet vacation +time in their homes, which allowed me to make significant progress on +this @value{DOCUMENT} and on @code{gawk} itself. Phil Hughes of SSC +contributed in a very important way by loaning me his laptop Linux +system, not once, but twice, allowing me to do a lot of work while +away from home. + +@cindex Robbins, Miriam +Finally, I must thank my wonderful wife, Miriam, for her patience through +the many versions of this project, for her proof-reading, +and for sharing me with the computer. +I would like to thank my parents for their love, and for the grace with +which they raised and educated me. +I also must acknowledge my gratitude to G-d, for the many opportunities +He has sent my way, as well as for the gifts He has given me with which to +take advantage of those opportunities. +@sp 2 +@noindent +Arnold Robbins @* +Atlanta, Georgia @* +February, 1997 + +@ignore +Stuff still not covered anywhere: +BASICS: + Integer vs. floating point + Hex vs. octal vs. decimal + Interpreter vs compiler + input/output +@end ignore + +@node What Is Awk, Getting Started, Preface, Top +@chapter Introduction + +If you are like many computer users, you would frequently like to make +changes in various text files wherever certain patterns appear, or +extract data from parts of certain lines while discarding the rest. To +write a program to do this in a language such as C or Pascal is a +time-consuming inconvenience that may take many lines of code. The job +may be easier with @code{awk}. + +The @code{awk} utility interprets a special-purpose programming language +that makes it possible to handle simple data-reformatting jobs +with just a few lines of code. + +The GNU implementation of @code{awk} is called @code{gawk}; it is fully +upward compatible with the System V Release 4 version of +@code{awk}. @code{gawk} is also upward compatible with the POSIX +specification of the @code{awk} language. This means that all +properly written @code{awk} programs should work with @code{gawk}. +Thus, we usually don't distinguish between @code{gawk} and other @code{awk} +implementations. + +@cindex uses of @code{awk} +Using @code{awk} you can: + +@itemize @bullet +@item +manage small, personal databases + +@item +generate reports + +@item +validate data + +@item +produce indexes, and perform other document preparation tasks + +@item +even experiment with algorithms that can be adapted later to other computer +languages +@end itemize + +@menu +* This Manual:: Using this @value{DOCUMENT}. Includes sample + input files that you can use. +* Conventions:: Typographical Conventions. +* Sample Data Files:: Sample data files for use in the @code{awk} + programs illustrated in this @value{DOCUMENT}. +@end menu + +@node This Manual, Conventions, What Is Awk, What Is Awk +@section Using This Book +@cindex book, using this +@cindex using this book +@cindex language, @code{awk} +@cindex program, @code{awk} +@ignore +@cindex @code{awk} language +@cindex @code{awk} program +@end ignore + +The term @code{awk} refers to a particular program, and to the language you +use to tell this program what to do. When we need to be careful, we call +the program ``the @code{awk} utility'' and the language ``the @code{awk} +language.'' The term @code{gawk} refers to a version of @code{awk} developed +as part the GNU project. The purpose of this @value{DOCUMENT} is to explain +both the @code{awk} language and how to run the @code{awk} utility. + +The main purpose of the @value{DOCUMENT} is to explain the features +of @code{awk}, as defined in the POSIX standard. It does so in the context +of one particular implementation, @code{gawk}. While doing so, it will also +attempt to describe important differences between @code{gawk} and other +@code{awk} implementations. Finally, any @code{gawk} features that +are not in the POSIX standard for @code{awk} will be noted. + +@iftex +This @value{DOCUMENT} has the difficult task of being both tutorial and reference. +If you are a novice, feel free to skip over details that seem too complex. +You should also ignore the many cross references; they are for the +expert user, and for the on-line Info version of the document. +@end iftex + +The term @dfn{@code{awk} program} refers to a program written by you in +the @code{awk} programming language. + +@xref{Getting Started, ,Getting Started with @code{awk}}, for the bare +essentials you need to know to start using @code{awk}. + +Some useful ``one-liners'' are included to give you a feel for the +@code{awk} language (@pxref{One-liners, ,Useful One Line Programs}). + +Many sample @code{awk} programs have been provided for you +(@pxref{Library Functions, ,A Library of @code{awk} Functions}; also +@pxref{Sample Programs, ,Practical @code{awk} Programs}). + +The entire @code{awk} language is summarized for quick reference in +@ref{Gawk Summary, ,@code{gawk} Summary}. Look there if you just need +to refresh your memory about a particular feature. + +If you find terms that you aren't familiar with, try looking them +up in the glossary (@pxref{Glossary}). + +Most of the time complete @code{awk} programs are used as examples, but in +some of the more advanced sections, only the part of the @code{awk} program +that illustrates the concept being described is shown. + +While this @value{DOCUMENT} is aimed principally at people who have not been +exposed +to @code{awk}, there is a lot of information here that even the @code{awk} +expert should find useful. In particular, the description of POSIX +@code{awk}, and the example programs in +@ref{Library Functions, ,A Library of @code{awk} Functions}, and +@ref{Sample Programs, ,Practical @code{awk} Programs}, +should be of interest. + +@c fakenode --- for prepinfo +@unnumberedsubsec Dark Corners +@display +@i{Who opened that window shade?!?} +Count Dracula +@end display +@sp 1 + +@cindex d.c., see ``dark corner'' +@cindex dark corner +Until the POSIX standard (and @cite{The Gawk Manual}), +many features of @code{awk} were either poorly documented, or not +documented at all. Descriptions of such features +(often called ``dark corners'') are noted in this @value{DOCUMENT} with +``(d.c.)''. +They also appear in the index under the heading ``dark corner.'' + +@node Conventions, Sample Data Files, This Manual, What Is Awk +@section Typographical Conventions + +This @value{DOCUMENT} is written using Texinfo, the GNU documentation formatting language. +A single Texinfo source file is used to produce both the printed and on-line +versions of the documentation. +@iftex +Because of this, the typographical conventions +are slightly different than in other books you may have read. +@end iftex +@ifinfo +This section briefly documents the typographical conventions used in Texinfo. +@end ifinfo + +Examples you would type at the command line are preceded by the common +shell primary and secondary prompts, @samp{$} and @samp{>}. +Output from the command is preceded by the glyph ``@print{}''. +This typically represents the command's standard output. +Error messages, and other output on the command's standard error, are preceded +by the glyph ``@error{}''. For example: + +@example +@group +$ echo hi on stdout +@print{} hi on stdout +$ echo hello on stderr 1>&2 +@error{} hello on stderr +@end group +@end example + +@iftex +In the text, command names appear in @code{this font}, while code segments +appear in the same font and quoted, @samp{like this}. Some things will +be emphasized @emph{like this}, and if a point needs to be made +strongly, it will be done @strong{like this}. The first occurrence of +a new term is usually its @dfn{definition}, and appears in the same +font as the previous occurrence of ``definition'' in this sentence. +File names are indicated like this: @file{/path/to/ourfile}. +@end iftex + +Characters that you type at the keyboard look @kbd{like this}. In particular, +there are special characters called ``control characters.'' These are +characters that you type by holding down both the @kbd{CONTROL} key and +another key, at the same time. For example, a @kbd{Control-d} is typed +by first pressing and holding the @kbd{CONTROL} key, next +pressing the @kbd{d} key, and finally releasing both keys. + +@node Sample Data Files, , Conventions, What Is Awk +@section Data Files for the Examples + +@cindex input file, sample +@cindex sample input file +@cindex @file{BBS-list} file +Many of the examples in this @value{DOCUMENT} take their input from two sample +data files. The first, called @file{BBS-list}, represents a list of +computer bulletin board systems together with information about those systems. +The second data file, called @file{inventory-shipped}, contains +information about shipments on a monthly basis. In both files, +each line is considered to be one @dfn{record}. + +In the file @file{BBS-list}, each record contains the name of a computer +bulletin board, its phone number, the board's baud rate(s), and a code for +the number of hours it is operational. An @samp{A} in the last column +means the board operates 24 hours a day. A @samp{B} in the last +column means the board operates evening and weekend hours, only. A +@samp{C} means the board operates only on weekends. + +@c 2e: Update the baud rates to reflect today's faster modems +@example +@c system mkdir eg +@c system mkdir eg/lib +@c system mkdir eg/data +@c system mkdir eg/prog +@c system mkdir eg/misc +@c file eg/data/BBS-list +aardvark 555-5553 1200/300 B +alpo-net 555-3412 2400/1200/300 A +barfly 555-7685 1200/300 A +bites 555-1675 2400/1200/300 A +camelot 555-0542 300 C +core 555-2912 1200/300 C +fooey 555-1234 2400/1200/300 B +foot 555-6699 1200/300 B +macfoo 555-6480 1200/300 A +sdace 555-3430 2400/1200/300 A +sabafoo 555-2127 1200/300 C +@c endfile +@end example + +@cindex @file{inventory-shipped} file +The second data file, called @file{inventory-shipped}, represents +information about shipments during the year. +Each record contains the month of the year, the number +of green crates shipped, the number of red boxes shipped, the number of +orange bags shipped, and the number of blue packages shipped, +respectively. There are 16 entries, covering the 12 months of one year +and four months of the next year. + +@example +@c file eg/data/inventory-shipped +Jan 13 25 15 115 +Feb 15 32 24 226 +Mar 15 24 34 228 +Apr 31 52 63 420 +May 16 34 29 208 +Jun 31 42 75 492 +Jul 24 34 67 436 +Aug 15 34 47 316 +Sep 13 55 37 277 +Oct 29 54 68 525 +Nov 20 87 82 577 +Dec 17 35 61 401 + +Jan 21 36 64 620 +Feb 26 58 80 652 +Mar 24 75 70 495 +Apr 21 70 74 514 +@c endfile +@end example + +@ifinfo +If you are reading this in GNU Emacs using Info, you can copy the regions +of text showing these sample files into your own test files. This way you +can try out the examples shown in the remainder of this document. You do +this by using the command @kbd{M-x write-region} to copy text from the Info +file into a file for use with @code{awk} +(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual}, +for more information). Using this information, create your own +@file{BBS-list} and @file{inventory-shipped} files, and practice what you +learn in this @value{DOCUMENT}. + +If you are using the stand-alone version of Info, +see @ref{Extract Program, ,Extracting Programs from Texinfo Source Files}, +for an @code{awk} program that will extract these data files from +@file{gawk.texi}, the Texinfo source file for this Info file. +@end ifinfo + +@node Getting Started, One-liners, What Is Awk, Top +@chapter Getting Started with @code{awk} +@cindex script, definition of +@cindex rule, definition of +@cindex program, definition of +@cindex basic function of @code{awk} + +The basic function of @code{awk} is to search files for lines (or other +units of text) that contain certain patterns. When a line matches one +of the patterns, @code{awk} performs specified actions on that line. +@code{awk} keeps processing input lines in this way until the end of the +input files are reached. + +@cindex data-driven languages +@cindex procedural languages +@cindex language, data-driven +@cindex language, procedural +Programs in @code{awk} are different from programs in most other languages, +because @code{awk} programs are @dfn{data-driven}; that is, you describe +the data you wish to work with, and then what to do when you find it. +Most other languages are @dfn{procedural}; you have to describe, in great +detail, every step the program is to take. When working with procedural +languages, it is usually much +harder to clearly describe the data your program will process. +For this reason, @code{awk} programs are often refreshingly easy to both +write and read. + +@cindex program, definition of +@cindex rule, definition of +When you run @code{awk}, you specify an @code{awk} @dfn{program} that +tells @code{awk} what to do. The program consists of a series of +@dfn{rules}. (It may also contain @dfn{function definitions}, +an advanced feature which we will ignore for now. +@xref{User-defined, ,User-defined Functions}.) Each rule specifies one +pattern to search for, and one action to perform when that pattern is found. + +Syntactically, a rule consists of a pattern followed by an action. The +action is enclosed in curly braces to separate it from the pattern. +Rules are usually separated by newlines. Therefore, an @code{awk} +program looks like this: + +@example +@var{pattern} @{ @var{action} @} +@var{pattern} @{ @var{action} @} +@dots{} +@end example + +@menu +* Names:: What name to use to find @code{awk}. +* Running gawk:: How to run @code{gawk} programs; includes + command line syntax. +* Very Simple:: A very simple example. +* Two Rules:: A less simple one-line example with two rules. +* More Complex:: A more complex example. +* Statements/Lines:: Subdividing or combining statements into + lines. +* Other Features:: Other Features of @code{awk}. +* When:: When to use @code{gawk} and when to use other + things. +@end menu + +@node Names, Running gawk , Getting Started, Getting Started +@section A Rose By Any Other Name + +@cindex old @code{awk} vs. new @code{awk} +@cindex new @code{awk} vs. old @code{awk} +The @code{awk} language has evolved over the years. Full details are +provided in @ref{Language History, ,The Evolution of the @code{awk} Language}. +The language described in this @value{DOCUMENT} +is often referred to as ``new @code{awk}.'' + +Because of this, many systems have multiple +versions of @code{awk}. +Some systems have an @code{awk} utility that implements the +original version of the @code{awk} language, and a @code{nawk} utility +for the new version. Others have an @code{oawk} for the ``old @code{awk}'' +language, and plain @code{awk} for the new one. Still others only +have one version, usually the new one.@footnote{Often, these systems +use @code{gawk} for their @code{awk} implementation!} + +All in all, this makes it difficult for you to know which version of +@code{awk} you should run when writing your programs. The best advice +we can give here is to check your local documentation. Look for @code{awk}, +@code{oawk}, and @code{nawk}, as well as for @code{gawk}. Chances are, you +will have some version of new @code{awk} on your system, and that is what +you should use when running your programs. (Of course, if you're reading +this @value{DOCUMENT}, chances are good that you have @code{gawk}!) + +Throughout this @value{DOCUMENT}, whenever we refer to a language feature +that should be available in any complete implementation of POSIX @code{awk}, +we simply use the term @code{awk}. When referring to a feature that is +specific to the GNU implementation, we use the term @code{gawk}. + +@node Running gawk, Very Simple, Names, Getting Started +@section How to Run @code{awk} Programs + +@cindex command line formats +@cindex running @code{awk} programs +There are several ways to run an @code{awk} program. If the program is +short, it is easiest to include it in the command that runs @code{awk}, +like this: + +@example +awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} +@end example + +@noindent +where @var{program} consists of a series of patterns and actions, as +described earlier. +(The reason for the single quotes is described below, in +@ref{One-shot, ,One-shot Throw-away @code{awk} Programs}.) + +When the program is long, it is usually more convenient to put it in a file +and run it with a command like this: + +@example +awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{} +@end example + +@menu +* One-shot:: Running a short throw-away @code{awk} program. +* Read Terminal:: Using no input files (input from terminal + instead). +* Long:: Putting permanent @code{awk} programs in + files. +* Executable Scripts:: Making self-contained @code{awk} programs. +* Comments:: Adding documentation to @code{gawk} programs. +@end menu + +@node One-shot, Read Terminal, Running gawk, Running gawk +@subsection One-shot Throw-away @code{awk} Programs + +Once you are familiar with @code{awk}, you will often type in simple +programs the moment you want to use them. Then you can write the +program as the first argument of the @code{awk} command, like this: + +@example +awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} +@end example + +@noindent +where @var{program} consists of a series of @var{patterns} and +@var{actions}, as described earlier. + +@cindex single quotes, why needed +This command format instructs the @dfn{shell}, or command interpreter, +to start @code{awk} and use the @var{program} to process records in the +input file(s). There are single quotes around @var{program} so that +the shell doesn't interpret any @code{awk} characters as special shell +characters. They also cause the shell to treat all of @var{program} as +a single argument for @code{awk} and allow @var{program} to be more +than one line long. + +This format is also useful for running short or medium-sized @code{awk} +programs from shell scripts, because it avoids the need for a separate +file for the @code{awk} program. A self-contained shell script is more +reliable since there are no other files to misplace. + +@ref{One-liners, , Useful One Line Programs}, presents several short, +self-contained programs. + +As an interesting side point, the command + +@example +awk '/foo/' @var{files} @dots{} +@end example + +@noindent +is essentially the same as + +@cindex @code{egrep} +@example +egrep foo @var{files} @dots{} +@end example + +@node Read Terminal, Long, One-shot, Running gawk +@subsection Running @code{awk} without Input Files + +@cindex standard input +@cindex input, standard +You can also run @code{awk} without any input files. If you type the +command line: + +@example +awk '@var{program}' +@end example + +@noindent +then @code{awk} applies the @var{program} to the @dfn{standard input}, +which usually means whatever you type on the terminal. This continues +until you indicate end-of-file by typing @kbd{Control-d}. +(On other operating systems, the end-of-file character may be different. +For example, on OS/2 and MS-DOS, it is @kbd{Control-z}.) + +For example, the following program prints a friendly piece of advice +(from Douglas Adams' @cite{The Hitchhiker's Guide to the Galaxy}), +to keep you from worrying about the complexities of computer programming +(@samp{BEGIN} is a feature we haven't discussed yet). + +@example +$ awk "BEGIN @{ print \"Don't Panic!\" @}" +@print{} Don't Panic! +@end example + +@cindex quoting, shell +@cindex shell quoting +This program does not read any input. The @samp{\} before each of the +inner double quotes is necessary because of the shell's quoting rules, +in particular because it mixes both single quotes and double quotes. + +This next simple @code{awk} program +emulates the @code{cat} utility; it copies whatever you type at the +keyboard to its standard output. (Why this works is explained shortly.) + +@example +$ awk '@{ print @}' +Now is the time for all good men +@print{} Now is the time for all good men +to come to the aid of their country. +@print{} to come to the aid of their country. +Four score and seven years ago, ... +@print{} Four score and seven years ago, ... +What, me worry? +@print{} What, me worry? +@kbd{Control-d} +@end example + +@node Long, Executable Scripts, Read Terminal, Running gawk +@subsection Running Long Programs + +@cindex running long programs +@cindex @code{-f} option +@cindex program file +@cindex file, @code{awk} program +Sometimes your @code{awk} programs can be very long. In this case it is +more convenient to put the program into a separate file. To tell +@code{awk} to use that file for its program, you type: + +@example +awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{} +@end example + +The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program +from the file @var{source-file}. Any file name can be used for +@var{source-file}. For example, you could put the program: + +@example +BEGIN @{ print "Don't Panic!" @} +@end example + +@noindent +into the file @file{advice}. Then this command: + +@example +awk -f advice +@end example + +@noindent +does the same thing as this one: + +@example +awk "BEGIN @{ print \"Don't Panic!\" @}" +@end example + +@cindex quoting, shell +@cindex shell quoting +@noindent +which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}). +Note that you don't usually need single quotes around the file name that you +specify with @samp{-f}, because most file names don't contain any of the shell's +special characters. Notice that in @file{advice}, the @code{awk} +program did not have single quotes around it. The quotes are only needed +for programs that are provided on the @code{awk} command line. + +If you want to identify your @code{awk} program files clearly as such, +you can add the extension @file{.awk} to the file name. This doesn't +affect the execution of the @code{awk} program, but it does make +``housekeeping'' easier. + +@node Executable Scripts, Comments, Long, Running gawk +@subsection Executable @code{awk} Programs +@cindex executable scripts +@cindex scripts, executable +@cindex self contained programs +@cindex program, self contained +@cindex @code{#!} (executable scripts) + +Once you have learned @code{awk}, you may want to write self-contained +@code{awk} scripts, using the @samp{#!} script mechanism. You can do +this on many Unix systems@footnote{The @samp{#!} mechanism works on +Linux systems, +Unix systems derived from Berkeley Unix, System V Release 4, and some System +V Release 3 systems.} (and someday on the GNU system). + +For example, you could update the file @file{advice} to look like this: + +@example +#! /bin/awk -f + +BEGIN @{ print "Don't Panic!" @} +@end example + +@noindent +After making this file executable (with the @code{chmod} utility), you +can simply type @samp{advice} +at the shell, and the system will arrange to run @code{awk}@footnote{The +line beginning with @samp{#!} lists the full file name of an interpreter +to be run, and an optional initial command line argument to pass to that +interpreter. The operating system then runs the interpreter with the given +argument and the full argument list of the executed program. The first argument +in the list is the full file name of the @code{awk} program. The rest of the +argument list will either be options to @code{awk}, or data files, +or both.} as if you had typed @samp{awk -f advice}. + +@example +@group +$ advice +@print{} Don't Panic! +@end group +@end example + +@noindent +Self-contained @code{awk} scripts are useful when you want to write a +program which users can invoke without their having to know that the program is +written in @code{awk}. + +@cindex shell scripts +@cindex scripts, shell +Some older systems do not support the @samp{#!} mechanism. You can get a +similar effect using a regular shell script. It would look something +like this: + +@example +: The colon ensures execution by the standard shell. +awk '@var{program}' "$@@" +@end example + +Using this technique, it is @emph{vital} to enclose the @var{program} in +single quotes to protect it from interpretation by the shell. If you +omit the quotes, only a shell wizard can predict the results. + +The @code{"$@@"} causes the shell to forward all the command line +arguments to the @code{awk} program, without interpretation. The first +line, which starts with a colon, is used so that this shell script will +work even if invoked by a user who uses the C shell. (Not all older systems +obey this convention, but many do.) +@c 2e: +@c Someday: (See @cite{The Bourne Again Shell}, by ??.) + +@node Comments, , Executable Scripts, Running gawk +@subsection Comments in @code{awk} Programs +@cindex @code{#} (comment) +@cindex comments +@cindex use of comments +@cindex documenting @code{awk} programs +@cindex programs, documenting + +A @dfn{comment} is some text that is included in a program for the sake +of human readers; it is not really part of the program. Comments +can explain what the program does, and how it works. Nearly all +programming languages have provisions for comments, because programs are +typically hard to understand without their extra help. + +In the @code{awk} language, a comment starts with the sharp sign +character, @samp{#}, and continues to the end of the line. +The @samp{#} does not have to be the first character on the line. The +@code{awk} language ignores the rest of a line following a sharp sign. +For example, we could have put the following into @file{advice}: + +@example +# This program prints a nice friendly message. It helps +# keep novice users from being afraid of the computer. +BEGIN @{ print "Don't Panic!" @} +@end example + +You can put comment lines into keyboard-composed throw-away @code{awk} +programs also, but this usually isn't very useful; the purpose of a +comment is to help you or another person understand the program at +a later time. + +@node Very Simple, Two Rules, Running gawk, Getting Started +@section A Very Simple Example + +The following command runs a simple @code{awk} program that searches the +input file @file{BBS-list} for the string of characters: @samp{foo}. (A +string of characters is usually called a @dfn{string}. +The term @dfn{string} is perhaps based on similar usage in English, such +as ``a string of pearls,'' or, ``a string of cars in a train.'') + +@example +awk '/foo/ @{ print $0 @}' BBS-list +@end example + +@noindent +When lines containing @samp{foo} are found, they are printed, because +@w{@samp{print $0}} means print the current line. (Just @samp{print} by +itself means the same thing, so we could have written that +instead.) + +You will notice that slashes, @samp{/}, surround the string @samp{foo} +in the @code{awk} program. The slashes indicate that @samp{foo} +is a pattern to search for. This type of pattern is called a +@dfn{regular expression}, and is covered in more detail later +(@pxref{Regexp, ,Regular Expressions}). +The pattern is allowed to match parts of words. +There are +single-quotes around the @code{awk} program so that the shell won't +interpret any of it as special shell characters. + +Here is what this program prints: + +@example +@group +$ awk '/foo/ @{ print $0 @}' BBS-list +@print{} fooey 555-1234 2400/1200/300 B +@print{} foot 555-6699 1200/300 B +@print{} macfoo 555-6480 1200/300 A +@print{} sabafoo 555-2127 1200/300 C +@end group +@end example + +@cindex action, default +@cindex pattern, default +@cindex default action +@cindex default pattern +In an @code{awk} rule, either the pattern or the action can be omitted, +but not both. If the pattern is omitted, then the action is performed +for @emph{every} input line. If the action is omitted, the default +action is to print all lines that match the pattern. + +@cindex empty action +@cindex action, empty +Thus, we could leave out the action (the @code{print} statement and the curly +braces) in the above example, and the result would be the same: all +lines matching the pattern @samp{foo} would be printed. By comparison, +omitting the @code{print} statement but retaining the curly braces makes an +empty action that does nothing; then no lines would be printed. + +@node Two Rules, More Complex, Very Simple, Getting Started +@section An Example with Two Rules +@cindex how @code{awk} works + +The @code{awk} utility reads the input files one line at a +time. For each line, @code{awk} tries the patterns of each of the rules. +If several patterns match then several actions are run, in the order in +which they appear in the @code{awk} program. If no patterns match, then +no actions are run. + +After processing all the rules (perhaps none) that match the line, +@code{awk} reads the next line (however, +@pxref{Next Statement, ,The @code{next} Statement}, +and also @pxref{Nextfile Statement, ,The @code{nextfile} Statement}). +This continues until the end of the file is reached. + +For example, the @code{awk} program: + +@example +/12/ @{ print $0 @} +/21/ @{ print $0 @} +@end example + +@noindent +contains two rules. The first rule has the string @samp{12} as the +pattern and @samp{print $0} as the action. The second rule has the +string @samp{21} as the pattern and also has @samp{print $0} as the +action. Each rule's action is enclosed in its own pair of braces. + +This @code{awk} program prints every line that contains the string +@samp{12} @emph{or} the string @samp{21}. If a line contains both +strings, it is printed twice, once by each rule. + +This is what happens if we run this program on our two sample data files, +@file{BBS-list} and @file{inventory-shipped}, as shown here: + +@example +$ awk '/12/ @{ print $0 @} +> /21/ @{ print $0 @}' BBS-list inventory-shipped +@print{} aardvark 555-5553 1200/300 B +@print{} alpo-net 555-3412 2400/1200/300 A +@print{} barfly 555-7685 1200/300 A +@print{} bites 555-1675 2400/1200/300 A +@print{} core 555-2912 1200/300 C +@print{} fooey 555-1234 2400/1200/300 B +@print{} foot 555-6699 1200/300 B +@print{} macfoo 555-6480 1200/300 A +@print{} sdace 555-3430 2400/1200/300 A +@print{} sabafoo 555-2127 1200/300 C +@print{} sabafoo 555-2127 1200/300 C +@print{} Jan 21 36 64 620 +@print{} Apr 21 70 74 514 +@end example + +@noindent +Note how the line in @file{BBS-list} beginning with @samp{sabafoo} +was printed twice, once for each rule. + +@node More Complex, Statements/Lines, Two Rules, Getting Started +@section A More Complex Example + +@ignore +We have to use ls -lg here to get portable output across Unix systems. +The POSIX ls matches this behavior too. Sigh. +@end ignore +Here is an example to give you an idea of what typical @code{awk} +programs do. This example shows how @code{awk} can be used to +summarize, select, and rearrange the output of another utility. It uses +features that haven't been covered yet, so don't worry if you don't +understand all the details. + +@example +ls -lg | awk '$6 == "Nov" @{ sum += $5 @} + END @{ print sum @}' +@end example + +@cindex @code{csh}, backslash continuation +@cindex backslash continuation in @code{csh} +This command prints the total number of bytes in all the files in the +current directory that were last modified in November (of any year). +(In the C shell you would need to type a semicolon and then a backslash +at the end of the first line; in a POSIX-compliant shell, such as the +Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example +as shown.) +@ignore +FIXME: how can users tell what shell they are running? Need a footnote +or something, but getting into this is a distraction. +@end ignore + +The @w{@samp{ls -lg}} part of this example is a system command that gives +you a listing of the files in a directory, including file size and the date +the file was last modified. Its output looks like this: + +@example +-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile +-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 gawk.h +-rw-r--r-- 1 arnold user 983 Apr 13 12:14 gawk.tab.h +-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 gawk.y +-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 gawk1.c +-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 gawk2.c +-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 gawk3.c +-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 gawk4.c +@end example + +@noindent +The first field contains read-write permissions, the second field contains +the number of links to the file, and the third field identifies the owner of +the file. The fourth field identifies the group of the file. +The fifth field contains the size of the file in bytes. The +sixth, seventh and eighth fields contain the month, day, and time, +respectively, that the file was last modified. Finally, the ninth field +contains the name of the file. + +@cindex automatic initialization +@cindex initialization, automatic +The @samp{$6 == "Nov"} in our @code{awk} program is an expression that +tests whether the sixth field of the output from @w{@samp{ls -lg}} +matches the string @samp{Nov}. Each time a line has the string +@samp{Nov} for its sixth field, the action @samp{sum += $5} is +performed. This adds the fifth field (the file size) to the variable +@code{sum}. As a result, when @code{awk} has finished reading all the +input lines, @code{sum} is the sum of the sizes of files whose +lines matched the pattern. (This works because @code{awk} variables +are automatically initialized to zero.) + +After the last line of output from @code{ls} has been processed, the +@code{END} rule is executed, and the value of @code{sum} is +printed. In this example, the value of @code{sum} would be 80600. + +These more advanced @code{awk} techniques are covered in later sections +(@pxref{Action Overview, ,Overview of Actions}). Before you can move on to more +advanced @code{awk} programming, you have to know how @code{awk} interprets +your input and displays your output. By manipulating fields and using +@code{print} statements, you can produce some very useful and impressive +looking reports. + +@node Statements/Lines, Other Features, More Complex, Getting Started +@section @code{awk} Statements Versus Lines +@cindex line break +@cindex newline + +Most often, each line in an @code{awk} program is a separate statement or +separate rule, like this: + +@example +awk '/12/ @{ print $0 @} + /21/ @{ print $0 @}' BBS-list inventory-shipped +@end example + +However, @code{gawk} will ignore newlines after any of the following: + +@example +, @{ ? : || && do else +@end example + +@noindent +A newline at any other point is considered the end of the statement. +(Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk} +extension. The @samp{?} and @samp{:} referred to here is the +three operand conditional expression described in +@ref{Conditional Exp, ,Conditional Expressions}.) + +@cindex backslash continuation +@cindex continuation of lines +@cindex line continuation +If you would like to split a single statement into two lines at a point +where a newline would terminate it, you can @dfn{continue} it by ending the +first line with a backslash character, @samp{\}. The backslash must be +the final character on the line to be recognized as a continuation +character. This is allowed absolutely anywhere in the statement, even +in the middle of a string or regular expression. For example: + +@example +awk '/This regular expression is too long, so continue it\ + on the next line/ @{ print $1 @}' +@end example + +@noindent +@cindex portability issues +We have generally not used backslash continuation in the sample programs +in this @value{DOCUMENT}. Since in @code{gawk} there is no limit on the +length of a line, it is never strictly necessary; it just makes programs +more readable. For this same reason, as well as for clarity, we have +kept most statements short in the sample programs presented throughout +the @value{DOCUMENT}. Backslash continuation is most useful when your +@code{awk} program is in a separate source file, instead of typed in on +the command line. You should also note that many @code{awk} +implementations are more particular about where you may use backslash +continuation. For example, they may not allow you to split a string +constant using backslash continuation. Thus, for maximal portability of +your @code{awk} programs, it is best not to split your lines in the +middle of a regular expression or a string. + +@cindex @code{csh}, backslash continuation +@cindex backslash continuation in @code{csh} +@strong{Caution: backslash continuation does not work as described above +with the C shell.} Continuation with backslash works for @code{awk} +programs in files, and also for one-shot programs @emph{provided} you +are using a POSIX-compliant shell, such as the Bourne shell or Bash, the +GNU Bourne-Again shell. But the C shell (@code{csh}) behaves +differently! There, you must use two backslashes in a row, followed by +a newline. Note also that when using the C shell, @emph{every} newline +in your awk program must be escaped with a backslash. To illustrate: + +@example +% awk 'BEGIN @{ \ +? print \\ +? "hello, world" \ +? @}' +@print{} hello, world +@end example + +@noindent +Here, the @samp{%} and @samp{?} are the C shell's primary and secondary +prompts, analogous to the standard shell's @samp{$} and @samp{>}. + +@code{awk} is a line-oriented language. Each rule's action has to +begin on the same line as the pattern. To have the pattern and action +on separate lines, you @emph{must} use backslash continuation---there +is no other way. + +@cindex backslash continuation and comments +@cindex comments and backslash continuation +Note that backslash continuation and comments do not mix. As soon +as @code{awk} sees the @samp{#} that starts a comment, it ignores +@emph{everything} on the rest of the line. For example: + +@example +@group +$ gawk 'BEGIN @{ print "dont panic" # a friendly \ +> BEGIN rule +> @}' +@error{} gawk: cmd. line:2: BEGIN rule +@error{} gawk: cmd. line:2: ^ parse error +@end group +@end example + +@noindent +Here, it looks like the backslash would continue the comment onto the +next line. However, the backslash-newline combination is never even +noticed, since it is ``hidden'' inside the comment. Thus, the +@samp{BEGIN} is noted as a syntax error. + +@cindex multiple statements on one line +When @code{awk} statements within one rule are short, you might want to put +more than one of them on a line. You do this by separating the statements +with a semicolon, @samp{;}. + +This also applies to the rules themselves. +Thus, the previous program could have been written: + +@example +/12/ @{ print $0 @} ; /21/ @{ print $0 @} +@end example + +@noindent +@strong{Note:} the requirement that rules on the same line must be +separated with a semicolon was not in the original @code{awk} +language; it was added for consistency with the treatment of statements +within an action. + +@node Other Features, When, Statements/Lines, Getting Started +@section Other Features of @code{awk} + +The @code{awk} language provides a number of predefined, or built-in variables, which +your programs can use to get information from @code{awk}. There are other +variables your program can set to control how @code{awk} processes your +data. + +In addition, @code{awk} provides a number of built-in functions for doing +common computational and string related operations. + +As we develop our presentation of the @code{awk} language, we introduce +most of the variables and many of the functions. They are defined +systematically in @ref{Built-in Variables}, and +@ref{Built-in, ,Built-in Functions}. + +@node When, , Other Features, Getting Started +@section When to Use @code{awk} + +@cindex when to use @code{awk} +@cindex applications of @code{awk} +You might wonder how @code{awk} might be useful for you. Using +utility programs, advanced patterns, field separators, arithmetic +statements, and other selection criteria, you can produce much more +complex output. The @code{awk} language is very useful for producing +reports from large amounts of raw data, such as summarizing information +from the output of other utility programs like @code{ls}. +(@xref{More Complex, ,A More Complex Example}.) + +Programs written with @code{awk} are usually much smaller than they would +be in other languages. This makes @code{awk} programs easy to compose and +use. Often, @code{awk} programs can be quickly composed at your terminal, +used once, and thrown away. Since @code{awk} programs are interpreted, you +can avoid the (usually lengthy) compilation part of the typical +edit-compile-test-debug cycle of software development. + +Complex programs have been written in @code{awk}, including a complete +retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for +more information) and a microcode assembler for a special purpose Prolog +computer. However, @code{awk}'s capabilities are strained by tasks of +such complexity. + +If you find yourself writing @code{awk} scripts of more than, say, a few +hundred lines, you might consider using a different programming +language. Emacs Lisp is a good choice if you need sophisticated string +or pattern matching capabilities. The shell is also good at string and +pattern matching; in addition, it allows powerful use of the system +utilities. More conventional languages, such as C, C++, and Lisp, offer +better facilities for system programming and for managing the complexity +of large programs. Programs in these languages may require more lines +of source code than the equivalent @code{awk} programs, but they are +easier to maintain and usually run more efficiently. + +@node One-liners, Regexp, Getting Started, Top +@chapter Useful One Line Programs + +@cindex one-liners +Many useful @code{awk} programs are short, just a line or two. Here is a +collection of useful, short programs to get you started. Some of these +programs contain constructs that haven't been covered yet. The description +of the program will give you a good idea of what is going on, but please +read the rest of the @value{DOCUMENT} to become an @code{awk} expert! + +Most of the examples use a data file named @file{data}. This is just a +placeholder; if you were to use these programs yourself, you would substitute +your own file names for @file{data}. + +@ifinfo +Since you are reading this in Info, each line of the example code is +enclosed in quotes, to represent text that you would type literally. +The examples themselves represent shell commands that use single quotes +to keep the shell from interpreting the contents of the program. +When reading the examples, focus on the text between the open and close +quotes. +@end ifinfo + +@table @code +@item awk '@{ if (length($0) > max) max = length($0) @} +@itemx @ @ @ @ @ END @{ print max @}' data +This program prints the length of the longest input line. + +@item awk 'length($0) > 80' data +This program prints every line that is longer than 80 characters. The sole +rule has a relational expression as its pattern, and has no action (so the +default action, printing the record, is used). + +@item expand@ data@ |@ awk@ '@{ if (x < length()) x = length() @} +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}' +This program prints the length of the longest line in @file{data}. The input +is processed by the @code{expand} program to change tabs into spaces, +so the widths compared are actually the right-margin columns. + +@item awk 'NF > 0' data +This program prints every line that has at least one field. This is an +easy way to delete blank lines from a file (or rather, to create a new +file similar to the old file but from which the blank lines have been +deleted). + +@c Karl Berry points out that new users probably don't want to see +@c multiple ways to do things, just the `best' way. He's probably +@c right. At some point it might be worth adding something about there +@c often being multiple ways to do things in awk, but for now we'll +@c just take this one out. +@ignore +@item awk '@{ if (NF > 0) print @}' data +This program also prints every line that has at least one field. Here we +allow the rule to match every line, and then decide in the action whether +to print. +@end ignore + +@item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++) +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}' +This program prints seven random numbers from zero to 100, inclusive. + +@item ls -lg @var{files} | awk '@{ x += $5 @} ; END @{ print "total bytes: " x @}' +This program prints the total number of bytes used by @var{files}. + +@item ls -lg @var{files} | awk '@{ x += $5 @} +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "total K-bytes: " (x + 1023)/1024 @}' +This program prints the total number of kilobytes used by @var{files}. + +@item awk -F: '@{ print $1 @}' /etc/passwd | sort +This program prints a sorted list of the login names of all users. + +@item awk 'END @{ print NR @}' data +This program counts lines in a file. + +@item awk 'NR % 2 == 0' data +This program prints the even numbered lines in the data file. +If you were to use the expression @samp{NR % 2 == 1} instead, +it would print the odd numbered lines. +@end table + +@node Regexp, Reading Files, One-liners, Top +@chapter Regular Expressions +@cindex pattern, regular expressions +@cindex regexp +@cindex regular expression +@cindex regular expressions as patterns + +A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a +set of strings. +Because regular expressions are such a fundamental part of @code{awk} +programming, their format and use deserve a separate chapter. + +A regular expression enclosed in slashes (@samp{/}) +is an @code{awk} pattern that matches every input record whose text +belongs to that set. + +The simplest regular expression is a sequence of letters, numbers, or +both. Such a regexp matches any string that contains that sequence. +Thus, the regexp @samp{foo} matches any string containing @samp{foo}. +Therefore, the pattern @code{/foo/} matches any input record containing +the three characters @samp{foo}, @emph{anywhere} in the record. Other +kinds of regexps let you specify more complicated classes of strings. + +@iftex +Initially, the examples will be simple. As we explain more about how +regular expressions work, we will present more complicated examples. +@end iftex + +@menu +* Regexp Usage:: How to Use Regular Expressions. +* Escape Sequences:: How to write non-printing characters. +* Regexp Operators:: Regular Expression Operators. +* GNU Regexp Operators:: Operators specific to GNU software. +* Case-sensitivity:: How to do case-insensitive matching. +* Leftmost Longest:: How much text matches. +* Computed Regexps:: Using Dynamic Regexps. +@end menu + +@node Regexp Usage, Escape Sequences, Regexp, Regexp +@section How to Use Regular Expressions + +A regular expression can be used as a pattern by enclosing it in +slashes. Then the regular expression is tested against the +entire text of each record. (Normally, it only needs +to match some part of the text in order to succeed.) For example, this +prints the second field of each record that contains the three +characters @samp{foo} anywhere in it: + +@example +@group +$ awk '/foo/ @{ print $2 @}' BBS-list +@print{} 555-1234 +@print{} 555-6699 +@print{} 555-6480 +@print{} 555-2127 +@end group +@end example + +@cindex regexp matching operators +@cindex string-matching operators +@cindex operators, string-matching +@cindex operators, regexp matching +@cindex regexp match/non-match operators +@cindex @code{~} operator +@cindex @code{!~} operator +Regular expressions can also be used in matching expressions. These +expressions allow you to specify the string to match against; it need +not be the entire current input record. The two operators, @samp{~} +and @samp{!~}, perform regular expression comparisons. Expressions +using these operators can be used as patterns or in @code{if}, +@code{while}, @code{for}, and @code{do} statements. +@ifinfo +@c adding this xref in TeX screws up the formatting too much +(@xref{Statements, ,Control Statements in Actions}.) +@end ifinfo + +@table @code +@item @var{exp} ~ /@var{regexp}/ +This is true if the expression @var{exp} (taken as a string) +is matched by @var{regexp}. The following example matches, or selects, +all input records with the upper-case letter @samp{J} somewhere in the +first field: + +@example +@group +$ awk '$1 ~ /J/' inventory-shipped +@print{} Jan 13 25 15 115 +@print{} Jun 31 42 75 492 +@print{} Jul 24 34 67 436 +@print{} Jan 21 36 64 620 +@end group +@end example + +So does this: + +@example +awk '@{ if ($1 ~ /J/) print @}' inventory-shipped +@end example + +@item @var{exp} !~ /@var{regexp}/ +This is true if the expression @var{exp} (taken as a character string) +is @emph{not} matched by @var{regexp}. The following example matches, +or selects, all input records whose first field @emph{does not} contain +the upper-case letter @samp{J}: + +@example +@group +$ awk '$1 !~ /J/' inventory-shipped +@print{} Feb 15 32 24 226 +@print{} Mar 15 24 34 228 +@print{} Apr 31 52 63 420 +@print{} May 16 34 29 208 +@dots{} +@end group +@end example +@end table + +@cindex regexp constant +When a regexp is written enclosed in slashes, like @code{/foo/}, we call it +a @dfn{regexp constant}, much like @code{5.27} is a numeric constant, and +@code{"foo"} is a string constant. + +@node Escape Sequences, Regexp Operators, Regexp Usage, Regexp +@section Escape Sequences + +@cindex escape sequence notation +Some characters cannot be included literally in string constants +(@code{"foo"}) or regexp constants (@code{/foo/}). You represent them +instead with @dfn{escape sequences}, which are character sequences +beginning with a backslash (@samp{\}). + +One use of an escape sequence is to include a double-quote character in +a string constant. Since a plain double-quote would end the string, you +must use @samp{\"} to represent an actual double-quote character as a +part of the string. For example: + +@example +$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}' +@print{} He said "hi!" to her. +@end example + +The backslash character itself is another character that cannot be +included normally; you write @samp{\\} to put one backslash in the +string or regexp. Thus, the string whose contents are the two characters +@samp{"} and @samp{\} must be written @code{"\"\\"}. + +Another use of backslash is to represent unprintable characters +such as tab or newline. While there is nothing to stop you from entering most +unprintable characters directly in a string constant or regexp constant, +they may look ugly. + +Here is a table of all the escape sequences used in @code{awk}, and +what they represent. Unless noted otherwise, all of these escape +sequences apply to both string constants and regexp constants. + +@c @cartouche +@table @code +@item \\ +A literal backslash, @samp{\}. + +@cindex @code{awk} language, V.4 version +@item \a +The ``alert'' character, @kbd{Control-g}, ASCII code 7 (BEL). + +@item \b +Backspace, @kbd{Control-h}, ASCII code 8 (BS). + +@item \f +Formfeed, @kbd{Control-l}, ASCII code 12 (FF). + +@item \n +Newline, @kbd{Control-j}, ASCII code 10 (LF). + +@item \r +Carriage return, @kbd{Control-m}, ASCII code 13 (CR). + +@item \t +Horizontal tab, @kbd{Control-i}, ASCII code 9 (HT). + +@cindex @code{awk} language, V.4 version +@item \v +Vertical tab, @kbd{Control-k}, ASCII code 11 (VT). + +@item \@var{nnn} +The octal value @var{nnn}, where @var{nnn} are one to three digits +between @samp{0} and @samp{7}. For example, the code for the ASCII ESC +(escape) character is @samp{\033}. + +@cindex @code{awk} language, V.4 version +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@item \x@var{hh}@dots{} +The hexadecimal value @var{hh}, where @var{hh} are hexadecimal +digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or +@samp{a} through @samp{f}). Like the same construct in ANSI C, the escape +sequence continues until the first non-hexadecimal digit is seen. However, +using more than two hexadecimal digits produces undefined results. (The +@samp{\x} escape sequence is not allowed in POSIX @code{awk}.) + +@item \/ +A literal slash (necessary for regexp constants only). +You use this when you wish to write a regexp +constant that contains a slash. Since the regexp is delimited by +slashes, you need to escape the slash that is part of the pattern, +in order to tell @code{awk} to keep processing the rest of the regexp. + +@item \" +A literal double-quote (necessary for string constants only). +You use this when you wish to write a string +constant that contains a double-quote. Since the string is delimited by +double-quotes, you need to escape the quote that is part of the string, +in order to tell @code{awk} to keep processing the rest of the string. +@end table +@c @end cartouche + +In @code{gawk}, there are additional two character sequences that begin +with backslash that have special meaning in regexps. +@xref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}. + +In a string constant, +what happens if you place a backslash before something that is not one of +the characters listed above? POSIX @code{awk} purposely leaves this case +undefined. There are two choices. + +@itemize @bullet +@item +Strip the backslash out. This is what Unix @code{awk} and @code{gawk} both do. +For example, @code{"a\qc"} is the same as @code{"aqc"}. + +@item +Leave the backslash alone. Some other @code{awk} implementations do this. +In such implementations, @code{"a\qc"} is the same as if you had typed +@code{"a\\qc"}. +@end itemize + +In a regexp, a backslash before any character that is not in the above table, +and not listed in +@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}, +means that the next character should be taken literally, even if it would +normally be a regexp operator. E.g., @code{/a\+b/} matches the three +characters @samp{a+b}. + +@cindex portability issues +For complete portability, do not use a backslash before any character not +listed in the table above. + +Another interesting question arises. Suppose you use an octal or hexadecimal +escape to represent a regexp metacharacter +(@pxref{Regexp Operators, , Regular Expression Operators}). +Does @code{awk} treat the character as literal character, or as a regexp +operator? + +@cindex dark corner +It turns out that historically, such characters were taken literally (d.c.). +However, the POSIX standard indicates that they should be treated +as real metacharacters, and this is what @code{gawk} does. +However, in compatibility mode (@pxref{Options, ,Command Line Options}), +@code{gawk} treats the characters represented by octal and hexadecimal +escape sequences literally when used in regexp constants. Thus, +@code{/a\52b/} is equivalent to @code{/a\*b/}. + +To summarize: + +@enumerate 1 +@item +The escape sequences in the table above are always processed first, +for both string constants and regexp constants. This happens very early, +as soon as @code{awk} reads your program. + +@item +@code{gawk} processes both regexp constants and dynamic regexps +(@pxref{Computed Regexps, ,Using Dynamic Regexps}), +for the special operators listed in +@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}. + +@item +A backslash before any other character means to treat that character +literally. +@end enumerate + +@node Regexp Operators, GNU Regexp Operators, Escape Sequences, Regexp +@section Regular Expression Operators +@cindex metacharacters +@cindex regular expression metacharacters +@cindex regexp operators + +You can combine regular expressions with the following characters, +called @dfn{regular expression operators}, or @dfn{metacharacters}, to +increase the power and versatility of regular expressions. + +The escape sequences described +@iftex +above +@end iftex +in @ref{Escape Sequences}, +are valid inside a regexp. They are introduced by a @samp{\}. They +are recognized and converted into the corresponding real characters as +the very first step in processing regexps. + +Here is a table of metacharacters. All characters that are not escape +sequences and that are not listed in the table stand for themselves. + +@table @code +@item \ +This is used to suppress the special meaning of a character when +matching. For example: + +@example +\$ +@end example + +@noindent +matches the character @samp{$}. + +@c NEEDED +@page +@cindex anchors in regexps +@cindex regexp, anchors +@item ^ +This matches the beginning of a string. For example: + +@example +^@@chapter +@end example + +@noindent +matches the @samp{@@chapter} at the beginning of a string, and can be used +to identify chapter beginnings in Texinfo source files. +The @samp{^} is known as an @dfn{anchor}, since it anchors the pattern to +matching only at the beginning of the string. + +It is important to realize that @samp{^} does not match the beginning of +a line embedded in a string. In this example the condition is not true: + +@example +if ("line1\nLINE 2" ~ /^L/) @dots{} +@end example + +@item $ +This is similar to @samp{^}, but it matches only at the end of a string. +For example: + +@example +p$ +@end example + +@noindent +matches a record that ends with a @samp{p}. The @samp{$} is also an anchor, +and also does not match the end of a line embedded in a string. In this +example the condition is not true: + +@example +if ("line1\nLINE 2" ~ /1$/) @dots{} +@end example + +@item . +The period, or dot, matches any single character, +@emph{including} the newline character. For example: + +@example +.P +@end example + +@noindent +matches any single character followed by a @samp{P} in a string. Using +concatenation we can make a regular expression like @samp{U.A}, which +matches any three-character sequence that begins with @samp{U} and ends +with @samp{A}. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +In strict POSIX mode (@pxref{Options, ,Command Line Options}), +@samp{.} does not match the @sc{nul} +character, which is a character with all bits equal to zero. +Otherwise, @sc{nul} is just another character. Other versions of @code{awk} +may not be able to match the @sc{nul} character. + +@ignore +2e: Add stuff that character list is the POSIX terminology. In other + literature known as character set or character class. +@end ignore + +@cindex character list +@item [@dots{}] +This is called a @dfn{character list}. It matches any @emph{one} of the +characters that are enclosed in the square brackets. For example: + +@example +[MVX] +@end example + +@noindent +matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a +string. + +Ranges of characters are indicated by using a hyphen between the beginning +and ending characters, and enclosing the whole thing in brackets. For +example: + +@example +[0-9] +@end example + +@noindent +matches any digit. +Multiple ranges are allowed. E.g., the list @code{@w{[A-Za-z0-9]}} is a +common way to express the idea of ``all alphanumeric characters.'' + +To include one of the characters @samp{\}, @samp{]}, @samp{-} or @samp{^} in a +character list, put a @samp{\} in front of it. For example: + +@example +[d\]] +@end example + +@noindent +matches either @samp{d}, or @samp{]}. + +@cindex @code{egrep} +This treatment of @samp{\} in character lists +is compatible with other @code{awk} +implementations, and is also mandated by POSIX. +The regular expressions in @code{awk} are a superset +of the POSIX specification for Extended Regular Expressions (EREs). +POSIX EREs are based on the regular expressions accepted by the +traditional @code{egrep} utility. + +@cindex character classes +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@dfn{Character classes} are a new feature introduced in the POSIX standard. +A character class is a special notation for describing +lists of characters that have a specific attribute, but where the +actual characters themselves can vary from country to country and/or +from character set to character set. For example, the notion of what +is an alphabetic character differs in the USA and in France. + +A character class is only valid in a regexp @emph{inside} the +brackets of a character list. Character classes consist of @samp{[:}, +a keyword denoting the class, and @samp{:]}. Here are the character +classes defined by the POSIX standard. + +@table @code +@item [:alnum:] +Alphanumeric characters. + +@item [:alpha:] +Alphabetic characters. + +@item [:blank:] +Space and tab characters. + +@item [:cntrl:] +Control characters. + +@item [:digit:] +Numeric characters. + +@item [:graph:] +Characters that are printable and are also visible. +(A space is printable, but not visible, while an @samp{a} is both.) + +@item [:lower:] +Lower-case alphabetic characters. + +@item [:print:] +Printable characters (characters that are not control characters.) + +@item [:punct:] +Punctuation characters (characters that are not letter, digits, +control characters, or space characters). + +@item [:space:] +Space characters (such as space, tab, and formfeed, to name a few). + +@item [:upper:] +Upper-case alphabetic characters. + +@item [:xdigit:] +Characters that are hexadecimal digits. +@end table + +For example, before the POSIX standard, to match alphanumeric +characters, you had to write @code{/[A-Za-z0-9]/}. If your +character set had other alphabetic characters in it, this would not +match them. With the POSIX character classes, you can write +@code{/[[:alnum:]]/}, and this will match @emph{all} the alphabetic +and numeric characters in your character set. + +@cindex collating elements +Two additional special sequences can appear in character lists. +These apply to non-ASCII character sets, which can have single symbols +(called @dfn{collating elements}) that are represented with more than one +character, as well as several characters that are equivalent for +@dfn{collating}, or sorting, purposes. (E.g., in French, a plain ``e'' +and a grave-accented ``@`e'' are equivalent.) + +@table @asis +@cindex collating symbols +@item Collating Symbols +A @dfn{collating symbol} is a multi-character collating element enclosed in +@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element, +then @code{[[.ch.]]} is a regexp that matches this collating element, while +@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}. + +@cindex equivalence classes +@item Equivalence Classes +An @dfn{equivalence class} is a locale-specific name for a list of +characters that are equivalent. The name is enclosed in +@samp{[=} and @samp{=]}. +For example, the name @samp{e} might be used to represent all of +``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e]]} is a regexp +that matches any of @samp{e}, @samp{@'e}, or @samp{@`e}. +@end table + +These features are very valuable in non-English speaking locales. + +@strong{Caution:} The library functions that @code{gawk} uses for regular +expression matching currently only recognize POSIX character classes; +they do not recognize collating symbols or equivalence classes. +@c maybe one day ... + +@cindex complemented character list +@cindex character list, complemented +@item [^ @dots{}] +This is a @dfn{complemented character list}. The first character after +the @samp{[} @emph{must} be a @samp{^}. It matches any characters +@emph{except} those in the square brackets. For example: + +@example +[^0-9] +@end example + +@noindent +matches any character that is not a digit. + +@item | +This is the @dfn{alternation operator}, and it is used to specify +alternatives. For example: + +@example +^P|[0-9] +@end example + +@noindent +matches any string that matches either @samp{^P} or @samp{[0-9]}. This +means it matches any string that starts with @samp{P} or contains a digit. + +The alternation applies to the largest possible regexps on either side. +In other words, @samp{|} has the lowest precedence of all the regular +expression operators. + +@item (@dots{}) +Parentheses are used for grouping in regular expressions as in +arithmetic. They can be used to concatenate regular expressions +containing the alternation operator, @samp{|}. For example, +@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and +@samp{@@samp@{bar@}}. (These are Texinfo formatting control sequences.) + +@item * +This symbol means that the preceding regular expression is to be +repeated as many times as necessary to find a match. For example: + +@example +ph* +@end example + +@noindent +applies the @samp{*} symbol to the preceding @samp{h} and looks for matches +of one @samp{p} followed by any number of @samp{h}s. This will also match +just @samp{p} if no @samp{h}s are present. + +The @samp{*} repeats the @emph{smallest} possible preceding expression. +(Use parentheses if you wish to repeat a larger expression.) It finds +as many repetitions as possible. For example: + +@example +awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample +@end example + +@noindent +prints every record in @file{sample} containing a string of the form +@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on. +Notice the escaping of the parentheses by preceding them +with backslashes. + +@item + +This symbol is similar to @samp{*}, but the preceding expression must be +matched at least once. This means that: + +@example +wh+y +@end example + +@noindent +would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas +@samp{wh*y} would match all three of these strings. This is a simpler +way of writing the last @samp{*} example: + +@example +awk '/\(c[ad]+r x\)/ @{ print @}' sample +@end example + +@item ? +This symbol is similar to @samp{*}, but the preceding expression can be +matched either once or not at all. For example: + +@example +fe?d +@end example + +@noindent +will match @samp{fed} and @samp{fd}, but nothing else. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@cindex interval expressions +@item @{@var{n}@} +@itemx @{@var{n},@} +@itemx @{@var{n},@var{m}@} +One or two numbers inside braces denote an @dfn{interval expression}. +If there is one number in the braces, the preceding regexp is repeated +@var{n} times. +If there are two numbers separated by a comma, the preceding regexp is +repeated @var{n} to @var{m} times. +If there is one number followed by a comma, then the preceding regexp +is repeated at least @var{n} times. + +@table @code +@item wh@{3@}y +matches @samp{whhhy} but not @samp{why} or @samp{whhhhy}. + +@item wh@{3,5@}y +matches @samp{whhhy} or @samp{whhhhy} or @samp{whhhhhy}, only. + +@item wh@{2,@}y +matches @samp{whhy} or @samp{whhhy}, and so on. +@end table + +Interval expressions were not traditionally available in @code{awk}. +As part of the POSIX standard they were added, to make @code{awk} +and @code{egrep} consistent with each other. + +However, since old programs may use @samp{@{} and @samp{@}} in regexp +constants, by default @code{gawk} does @emph{not} match interval expressions +in regexps. If either @samp{--posix} or @samp{--re-interval} are specified +(@pxref{Options, , Command Line Options}), then interval expressions +are allowed in regexps. +@end table + +@cindex precedence, regexp operators +@cindex regexp operators, precedence of +In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators, +as well as the braces @samp{@{} and @samp{@}}, +have +the highest precedence, followed by concatenation, and finally by @samp{|}. +As in arithmetic, parentheses can change how operators are grouped. + +If @code{gawk} is in compatibility mode +(@pxref{Options, ,Command Line Options}), +character classes and interval expressions are not available in +regular expressions. + +The next +@ifinfo +node +@end ifinfo +@iftex +section +@end iftex +discusses the GNU-specific regexp operators, and provides +more detail concerning how command line options affect the way @code{gawk} +interprets the characters in regular expressions. + +@node GNU Regexp Operators, Case-sensitivity, Regexp Operators, Regexp +@section Additional Regexp Operators Only in @code{gawk} + +@c This section adapted from the regex-0.12 manual + +@cindex regexp operators, GNU specific +GNU software that deals with regular expressions provides a number of +additional regexp operators. These operators are described in this +section, and are specific to @code{gawk}; they are not available in other +@code{awk} implementations. + +@cindex word, regexp definition of +Most of the additional operators are for dealing with word matching. +For our purposes, a @dfn{word} is a sequence of one or more letters, digits, +or underscores (@samp{_}). + +@table @code +@cindex @code{\w} regexp operator +@item \w +This operator matches any word-constituent character, i.e.@: any +letter, digit, or underscore. Think of it as a short-hand for +@c @w{@code{[A-Za-z0-9_]}} or +@w{@code{[[:alnum:]_]}}. + +@cindex @code{\W} regexp operator +@item \W +This operator matches any character that is not word-constituent. +Think of it as a short-hand for +@c @w{@code{[^A-Za-z0-9_]}} or +@w{@code{[^[:alnum:]_]}}. + +@cindex @code{\<} regexp operator +@item \< +This operator matches the empty string at the beginning of a word. +For example, @code{/\<away/} matches @samp{away}, but not +@samp{stowaway}. + +@cindex @code{\>} regexp operator +@item \> +This operator matches the empty string at the end of a word. +For example, @code{/stow\>/} matches @samp{stow}, but not @samp{stowaway}. + +@cindex @code{\y} regexp operator +@cindex word boundaries, matching +@item \y +This operator matches the empty string at either the beginning or the +end of a word (the word boundar@strong{y}). For example, @samp{\yballs?\y} +matches either @samp{ball} or @samp{balls} as a separate word. + +@cindex @code{\B} regexp operator +@item \B +This operator matches the empty string within a word. In other words, +@samp{\B} matches the empty string that occurs between two +word-constituent characters. For example, +@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}. +@samp{\B} is essentially the opposite of @samp{\y}. +@end table + +There are two other operators that work on buffers. In Emacs, a +@dfn{buffer} is, naturally, an Emacs buffer. For other programs, the +regexp library routines that @code{gawk} uses consider the entire +string to be matched as the buffer. + +For @code{awk}, since @samp{^} and @samp{$} always work in terms +of the beginning and end of strings, these operators don't add any +new capabilities. They are provided for compatibility with other GNU +software. + +@cindex buffer matching operators +@table @code +@cindex @code{\`} regexp operator +@item \` +This operator matches the empty string at the +beginning of the buffer. + +@cindex @code{\'} regexp operator +@item \' +This operator matches the empty string at the +end of the buffer. +@end table + +In other GNU software, the word boundary operator is @samp{\b}. However, +that conflicts with the @code{awk} language's definition of @samp{\b} +as backspace, so @code{gawk} uses a different letter. + +An alternative method would have been to require two backslashes in the +GNU operators, but this was deemed to be too confusing, and the current +method of using @samp{\y} for the GNU @samp{\b} appears to be the +lesser of two evils. + +@c NOTE!!! Keep this in sync with the same table in the summary appendix! +@cindex regexp, effect of command line options +The various command line options +(@pxref{Options, ,Command Line Options}) +control how @code{gawk} interprets characters in regexps. + +@table @asis +@item No options +In the default case, @code{gawk} provide all the facilities of +POSIX regexps and the GNU regexp operators described +@iftex +above. +@end iftex +@ifinfo +in @ref{Regexp Operators, ,Regular Expression Operators}. +@end ifinfo +However, interval expressions are not supported. + +@item @code{--posix} +Only POSIX regexps are supported, the GNU operators are not special +(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions +are allowed. + +@item @code{--traditional} +Traditional Unix @code{awk} regexps are matched. The GNU operators +are not special, interval expressions are not available, and neither +are the POSIX character classes (@code{[[:alnum:]]} and so on). +Characters described by octal and hexadecimal escape sequences are +treated literally, even if they represent regexp metacharacters. + +@item @code{--re-interval} +Allow interval expressions in regexps, even if @samp{--traditional} +has been provided. +@end table + +@node Case-sensitivity, Leftmost Longest, GNU Regexp Operators, Regexp +@section Case-sensitivity in Matching + +@cindex case sensitivity +@cindex ignoring case +Case is normally significant in regular expressions, both when matching +ordinary characters (i.e.@: not metacharacters), and inside character +sets. Thus a @samp{w} in a regular expression matches only a lower-case +@samp{w} and not an upper-case @samp{W}. + +The simplest way to do a case-independent match is to use a character +list: @samp{[Ww]}. However, this can be cumbersome if you need to use it +often; and it can make the regular expressions harder to +read. There are two alternatives that you might prefer. + +One way to do a case-insensitive match at a particular point in the +program is to convert the data to a single case, using the +@code{tolower} or @code{toupper} built-in string functions (which we +haven't discussed yet; +@pxref{String Functions, ,Built-in Functions for String Manipulation}). +For example: + +@example +tolower($1) ~ /foo/ @{ @dots{} @} +@end example + +@noindent +converts the first field to lower-case before matching against it. +This will work in any POSIX-compliant implementation of @code{awk}. + +@cindex differences between @code{gawk} and @code{awk} +@cindex @code{~} operator +@cindex @code{!~} operator +@vindex IGNORECASE +Another method, specific to @code{gawk}, is to set the variable +@code{IGNORECASE} to a non-zero value (@pxref{Built-in Variables}). +When @code{IGNORECASE} is not zero, @emph{all} regexp and string +operations ignore case. Changing the value of +@code{IGNORECASE} dynamically controls the case sensitivity of your +program as it runs. Case is significant by default because +@code{IGNORECASE} (like most variables) is initialized to zero. + +@example +@group +x = "aB" +if (x ~ /ab/) @dots{} # this test will fail +@end group + +@group +IGNORECASE = 1 +if (x ~ /ab/) @dots{} # now it will succeed +@end group +@end example + +In general, you cannot use @code{IGNORECASE} to make certain rules +case-insensitive and other rules case-sensitive, because there is no way +to set @code{IGNORECASE} just for the pattern of a particular rule. +@ignore +This isn't quite true. Consider: + + IGNORECASE=1 && /foObAr/ { .... } + IGNORECASE=0 || /foobar/ { .... } + +But that's pretty bad style and I don't want to get into it at this +late date. +@end ignore +To do this, you must use character lists or @code{tolower}. However, one +thing you can do only with @code{IGNORECASE} is turn case-sensitivity on +or off dynamically for all the rules at once. + +@code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} rule +(@pxref{Other Arguments, ,Other Command Line Arguments}; also +@pxref{Using BEGIN/END, ,Startup and Cleanup Actions}). +Setting @code{IGNORECASE} from the command line is a way to make +a program case-insensitive without having to edit it. + +Prior to version 3.0 of @code{gawk}, the value of @code{IGNORECASE} +only affected regexp operations. It did not affect string comparison +with @samp{==}, @samp{!=}, and so on. +Beginning with version 3.0, both regexp and string comparison +operations are affected by @code{IGNORECASE}. + +@cindex ISO 8859-1 +@cindex ISO Latin-1 +Beginning with version 3.0 of @code{gawk}, the equivalences between upper-case +and lower-case characters are based on the ISO-8859-1 (ISO Latin-1) +character set. This character set is a superset of the traditional 128 +ASCII characters, that also provides a number of characters suitable +for use with European languages. +@ignore +A pure ASCII character set can be used instead if @code{gawk} is compiled +with @samp{-DUSE_PURE_ASCII}. +@end ignore + +The value of @code{IGNORECASE} has no effect if @code{gawk} is in +compatibility mode (@pxref{Options, ,Command Line Options}). +Case is always significant in compatibility mode. + +@node Leftmost Longest, Computed Regexps, Case-sensitivity, Regexp +@section How Much Text Matches? + +@cindex leftmost longest match +@cindex matching, leftmost longest +Consider the following example: + +@example +echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}' +@end example + +This example uses the @code{sub} function (which we haven't discussed yet, +@pxref{String Functions, ,Built-in Functions for String Manipulation}) +to make a change to the input record. Here, the regexp @code{/a+/} +indicates ``one or more @samp{a} characters,'' and the replacement +text is @samp{<A>}. + +The input contains four @samp{a} characters. What will the output be? +In other words, how many is ``one or more''---will @code{awk} match two, +three, or all four @samp{a} characters? + +The answer is, @code{awk} (and POSIX) regular expressions always match +the leftmost, @emph{longest} sequence of input characters that can +match. Thus, in this example, all four @samp{a} characters are +replaced with @samp{<A>}. + +@example +$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}' +@print{} <A>bcd +@end example + +For simple match/no-match tests, this is not so important. But when doing +regexp-based field and record splitting, and +text matching and substitutions with the @code{match}, @code{sub}, @code{gsub}, +and @code{gensub} functions, it is very important. +@ifinfo +@xref{String Functions, ,Built-in Functions for String Manipulation}, +for more information on these functions. +@end ifinfo +Understanding this principle is also important for regexp-based record +and field splitting (@pxref{Records, ,How Input is Split into Records}, +and also @pxref{Field Separators, ,Specifying How Fields are Separated}). + +@node Computed Regexps, , Leftmost Longest, Regexp +@section Using Dynamic Regexps + +@cindex computed regular expressions +@cindex regular expressions, computed +@cindex dynamic regular expressions +@cindex regexp, dynamic +@cindex @code{~} operator +@cindex @code{!~} operator +The right hand side of a @samp{~} or @samp{!~} operator need not be a +regexp constant (i.e.@: a string of characters between slashes). It may +be any expression. The expression is evaluated, and converted if +necessary to a string; the contents of the string are used as the +regexp. A regexp that is computed in this way is called a @dfn{dynamic +regexp}. For example: + +@example +BEGIN @{ identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" @} +$0 ~ identifier_regexp @{ print @} +@end example + +@noindent +sets @code{identifier_regexp} to a regexp that describes @code{awk} +variable names, and tests if the input record matches this regexp. + +@strong{Caution:} When using the @samp{~} and @samp{!~} +operators, there is a difference between a regexp constant +enclosed in slashes, and a string constant enclosed in double quotes. +If you are going to use a string constant, you have to understand that +the string is in essence scanned @emph{twice}; the first time when +@code{awk} reads your program, and the second time when it goes to +match the string on the left-hand side of the operator with the pattern +on the right. This is true of any string valued expression (such as +@code{identifier_regexp} above), not just string constants. + +@cindex regexp constants, difference between slashes and quotes +What difference does it make if the string is +scanned twice? The answer has to do with escape sequences, and particularly +with backslashes. To get a backslash into a regular expression inside a +string, you have to type two backslashes. + +For example, @code{/\*/} is a regexp constant for a literal @samp{*}. +Only one backslash is needed. To do the same thing with a string, +you would have to type @code{"\\*"}. The first backslash escapes the +second one, so that the string actually contains the +two characters @samp{\} and @samp{*}. + +@cindex common mistakes +@cindex mistakes, common +@cindex errors, common +Given that you can use both regexp and string constants to describe +regular expressions, which should you use? The answer is ``regexp +constants,'' for several reasons. + +@enumerate 1 +@item +String constants are more complicated to write, and +more difficult to read. Using regexp constants makes your programs +less error-prone. Not understanding the difference between the two +kinds of constants is a common source of errors. + +@item +It is also more efficient to use regexp constants: @code{awk} can note +that you have supplied a regexp and store it internally in a form that +makes pattern matching more efficient. When using a string constant, +@code{awk} must first convert the string into this internal form, and +then perform the pattern matching. + +@item +Using regexp constants is better style; it shows clearly that you +intend a regexp match. +@end enumerate + +@node Reading Files, Printing, Regexp, Top +@chapter Reading Input Files + +@cindex reading files +@cindex input +@cindex standard input +@vindex FILENAME +In the typical @code{awk} program, all input is read either from the +standard input (by default the keyboard, but often a pipe from another +command) or from files whose names you specify on the @code{awk} command +line. If you specify input files, @code{awk} reads them in order, reading +all the data from one before going on to the next. The name of the current +input file can be found in the built-in variable @code{FILENAME} +(@pxref{Built-in Variables}). + +The input is read in units called @dfn{records}, and processed by the +rules of your program one record at a time. +By default, each record is one line. Each +record is automatically split into chunks called @dfn{fields}. +This makes it more convenient for programs to work on the parts of a record. + +On rare occasions you will need to use the @code{getline} command. +The @code{getline} command is valuable, both because it +can do explicit input from any number of files, and because the files +used with it do not have to be named on the @code{awk} command line +(@pxref{Getline, ,Explicit Input with @code{getline}}). + +@menu +* Records:: Controlling how data is split into records. +* Fields:: An introduction to fields. +* Non-Constant Fields:: Non-constant Field Numbers. +* Changing Fields:: Changing the Contents of a Field. +* Field Separators:: The field separator and how to change it. +* Constant Size:: Reading constant width data. +* Multiple Line:: Reading multi-line records. +* Getline:: Reading files under explicit program control + using the @code{getline} function. +@end menu + +@node Records, Fields, Reading Files, Reading Files +@section How Input is Split into Records + +@cindex record separator, @code{RS} +@cindex changing the record separator +@cindex record, definition of +@vindex RS +The @code{awk} utility divides the input for your @code{awk} +program into records and fields. +Records are separated by a character called the @dfn{record separator}. +By default, the record separator is the newline character. +This is why records are, by default, single lines. +You can use a different character for the record separator by +assigning the character to the built-in variable @code{RS}. + +You can change the value of @code{RS} in the @code{awk} program, +like any other variable, with the +assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}). +The new record-separator character should be enclosed in quotation marks, +which indicate +a string constant. Often the right time to do this is at the beginning +of execution, before any input has been processed, so that the very +first record will be read with the proper separator. To do this, use +the special @code{BEGIN} pattern +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). For +example: + +@example +awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list +@end example + +@noindent +changes the value of @code{RS} to @code{"/"}, before reading any input. +This is a string whose first character is a slash; as a result, records +are separated by slashes. Then the input file is read, and the second +rule in the @code{awk} program (the action with no pattern) prints each +record. Since each @code{print} statement adds a newline at the end of +its output, the effect of this @code{awk} program is to copy the input +with each slash changed to a newline. Here are the results of running +the program on @file{BBS-list}: + +@example +@group +$ awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list +@print{} aardvark 555-5553 1200 +@print{} 300 B +@print{} alpo-net 555-3412 2400 +@print{} 1200 +@print{} 300 A +@print{} barfly 555-7685 1200 +@print{} 300 A +@print{} bites 555-1675 2400 +@print{} 1200 +@print{} 300 A +@print{} camelot 555-0542 300 C +@print{} core 555-2912 1200 +@print{} 300 C +@print{} fooey 555-1234 2400 +@print{} 1200 +@print{} 300 B +@print{} foot 555-6699 1200 +@print{} 300 B +@print{} macfoo 555-6480 1200 +@print{} 300 A +@print{} sdace 555-3430 2400 +@print{} 1200 +@print{} 300 A +@print{} sabafoo 555-2127 1200 +@print{} 300 C +@print{} +@end group +@end example + +@noindent +Note that the entry for the @samp{camelot} BBS is not split. +In the original data file +(@pxref{Sample Data Files, , Data Files for the Examples}), +the line looks like this: + +@example +camelot 555-0542 300 C +@end example + +@noindent +It only has one baud rate; there are no slashes in the record. + +Another way to change the record separator is on the command line, +using the variable-assignment feature +(@pxref{Other Arguments, ,Other Command Line Arguments}). + +@example +awk '@{ print $0 @}' RS="/" BBS-list +@end example + +@noindent +This sets @code{RS} to @samp{/} before processing @file{BBS-list}. + +Using an unusual character such as @samp{/} for the record separator +produces correct behavior in the vast majority of cases. However, +the following (extreme) pipeline prints a surprising @samp{1}. There +is one field, consisting of a newline. The value of the built-in +variable @code{NF} is the number of fields in the current record. + +@example +$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}' +@print{} 1 +@end example + +@cindex dark corner +@noindent +Reaching the end of an input file terminates the current input record, +even if the last character in the file is not the character in @code{RS} +(d.c.). + +@cindex empty string +The empty string, @code{""} (a string of no characters), has a special meaning +as the value of @code{RS}: it means that records are separated +by one or more blank lines, and nothing else. +@xref{Multiple Line, ,Multiple-Line Records}, for more details. + +If you change the value of @code{RS} in the middle of an @code{awk} run, +the new value is used to delimit subsequent records, but the record +currently being processed (and records already processed) are not +affected. + +@vindex RT +@cindex record terminator, @code{RT} +@cindex terminator, record +@cindex differences between @code{gawk} and @code{awk} +After the end of the record has been determined, @code{gawk} +sets the variable @code{RT} to the text in the input that matched +@code{RS}. + +@cindex regular expressions as record separators +The value of @code{RS} is in fact not limited to a one-character +string. It can be any regular expression +(@pxref{Regexp, ,Regular Expressions}). +In general, each record +ends at the next string that matches the regular expression; the next +record starts at the end of the matching string. This general rule is +actually at work in the usual case, where @code{RS} contains just a +newline: a record ends at the beginning of the next matching string (the +next newline in the input) and the following record starts just after +the end of this string (at the first character of the following line). +The newline, since it matches @code{RS}, is not part of either record. + +When @code{RS} is a single character, @code{RT} will +contain the same single character. However, when @code{RS} is a +regular expression, then @code{RT} becomes more useful; it contains +the actual input text that matched the regular expression. + +The following example illustrates both of these features. +It sets @code{RS} equal to a regular expression that +matches either a newline, or a series of one or more upper-case letters +with optional leading and/or trailing white space +(@pxref{Regexp, , Regular Expressions}). + +@example +$ echo record 1 AAAA record 2 BBBB record 3 | +> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @} +> @{ print "Record =", $0, "and RT =", RT @}' +@print{} Record = record 1 and RT = AAAA +@print{} Record = record 2 and RT = BBBB +@print{} Record = record 3 and RT = +@print{} +@end example + +@noindent +The final line of output has an extra blank line. This is because the +value of @code{RT} is a newline, and then the @code{print} statement +supplies its own terminating newline. + +@xref{Simple Sed, ,A Simple Stream Editor}, for a more useful example +of @code{RS} as a regexp and @code{RT}. + +@cindex differences between @code{gawk} and @code{awk} +The use of @code{RS} as a regular expression and the @code{RT} +variable are @code{gawk} extensions; they are not available in +compatibility mode +(@pxref{Options, ,Command Line Options}). +In compatibility mode, only the first character of the value of +@code{RS} is used to determine the end of the record. + +@cindex number of records, @code{NR}, @code{FNR} +@vindex NR +@vindex FNR +The @code{awk} utility keeps track of the number of records that have +been read so far from the current input file. This value is stored in a +built-in variable called @code{FNR}. It is reset to zero when a new +file is started. Another built-in variable, @code{NR}, is the total +number of input records read so far from all data files. It starts at zero +but is never automatically reset to zero. + +@node Fields, Non-Constant Fields, Records, Reading Files +@section Examining Fields + +@cindex examining fields +@cindex fields +@cindex accessing fields +When @code{awk} reads an input record, the record is +automatically separated or @dfn{parsed} by the interpreter into chunks +called @dfn{fields}. By default, fields are separated by whitespace, +like words in a line. +Whitespace in @code{awk} means any string of one or more spaces, +tabs or newlines;@footnote{In POSIX @code{awk}, newlines are not +considered whitespace for separating fields.} other characters such as +formfeed, and so on, that are +considered whitespace by other languages are @emph{not} considered +whitespace by @code{awk}. + +The purpose of fields is to make it more convenient for you to refer to +these pieces of the record. You don't have to use them---you can +operate on the whole record if you wish---but fields are what make +simple @code{awk} programs so powerful. + +@cindex @code{$} (field operator) +@cindex field operator @code{$} +To refer to a field in an @code{awk} program, you use a dollar-sign, +@samp{$}, followed by the number of the field you want. Thus, @code{$1} +refers to the first field, @code{$2} to the second, and so on. For +example, suppose the following is a line of input: + +@example +This seems like a pretty nice example. +@end example + +@noindent +Here the first field, or @code{$1}, is @samp{This}; the second field, or +@code{$2}, is @samp{seems}; and so on. Note that the last field, +@code{$7}, is @samp{example.}. Because there is no space between the +@samp{e} and the @samp{.}, the period is considered part of the seventh +field. + +@vindex NF +@cindex number of fields, @code{NF} +@code{NF} is a built-in variable whose value +is the number of fields in the current record. +@code{awk} updates the value of @code{NF} automatically, each time +a record is read. + +No matter how many fields there are, the last field in a record can be +represented by @code{$NF}. So, in the example above, @code{$NF} would +be the same as @code{$7}, which is @samp{example.}. Why this works is +explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}). +If you try to reference a field beyond the last one, such as @code{$8} +when the record has only seven fields, you get the empty string. +@c the empty string acts like 0 in some contexts, but I don't want to +@c get into that here.... + +@code{$0}, which looks like a reference to the ``zeroth'' field, is +a special case: it represents the whole input record. @code{$0} is +used when you are not interested in fields. + +Here are some more examples: + +@example +@group +$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list +@print{} fooey 555-1234 2400/1200/300 B +@print{} foot 555-6699 1200/300 B +@print{} macfoo 555-6480 1200/300 A +@print{} sabafoo 555-2127 1200/300 C +@end group +@end example + +@noindent +This example prints each record in the file @file{BBS-list} whose first +field contains the string @samp{foo}. The operator @samp{~} is called a +@dfn{matching operator} +(@pxref{Regexp Usage, , How to Use Regular Expressions}); +it tests whether a string (here, the field @code{$1}) matches a given regular +expression. + +By contrast, the following example +looks for @samp{foo} in @emph{the entire record} and prints the first +field and the last field for each input record containing a +match. + +@example +@group +$ awk '/foo/ @{ print $1, $NF @}' BBS-list +@print{} fooey B +@print{} foot B +@print{} macfoo A +@print{} sabafoo C +@end group +@end example + +@node Non-Constant Fields, Changing Fields, Fields, Reading Files +@section Non-constant Field Numbers + +The number of a field does not need to be a constant. Any expression in +the @code{awk} language can be used after a @samp{$} to refer to a +field. The value of the expression specifies the field number. If the +value is a string, rather than a number, it is converted to a number. +Consider this example: + +@example +awk '@{ print $NR @}' +@end example + +@noindent +Recall that @code{NR} is the number of records read so far: one in the +first record, two in the second, etc. So this example prints the first +field of the first record, the second field of the second record, and so +on. For the twentieth record, field number 20 is printed; most likely, +the record has fewer than 20 fields, so this prints a blank line. + +Here is another example of using expressions as field numbers: + +@example +awk '@{ print $(2*2) @}' BBS-list +@end example + +@code{awk} must evaluate the expression @samp{(2*2)} and use +its value as the number of the field to print. The @samp{*} sign +represents multiplication, so the expression @samp{2*2} evaluates to four. +The parentheses are used so that the multiplication is done before the +@samp{$} operation; they are necessary whenever there is a binary +operator in the field-number expression. This example, then, prints the +hours of operation (the fourth field) for every line of the file +@file{BBS-list}. (All of the @code{awk} operators are listed, in +order of decreasing precedence, in +@ref{Precedence, , Operator Precedence (How Operators Nest)}.) + +If the field number you compute is zero, you get the entire record. +Thus, @code{$(2-2)} has the same value as @code{$0}. Negative field +numbers are not allowed; trying to reference one will usually terminate +your running @code{awk} program. (The POSIX standard does not define +what happens when you reference a negative field number. @code{gawk} +will notice this and terminate your program. Other @code{awk} +implementations may behave differently.) + +As mentioned in @ref{Fields, ,Examining Fields}, +the number of fields in the current record is stored in the built-in +variable @code{NF} (also @pxref{Built-in Variables}). The expression +@code{$NF} is not a special feature: it is the direct consequence of +evaluating @code{NF} and using its value as a field number. + +@node Changing Fields, Field Separators, Non-Constant Fields, Reading Files +@section Changing the Contents of a Field + +@cindex field, changing contents of +@cindex changing contents of a field +@cindex assignment to fields +You can change the contents of a field as seen by @code{awk} within an +@code{awk} program; this changes what @code{awk} perceives as the +current input record. (The actual input is untouched; @code{awk} @emph{never} +modifies the input file.) + +Consider this example and its output: + +@example +@group +$ awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped +@print{} 13 3 +@print{} 15 5 +@print{} 15 5 +@dots{} +@end group +@end example + +@noindent +The @samp{-} sign represents subtraction, so this program reassigns +field three, @code{$3}, to be the value of field two minus ten, +@samp{$2 - 10}. (@xref{Arithmetic Ops, ,Arithmetic Operators}.) +Then field two, and the new value for field three, are printed. + +In order for this to work, the text in field @code{$2} must make sense +as a number; the string of characters must be converted to a number in +order for the computer to do arithmetic on it. The number resulting +from the subtraction is converted back to a string of characters which +then becomes field three. +@xref{Conversion, ,Conversion of Strings and Numbers}. + +When you change the value of a field (as perceived by @code{awk}), the +text of the input record is recalculated to contain the new field where +the old one was. Therefore, @code{$0} changes to reflect the altered +field. Thus, this program +prints a copy of the input file, with 10 subtracted from the second +field of each line. + +@example +@group +$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped +@print{} Jan 3 25 15 115 +@print{} Feb 5 32 24 226 +@print{} Mar 5 24 34 228 +@dots{} +@end group +@end example + +You can also assign contents to fields that are out of range. For +example: + +@example +$ awk '@{ $6 = ($5 + $4 + $3 + $2) +> print $6 @}' inventory-shipped +@print{} 168 +@print{} 297 +@print{} 301 +@dots{} +@end example + +@noindent +We've just created @code{$6}, whose value is the sum of fields +@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign +represents addition. For the file @file{inventory-shipped}, @code{$6} +represents the total number of parcels shipped for a particular month. + +Creating a new field changes @code{awk}'s internal copy of the current +input record---the value of @code{$0}. Thus, if you do @samp{print $0} +after adding a field, the record printed includes the new field, with +the appropriate number of field separators between it and the previously +existing fields. + +This recomputation affects and is affected by +@code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}), +and by a feature that has not been discussed yet, +the @dfn{output field separator}, @code{OFS}, +which is used to separate the fields (@pxref{Output Separators}). +For example, the value of @code{NF} is set to the number of the highest +field you create. + +Note, however, that merely @emph{referencing} an out-of-range field +does @emph{not} change the value of either @code{$0} or @code{NF}. +Referencing an out-of-range field only produces an empty string. For +example: + +@example +if ($(NF+1) != "") + print "can't happen" +else + print "everything is normal" +@end example + +@noindent +should print @samp{everything is normal}, because @code{NF+1} is certain +to be out of range. (@xref{If Statement, ,The @code{if}-@code{else} Statement}, +for more information about @code{awk}'s @code{if-else} statements. +@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}, +for more information about the @samp{!=} operator.) + +It is important to note that making an assignment to an existing field +will change the +value of @code{$0}, but will not change the value of @code{NF}, +even when you assign the empty string to a field. For example: + +@example +@group +$ echo a b c d | awk '@{ OFS = ":"; $2 = "" +> print $0; print NF @}' +@print{} a::c:d +@print{} 4 +@end group +@end example + +@noindent +The field is still there; it just has an empty value. You can tell +because there are two colons in a row. + +This example shows what happens if you create a new field. + +@example +$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new" +> print $0; print NF @}' +@print{} a::c:d::new +@print{} 6 +@end example + +@noindent +The intervening field, @code{$5} is created with an empty value +(indicated by the second pair of adjacent colons), +and @code{NF} is updated with the value six. + +Finally, decrementing @code{NF} will lose the values of the fields +after the new value of @code{NF}, and @code{$0} will be recomputed. +Here is an example: + +@example +$ echo a b c d e f | ../gawk '@{ print "NF =", NF; +> NF = 3; print $0 @}' +@print{} NF = 6 +@print{} a b c +@end example + +@node Field Separators, Constant Size, Changing Fields, Reading Files +@section Specifying How Fields are Separated + +This section is rather long; it describes one of the most fundamental +operations in @code{awk}. + +@menu +* Basic Field Splitting:: How fields are split with single characters + or simple strings. +* Regexp Field Splitting:: Using regexps as the field separator. +* Single Character Fields:: Making each character a separate field. +* Command Line Field Separator:: Setting @code{FS} from the command line. +* Field Splitting Summary:: Some final points and a summary table. +@end menu + +@node Basic Field Splitting, Regexp Field Splitting, Field Separators, Field Separators +@subsection The Basics of Field Separating +@vindex FS +@cindex fields, separating +@cindex field separator, @code{FS} + +The @dfn{field separator}, which is either a single character or a regular +expression, controls the way @code{awk} splits an input record into fields. +@code{awk} scans the input record for character sequences that +match the separator; the fields themselves are the text between the matches. + +In the examples below, we use the bullet symbol ``@bullet{}'' to represent +spaces in the output. + +If the field separator is @samp{oo}, then the following line: + +@example +moo goo gai pan +@end example + +@noindent +would be split into three fields: @samp{m}, @samp{@bullet{}g} and +@samp{@bullet{}gai@bullet{}pan}. +Note the leading spaces in the values of the second and third fields. + +@cindex common mistakes +@cindex mistakes, common +@cindex errors, common +The field separator is represented by the built-in variable @code{FS}. +Shell programmers take note! @code{awk} does @emph{not} use the name @code{IFS} +which is used by the POSIX compatible shells (such as the Bourne shell, +@code{sh}, or the GNU Bourne-Again Shell, Bash). + +You can change the value of @code{FS} in the @code{awk} program with the +assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}). +Often the right time to do this is at the beginning of execution, +before any input has been processed, so that the very first record +will be read with the proper separator. To do this, use the special +@code{BEGIN} pattern +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). +For example, here we set the value of @code{FS} to the string +@code{","}: + +@example +awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}' +@end example + +@noindent +Given the input line, + +@example +John Q. Smith, 29 Oak St., Walamazoo, MI 42139 +@end example + +@noindent +this @code{awk} program extracts and prints the string +@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. + +@cindex field separator, choice of +@cindex regular expressions as field separators +Sometimes your input data will contain separator characters that don't +separate fields the way you thought they would. For instance, the +person's name in the example we just used might have a title or +suffix attached, such as @samp{John Q. Smith, LXIX}. From input +containing such a name: + +@example +John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 +@end example + +@noindent +@c careful of an overfull hbox here! +the above program would extract @samp{@bullet{}LXIX}, instead of +@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. +If you were expecting the program to print the +address, you would be surprised. The moral is: choose your data layout and +separator characters carefully to prevent such problems. + +@iftex +As you know, normally, +@end iftex +@ifinfo +Normally, +@end ifinfo +fields are separated by whitespace sequences +(spaces, tabs and newlines), not by single spaces: two spaces in a row do not +delimit an empty field. The default value of the field separator @code{FS} +is a string containing a single space, @w{@code{" "}}. If this value were +interpreted in the usual way, each space character would separate +fields, so two spaces in a row would make an empty field between them. +The reason this does not happen is that a single space as the value of +@code{FS} is a special case: it is taken to specify the default manner +of delimiting fields. + +If @code{FS} is any other single character, such as @code{","}, then +each occurrence of that character separates two fields. Two consecutive +occurrences delimit an empty field. If the character occurs at the +beginning or the end of the line, that too delimits an empty field. The +space character is the only single character which does not follow these +rules. + +@node Regexp Field Splitting, Single Character Fields, Basic Field Splitting, Field Separators +@subsection Using Regular Expressions to Separate Fields + +The previous +@iftex +subsection +@end iftex +@ifinfo +node +@end ifinfo +discussed the use of single characters or simple strings as the +value of @code{FS}. +More generally, the value of @code{FS} may be a string containing any +regular expression. In this case, each match in the record for the regular +expression separates fields. For example, the assignment: + +@example +FS = ", \t" +@end example + +@noindent +makes every area of an input line that consists of a comma followed by a +space and a tab, into a field separator. (@samp{\t} +is an @dfn{escape sequence} that stands for a tab; +@pxref{Escape Sequences}, +for the complete list of similar escape sequences.) + +For a less trivial example of a regular expression, suppose you want +single spaces to separate fields the way single commas were used above. +You can set @code{FS} to @w{@code{"[@ ]"}} (left bracket, space, right +bracket). This regular expression matches a single space and nothing else +(@pxref{Regexp, ,Regular Expressions}). + +There is an important difference between the two cases of @samp{FS = @w{" "}} +(a single space) and @samp{FS = @w{"[ \t\n]+"}} (left bracket, space, +backslash, ``t'', backslash, ``n'', right bracket, which is a regular +expression matching one or more spaces, tabs, or newlines). For both +values of @code{FS}, fields are separated by runs of spaces, tabs +and/or newlines. However, when the value of @code{FS} is @w{@code{" +"}}, @code{awk} will first strip leading and trailing whitespace from +the record, and then decide where the fields are. + +For example, the following pipeline prints @samp{b}: + +@example +$ echo ' a b c d ' | awk '@{ print $2 @}' +@print{} b +@end example + +@noindent +However, this pipeline prints @samp{a} (note the extra spaces around +each letter): + +@example +$ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t]+" @} +> @{ print $2 @}' +@print{} a +@end example + +@noindent +@cindex null string +@cindex empty string +In this case, the first field is @dfn{null}, or empty. + +The stripping of leading and trailing whitespace also comes into +play whenever @code{$0} is recomputed. For instance, study this pipeline: + +@example +$ echo ' a b c d' | awk '@{ print; $2 = $2; print @}' +@print{} a b c d +@print{} a b c d +@end example + +@noindent +The first @code{print} statement prints the record as it was read, +with leading whitespace intact. The assignment to @code{$2} rebuilds +@code{$0} by concatenating @code{$1} through @code{$NF} together, +separated by the value of @code{OFS}. Since the leading whitespace +was ignored when finding @code{$1}, it is not part of the new @code{$0}. +Finally, the last @code{print} statement prints the new @code{$0}. + +@node Single Character Fields, Command Line Field Separator, Regexp Field Splitting, Field Separators +@subsection Making Each Character a Separate Field + +@cindex differences between @code{gawk} and @code{awk} +@cindex single character fields +There are times when you may want to examine each character +of a record separately. In @code{gawk}, this is easy to do, you +simply assign the null string (@code{""}) to @code{FS}. In this case, +each individual character in the record will become a separate field. +Here is an example: + +@example +@group +$ echo a b | gawk 'BEGIN @{ FS = "" @} +> @{ +> for (i = 1; i <= NF; i = i + 1) +> print "Field", i, "is", $i +> @}' +@print{} Field 1 is a +@print{} Field 2 is +@print{} Field 3 is b +@end group +@end example + +@cindex dark corner +Traditionally, the behavior for @code{FS} equal to @code{""} was not defined. +In this case, Unix @code{awk} would simply treat the entire record +as only having one field (d.c.). In compatibility mode +(@pxref{Options, ,Command Line Options}), +if @code{FS} is the null string, then @code{gawk} will also +behave this way. + +@node Command Line Field Separator, Field Splitting Summary, Single Character Fields, Field Separators +@subsection Setting @code{FS} from the Command Line +@cindex @code{-F} option +@cindex field separator, on command line +@cindex command line, setting @code{FS} on + +@code{FS} can be set on the command line. You use the @samp{-F} option to +do so. For example: + +@example +awk -F, '@var{program}' @var{input-files} +@end example + +@noindent +sets @code{FS} to be the @samp{,} character. Notice that the option uses +a capital @samp{F}. Contrast this with @samp{-f}, which specifies a file +containing an @code{awk} program. Case is significant in command line options: +the @samp{-F} and @samp{-f} options have nothing to do with each other. +You can use both options at the same time to set the @code{FS} variable +@emph{and} get an @code{awk} program from a file. + +The value used for the argument to @samp{-F} is processed in exactly the +same way as assignments to the built-in variable @code{FS}. This means that +if the field separator contains special characters, they must be escaped +appropriately. For example, to use a @samp{\} as the field separator, you +would have to type: + +@example +# same as FS = "\\" +awk -F\\\\ '@dots{}' files @dots{} +@end example + +@noindent +Since @samp{\} is used for quoting in the shell, @code{awk} will see +@samp{-F\\}. Then @code{awk} processes the @samp{\\} for escape +characters (@pxref{Escape Sequences}), finally yielding +a single @samp{\} to be used for the field separator. + +@cindex historical features +As a special case, in compatibility mode +(@pxref{Options, ,Command Line Options}), if the +argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab +character. This is because if you type @samp{-F\t} at the shell, +without any quotes, the @samp{\} gets deleted, so @code{awk} figures that you +really want your fields to be separated with tabs, and not @samp{t}s. +Use @samp{-v FS="t"} on the command line if you really do want to separate +your fields with @samp{t}s +(@pxref{Options, ,Command Line Options}). + +For example, let's use an @code{awk} program file called @file{baud.awk} +that contains the pattern @code{/300/}, and the action @samp{print $1}. +Here is the program: + +@example +/300/ @{ print $1 @} +@end example + +Let's also set @code{FS} to be the @samp{-} character, and run the +program on the file @file{BBS-list}. The following command prints a +list of the names of the bulletin boards that operate at 300 baud and +the first three digits of their phone numbers: + +@c tweaked to make the tex output look better in @smallbook +@example +@group +$ awk -F- -f baud.awk BBS-list +@print{} aardvark 555 +@print{} alpo +@print{} barfly 555 +@dots{} +@end group +@ignore +@print{} bites 555 +@print{} camelot 555 +@print{} core 555 +@print{} fooey 555 +@print{} foot 555 +@print{} macfoo 555 +@print{} sdace 555 +@print{} sabafoo 555 +@end ignore +@end example + +@noindent +Note the second line of output. In the original file +(@pxref{Sample Data Files, ,Data Files for the Examples}), +the second line looked like this: + +@example +alpo-net 555-3412 2400/1200/300 A +@end example + +The @samp{-} as part of the system's name was used as the field +separator, instead of the @samp{-} in the phone number that was +originally intended. This demonstrates why you have to be careful in +choosing your field and record separators. + +On many Unix systems, each user has a separate entry in the system password +file, one line per user. The information in these lines is separated +by colons. The first field is the user's logon name, and the second is +the user's encrypted password. A password file entry might look like this: + +@example +arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh +@end example + +The following program searches the system password file, and prints +the entries for users who have no password: + +@example +awk -F: '$2 == ""' /etc/passwd +@end example + +@node Field Splitting Summary, , Command Line Field Separator, Field Separators +@subsection Field Splitting Summary + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +According to the POSIX standard, @code{awk} is supposed to behave +as if each record is split into fields at the time that it is read. +In particular, this means that you can change the value of @code{FS} +after a record is read, and the value of the fields (i.e.@: how they were split) +should reflect the old value of @code{FS}, not the new one. + +@cindex dark corner +@cindex @code{sed} utility +@cindex stream editor +However, many implementations of @code{awk} do not work this way. Instead, +they defer splitting the fields until a field is actually +referenced. The fields will be split +using the @emph{current} value of @code{FS}! (d.c.) +This behavior can be difficult +to diagnose. The following example illustrates the difference +between the two methods. +(The @code{sed}@footnote{The @code{sed} utility is a ``stream editor.'' +Its behavior is also defined by the POSIX standard.} +command prints just the first line of @file{/etc/passwd}.) + +@example +sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}' +@end example + +@noindent +will usually print + +@example +root +@end example + +@noindent +on an incorrect implementation of @code{awk}, while @code{gawk} +will print something like + +@example +root:nSijPlPhZZwgE:0:0:Root:/: +@end example + +The following table summarizes how fields are split, based on the +value of @code{FS}. (@samp{==} means ``is equal to.'') + +@c @cartouche +@table @code +@item FS == " " +Fields are separated by runs of whitespace. Leading and trailing +whitespace are ignored. This is the default. + +@item FS == @var{any other single character} +Fields are separated by each occurrence of the character. Multiple +successive occurrences delimit empty fields, as do leading and +trailing occurrences. +The character can even be a regexp metacharacter; it does not need +to be escaped. + +@item FS == @var{regexp} +Fields are separated by occurrences of characters that match @var{regexp}. +Leading and trailing matches of @var{regexp} delimit empty fields. + +@item FS == "" +Each individual character in the record becomes a separate field. +@end table +@c @end cartouche + +@node Constant Size, Multiple Line, Field Separators, Reading Files +@section Reading Fixed-width Data + +(This section discusses an advanced, experimental feature. If you are +a novice @code{awk} user, you may wish to skip it on the first reading.) + +@code{gawk} version 2.13 introduced a new facility for dealing with +fixed-width fields with no distinctive field separator. Data of this +nature arises, for example, in the input for old FORTRAN programs where +numbers are run together; or in the output of programs that did not +anticipate the use of their output as input for other programs. + +An example of the latter is a table where all the columns are lined up by +the use of a variable number of spaces and @emph{empty fields are just +spaces}. Clearly, @code{awk}'s normal field splitting based on @code{FS} +will not work well in this case. Although a portable @code{awk} program +can use a series of @code{substr} calls on @code{$0} +(@pxref{String Functions, ,Built-in Functions for String Manipulation}), +this is awkward and inefficient for a large number of fields. + +The splitting of an input record into fixed-width fields is specified by +assigning a string containing space-separated numbers to the built-in +variable @code{FIELDWIDTHS}. Each number specifies the width of the field +@emph{including} columns between fields. If you want to ignore the columns +between fields, you can specify the width as a separate field that is +subsequently ignored. + +The following data is the output of the Unix @code{w} utility. It is useful +to illustrate the use of @code{FIELDWIDTHS}. + +@example +@group + 10:06pm up 21 days, 14:04, 23 users +User tty login@ idle JCPU PCPU what +hzuo ttyV0 8:58pm 9 5 vi p24.tex +hzang ttyV3 6:37pm 50 -csh +eklye ttyV5 9:53pm 7 1 em thes.tex +dportein ttyV6 8:17pm 1:47 -csh +gierd ttyD3 10:00pm 1 elm +dave ttyD4 9:47pm 4 4 w +brent ttyp0 26Jun91 4:46 26:46 4:41 bash +dave ttyq4 26Jun9115days 46 46 wnewmail +@end group +@end example + +The following program takes the above input, converts the idle time to +number of seconds and prints out the first two fields and the calculated +idle time. (This program uses a number of @code{awk} features that +haven't been introduced yet.) + +@example +@group +BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @} +NR > 2 @{ + idle = $4 + sub(/^ */, "", idle) # strip leading spaces + if (idle == "") + idle = 0 + if (idle ~ /:/) @{ + split(idle, t, ":") + idle = t[1] * 60 + t[2] + @} + if (idle ~ /days/) + idle *= 24 * 60 * 60 + + print $1, $2, idle +@} +@end group +@end example + +Here is the result of running the program on the data: + +@example +hzuo ttyV0 0 +hzang ttyV3 50 +eklye ttyV5 0 +dportein ttyV6 107 +gierd ttyD3 1 +dave ttyD4 0 +brent ttyp0 286 +dave ttyq4 1296000 +@end example + +Another (possibly more practical) example of fixed-width input data +would be the input from a deck of balloting cards. In some parts of +the United States, voters mark their choices by punching holes in computer +cards. These cards are then processed to count the votes for any particular +candidate or on any particular issue. Since a voter may choose not to +vote on some issue, any column on the card may be empty. An @code{awk} +program for processing such data could use the @code{FIELDWIDTHS} feature +to simplify reading the data. (Of course, getting @code{gawk} to run on +a system with card readers is another story!) + +@ignore +Exercise: Write a ballot card reading program +@end ignore + +Assigning a value to @code{FS} causes @code{gawk} to return to using +@code{FS} for field splitting. Use @samp{FS = FS} to make this happen, +without having to know the current value of @code{FS}. + +This feature is still experimental, and may evolve over time. +Note that in particular, @code{gawk} does not attempt to verify +the sanity of the values used in the value of @code{FIELDWIDTHS}. + +@node Multiple Line, Getline, Constant Size, Reading Files +@section Multiple-Line Records + +@cindex multiple line records +@cindex input, multiple line records +@cindex reading files, multiple line records +@cindex records, multiple line +In some data bases, a single line cannot conveniently hold all the +information in one entry. In such cases, you can use multi-line +records. + +The first step in doing this is to choose your data format: when records +are not defined as single lines, how do you want to define them? +What should separate records? + +One technique is to use an unusual character or string to separate +records. For example, you could use the formfeed character (written +@samp{\f} in @code{awk}, as in C) to separate them, making each record +a page of the file. To do this, just set the variable @code{RS} to +@code{"\f"} (a string containing the formfeed character). Any +other character could equally well be used, as long as it won't be part +of the data in a record. + +Another technique is to have blank lines separate records. By a special +dispensation, an empty string as the value of @code{RS} indicates that +records are separated by one or more blank lines. If you set @code{RS} +to the empty string, a record always ends at the first blank line +encountered. And the next record doesn't start until the first non-blank +line that follows---no matter how many blank lines appear in a row, they +are considered one record-separator. + +@cindex leftmost longest match +@cindex matching, leftmost longest +You can achieve the same effect as @samp{RS = ""} by assigning the +string @code{"\n\n+"} to @code{RS}. This regexp matches the newline +at the end of the record, and one or more blank lines after the record. +In addition, a regular expression always matches the longest possible +sequence when there is a choice +(@pxref{Leftmost Longest, ,How Much Text Matches?}). +So the next record doesn't start until +the first non-blank line that follows---no matter how many blank lines +appear in a row, they are considered one record-separator. + +@cindex dark corner +There is an important difference between @samp{RS = ""} and +@samp{RS = "\n\n+"}. In the first case, leading newlines in the input +data file are ignored, and if a file ends without extra blank lines +after the last record, the final newline is removed from the record. +In the second case, this special processing is not done (d.c.). + +Now that the input is separated into records, the second step is to +separate the fields in the record. One way to do this is to divide each +of the lines into fields in the normal manner. This happens by default +as the result of a special feature: when @code{RS} is set to the empty +string, the newline character @emph{always} acts as a field separator. +This is in addition to whatever field separations result from @code{FS}. + +The original motivation for this special exception was probably to provide +useful behavior in the default case (i.e.@: @code{FS} is equal +to @w{@code{" "}}). This feature can be a problem if you really don't +want the newline character to separate fields, since there is no way to +prevent it. However, you can work around this by using the @code{split} +function to break up the record manually +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). + +Another way to separate fields is to +put each field on a separate line: to do this, just set the +variable @code{FS} to the string @code{"\n"}. (This simple regular +expression matches a single newline.) + +A practical example of a data file organized this way might be a mailing +list, where each entry is separated by blank lines. If we have a mailing +list in a file named @file{addresses}, that looks like this: + +@example +Jane Doe +123 Main Street +Anywhere, SE 12345-6789 + +John Smith +456 Tree-lined Avenue +Smallville, MW 98765-4321 + +@dots{} +@end example + +@noindent +A simple program to process this file would look like this: + +@example +@group +# addrs.awk --- simple mailing list program + +# Records are separated by blank lines. +# Each line is one field. +BEGIN @{ RS = "" ; FS = "\n" @} + +@{ + print "Name is:", $1 + print "Address is:", $2 + print "City and State are:", $3 + print "" +@} +@end group +@end example + +Running the program produces the following output: + +@example +@group +$ awk -f addrs.awk addresses +@print{} Name is: Jane Doe +@print{} Address is: 123 Main Street +@print{} City and State are: Anywhere, SE 12345-6789 +@print{} +@end group +@group +@print{} Name is: John Smith +@print{} Address is: 456 Tree-lined Avenue +@print{} City and State are: Smallville, MW 98765-4321 +@print{} +@dots{} +@end group +@end example + +@xref{Labels Program, ,Printing Mailing Labels}, for a more realistic +program that deals with address lists. + +The following table summarizes how records are split, based on the +value of @code{RS}. (@samp{==} means ``is equal to.'') + +@c @cartouche +@table @code +@item RS == "\n" +Records are separated by the newline character (@samp{\n}). In effect, +every line in the data file is a separate record, including blank lines. +This is the default. + +@item RS == @var{any single character} +Records are separated by each occurrence of the character. Multiple +successive occurrences delimit empty records. + +@item RS == "" +Records are separated by runs of blank lines. The newline character +always serves as a field separator, in addition to whatever value +@code{FS} may have. Leading and trailing newlines in a file are ignored. + +@item RS == @var{regexp} +Records are separated by occurrences of characters that match @var{regexp}. +Leading and trailing matches of @var{regexp} delimit empty records. +@end table +@c @end cartouche + +@vindex RT +In all cases, @code{gawk} sets @code{RT} to the input text that matched the +value specified by @code{RS}. + +@node Getline, , Multiple Line, Reading Files +@section Explicit Input with @code{getline} + +@findex getline +@cindex input, explicit +@cindex explicit input +@cindex input, @code{getline} command +@cindex reading files, @code{getline} command +So far we have been getting our input data from @code{awk}'s main +input stream---either the standard input (usually your terminal, sometimes +the output from another program) or from the +files specified on the command line. The @code{awk} language has a +special built-in command called @code{getline} that +can be used to read input under your explicit control. + +@menu +* Getline Intro:: Introduction to the @code{getline} function. +* Plain Getline:: Using @code{getline} with no arguments. +* Getline/Variable:: Using @code{getline} into a variable. +* Getline/File:: Using @code{getline} from a file. +* Getline/Variable/File:: Using @code{getline} into a variable from a + file. +* Getline/Pipe:: Using @code{getline} from a pipe. +* Getline/Variable/Pipe:: Using @code{getline} into a variable from a + pipe. +* Getline Summary:: Summary Of @code{getline} Variants. +@end menu + +@node Getline Intro, Plain Getline, Getline, Getline +@subsection Introduction to @code{getline} + +This command is used in several different ways, and should @emph{not} be +used by beginners. It is covered here because this is the chapter on input. +The examples that follow the explanation of the @code{getline} command +include material that has not been covered yet. Therefore, come back +and study the @code{getline} command @emph{after} you have reviewed the +rest of this @value{DOCUMENT} and have a good knowledge of how @code{awk} works. + +@vindex ERRNO +@cindex differences between @code{gawk} and @code{awk} +@cindex @code{getline}, return values +@code{getline} returns one if it finds a record, and zero if the end of the +file is encountered. If there is some error in getting a record, such +as a file that cannot be opened, then @code{getline} returns @minus{}1. +In this case, @code{gawk} sets the variable @code{ERRNO} to a string +describing the error that occurred. + +In the following examples, @var{command} stands for a string value that +represents a shell command. + +@node Plain Getline, Getline/Variable, Getline Intro, Getline +@subsection Using @code{getline} with No Arguments + +The @code{getline} command can be used without arguments to read input +from the current input file. All it does in this case is read the next +input record and split it up into fields. This is useful if you've +finished processing the current record, but you want to do some special +processing @emph{right now} on the next record. Here's an +example: + +@example +@group +awk '@{ + if ((t = index($0, "/*")) != 0) @{ + # value will be "" if t is 1 + tmp = substr($0, 1, t - 1) + u = index(substr($0, t + 2), "*/") + while (u == 0) @{ + if (getline <= 0) @{ + m = "unexpected EOF or error" + m = (m ": " ERRNO) + print m > "/dev/stderr" + exit + @} + t = -1 + u = index($0, "*/") + @} +@end group +@group + # substr expression will be "" if */ + # occurred at end of line + $0 = tmp substr($0, t + u + 3) + @} + print $0 +@}' +@end group +@end example + +This @code{awk} program deletes all C-style comments, @samp{/* @dots{} +*/}, from the input. By replacing the @samp{print $0} with other +statements, you could perform more complicated processing on the +decommented input, like searching for matches of a regular +expression. This program has a subtle problem---it does not work if one +comment ends and another begins on the same line. + +@ignore +Exercise, +write a program that does handle multiple comments on the line. +@end ignore + +This form of the @code{getline} command sets @code{NF} (the number of +fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of +records read so far; @pxref{Records, ,How Input is Split into Records}), +@code{FNR} (the number of records read from this input file), and the +value of @code{$0}. + +@cindex dark corner +@strong{Note:} the new value of @code{$0} is used in testing +the patterns of any subsequent rules. The original value +of @code{$0} that triggered the rule which executed @code{getline} +is lost (d.c.). +By contrast, the @code{next} statement reads a new record +but immediately begins processing it normally, starting with the first +rule in the program. @xref{Next Statement, ,The @code{next} Statement}. + +@node Getline/Variable, Getline/File, Plain Getline, Getline +@subsection Using @code{getline} Into a Variable + +You can use @samp{getline @var{var}} to read the next record from +@code{awk}'s input into the variable @var{var}. No other processing is +done. + +For example, suppose the next line is a comment, or a special string, +and you want to read it, without triggering +any rules. This form of @code{getline} allows you to read that line +and store it in a variable so that the main +read-a-line-and-check-each-rule loop of @code{awk} never sees it. + +The following example swaps every two lines of input. For example, given: + +@example +wan +tew +free +phore +@end example + +@noindent +it outputs: + +@example +tew +wan +phore +free +@end example + +@noindent +Here's the program: + +@example +@group +awk '@{ + if ((getline tmp) > 0) @{ + print tmp + print $0 + @} else + print $0 +@}' +@end group +@end example + +The @code{getline} command used in this way sets only the variables +@code{NR} and @code{FNR} (and of course, @var{var}). The record is not +split into fields, so the values of the fields (including @code{$0}) and +the value of @code{NF} do not change. + +@node Getline/File, Getline/Variable/File, Getline/Variable, Getline +@subsection Using @code{getline} from a File + +@cindex input redirection +@cindex redirection of input +Use @samp{getline < @var{file}} to read +the next record from the file +@var{file}. Here @var{file} is a string-valued expression that +specifies the file name. @samp{< @var{file}} is called a @dfn{redirection} +since it directs input to come from a different place. + +For example, the following +program reads its input record from the file @file{secondary.input} when it +encounters a first field with a value equal to 10 in the current input +file. + +@example +@group +awk '@{ + if ($1 == 10) @{ + getline < "secondary.input" + print + @} else + print +@}' +@end group +@end example + +Since the main input stream is not used, the values of @code{NR} and +@code{FNR} are not changed. But the record read is split into fields in +the normal manner, so the values of @code{$0} and other fields are +changed. So is the value of @code{NF}. + +@c Thanks to Paul Eggert for initial wording here +According to POSIX, @samp{getline < @var{expression}} is ambiguous if +@var{expression} contains unparenthesized operators other than +@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous +because the concatenation operator is not parenthesized, and you should +write it as @samp{getline < (dir "/" file)} if you want your program +to be portable to other @code{awk} implementations. + +@node Getline/Variable/File, Getline/Pipe, Getline/File, Getline +@subsection Using @code{getline} Into a Variable from a File + +Use @samp{getline @var{var} < @var{file}} to read input +the file +@var{file} and put it in the variable @var{var}. As above, @var{file} +is a string-valued expression that specifies the file from which to read. + +In this version of @code{getline}, none of the built-in variables are +changed, and the record is not split into fields. The only variable +changed is @var{var}. + +@ifinfo +@c Thanks to Paul Eggert for initial wording here +According to POSIX, @samp{getline @var{var} < @var{expression}} is ambiguous if +@var{expression} contains unparenthesized operators other than +@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous +because the concatenation operator is not parenthesized, and you should +write it as @samp{getline < (dir "/" file)} if you want your program +to be portable to other @code{awk} implementations. +@end ifinfo + +For example, the following program copies all the input files to the +output, except for records that say @w{@samp{@@include @var{filename}}}. +Such a record is replaced by the contents of the file +@var{filename}. + +@example +@group +awk '@{ + if (NF == 2 && $1 == "@@include") @{ + while ((getline line < $2) > 0) + print line + close($2) + @} else + print +@}' +@end group +@end example + +Note here how the name of the extra input file is not built into +the program; it is taken directly from the data, from the second field on +the @samp{@@include} line. + +The @code{close} function is called to ensure that if two identical +@samp{@@include} lines appear in the input, the entire specified file is +included twice. +@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}. + +One deficiency of this program is that it does not process nested +@samp{@@include} statements +(@samp{@@include} statements in included files) +the way a true macro preprocessor would. +@xref{Igawk Program, ,An Easy Way to Use Library Functions}, for a program +that does handle nested @samp{@@include} statements. + +@node Getline/Pipe, Getline/Variable/Pipe, Getline/Variable/File, Getline +@subsection Using @code{getline} from a Pipe + +@cindex input pipeline +@cindex pipeline, input +You can pipe the output of a command into @code{getline}, using +@samp{@var{command} | getline}. In +this case, the string @var{command} is run as a shell command and its output +is piped into @code{awk} to be used as input. This form of @code{getline} +reads one record at a time from the pipe. + +For example, the following program copies its input to its output, except for +lines that begin with @samp{@@execute}, which are replaced by the output +produced by running the rest of the line as a shell command: + +@example +@group +awk '@{ + if ($1 == "@@execute") @{ + tmp = substr($0, 10) + while ((tmp | getline) > 0) + print + close(tmp) + @} else + print +@}' +@end group +@end example + +@noindent +The @code{close} function is called to ensure that if two identical +@samp{@@execute} lines appear in the input, the command is run for +each one. +@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}. +@c Exercise!! +@c This example is unrealistic, since you could just use system + +@c NEEDED +@page +Given the input: + +@example +@group +foo +bar +baz +@@execute who +bletch +@end group +@end example + +@noindent +the program might produce: + +@example +@group +foo +bar +baz +arnold ttyv0 Jul 13 14:22 +miriam ttyp0 Jul 13 14:23 (murphy:0) +bill ttyp1 Jul 13 14:23 (murphy:0) +bletch +@end group +@end example + +@noindent +Notice that this program ran the command @code{who} and printed the result. +(If you try this program yourself, you will of course get different results, +showing you who is logged in on your system.) + +This variation of @code{getline} splits the record into fields, sets the +value of @code{NF} and recomputes the value of @code{$0}. The values of +@code{NR} and @code{FNR} are not changed. + +@c Thanks to Paul Eggert for initial wording here +According to POSIX, @samp{@var{expression} | getline} is ambiguous if +@var{expression} contains unparenthesized operators other than +@samp{$}; for example, @samp{"echo " "date" | getline} is ambiguous +because the concatenation operator is not parenthesized, and you should +write it as @samp{("echo " "date") | getline} if you want your program +to be portable to other @code{awk} implementations. + +@node Getline/Variable/Pipe, Getline Summary, Getline/Pipe, Getline +@subsection Using @code{getline} Into a Variable from a Pipe + +When you use @samp{@var{command} | getline @var{var}}, the +output of the command @var{command} is sent through a pipe to +@code{getline} and into the variable @var{var}. For example, the +following program reads the current date and time into the variable +@code{current_time}, using the @code{date} utility, and then +prints it. + +@example +@group +awk 'BEGIN @{ + "date" | getline current_time + close("date") + print "Report printed on " current_time +@}' +@end group +@end example + +In this version of @code{getline}, none of the built-in variables are +changed, and the record is not split into fields. + +@ifinfo +@c Thanks to Paul Eggert for initial wording here +According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if +@var{expression} contains unparenthesized operators other than +@samp{$}; for example, @samp{"echo " "date" | getline @var{var}} is ambiguous +because the concatenation operator is not parenthesized, and you should +write it as @samp{("echo " "date") | getline @var{var}} if you want your +program to be portable to other @code{awk} implementations. +@end ifinfo + +@node Getline Summary, , Getline/Variable/Pipe, Getline +@subsection Summary of @code{getline} Variants + +With all the forms of @code{getline}, even though @code{$0} and @code{NF}, +may be updated, the record will not be tested against all the patterns +in the @code{awk} program, in the way that would happen if the record +were read normally by the main processing loop of @code{awk}. However +the new record is tested against any subsequent rules. + +@cindex differences between @code{gawk} and @code{awk} +@cindex limitations +@cindex implementation limits +Many @code{awk} implementations limit the number of pipelines an @code{awk} +program may have open to just one! In @code{gawk}, there is no such limit. +You can open as many pipelines as the underlying operating system will +permit. + +@vindex FILENAME +@cindex dark corner +@cindex @code{getline}, setting @code{FILENAME} +@cindex @code{FILENAME}, being set by @code{getline} +An interesting side-effect occurs if you use @code{getline} (without a +redirection) inside a @code{BEGIN} rule. Since an unredirected @code{getline} +reads from the command line data files, the first @code{getline} command +causes @code{awk} to set the value of @code{FILENAME}. Normally, +@code{FILENAME} does not have a value inside @code{BEGIN} rules, since you +have not yet started to process the command line data files (d.c.). +(@xref{BEGIN/END, , The @code{BEGIN} and @code{END} Special Patterns}, +also @pxref{Auto-set, , Built-in Variables that Convey Information}.) + +The following table summarizes the six variants of @code{getline}, +listing which built-in variables are set by each one. + +@c @cartouche +@table @code +@item getline +sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}. + +@item getline @var{var} +sets @var{var}, @code{FNR}, and @code{NR}. + +@item getline < @var{file} +sets @code{$0}, and @code{NF}. + +@item getline @var{var} < @var{file} +sets @var{var}. + +@item @var{command} | getline +sets @code{$0}, and @code{NF}. + +@item @var{command} | getline @var{var} +sets @var{var}. +@end table +@c @end cartouche + +@node Printing, Expressions, Reading Files, Top +@chapter Printing Output + +@cindex printing +@cindex output +One of the most common actions is to @dfn{print}, or output, +some or all of the input. You use the @code{print} statement +for simple output. You use the @code{printf} statement +for fancier formatting. Both are described in this chapter. + +@menu +* Print:: The @code{print} statement. +* Print Examples:: Simple examples of @code{print} statements. +* Output Separators:: The output separators and how to change them. +* OFMT:: Controlling Numeric Output With @code{print}. +* Printf:: The @code{printf} statement. +* Redirection:: How to redirect output to multiple files and + pipes. +* Special Files:: File name interpretation in @code{gawk}. + @code{gawk} allows access to inherited file + descriptors. +* Close Files And Pipes:: Closing Input and Output Files and Pipes. +@end menu + +@node Print, Print Examples, Printing, Printing +@section The @code{print} Statement +@cindex @code{print} statement + +The @code{print} statement does output with simple, standardized +formatting. You specify only the strings or numbers to be printed, in a +list separated by commas. They are output, separated by single spaces, +followed by a newline. The statement looks like this: + +@example +print @var{item1}, @var{item2}, @dots{} +@end example + +@noindent +The entire list of items may optionally be enclosed in parentheses. The +parentheses are necessary if any of the item expressions uses the @samp{>} +relational operator; otherwise it could be confused with a redirection +(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}). + +The items to be printed can be constant strings or numbers, fields of the +current record (such as @code{$1}), variables, or any @code{awk} +expressions. +Numeric values are converted to strings, and then printed. + +The @code{print} statement is completely general for +computing @emph{what} values to print. However, with two exceptions, +you cannot specify @emph{how} to print them---how many +columns, whether to use exponential notation or not, and so on. +(For the exceptions, @pxref{Output Separators}, and +@ref{OFMT, ,Controlling Numeric Output with @code{print}}.) +For that, you need the @code{printf} statement +(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}). + +The simple statement @samp{print} with no items is equivalent to +@samp{print $0}: it prints the entire current record. To print a blank +line, use @samp{print ""}, where @code{""} is the empty string. + +To print a fixed piece of text, use a string constant such as +@w{@code{"Don't Panic"}} as one item. If you forget to use the +double-quote characters, your text will be taken as an @code{awk} +expression, and you will probably get an error. Keep in mind that a +space is printed between any two items. + +Each @code{print} statement makes at least one line of output. But it +isn't limited to one line. If an item value is a string that contains a +newline, the newline is output along with the rest of the string. A +single @code{print} can make any number of lines this way. + +@node Print Examples, Output Separators, Print, Printing +@section Examples of @code{print} Statements + +Here is an example of printing a string that contains embedded newlines +(the @samp{\n} is an escape sequence, used to represent the newline +character; see @ref{Escape Sequences}): + +@example +@group +$ awk 'BEGIN @{ print "line one\nline two\nline three" @}' +@print{} line one +@print{} line two +@print{} line three +@end group +@end example + +Here is an example that prints the first two fields of each input record, +with a space between them: + +@example +@group +$ awk '@{ print $1, $2 @}' inventory-shipped +@print{} Jan 13 +@print{} Feb 15 +@print{} Mar 15 +@dots{} +@end group +@end example + +@cindex common mistakes +@cindex mistakes, common +@cindex errors, common +A common mistake in using the @code{print} statement is to omit the comma +between two items. This often has the effect of making the items run +together in the output, with no space. The reason for this is that +juxtaposing two string expressions in @code{awk} means to concatenate +them. Here is the same program, without the comma: + +@example +@group +$ awk '@{ print $1 $2 @}' inventory-shipped +@print{} Jan13 +@print{} Feb15 +@print{} Mar15 +@dots{} +@end group +@end example + +To someone unfamiliar with the file @file{inventory-shipped}, neither +example's output makes much sense. A heading line at the beginning +would make it clearer. Let's add some headings to our table of months +(@code{$1}) and green crates shipped (@code{$2}). We do this using the +@code{BEGIN} pattern +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}) +to force the headings to be printed only once: + +@example +awk 'BEGIN @{ print "Month Crates" + print "----- ------" @} + @{ print $1, $2 @}' inventory-shipped +@end example + +@noindent +Did you already guess what happens? When run, the program prints +the following: + +@example +@group +Month Crates +----- ------ +Jan 13 +Feb 15 +Mar 15 +@dots{} +@end group +@end example + +@noindent +The headings and the table data don't line up! We can fix this by printing +some spaces between the two fields: + +@example +awk 'BEGIN @{ print "Month Crates" + print "----- ------" @} + @{ print $1, " ", $2 @}' inventory-shipped +@end example + +You can imagine that this way of lining up columns can get pretty +complicated when you have many columns to fix. Counting spaces for two +or three columns can be simple, but more than this and you can get +lost quite easily. This is why the @code{printf} statement was +created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}); +one of its specialties is lining up columns of data. + +@cindex line continuation +As a side point, +you can continue either a @code{print} or @code{printf} statement simply +by putting a newline after any comma +(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). + +@node Output Separators, OFMT, Print Examples, Printing +@section Output Separators + +@cindex output field separator, @code{OFS} +@cindex output record separator, @code{ORS} +@vindex OFS +@vindex ORS +As mentioned previously, a @code{print} statement contains a list +of items, separated by commas. In the output, the items are normally +separated by single spaces. This need not be the case; a +single space is only the default. You can specify any string of +characters to use as the @dfn{output field separator} by setting the +built-in variable @code{OFS}. The initial value of this variable +is the string @w{@code{" "}}, that is, a single space. + +The output from an entire @code{print} statement is called an +@dfn{output record}. Each @code{print} statement outputs one output +record and then outputs a string called the @dfn{output record separator}. +The built-in variable @code{ORS} specifies this string. The initial +value of @code{ORS} is the string @code{"\n"}, i.e.@: a newline +character; thus, normally each @code{print} statement makes a separate line. + +You can change how output fields and records are separated by assigning +new values to the variables @code{OFS} and/or @code{ORS}. The usual +place to do this is in the @code{BEGIN} rule +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), so +that it happens before any input is processed. You may also do this +with assignments on the command line, before the names of your input +files, or using the @samp{-v} command line option +(@pxref{Options, ,Command Line Options}). + +@ignore +Exercise, +Rewrite the +@example +awk 'BEGIN @{ print "Month Crates" + print "----- ------" @} + @{ print $1, " ", $2 @}' inventory-shipped +@end example +program by using a new value of @code{OFS}. +@end ignore + +The following example prints the first and second fields of each input +record separated by a semicolon, with a blank line added after each +line: + +@example +@group +$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @} +> @{ print $1, $2 @}' BBS-list +@print{} aardvark;555-5553 +@print{} +@print{} alpo-net;555-3412 +@print{} +@print{} barfly;555-7685 +@dots{} +@end group +@end example + +If the value of @code{ORS} does not contain a newline, all your output +will be run together on a single line, unless you output newlines some +other way. + +@node OFMT, Printf, Output Separators, Printing +@section Controlling Numeric Output with @code{print} +@vindex OFMT +@cindex numeric output format +@cindex format, numeric output +@cindex output format specifier, @code{OFMT} +When you use the @code{print} statement to print numeric values, +@code{awk} internally converts the number to a string of characters, +and prints that string. @code{awk} uses the @code{sprintf} function +to do this conversion +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +For now, it suffices to say that the @code{sprintf} +function accepts a @dfn{format specification} that tells it how to format +numbers (or strings), and that there are a number of different ways in which +numbers can be formatted. The different format specifications are discussed +more fully in +@ref{Control Letters, , Format-Control Letters}. + +The built-in variable @code{OFMT} contains the default format specification +that @code{print} uses with @code{sprintf} when it wants to convert a +number to a string for printing. +The default value of @code{OFMT} is @code{"%.6g"}. +By supplying different format specifications +as the value of @code{OFMT}, you can change how @code{print} will print +your numbers. As a brief example: + +@example +@group +$ awk 'BEGIN @{ +> OFMT = "%.0f" # print numbers as integers (rounds) +> print 17.23 @}' +@print{} 17 +@end group +@end example + +@noindent +@cindex dark corner +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +According to the POSIX standard, @code{awk}'s behavior will be undefined +if @code{OFMT} contains anything but a floating point conversion specification +(d.c.). + +@node Printf, Redirection, OFMT, Printing +@section Using @code{printf} Statements for Fancier Printing +@cindex formatted output +@cindex output, formatted + +If you want more precise control over the output format than +@code{print} gives you, use @code{printf}. With @code{printf} you can +specify the width to use for each item, and you can specify various +formatting choices for numbers (such as what radix to use, whether to +print an exponent, whether to print a sign, and how many digits to print +after the decimal point). You do this by supplying a string, called +the @dfn{format string}, which controls how and where to print the other +arguments. + +@menu +* Basic Printf:: Syntax of the @code{printf} statement. +* Control Letters:: Format-control letters. +* Format Modifiers:: Format-specification modifiers. +* Printf Examples:: Several examples. +@end menu + +@node Basic Printf, Control Letters, Printf, Printf +@subsection Introduction to the @code{printf} Statement + +@cindex @code{printf} statement, syntax of +The @code{printf} statement looks like this: + +@example +printf @var{format}, @var{item1}, @var{item2}, @dots{} +@end example + +@noindent +The entire list of arguments may optionally be enclosed in parentheses. The +parentheses are necessary if any of the item expressions use the @samp{>} +relational operator; otherwise it could be confused with a redirection +(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}). + +@cindex format string +The difference between @code{printf} and @code{print} is the @var{format} +argument. This is an expression whose value is taken as a string; it +specifies how to output each of the other arguments. It is called +the @dfn{format string}. + +The format string is very similar to that in the ANSI C library function +@code{printf}. Most of @var{format} is text to be output verbatim. +Scattered among this text are @dfn{format specifiers}, one per item. +Each format specifier says to output the next item in the argument list +at that place in the format. + +The @code{printf} statement does not automatically append a newline to its +output. It outputs only what the format string specifies. So if you want +a newline, you must include one in the format string. The output separator +variables @code{OFS} and @code{ORS} have no effect on @code{printf} +statements. For example: + +@example +@group +BEGIN @{ + ORS = "\nOUCH!\n"; OFS = "!" + msg = "Don't Panic!"; printf "%s\n", msg +@} +@end group +@end example + +This program still prints the familiar @samp{Don't Panic!} message. + +@node Control Letters, Format Modifiers, Basic Printf, Printf +@subsection Format-Control Letters +@cindex @code{printf}, format-control characters +@cindex format specifier + +A format specifier starts with the character @samp{%} and ends with a +@dfn{format-control letter}; it tells the @code{printf} statement how +to output one item. (If you actually want to output a @samp{%}, write +@samp{%%}.) The format-control letter specifies what kind of value to +print. The rest of the format specifier is made up of optional +@dfn{modifiers} which are parameters to use, such as the field width. + +Here is a list of the format-control letters: + +@table @code +@item c +This prints a number as an ASCII character. Thus, @samp{printf "%c", +65} outputs the letter @samp{A}. The output for a string value is +the first character of the string. + +@item d +@itemx i +These are equivalent. They both print a decimal integer. +The @samp{%i} specification is for compatibility with ANSI C. + +@item e +@itemx E +This prints a number in scientific (exponential) notation. +For example, + +@example +printf "%4.3e\n", 1950 +@end example + +@noindent +prints @samp{1.950e+03}, with a total of four significant figures of +which three follow the decimal point. The @samp{4.3} are modifiers, +discussed below. @samp{%E} uses @samp{E} instead of @samp{e} in the output. + +@item f +This prints a number in floating point notation. +For example, + +@example +printf "%4.3f", 1950 +@end example + +@noindent +prints @samp{1950.000}, with a total of four significant figures of +which three follow the decimal point. The @samp{4.3} are modifiers, +discussed below. + +@item g +@itemx G +This prints a number in either scientific notation or floating point +notation, whichever uses fewer characters. If the result is printed in +scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}. + +@item o +This prints an unsigned octal integer. +(In octal, or base-eight notation, the digits run from @samp{0} to @samp{7}; +the decimal number eight is represented as @samp{10} in octal.) + +@item s +This prints a string. + +@item x +@itemx X +This prints an unsigned hexadecimal integer. +(In hexadecimal, or base-16 notation, the digits are @samp{0} through @samp{9} +and @samp{a} through @samp{f}. The hexadecimal digit @samp{f} represents +the decimal number 15.) @samp{%X} uses the letters @samp{A} through @samp{F} +instead of @samp{a} through @samp{f}. + +@item % +This isn't really a format-control letter, but it does have a meaning +when used after a @samp{%}: the sequence @samp{%%} outputs one +@samp{%}. It does not consume an argument, and it ignores any +modifiers. +@end table + +@cindex dark corner +When using the integer format-control letters for values that are outside +the range of a C @code{long} integer, @code{gawk} will switch to the +@samp{%g} format specifier. Other versions of @code{awk} may print +invalid values, or do something else entirely (d.c.). + +@node Format Modifiers, Printf Examples, Control Letters, Printf +@subsection Modifiers for @code{printf} Formats + +@cindex @code{printf}, modifiers +@cindex modifiers (in format specifiers) +A format specification can also include @dfn{modifiers} that can control +how much of the item's value is printed and how much space it gets. The +modifiers come between the @samp{%} and the format-control letter. +In the examples below, we use the bullet symbol ``@bullet{}'' to represent +spaces in the output. Here are the possible modifiers, in the order in +which they may appear: + +@table @code +@item - +The minus sign, used before the width modifier (see below), +says to left-justify +the argument within its specified width. Normally the argument +is printed right-justified in the specified width. Thus, + +@example +printf "%-4s", "foo" +@end example + +@noindent +prints @samp{foo@bullet{}}. + +@item @var{space} +For numeric conversions, prefix positive values with a space, and +negative values with a minus sign. + +@item + +The plus sign, used before the width modifier (see below), +says to always supply a sign for numeric conversions, even if the data +to be formatted is positive. The @samp{+} overrides the space modifier. + +@item # +Use an ``alternate form'' for certain control letters. +For @samp{%o}, supply a leading zero. +For @samp{%x}, and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for +a non-zero result. +For @samp{%e}, @samp{%E}, and @samp{%f}, the result will always contain a +decimal point. +For @samp{%g}, and @samp{%G}, trailing zeros are not removed from the result. + +@cindex dark corner +@item 0 +A leading @samp{0} (zero) acts as a flag, that indicates output should be +padded with zeros instead of spaces. +This applies even to non-numeric output formats (d.c.). +This flag only has an effect when the field width is wider than the +value to be printed. + +@item @var{width} +This is a number specifying the desired minimum width of a field. Inserting any +number between the @samp{%} sign and the format control character forces the +field to be expanded to this width. The default way to do this is to +pad with spaces on the left. For example, + +@example +printf "%4s", "foo" +@end example + +@noindent +prints @samp{@bullet{}foo}. + +The value of @var{width} is a minimum width, not a maximum. If the item +value requires more than @var{width} characters, it can be as wide as +necessary. Thus, + +@example +printf "%4s", "foobar" +@end example + +@noindent +prints @samp{foobar}. + +Preceding the @var{width} with a minus sign causes the output to be +padded with spaces on the right, instead of on the left. + +@item .@var{prec} +This is a number that specifies the precision to use when printing. +For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the +number of digits you want printed to the right of the decimal point. +For the @samp{g}, and @samp{G} formats, it specifies the maximum number +of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u}, +@samp{x}, and @samp{X} formats, it specifies the minimum number of +digits to print. For a string, it specifies the maximum number of +characters from the string that should be printed. Thus, + +@example +printf "%.4s", "foobar" +@end example + +@noindent +prints @samp{foob}. +@end table + +The C library @code{printf}'s dynamic @var{width} and @var{prec} +capability (for example, @code{"%*.*s"}) is supported. Instead of +supplying explicit @var{width} and/or @var{prec} values in the format +string, you pass them in the argument list. For example: + +@example +w = 5 +p = 3 +s = "abcdefg" +printf "%*.*s\n", w, p, s +@end example + +@noindent +is exactly equivalent to + +@example +s = "abcdefg" +printf "%5.3s\n", s +@end example + +@noindent +Both programs output @samp{@w{@bullet{}@bullet{}abc}}. + +Earlier versions of @code{awk} did not support this capability. +If you must use such a version, you may simulate this feature by using +concatenation to build up the format string, like so: + +@example +w = 5 +p = 3 +s = "abcdefg" +printf "%" w "." p "s\n", s +@end example + +@noindent +This is not particularly easy to read, but it does work. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +C programmers may be used to supplying additional @samp{l} and @samp{h} +flags in @code{printf} format strings. These are not valid in @code{awk}. +Most @code{awk} implementations silently ignore these flags. +If @samp{--lint} is provided on the command line +(@pxref{Options, ,Command Line Options}), +@code{gawk} will warn about their use. If @samp{--posix} is supplied, +their use is a fatal error. + +@node Printf Examples, , Format Modifiers, Printf +@subsection Examples Using @code{printf} + +Here is how to use @code{printf} to make an aligned table: + +@example +awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list +@end example + +@noindent +prints the names of bulletin boards (@code{$1}) of the file +@file{BBS-list} as a string of 10 characters, left justified. It also +prints the phone numbers (@code{$2}) afterward on the line. This +produces an aligned two-column table of names and phone numbers: + +@example +@group +$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list +@print{} aardvark 555-5553 +@print{} alpo-net 555-3412 +@print{} barfly 555-7685 +@print{} bites 555-1675 +@print{} camelot 555-0542 +@print{} core 555-2912 +@print{} fooey 555-1234 +@print{} foot 555-6699 +@print{} macfoo 555-6480 +@print{} sdace 555-3430 +@print{} sabafoo 555-2127 +@end group +@end example + +Did you notice that we did not specify that the phone numbers be printed +as numbers? They had to be printed as strings because the numbers are +separated by a dash. +If we had tried to print the phone numbers as numbers, all we would have +gotten would have been the first three digits, @samp{555}. +This would have been pretty confusing. + +We did not specify a width for the phone numbers because they are the +last things on their lines. We don't need to put spaces after them. + +We could make our table look even nicer by adding headings to the tops +of the columns. To do this, we use the @code{BEGIN} pattern +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}) +to force the header to be printed only once, at the beginning of +the @code{awk} program: + +@example +@group +awk 'BEGIN @{ print "Name Number" + print "---- ------" @} + @{ printf "%-10s %s\n", $1, $2 @}' BBS-list +@end group +@end example + +Did you notice that we mixed @code{print} and @code{printf} statements in +the above example? We could have used just @code{printf} statements to get +the same results: + +@example +@group +awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number" + printf "%-10s %s\n", "----", "------" @} + @{ printf "%-10s %s\n", $1, $2 @}' BBS-list +@end group +@end example + +@noindent +By printing each column heading with the same format specification +used for the elements of the column, we have made sure that the headings +are aligned just like the columns. + +The fact that the same format specification is used three times can be +emphasized by storing it in a variable, like this: + +@example +@group +awk 'BEGIN @{ format = "%-10s %s\n" + printf format, "Name", "Number" + printf format, "----", "------" @} + @{ printf format, $1, $2 @}' BBS-list +@end group +@end example + +@c !!! exercise +See if you can use the @code{printf} statement to line up the headings and +table data for our @file{inventory-shipped} example covered earlier in the +section on the @code{print} statement +(@pxref{Print, ,The @code{print} Statement}). + +@node Redirection, Special Files, Printf, Printing +@section Redirecting Output of @code{print} and @code{printf} + +@cindex output redirection +@cindex redirection of output +So far we have been dealing only with output that prints to the standard +output, usually your terminal. Both @code{print} and @code{printf} can +also send their output to other places. +This is called @dfn{redirection}. + +A redirection appears after the @code{print} or @code{printf} statement. +Redirections in @code{awk} are written just like redirections in shell +commands, except that they are written inside the @code{awk} program. + +There are three forms of output redirection: output to a file, +output appended to a file, and output through a pipe to another +command. +They are all shown for +the @code{print} statement, but they work identically for @code{printf} +also. + +@table @code +@item print @var{items} > @var{output-file} +This type of redirection prints the items into the output file +@var{output-file}. The file name @var{output-file} can be any +expression. Its value is changed to a string and then used as a +file name (@pxref{Expressions}). + +When this type of redirection is used, the @var{output-file} is erased +before the first output is written to it. Subsequent writes +to the same @var{output-file} do not +erase @var{output-file}, but append to it. If @var{output-file} does +not exist, then it is created. + +For example, here is how an @code{awk} program can write a list of +BBS names to a file @file{name-list} and a list of phone numbers to a +file @file{phone-list}. Each output file contains one name or number +per line. + +@example +@group +$ awk '@{ print $2 > "phone-list" +> print $1 > "name-list" @}' BBS-list +@end group +@group +$ cat phone-list +@print{} 555-5553 +@print{} 555-3412 +@dots{} +@end group +@group +$ cat name-list +@print{} aardvark +@print{} alpo-net +@dots{} +@end group +@end example + +@item print @var{items} >> @var{output-file} +This type of redirection prints the items into the pre-existing output file +@var{output-file}. The difference between this and the +single-@samp{>} redirection is that the old contents (if any) of +@var{output-file} are not erased. Instead, the @code{awk} output is +appended to the file. +If @var{output-file} does not exist, then it is created. + +@cindex pipes for output +@cindex output, piping +@item print @var{items} | @var{command} +It is also possible to send output to another program through a pipe +instead of into a +file. This type of redirection opens a pipe to @var{command} and writes +the values of @var{items} through this pipe, to another process created +to execute @var{command}. + +The redirection argument @var{command} is actually an @code{awk} +expression. Its value is converted to a string, whose contents give the +shell command to be run. + +For example, this produces two files, one unsorted list of BBS names +and one list sorted in reverse alphabetical order: + +@example +awk '@{ print $1 > "names.unsorted" + command = "sort -r > names.sorted" + print $1 | command @}' BBS-list +@end example + +Here the unsorted list is written with an ordinary redirection while +the sorted list is written by piping through the @code{sort} utility. + +This example uses redirection to mail a message to a mailing +list @samp{bug-system}. This might be useful when trouble is encountered +in an @code{awk} script run periodically for system maintenance. + +@example +report = "mail bug-system" +print "Awk script failed:", $0 | report +m = ("at record number " FNR " of " FILENAME) +print m | report +close(report) +@end example + +The message is built using string concatenation and saved in the variable +@code{m}. It is then sent down the pipeline to the @code{mail} program. + +We call the @code{close} function here because it's a good idea to close +the pipe as soon as all the intended output has been sent to it. +@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}, +for more information +on this. This example also illustrates the use of a variable to represent +a @var{file} or @var{command}: it is not necessary to always +use a string constant. Using a variable is generally a good idea, +since @code{awk} requires you to spell the string value identically +every time. +@end table + +Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system +to open a file or pipe only if the particular @var{file} or @var{command} +you've specified has not already been written to by your program, or if +it has been closed since it was last written to. + +@cindex differences between @code{gawk} and @code{awk} +@cindex limitations +@cindex implementation limits +@iftex +As mentioned earlier +(@pxref{Getline Summary, , Summary of @code{getline} Variants}), +many +@end iftex +@ifinfo +Many +@end ifinfo +@code{awk} implementations limit the number of pipelines an @code{awk} +program may have open to just one! In @code{gawk}, there is no such limit. +You can open as many pipelines as the underlying operating system will +permit. + +@node Special Files, Close Files And Pipes , Redirection, Printing +@section Special File Names in @code{gawk} +@cindex standard input +@cindex standard output +@cindex standard error output +@cindex file descriptors + +Running programs conventionally have three input and output streams +already available to them for reading and writing. These are known as +the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error +output}. These streams are, by default, connected to your terminal, but +they are often redirected with the shell, via the @samp{<}, @samp{<<}, +@samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators. Standard error +is typically used for writing error messages; the reason we have two separate +streams, standard output and standard error, is so that they can be +redirected separately. + +@cindex differences between @code{gawk} and @code{awk} +In other implementations of @code{awk}, the only way to write an error +message to standard error in an @code{awk} program is as follows: + +@example +print "Serious error detected!" | "cat 1>&2" +@end example + +@noindent +This works by opening a pipeline to a shell command which can access the +standard error stream which it inherits from the @code{awk} process. +This is far from elegant, and is also inefficient, since it requires a +separate process. So people writing @code{awk} programs often +neglect to do this. Instead, they send the error messages to the +terminal, like this: + +@example +@group +print "Serious error detected!" > "/dev/tty" +@end group +@end example + +@noindent +This usually has the same effect, but not always: although the +standard error stream is usually the terminal, it can be redirected, and +when that happens, writing to the terminal is not correct. In fact, if +@code{awk} is run from a background job, it may not have a terminal at all. +Then opening @file{/dev/tty} will fail. + +@code{gawk} provides special file names for accessing the three standard +streams. When you redirect input or output in @code{gawk}, if the file name +matches one of these special names, then @code{gawk} directly uses the +stream it stands for. + +@cindex @file{/dev/stdin} +@cindex @file{/dev/stdout} +@cindex @file{/dev/stderr} +@cindex @file{/dev/fd} +@c @cartouche +@table @file +@item /dev/stdin +The standard input (file descriptor 0). + +@item /dev/stdout +The standard output (file descriptor 1). + +@item /dev/stderr +The standard error output (file descriptor 2). + +@item /dev/fd/@var{N} +The file associated with file descriptor @var{N}. Such a file must have +been opened by the program initiating the @code{awk} execution (typically +the shell). Unless you take special pains in the shell from which +you invoke @code{gawk}, only descriptors 0, 1 and 2 are available. +@end table +@c @end cartouche + +The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} +are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2}, +respectively, but they are more self-explanatory. + +The proper way to write an error message in a @code{gawk} program +is to use @file{/dev/stderr}, like this: + +@example +print "Serious error detected!" > "/dev/stderr" +@end example + +@code{gawk} also provides special file names that give access to information +about the running @code{gawk} process. Each of these ``files'' provides +a single record of information. To read them more than once, you must +first close them with the @code{close} function +(@pxref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}). +The filenames are: + +@cindex process information +@cindex @file{/dev/pid} +@cindex @file{/dev/pgrpid} +@cindex @file{/dev/ppid} +@cindex @file{/dev/user} +@c @cartouche +@table @file +@item /dev/pid +Reading this file returns the process ID of the current process, +in decimal, terminated with a newline. + +@item /dev/ppid +Reading this file returns the parent process ID of the current process, +in decimal, terminated with a newline. + +@item /dev/pgrpid +Reading this file returns the process group ID of the current process, +in decimal, terminated with a newline. + +@item /dev/user +Reading this file returns a single record terminated with a newline. +The fields are separated with spaces. The fields represent the +following information: + +@table @code +@item $1 +The return value of the @code{getuid} system call +(the real user ID number). + +@item $2 +The return value of the @code{geteuid} system call +(the effective user ID number). + +@item $3 +The return value of the @code{getgid} system call +(the real group ID number). + +@item $4 +The return value of the @code{getegid} system call +(the effective group ID number). +@end table + +If there are any additional fields, they are the group IDs returned by +@code{getgroups} system call. +(Multiple groups may not be supported on all systems.) +@end table +@c @end cartouche + +These special file names may be used on the command line as data +files, as well as for I/O redirections within an @code{awk} program. +They may not be used as source files with the @samp{-f} option. + +Recognition of these special file names is disabled if @code{gawk} is in +compatibility mode (@pxref{Options, ,Command Line Options}). + +@strong{Caution}: Unless your system actually has a @file{/dev/fd} directory +(or any of the other above listed special files), +the interpretation of these file names is done by @code{gawk} itself. +For example, using @samp{/dev/fd/4} for output will actually write on +file descriptor 4, and not on a new file descriptor that was @code{dup}'ed +from file descriptor 4. Most of the time this does not matter; however, it +is important to @emph{not} close any of the files related to file descriptors +0, 1, and 2. If you do close one of these files, unpredictable behavior +will result. + +The special files that provide process-related information may disappear +in a future version of @code{gawk}. +@xref{Future Extensions, ,Probable Future Extensions}. + +@node Close Files And Pipes, , Special Files, Printing +@section Closing Input and Output Files and Pipes +@cindex closing input files and pipes +@cindex closing output files and pipes +@findex close + +If the same file name or the same shell command is used with +@code{getline} +(@pxref{Getline, ,Explicit Input with @code{getline}}) +more than once during the execution of an @code{awk} +program, the file is opened (or the command is executed) only the first time. +At that time, the first record of input is read from that file or command. +The next time the same file or command is used in @code{getline}, another +record is read from it, and so on. + +Similarly, when a file or pipe is opened for output, the file name or command +associated with +it is remembered by @code{awk} and subsequent writes to the same file or +command are appended to the previous writes. The file or pipe stays +open until @code{awk} exits. + +This implies that if you want to start reading the same file again from +the beginning, or if you want to rerun a shell command (rather than +reading more output from the command), you must take special steps. +What you must do is use the @code{close} function, as follows: + +@example +close(@var{filename}) +@end example + +@noindent +or + +@example +close(@var{command}) +@end example + +The argument @var{filename} or @var{command} can be any expression. Its +value must @emph{exactly} match the string that was used to open the file or +start the command (spaces and other ``irrelevant'' characters +included). For example, if you open a pipe with this: + +@example +"sort -r names" | getline foo +@end example + +@noindent +then you must close it with this: + +@example +close("sort -r names") +@end example + +Once this function call is executed, the next @code{getline} from that +file or command, or the next @code{print} or @code{printf} to that +file or command, will reopen the file or rerun the command. + +Because the expression that you use to close a file or pipeline must +exactly match the expression used to open the file or run the command, +it is good practice to use a variable to store the file name or command. +The previous example would become + +@example +sortcom = "sort -r names" +sortcom | getline foo +@dots{} +close(sortcom) +@end example + +@noindent +This helps avoid hard-to-find typographical errors in your @code{awk} +programs. + +Here are some reasons why you might need to close an output file: + +@itemize @bullet +@item +To write a file and read it back later on in the same @code{awk} +program. Close the file when you are finished writing it; then +you can start reading it with @code{getline}. + +@item +To write numerous files, successively, in the same @code{awk} +program. If you don't close the files, eventually you may exceed a +system limit on the number of open files in one process. So close +each one when you are finished writing it. + +@item +To make a command finish. When you redirect output through a pipe, +the command reading the pipe normally continues to try to read input +as long as the pipe is open. Often this means the command cannot +really do its work until the pipe is closed. For example, if you +redirect output to the @code{mail} program, the message is not +actually sent until the pipe is closed. + +@item +To run the same program a second time, with the same arguments. +This is not the same thing as giving more input to the first run! + +For example, suppose you pipe output to the @code{mail} program. If you +output several lines redirected to this pipe without closing it, they make +a single message of several lines. By contrast, if you close the pipe +after each line of output, then each line makes a separate message. +@end itemize + +@vindex ERRNO +@cindex differences between @code{gawk} and @code{awk} +@code{close} returns a value of zero if the close succeeded. +Otherwise, the value will be non-zero. +In this case, @code{gawk} sets the variable @code{ERRNO} to a string +describing the error that occurred. + +@cindex differences between @code{gawk} and @code{awk} +@cindex portability issues +If you use more files than the system allows you to have open, +@code{gawk} will attempt to multiplex the available open files among +your data files. @code{gawk}'s ability to do this depends upon the +facilities of your operating system: it may not always work. It is +therefore both good practice and good portability advice to always +use @code{close} on your files when you are done with them. + +@node Expressions, Patterns and Actions, Printing, Top +@chapter Expressions +@cindex expression + +Expressions are the basic building blocks of @code{awk} patterns +and actions. An expression evaluates to a value, which you can print, test, +store in a variable or pass to a function. Additionally, an expression +can assign a new value to a variable or a field, with an assignment operator. + +An expression can serve as a pattern or action statement on its own. +Most other kinds of +statements contain one or more expressions which specify data on which to +operate. As in other languages, expressions in @code{awk} include +variables, array references, constants, and function calls, as well as +combinations of these with various operators. + +@menu +* Constants:: String, numeric, and regexp constants. +* Using Constant Regexps:: When and how to use a regexp constant. +* Variables:: Variables give names to values for later use. +* Conversion:: The conversion of strings to numbers and vice + versa. +* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, + etc.) +* Concatenation:: Concatenating strings. +* Assignment Ops:: Changing the value of a variable or a field. +* Increment Ops:: Incrementing the numeric value of a variable. +* Truth Values:: What is ``true'' and what is ``false''. +* Typing and Comparison:: How variables acquire types, and how this + affects comparison of numbers and strings with + @samp{<}, etc. +* Boolean Ops:: Combining comparison expressions using boolean + operators @samp{||} (``or''), @samp{&&} + (``and'') and @samp{!} (``not''). +* Conditional Exp:: Conditional expressions select between two + subexpressions under control of a third + subexpression. +* Function Calls:: A function call is an expression. +* Precedence:: How various operators nest. +@end menu + +@node Constants, Using Constant Regexps, Expressions, Expressions +@section Constant Expressions +@cindex constants, types of +@cindex string constants + +The simplest type of expression is the @dfn{constant}, which always has +the same value. There are three types of constants: numeric constants, +string constants, and regular expression constants. + +@menu +* Scalar Constants:: Numeric and string constants. +* Regexp Constants:: Regular Expression constants. +@end menu + +@node Scalar Constants, Regexp Constants, Constants, Constants +@subsection Numeric and String Constants + +@cindex numeric constant +@cindex numeric value +A @dfn{numeric constant} stands for a number. This number can be an +integer, a decimal fraction, or a number in scientific (exponential) +notation.@footnote{The internal representation uses double-precision +floating point numbers. If you don't know what that means, then don't +worry about it.} Here are some examples of numeric constants, which all +have the same value: + +@example +105 +1.05e+2 +1050e-1 +@end example + +A string constant consists of a sequence of characters enclosed in +double-quote marks. For example: + +@example +"parrot" +@end example + +@noindent +@cindex differences between @code{gawk} and @code{awk} +represents the string whose contents are @samp{parrot}. Strings in +@code{gawk} can be of any length and they can contain any of the possible +eight-bit ASCII characters including ASCII NUL (character code zero). +Other @code{awk} +implementations may have difficulty with some character codes. + +@node Regexp Constants, , Scalar Constants, Constants +@subsection Regular Expression Constants + +@cindex @code{~} operator +@cindex @code{!~} operator +A regexp constant is a regular expression description enclosed in +slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in +@code{awk} programs are constant, but the @samp{~} and @samp{!~} +matching operators can also match computed or ``dynamic'' regexps +(which are just ordinary strings or variables that contain a regexp). + +@node Using Constant Regexps, Variables, Constants, Expressions +@section Using Regular Expression Constants + +When used on the right hand side of the @samp{~} or @samp{!~} +operators, a regexp constant merely stands for the regexp that is to be +matched. + +@cindex dark corner +Regexp constants (such as @code{/foo/}) may be used like simple expressions. +When a +regexp constant appears by itself, it has the same meaning as if it appeared +in a pattern, i.e.@: @samp{($0 ~ /foo/)} (d.c.) +(@pxref{Expression Patterns, ,Expressions as Patterns}). +This means that the two code segments, + +@example +if ($0 ~ /barfly/ || $0 ~ /camelot/) + print "found" +@end example + +@noindent +and + +@example +if (/barfly/ || /camelot/) + print "found" +@end example + +@noindent +are exactly equivalent. + +One rather bizarre consequence of this rule is that the following +boolean expression is valid, but does not do what the user probably +intended: + +@example +# note that /foo/ is on the left of the ~ +if (/foo/ ~ $1) print "found foo" +@end example + +@noindent +This code is ``obviously'' testing @code{$1} for a match against the regexp +@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means +@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record +against the regexp @code{/foo/}. The result will be either zero or one, +depending upon the success or failure of the match. Then match that result +against the first field in the record. + +Since it is unlikely that you would ever really wish to make this kind of +test, @code{gawk} will issue a warning when it sees this construct in +a program. + +Another consequence of this rule is that the assignment statement + +@example +matches = /foo/ +@end example + +@noindent +will assign either zero or one to the variable @code{matches}, depending +upon the contents of the current input record. + +This feature of the language was never well documented until the +POSIX specification. + +@cindex differences between @code{gawk} and @code{awk} +@cindex dark corner +Constant regular expressions are also used as the first argument for +the @code{gensub}, @code{sub} and @code{gsub} functions, and as the +second argument of the @code{match} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +Modern implementations of @code{awk}, including @code{gawk}, allow +the third argument of @code{split} to be a regexp constant, while some +older implementations do not (d.c.). + +This can lead to confusion when attempting to use regexp constants +as arguments to user defined functions +(@pxref{User-defined, , User-defined Functions}). +For example: + +@example +@group +function mysub(pat, repl, str, global) +@{ + if (global) + gsub(pat, repl, str) + else + sub(pat, repl, str) + return str +@} +@end group + +@group +@{ + @dots{} + text = "hi! hi yourself!" + mysub(/hi/, "howdy", text, 1) + @dots{} +@} +@end group +@end example + +In this example, the programmer wishes to pass a regexp constant to the +user-defined function @code{mysub}, which will in turn pass it on to +either @code{sub} or @code{gsub}. However, what really happens is that +the @code{pat} parameter will be either one or zero, depending upon whether +or not @code{$0} matches @code{/hi/}. + +As it is unlikely that you would ever really wish to pass a truth value +in this way, @code{gawk} will issue a warning when it sees a regexp +constant used as a parameter to a user-defined function. + +@node Variables, Conversion, Using Constant Regexps, Expressions +@section Variables + +Variables are ways of storing values at one point in your program for +use later in another part of your program. You can manipulate them +entirely within your program text, and you can also assign values to +them on the @code{awk} command line. + +@menu +* Using Variables:: Using variables in your programs. +* Assignment Options:: Setting variables on the command line and a + summary of command line syntax. This is an + advanced method of input. +@end menu + +@node Using Variables, Assignment Options, Variables, Variables +@subsection Using Variables in a Program + +@cindex variables, user-defined +@cindex user-defined variables +Variables let you give names to values and refer to them later. You have +already seen variables in many of the examples. The name of a variable +must be a sequence of letters, digits and underscores, but it may not begin +with a digit. Case is significant in variable names; @code{a} and @code{A} +are distinct variables. + +A variable name is a valid expression by itself; it represents the +variable's current value. Variables are given new values with +@dfn{assignment operators}, @dfn{increment operators} and +@dfn{decrement operators}. +@xref{Assignment Ops, ,Assignment Expressions}. + +A few variables have special built-in meanings, such as @code{FS}, the +field separator, and @code{NF}, the number of fields in the current +input record. @xref{Built-in Variables}, for a list of them. These +built-in variables can be used and assigned just like all other +variables, but their values are also used or changed automatically by +@code{awk}. All built-in variables names are entirely upper-case. + +Variables in @code{awk} can be assigned either numeric or string +values. By default, variables are initialized to the empty string, which +is zero if converted to a number. There is no need to +``initialize'' each variable explicitly in @code{awk}, +the way you would in C and in most other traditional languages. + +@node Assignment Options, , Using Variables, Variables +@subsection Assigning Variables on the Command Line + +You can set any @code{awk} variable by including a @dfn{variable assignment} +among the arguments on the command line when you invoke @code{awk} +(@pxref{Other Arguments, ,Other Command Line Arguments}). Such an assignment has +this form: + +@example +@var{variable}=@var{text} +@end example + +@noindent +With it, you can set a variable either at the beginning of the +@code{awk} run or in between input files. + +If you precede the assignment with the @samp{-v} option, like this: + +@example +-v @var{variable}=@var{text} +@end example + +@noindent +then the variable is set at the very beginning, before even the +@code{BEGIN} rules are run. The @samp{-v} option and its assignment +must precede all the file name arguments, as well as the program text. +(@xref{Options, ,Command Line Options}, for more information about +the @samp{-v} option.) + +Otherwise, the variable assignment is performed at a time determined by +its position among the input file arguments: after the processing of the +preceding input file argument. For example: + +@example +awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list +@end example + +@noindent +prints the value of field number @code{n} for all input records. Before +the first file is read, the command line sets the variable @code{n} +equal to four. This causes the fourth field to be printed in lines from +the file @file{inventory-shipped}. After the first file has finished, +but before the second file is started, @code{n} is set to two, so that the +second field is printed in lines from @file{BBS-list}. + +@example +@group +$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list +@print{} 15 +@print{} 24 +@dots{} +@print{} 555-5553 +@print{} 555-3412 +@dots{} +@end group +@end example + +Command line arguments are made available for explicit examination by +the @code{awk} program in an array named @code{ARGV} +(@pxref{ARGC and ARGV, ,Using @code{ARGC} and @code{ARGV}}). + +@cindex dark corner +@code{awk} processes the values of command line assignments for escape +sequences (d.c.) (@pxref{Escape Sequences}). + +@node Conversion, Arithmetic Ops, Variables, Expressions +@section Conversion of Strings and Numbers + +@cindex conversion of strings and numbers +Strings are converted to numbers, and numbers to strings, if the context +of the @code{awk} program demands it. For example, if the value of +either @code{foo} or @code{bar} in the expression @samp{foo + bar} +happens to be a string, it is converted to a number before the addition +is performed. If numeric values appear in string concatenation, they +are converted to strings. Consider this: + +@example +two = 2; three = 3 +print (two three) + 4 +@end example + +@noindent +This prints the (numeric) value 27. The numeric values of +the variables @code{two} and @code{three} are converted to strings and +concatenated together, and the resulting string is converted back to the +number 23, to which four is then added. + +@cindex null string +@cindex empty string +@cindex type conversion +If, for some reason, you need to force a number to be converted to a +string, concatenate the empty string, @code{""}, with that number. +To force a string to be converted to a number, add zero to that string. + +A string is converted to a number by interpreting any numeric prefix +of the string as numerals: +@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"} +has a numeric value of 25. +Strings that can't be interpreted as valid numbers are converted to +zero. + +@vindex CONVFMT +The exact manner in which numbers are converted into strings is controlled +by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}). +Numbers are converted using the @code{sprintf} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}) +with @code{CONVFMT} as the format +specifier. + +@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with +at least six significant digits. For some applications you will want to +change it to specify more precision. Double precision on most modern +machines gives you 16 or 17 decimal digits of precision. + +Strange results can happen if you set @code{CONVFMT} to a string that doesn't +tell @code{sprintf} how to format floating point numbers in a useful way. +For example, if you forget the @samp{%} in the format, all numbers will be +converted to the same constant string. + +@cindex dark corner +As a special case, if a number is an integer, then the result of converting +it to a string is @emph{always} an integer, no matter what the value of +@code{CONVFMT} may be. Given the following code fragment: + +@example +CONVFMT = "%2.2f" +a = 12 +b = a "" +@end example + +@noindent +@code{b} has the value @code{"12"}, not @code{"12.00"} (d.c.). + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@vindex OFMT +Prior to the POSIX standard, @code{awk} specified that the value +of @code{OFMT} was used for converting numbers to strings. @code{OFMT} +specifies the output format to use when printing numbers with @code{print}. +@code{CONVFMT} was introduced in order to separate the semantics of +conversion from the semantics of printing. Both @code{CONVFMT} and +@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority +of cases, old @code{awk} programs will not change their behavior. +However, this use of @code{OFMT} is something to keep in mind if you must +port your program to other implementations of @code{awk}; we recommend +that instead of changing your programs, you just port @code{gawk} itself! +@xref{Print, ,The @code{print} Statement}, +for more information on the @code{print} statement. + +@node Arithmetic Ops, Concatenation, Conversion, Expressions +@section Arithmetic Operators +@cindex arithmetic operators +@cindex operators, arithmetic +@cindex addition +@cindex subtraction +@cindex multiplication +@cindex division +@cindex remainder +@cindex quotient +@cindex exponentiation + +The @code{awk} language uses the common arithmetic operators when +evaluating expressions. All of these arithmetic operators follow normal +precedence rules, and work as you would expect them to. + +Here is a file @file{grades} containing a list of student names and +three test scores per student (it's a small class): + +@example +Pat 100 97 58 +Sandy 84 72 93 +Chris 72 92 89 +@end example + +@noindent +This programs takes the file @file{grades}, and prints the average +of the scores. + +@example +$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3 +> print $1, avg @}' grades +@print{} Pat 85 +@print{} Sandy 83 +@print{} Chris 84.3333 +@end example + +This table lists the arithmetic operators in @code{awk}, in order from +highest precedence to lowest: + +@c @cartouche +@table @code +@item - @var{x} +Negation. + +@item + @var{x} +Unary plus. The expression is converted to a number. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@item @var{x} ^ @var{y} +@itemx @var{x} ** @var{y} +Exponentiation: @var{x} raised to the @var{y} power. @samp{2 ^ 3} has +the value eight. The character sequence @samp{**} is equivalent to +@samp{^}. (The POSIX standard only specifies the use of @samp{^} +for exponentiation.) + +@item @var{x} * @var{y} +Multiplication. + +@item @var{x} / @var{y} +Division. Since all numbers in @code{awk} are +real numbers, the result is not rounded to an integer: @samp{3 / 4} +has the value 0.75. + +@item @var{x} % @var{y} +@cindex differences between @code{gawk} and @code{awk} +Remainder. The quotient is rounded toward zero to an integer, +multiplied by @var{y} and this result is subtracted from @var{x}. +This operation is sometimes known as ``trunc-mod.'' The following +relation always holds: + +@example +b * int(a / b) + (a % b) == a +@end example + +One possibly undesirable effect of this definition of remainder is that +@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus, + +@example +-17 % 8 = -1 +@end example + +In other @code{awk} implementations, the signedness of the remainder +may be machine dependent. +@c !!! what does posix say? + +@item @var{x} + @var{y} +Addition. + +@item @var{x} - @var{y} +Subtraction. +@end table +@c @end cartouche + +For maximum portability, do not use the @samp{**} operator. + +Unary plus and minus have the same precedence, +the multiplication operators all have the same precedence, and +addition and subtraction have the same precedence. + +@node Concatenation, Assignment Ops, Arithmetic Ops, Expressions +@section String Concatenation +@cindex Kernighan, Brian +@display +@i{It seemed like a good idea at the time.} +Brian Kernighan +@end display +@sp 1 + +@cindex string operators +@cindex operators, string +@cindex concatenation +There is only one string operation: concatenation. It does not have a +specific operator to represent it. Instead, concatenation is performed by +writing expressions next to one another, with no operator. For example: + +@example +@group +$ awk '@{ print "Field number one: " $1 @}' BBS-list +@print{} Field number one: aardvark +@print{} Field number one: alpo-net +@dots{} +@end group +@end example + +Without the space in the string constant after the @samp{:}, the line +would run together. For example: + +@example +@group +$ awk '@{ print "Field number one:" $1 @}' BBS-list +@print{} Field number one:aardvark +@print{} Field number one:alpo-net +@dots{} +@end group +@end example + +Since string concatenation does not have an explicit operator, it is +often necessary to insure that it happens where you want it to by +using parentheses to enclose +the items to be concatenated. For example, the +following code fragment does not concatenate @code{file} and @code{name} +as you might expect: + +@example +@group +file = "file" +name = "name" +print "something meaningful" > file name +@end group +@end example + +@noindent +It is necessary to use the following: + +@example +print "something meaningful" > (file name) +@end example + +We recommend that you use parentheses around concatenation in all but the +most common contexts (such as on the right-hand side of @samp{=}). + +@node Assignment Ops, Increment Ops, Concatenation, Expressions +@section Assignment Expressions +@cindex assignment operators +@cindex operators, assignment +@cindex expression, assignment + +An @dfn{assignment} is an expression that stores a new value into a +variable. For example, let's assign the value one to the variable +@code{z}: + +@example +z = 1 +@end example + +After this expression is executed, the variable @code{z} has the value one. +Whatever old value @code{z} had before the assignment is forgotten. + +Assignments can store string values also. For example, this would store +the value @code{"this food is good"} in the variable @code{message}: + +@example +thing = "food" +predicate = "good" +message = "this " thing " is " predicate +@end example + +@noindent +(This also illustrates string concatenation.) + +The @samp{=} sign is called an @dfn{assignment operator}. It is the +simplest assignment operator because the value of the right-hand +operand is stored unchanged. + +@cindex side effect +Most operators (addition, concatenation, and so on) have no effect +except to compute a value. If you ignore the value, you might as well +not use the operator. An assignment operator is different; it does +produce a value, but even if you ignore the value, the assignment still +makes itself felt through the alteration of the variable. We call this +a @dfn{side effect}. + +@cindex lvalue +@cindex rvalue +The left-hand operand of an assignment need not be a variable +(@pxref{Variables}); it can also be a field +(@pxref{Changing Fields, ,Changing the Contents of a Field}) or +an array element (@pxref{Arrays, ,Arrays in @code{awk}}). +These are all called @dfn{lvalues}, +which means they can appear on the left-hand side of an assignment operator. +The right-hand operand may be any expression; it produces the new value +which the assignment stores in the specified variable, field or array +element. (Such values are called @dfn{rvalues}). + +@cindex types of variables +It is important to note that variables do @emph{not} have permanent types. +The type of a variable is simply the type of whatever value it happens +to hold at the moment. In the following program fragment, the variable +@code{foo} has a numeric value at first, and a string value later on: + +@example +@group +foo = 1 +print foo +foo = "bar" +print foo +@end group +@end example + +@noindent +When the second assignment gives @code{foo} a string value, the fact that +it previously had a numeric value is forgotten. + +String values that do not begin with a digit have a numeric value of +zero. After executing this code, the value of @code{foo} is five: + +@example +foo = "a string" +foo = foo + 5 +@end example + +@noindent +(Note that using a variable as a number and then later as a string can +be confusing and is poor programming style. The above examples illustrate how +@code{awk} works, @emph{not} how you should write your own programs!) + +An assignment is an expression, so it has a value: the same value that +is assigned. Thus, @samp{z = 1} as an expression has the value one. +One consequence of this is that you can write multiple assignments together: + +@example +x = y = z = 0 +@end example + +@noindent +stores the value zero in all three variables. It does this because the +value of @samp{z = 0}, which is zero, is stored into @code{y}, and then +the value of @samp{y = z = 0}, which is zero, is stored into @code{x}. + +You can use an assignment anywhere an expression is called for. For +example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one +and then test whether @code{x} equals one. But this style tends to make +programs hard to read; except in a one-shot program, you should +not use such nesting of assignments. + +Aside from @samp{=}, there are several other assignment operators that +do arithmetic with the old value of the variable. For example, the +operator @samp{+=} computes a new value by adding the right-hand value +to the old value of the variable. Thus, the following assignment adds +five to the value of @code{foo}: + +@example +foo += 5 +@end example + +@noindent +This is equivalent to the following: + +@example +foo = foo + 5 +@end example + +@noindent +Use whichever one makes the meaning of your program clearer. + +There are situations where using @samp{+=} (or any assignment operator) +is @emph{not} the same as simply repeating the left-hand operand in the +right-hand expression. For example: + +@cindex Rankin, Pat +@example +@group +# Thanks to Pat Rankin for this example +BEGIN @{ + foo[rand()] += 5 + for (x in foo) + print x, foo[x] + + bar[rand()] = bar[rand()] + 5 + for (x in bar) + print x, bar[x] +@} +@end group +@end example + +@noindent +The indices of @code{bar} are guaranteed to be different, because +@code{rand} will return different values each time it is called. +(Arrays and the @code{rand} function haven't been covered yet. +@xref{Arrays, ,Arrays in @code{awk}}, +and see @ref{Numeric Functions, ,Numeric Built-in Functions}, for more information). +This example illustrates an important fact about the assignment +operators: the left-hand expression is only evaluated @emph{once}. + +It is also up to the implementation as to which expression is evaluated +first, the left-hand one or the right-hand one. +Consider this example: + +@example +i = 1 +a[i += 2] = i + 1 +@end example + +@noindent +The value of @code{a[3]} could be either two or four. + +Here is a table of the arithmetic assignment operators. In each +case, the right-hand operand is an expression whose value is converted +to a number. + +@c @cartouche +@table @code +@item @var{lvalue} += @var{increment} +Adds @var{increment} to the value of @var{lvalue} to make the new value +of @var{lvalue}. + +@item @var{lvalue} -= @var{decrement} +Subtracts @var{decrement} from the value of @var{lvalue}. + +@item @var{lvalue} *= @var{coefficient} +Multiplies the value of @var{lvalue} by @var{coefficient}. + +@item @var{lvalue} /= @var{divisor} +Divides the value of @var{lvalue} by @var{divisor}. + +@item @var{lvalue} %= @var{modulus} +Sets @var{lvalue} to its remainder by @var{modulus}. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@item @var{lvalue} ^= @var{power} +@itemx @var{lvalue} **= @var{power} +Raises @var{lvalue} to the power @var{power}. +(Only the @samp{^=} operator is specified by POSIX.) +@end table +@c @end cartouche + +For maximum portability, do not use the @samp{**=} operator. + +@node Increment Ops, Truth Values, Assignment Ops, Expressions +@section Increment and Decrement Operators + +@cindex increment operators +@cindex operators, increment +@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of +a variable by one. You could do the same thing with an assignment operator, so +the increment operators add no power to the @code{awk} language; but they +are convenient abbreviations for very common operations. + +The operator to add one is written @samp{++}. It can be used to increment +a variable either before or after taking its value. + +To pre-increment a variable @var{v}, write @samp{++@var{v}}. This adds +one to the value of @var{v} and that new value is also the value of this +expression. The assignment expression @samp{@var{v} += 1} is completely +equivalent. + +Writing the @samp{++} after the variable specifies post-increment. This +increments the variable value just the same; the difference is that the +value of the increment expression itself is the variable's @emph{old} +value. Thus, if @code{foo} has the value four, then the expression @samp{foo++} +has the value four, but it changes the value of @code{foo} to five. + +The post-increment @samp{foo++} is nearly equivalent to writing @samp{(foo ++= 1) - 1}. It is not perfectly equivalent because all numbers in +@code{awk} are floating point: in floating point, @samp{foo + 1 - 1} does +not necessarily equal @code{foo}. But the difference is minute as +long as you stick to numbers that are fairly small (less than 10e12). + +Any lvalue can be incremented. Fields and array elements are incremented +just like variables. (Use @samp{$(i++)} when you wish to do a field reference +and a variable increment at the same time. The parentheses are necessary +because of the precedence of the field reference operator, @samp{$}.) + +@cindex decrement operators +@cindex operators, decrement +The decrement operator @samp{--} works just like @samp{++} except that +it subtracts one instead of adding. Like @samp{++}, it can be used before +the lvalue to pre-decrement or after it to post-decrement. + +Here is a summary of increment and decrement expressions. + +@c @cartouche +@table @code +@item ++@var{lvalue} +This expression increments @var{lvalue} and the new value becomes the +value of the expression. + +@item @var{lvalue}++ +This expression increments @var{lvalue}, but +the value of the expression is the @emph{old} value of @var{lvalue}. + +@item --@var{lvalue} +Like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It +decrements @var{lvalue} and delivers the value that results. + +@item @var{lvalue}-- +Like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It +decrements @var{lvalue}. The value of the expression is the @emph{old} +value of @var{lvalue}. +@end table +@c @end cartouche + +@node Truth Values, Typing and Comparison, Increment Ops, Expressions +@section True and False in @code{awk} +@cindex truth values +@cindex logical true +@cindex logical false + +Many programming languages have a special representation for the concepts +of ``true'' and ``false.'' Such languages usually use the special +constants @code{true} and @code{false}, or perhaps their upper-case +equivalents. + +@cindex null string +@cindex empty string +@code{awk} is different. It borrows a very simple concept of true and +false from C. In @code{awk}, any non-zero numeric value, @emph{or} any +non-empty string value is true. Any other value (zero or the null +string, @code{""}) is false. The following program will print @samp{A strange +truth value} three times: + +@example +@group +BEGIN @{ + if (3.1415927) + print "A strange truth value" + if ("Four Score And Seven Years Ago") + print "A strange truth value" + if (j = 57) + print "A strange truth value" +@} +@end group +@end example + +@cindex dark corner +There is a surprising consequence of the ``non-zero or non-null'' rule: +The string constant @code{"0"} is actually true, since it is non-null (d.c.). + +@node Typing and Comparison, Boolean Ops, Truth Values, Expressions +@section Variable Typing and Comparison Expressions +@cindex comparison expressions +@cindex expression, comparison +@cindex expression, matching +@cindex relational operators +@cindex operators, relational +@cindex regexp match/non-match operators +@cindex variable typing +@cindex types of variables +@c 2e: consider splitting this section into subsections +@display +@i{The Guide is definitive. Reality is frequently inaccurate.} +The Hitchhiker's Guide to the Galaxy +@end display +@sp 1 + +Unlike other programming languages, @code{awk} variables do not have a +fixed type. Instead, they can be either a number or a string, depending +upon the value that is assigned to them. + +@cindex numeric string +The 1992 POSIX standard introduced +the concept of a @dfn{numeric string}, which is simply a string that looks +like a number, for example, @code{@w{" +2"}}. This concept is used +for determining the type of a variable. + +The type of the variable is important, since the types of two variables +determine how they are compared. + +In @code{gawk}, variable typing follows these rules. + +@enumerate 1 +@item +A numeric literal or the result of a numeric operation has the @var{numeric} +attribute. + +@item +A string literal or the result of a string operation has the @var{string} +attribute. + +@item +Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements, +@code{ENVIRON} elements and the +elements of an array created by @code{split} that are numeric strings +have the @var{strnum} attribute. Otherwise, they have the @var{string} +attribute. +Uninitialized variables also have the @var{strnum} attribute. + +@item +Attributes propagate across assignments, but are not changed by +any use. +@c (Although a use may cause the entity to acquire an additional +@c value such that it has both a numeric and string value -- this leaves the +@c attribute unchanged.) +@c This is important but not relevant +@end enumerate + +The last rule is particularly important. In the following program, +@code{a} has numeric type, even though it is later used in a string +operation. + +@example +BEGIN @{ + a = 12.345 + b = a " is a cute number" + print b +@} +@end example + +When two operands are compared, either string comparison or numeric comparison +may be used, depending on the attributes of the operands, according to the +following, symmetric, matrix: + +@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables +@tex +\centerline{ +\vbox{\bigskip % space above the table (about 1 linespace) +% Because we have vertical rules, we can't let TeX insert interline space +% in its usual way. +\offinterlineskip +% +% Define the table template. & separates columns, and \cr ends the +% template (and each row). # is replaced by the text of that entry on +% each row. The template for the first column breaks down like this: +% \strut -- a way to make each line have the height and depth +% of a normal line of type, since we turned off interline spacing. +% \hfil -- infinite glue; has the effect of right-justifying in this case. +% # -- replaced by the text (for instance, `STRNUM', in the last row). +% \quad -- about the width of an `M'. Just separates the columns. +% +% The second column (\vrule#) is what generates the vertical rule that +% spans table rows. +% +% The doubled && before the next entry means `repeat the following +% template as many times as necessary on each line' -- in our case, twice. +% +% The template itself, \quad#\hfil, left-justifies with a little space before. +% +\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr + &&STRING &NUMERIC &STRNUM\cr +% The \omit tells TeX to skip inserting the template for this column on +% this particular row. In this case, we only want a little extra space +% to separate the heading row from the rule below it. the depth 2pt -- +% `\vrule depth 2pt' is that little space. +\omit &depth 2pt\cr +% This is the horizontal rule below the heading. Since it has nothing to +% do with the columns of the table, we use \noalign to get it in there. +\noalign{\hrule} +% Like above, this time a little more space. +\omit &depth 4pt\cr +% The remaining rows have nothing special about them. +STRING &&string &string &string\cr +NUMERIC &&string &numeric &numeric\cr +STRNUM &&string &numeric &numeric\cr +}}} +@end tex +@ifinfo +@display + +---------------------------------------------- + | STRING NUMERIC STRNUM +--------+---------------------------------------------- + | +STRING | string string string + | +NUMERIC | string numeric numeric + | +STRNUM | string numeric numeric +--------+---------------------------------------------- +@end display +@end ifinfo + +The basic idea is that user input that looks numeric, and @emph{only} +user input, should be treated as numeric, even though it is actually +made of characters, and is therefore also a string. + +@dfn{Comparison expressions} compare strings or numbers for +relationships such as equality. They are written using @dfn{relational +operators}, which are a superset of those in C. Here is a table of +them: + +@cindex relational operators +@cindex operators, relational +@cindex @code{<} operator +@cindex @code{<=} operator +@cindex @code{>} operator +@cindex @code{>=} operator +@cindex @code{==} operator +@cindex @code{!=} operator +@cindex @code{~} operator +@cindex @code{!~} operator +@cindex @code{in} operator +@c @cartouche +@table @code +@item @var{x} < @var{y} +True if @var{x} is less than @var{y}. + +@item @var{x} <= @var{y} +True if @var{x} is less than or equal to @var{y}. + +@item @var{x} > @var{y} +True if @var{x} is greater than @var{y}. + +@item @var{x} >= @var{y} +True if @var{x} is greater than or equal to @var{y}. + +@item @var{x} == @var{y} +True if @var{x} is equal to @var{y}. + +@item @var{x} != @var{y} +True if @var{x} is not equal to @var{y}. + +@item @var{x} ~ @var{y} +True if the string @var{x} matches the regexp denoted by @var{y}. + +@item @var{x} !~ @var{y} +True if the string @var{x} does not match the regexp denoted by @var{y}. + +@item @var{subscript} in @var{array} +True if the array @var{array} has an element with the subscript @var{subscript}. +@end table +@c @end cartouche + +Comparison expressions have the value one if true and zero if false. + +When comparing operands of mixed types, numeric operands are converted +to strings using the value of @code{CONVFMT} +(@pxref{Conversion, ,Conversion of Strings and Numbers}). + +Strings are compared +by comparing the first character of each, then the second character of each, +and so on. Thus @code{"10"} is less than @code{"9"}. If there are two +strings where one is a prefix of the other, the shorter string is less than +the longer one. Thus @code{"abc"} is less than @code{"abcd"}. + +@cindex common mistakes +@cindex mistakes, common +@cindex errors, common +It is very easy to accidentally mistype the @samp{==} operator, and +leave off one of the @samp{=}s. The result is still valid @code{awk} +code, but the program will not do what you mean: + +@example +if (a = b) # oops! should be a == b + @dots{} +else + @dots{} +@end example + +@noindent +Unless @code{b} happens to be zero or the null string, the @code{if} +part of the test will always succeed. Because the operators are +so similar, this kind of error is very difficult to spot when +scanning the source code. + +Here are some sample expressions, how @code{gawk} compares them, and what +the result of the comparison is. + +@table @code +@item 1.5 <= 2.0 +numeric comparison (true) + +@item "abc" >= "xyz" +string comparison (false) + +@item 1.5 != " +2" +string comparison (true) + +@item "1e2" < "3" +string comparison (true) + +@item a = 2; b = "2" +@itemx a == b +string comparison (true) + +@item a = 2; b = " +2" +@itemx a == b +string comparison (false) +@end table + +In this example, + +@example +@group +$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}' +@print{} false +@end group +@end example + +@noindent +the result is @samp{false} since both @code{$1} and @code{$2} are numeric +strings and thus both have the @var{strnum} attribute, +dictating a numeric comparison. + +The purpose of the comparison rules and the use of numeric strings is +to attempt to produce the behavior that is ``least surprising,'' while +still ``doing the right thing.'' + +@cindex comparisons, string vs. regexp +@cindex string comparison vs. regexp comparison +@cindex regexp comparison vs. string comparison +String comparisons and regular expression comparisons are very different. +For example, + +@example +x == "foo" +@end example + +@noindent +has the value of one, or is true, if the variable @code{x} +is precisely @samp{foo}. By contrast, + +@example +x ~ /foo/ +@end example + +@noindent +has the value one if @code{x} contains @samp{foo}, such as +@code{"Oh, what a fool am I!"}. + +The right hand operand of the @samp{~} and @samp{!~} operators may be +either a regexp constant (@code{/@dots{}/}), or an ordinary +expression, in which case the value of the expression as a string is used as a +dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}; also +@pxref{Computed Regexps, ,Using Dynamic Regexps}). + +@cindex regexp as expression +In recent implementations of @code{awk}, a constant regular +expression in slashes by itself is also an expression. The regexp +@code{/@var{regexp}/} is an abbreviation for this comparison expression: + +@example +$0 ~ /@var{regexp}/ +@end example + +One special place where @code{/foo/} is @emph{not} an abbreviation for +@samp{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or +@samp{!~}! +@xref{Using Constant Regexps, ,Using Regular Expression Constants}, +where this is discussed in more detail. + +@c This paragraph has been here since day 1, and has always bothered +@c me, especially since the expression doesn't really make a lot of +@c sense. So, just take it out. +@ignore +In some contexts it may be necessary to write parentheses around the +regexp to avoid confusing the @code{gawk} parser. For example, +@samp{(/x/ - /y/) > threshold} is not allowed, but @samp{((/x/) - (/y/)) +> threshold} parses properly. +@end ignore + +@node Boolean Ops, Conditional Exp, Typing and Comparison, Expressions +@section Boolean Expressions +@cindex expression, boolean +@cindex boolean expressions +@cindex operators, boolean +@cindex boolean operators +@cindex logical operations +@cindex operations, logical +@cindex short-circuit operators +@cindex operators, short-circuit +@cindex and operator +@cindex or operator +@cindex not operator +@cindex @code{&&} operator +@cindex @code{||} operator +@cindex @code{!} operator + +A @dfn{boolean expression} is a combination of comparison expressions or +matching expressions, using the boolean operators ``or'' +(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with +parentheses to control nesting. The truth value of the boolean expression is +computed by combining the truth values of the component expressions. +Boolean expressions are also referred to as @dfn{logical expressions}. +The terms are equivalent. + +Boolean expressions can be used wherever comparison and matching +expressions can be used. They can be used in @code{if}, @code{while}, +@code{do} and @code{for} statements +(@pxref{Statements, ,Control Statements in Actions}). +They have numeric values (one if true, zero if false), which come into play +if the result of the boolean expression is stored in a variable, or +used in arithmetic. + +In addition, every boolean expression is also a valid pattern, so +you can use one as a pattern to control the execution of rules. + +Here are descriptions of the three boolean operators, with examples. + +@c @cartouche +@table @code +@item @var{boolean1} && @var{boolean2} +True if both @var{boolean1} and @var{boolean2} are true. For example, +the following statement prints the current input record if it contains +both @samp{2400} and @samp{foo}. + +@example +if ($0 ~ /2400/ && $0 ~ /foo/) print +@end example + +The subexpression @var{boolean2} is evaluated only if @var{boolean1} +is true. This can make a difference when @var{boolean2} contains +expressions that have side effects: in the case of @samp{$0 ~ /foo/ && +($2 == bar++)}, the variable @code{bar} is not incremented if there is +no @samp{foo} in the record. + +@item @var{boolean1} || @var{boolean2} +True if at least one of @var{boolean1} or @var{boolean2} is true. +For example, the following statement prints all records in the input +that contain @emph{either} @samp{2400} or +@samp{foo}, or both. + +@example +if ($0 ~ /2400/ || $0 ~ /foo/) print +@end example + +The subexpression @var{boolean2} is evaluated only if @var{boolean1} +is false. This can make a difference when @var{boolean2} contains +expressions that have side effects. + +@item ! @var{boolean} +True if @var{boolean} is false. For example, the following program prints +all records in the input file @file{BBS-list} that do @emph{not} contain the +string @samp{foo}. + +@c A better example would be `if (! (subscript in array)) ...' but we +@c haven't done anything with arrays or `in' yet. Sigh. +@example +awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list +@end example +@end table +@c @end cartouche + +The @samp{&&} and @samp{||} operators are called @dfn{short-circuit} +operators because of the way they work. Evaluation of the full expression +is ``short-circuited'' if the result can be determined part way through +its evaluation. + +@cindex line continuation +You can continue a statement that uses @samp{&&} or @samp{||} simply +by putting a newline after them. But you cannot put a newline in front +of either of these operators without using backslash continuation +(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). + +The actual value of an expression using the @samp{!} operator will be +either one or zero, depending upon the truth value of the expression it +is applied to. + +The @samp{!} operator is often useful for changing the sense of a flag +variable from false to true and back again. For example, the following +program is one way to print lines in between special bracketing lines: + +@example +$1 == "START" @{ interested = ! interested @} +interested == 1 @{ print @} +$1 == "END" @{ interested = ! interested @} +@end example + +@noindent +The variable @code{interested}, like all @code{awk} variables, starts +out initialized to zero, which is also false. When a line is seen whose +first field is @samp{START}, the value of @code{interested} is toggled +to true, using @samp{!}. The next rule prints lines as long as +@code{interested} is true. When a line is seen whose first field is +@samp{END}, @code{interested} is toggled back to false. +@ignore +We should discuss using `next' in the two rules that toggle the +variable, to avoid printing the bracketing lines, but that's more +distraction than really needed. +@end ignore + +@node Conditional Exp, Function Calls, Boolean Ops, Expressions +@section Conditional Expressions +@cindex conditional expression +@cindex expression, conditional + +A @dfn{conditional expression} is a special kind of expression with +three operands. It allows you to use one expression's value to select +one of two other expressions. + +The conditional expression is the same as in the C language: + +@example +@var{selector} ? @var{if-true-exp} : @var{if-false-exp} +@end example + +@noindent +There are three subexpressions. The first, @var{selector}, is always +computed first. If it is ``true'' (not zero and not null) then +@var{if-true-exp} is computed next and its value becomes the value of +the whole expression. Otherwise, @var{if-false-exp} is computed next +and its value becomes the value of the whole expression. + +For example, this expression produces the absolute value of @code{x}: + +@example +x > 0 ? x : -x +@end example + +Each time the conditional expression is computed, exactly one of +@var{if-true-exp} and @var{if-false-exp} is computed; the other is ignored. +This is important when the expressions contain side effects. For example, +this conditional expression examines element @code{i} of either array +@code{a} or array @code{b}, and increments @code{i}. + +@example +x == y ? a[i++] : b[i++] +@end example + +@noindent +This is guaranteed to increment @code{i} exactly once, because each time +only one of the two increment expressions is executed, +and the other is not. +@xref{Arrays, ,Arrays in @code{awk}}, +for more information about arrays. + +@cindex differences between @code{gawk} and @code{awk} +@cindex line continuation +As a minor @code{gawk} extension, +you can continue a statement that uses @samp{?:} simply +by putting a newline after either character. +However, you cannot put a newline in front +of either character without using backslash continuation +(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). + +@node Function Calls, Precedence, Conditional Exp, Expressions +@section Function Calls +@cindex function call +@cindex calling a function + +A @dfn{function} is a name for a particular calculation. Because it has +a name, you can ask for it by name at any point in the program. For +example, the function @code{sqrt} computes the square root of a number. + +A fixed set of functions are @dfn{built-in}, which means they are +available in every @code{awk} program. The @code{sqrt} function is one +of these. @xref{Built-in, ,Built-in Functions}, for a list of built-in +functions and their descriptions. In addition, you can define your own +functions for use in your program. +@xref{User-defined, ,User-defined Functions}, for how to do this. + +@cindex arguments in function call +The way to use a function is with a @dfn{function call} expression, +which consists of the function name followed immediately by a list of +@dfn{arguments} in parentheses. The arguments are expressions which +provide the raw materials for the function's calculations. +When there is more than one argument, they are separated by commas. If +there are no arguments, write just @samp{()} after the function name. +Here are some examples: + +@example +sqrt(x^2 + y^2) @i{one argument} +atan2(y, x) @i{two arguments} +rand() @i{no arguments} +@end example + +@strong{Do not put any space between the function name and the +open-parenthesis!} A user-defined function name looks just like the name of +a variable, and space would make the expression look like concatenation +of a variable with an expression inside parentheses. Space before the +parenthesis is harmless with built-in functions, but it is best not to get +into the habit of using space to avoid mistakes with user-defined +functions. + +Each function expects a particular number of arguments. For example, the +@code{sqrt} function must be called with a single argument, the number +to take the square root of: + +@example +sqrt(@var{argument}) +@end example + +Some of the built-in functions allow you to omit the final argument. +If you do so, they use a reasonable default. +@xref{Built-in, ,Built-in Functions}, for full details. If arguments +are omitted in calls to user-defined functions, then those arguments are +treated as local variables, initialized to the empty string +(@pxref{User-defined, ,User-defined Functions}). + +Like every other expression, the function call has a value, which is +computed by the function based on the arguments you give it. In this +example, the value of @samp{sqrt(@var{argument})} is the square root of +@var{argument}. A function can also have side effects, such as assigning +values to certain variables or doing I/O. + +Here is a command to read numbers, one number per line, and print the +square root of each one: + +@example +@group +$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}' +1 +@print{} The square root of 1 is 1 +3 +@print{} The square root of 3 is 1.73205 +5 +@print{} The square root of 5 is 2.23607 +@kbd{Control-d} +@end group +@end example + +@node Precedence, , Function Calls, Expressions +@section Operator Precedence (How Operators Nest) +@cindex precedence +@cindex operator precedence + +@dfn{Operator precedence} determines how operators are grouped, when +different operators appear close by in one expression. For example, +@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c} +means to multiply @code{b} and @code{c}, and then add @code{a} to the +product (i.e.@: @samp{a + (b * c)}). + +You can overrule the precedence of the operators by using parentheses. +You can think of the precedence rules as saying where the +parentheses are assumed to be if you do not write parentheses yourself. In +fact, it is wise to always use parentheses whenever you have an unusual +combination of operators, because other people who read the program may +not remember what the precedence is in this case. You might forget, +too; then you could make a mistake. Explicit parentheses will help prevent +any such mistake. + +When operators of equal precedence are used together, the leftmost +operator groups first, except for the assignment, conditional and +exponentiation operators, which group in the opposite order. +Thus, @samp{a - b + c} groups as @samp{(a - b) + c}, and +@samp{a = b = c} groups as @samp{a = (b = c)}. + +The precedence of prefix unary operators does not matter as long as only +unary operators are involved, because there is only one way to interpret +them---innermost first. Thus, @samp{$++i} means @samp{$(++i)} and +@samp{++$x} means @samp{++($x)}. However, when another operator follows +the operand, then the precedence of the unary operators can matter. +Thus, @samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means +@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^} +while @samp{$} has higher precedence. + +Here is a table of @code{awk}'s operators, in order from highest +precedence to lowest: + +@c use @code in the items, looks better in TeX w/o all the quotes +@table @code +@item (@dots{}) +Grouping. + +@item $ +Field. + +@item ++ -- +Increment, decrement. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@item ^ ** +Exponentiation. These operators group right-to-left. +(The @samp{**} operator is not specified by POSIX.) + +@item + - ! +Unary plus, minus, logical ``not''. + +@item * / % +Multiplication, division, modulus. + +@item + - +Addition, subtraction. + +@item @r{Concatenation} +No special token is used to indicate concatenation. +The operands are simply written side by side. + +@item < <= == != +@itemx > >= >> | +Relational, and redirection. +The relational operators and the redirections have the same precedence +level. Characters such as @samp{>} serve both as relationals and as +redirections; the context distinguishes between the two meanings. + +Note that the I/O redirection operators in @code{print} and @code{printf} +statements belong to the statement level, not to expressions. The +redirection does not produce an expression which could be the operand of +another operator. As a result, it does not make sense to use a +redirection operator near another operator of lower precedence, without +parentheses. Such combinations, for example @samp{print foo > a ? b : c}, +result in syntax errors. +The correct way to write this statement is @samp{print foo > (a ? b : c)}. + +@item ~ !~ +Matching, non-matching. + +@item in +Array membership. + +@item && +Logical ``and''. + +@item || +Logical ``or''. + +@item ?: +Conditional. This operator groups right-to-left. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@item = += -= *= +@itemx /= %= ^= **= +Assignment. These operators group right-to-left. +(The @samp{**=} operator is not specified by POSIX.) +@end table + +@node Patterns and Actions, Statements, Expressions, Top +@chapter Patterns and Actions +@cindex pattern, definition of + +As you have already seen, each @code{awk} statement consists of +a pattern with an associated action. This chapter describes how +you build patterns and actions. + +@menu +* Pattern Overview:: What goes into a pattern. +* Action Overview:: What goes into an action. +@end menu + +@node Pattern Overview, Action Overview, Patterns and Actions, Patterns and Actions +@section Pattern Elements + +Patterns in @code{awk} control the execution of rules: a rule is +executed when its pattern matches the current input record. This +section explains all about how to write patterns. + +@menu +* Kinds of Patterns:: A list of all kinds of patterns. +* Regexp Patterns:: Using regexps as patterns. +* Expression Patterns:: Any expression can be used as a pattern. +* Ranges:: Pairs of patterns specify record ranges. +* BEGIN/END:: Specifying initialization and cleanup rules. +* Empty:: The empty pattern, which matches every record. +@end menu + +@node Kinds of Patterns, Regexp Patterns, Pattern Overview, Pattern Overview +@subsection Kinds of Patterns +@cindex patterns, types of + +Here is a summary of the types of patterns supported in @code{awk}. + +@table @code +@item /@var{regular expression}/ +A regular expression as a pattern. It matches when the text of the +input record fits the regular expression. +(@xref{Regexp, ,Regular Expressions}.) + +@item @var{expression} +A single expression. It matches when its value +is non-zero (if a number) or non-null (if a string). +(@xref{Expression Patterns, ,Expressions as Patterns}.) + +@item @var{pat1}, @var{pat2} +A pair of patterns separated by a comma, specifying a range of records. +The range includes both the initial record that matches @var{pat1}, and +the final record that matches @var{pat2}. +(@xref{Ranges, ,Specifying Record Ranges with Patterns}.) + +@item BEGIN +@itemx END +Special patterns for you to supply start-up or clean-up actions for your +@code{awk} program. +(@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.) + +@item @var{empty} +The empty pattern matches every input record. +(@xref{Empty, ,The Empty Pattern}.) +@end table + +@node Regexp Patterns, Expression Patterns, Kinds of Patterns, Pattern Overview +@subsection Regular Expressions as Patterns + +We have been using regular expressions as patterns since our early examples. +This kind of pattern is simply a regexp constant in the pattern part of +a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}. +The pattern matches when the input record matches the regexp. +For example: + +@example +/foo|bar|baz/ @{ buzzwords++ @} +END @{ print buzzwords, "buzzwords seen" @} +@end example + +@node Expression Patterns, Ranges, Regexp Patterns, Pattern Overview +@subsection Expressions as Patterns + +Any @code{awk} expression is valid as an @code{awk} pattern. +Then the pattern matches if the expression's value is non-zero (if a +number) or non-null (if a string). + +The expression is reevaluated each time the rule is tested against a new +input record. If the expression uses fields such as @code{$1}, the +value depends directly on the new input record's text; otherwise, it +depends only on what has happened so far in the execution of the +@code{awk} program, but that may still be useful. + +A very common kind of expression used as a pattern is the comparison +expression, using the comparison operators described in +@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}. + +Regexp matching and non-matching are also very common expressions. +The left operand of the @samp{~} and @samp{!~} operators is a string. +The right operand is either a constant regular expression enclosed in +slashes (@code{/@var{regexp}/}), or any expression, whose string value +is used as a dynamic regular expression +(@pxref{Computed Regexps, , Using Dynamic Regexps}). + +The following example prints the second field of each input record +whose first field is precisely @samp{foo}. + +@example +$ awk '$1 == "foo" @{ print $2 @}' BBS-list +@end example + +@noindent +(There is no output, since there is no BBS site named ``foo''.) +Contrast this with the following regular expression match, which would +accept any record with a first field that contains @samp{foo}: + +@example +@group +$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list +@print{} 555-1234 +@print{} 555-6699 +@print{} 555-6480 +@print{} 555-2127 +@end group +@end example + +Boolean expressions are also commonly used as patterns. +Whether the pattern +matches an input record depends on whether its subexpressions match. + +For example, the following command prints all records in +@file{BBS-list} that contain both @samp{2400} and @samp{foo}. + +@example +$ awk '/2400/ && /foo/' BBS-list +@print{} fooey 555-1234 2400/1200/300 B +@end example + +The following command prints all records in +@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or +both. + +@example +@group +$ awk '/2400/ || /foo/' BBS-list +@print{} alpo-net 555-3412 2400/1200/300 A +@print{} bites 555-1675 2400/1200/300 A +@print{} fooey 555-1234 2400/1200/300 B +@print{} foot 555-6699 1200/300 B +@print{} macfoo 555-6480 1200/300 A +@print{} sdace 555-3430 2400/1200/300 A +@print{} sabafoo 555-2127 1200/300 C +@end group +@end example + +The following command prints all records in +@file{BBS-list} that do @emph{not} contain the string @samp{foo}. + +@example +@group +$ awk '! /foo/' BBS-list +@print{} aardvark 555-5553 1200/300 B +@print{} alpo-net 555-3412 2400/1200/300 A +@print{} barfly 555-7685 1200/300 A +@print{} bites 555-1675 2400/1200/300 A +@print{} camelot 555-0542 300 C +@print{} core 555-2912 1200/300 C +@print{} sdace 555-3430 2400/1200/300 A +@end group +@end example + +The subexpressions of a boolean operator in a pattern can be constant regular +expressions, comparisons, or any other @code{awk} expressions. Range +patterns are not expressions, so they cannot appear inside boolean +patterns. Likewise, the special patterns @code{BEGIN} and @code{END}, +which never match any input record, are not expressions and cannot +appear inside boolean patterns. + +A regexp constant as a pattern is also a special case of an expression +pattern. @code{/foo/} as an expression has the value one if @samp{foo} +appears in the current input record; thus, as a pattern, @code{/foo/} +matches any record containing @samp{foo}. + +@node Ranges, BEGIN/END, Expression Patterns, Pattern Overview +@subsection Specifying Record Ranges with Patterns + +@cindex range pattern +@cindex pattern, range +@cindex matching ranges of lines +A @dfn{range pattern} is made of two patterns separated by a comma, of +the form @samp{@var{begpat}, @var{endpat}}. It matches ranges of +consecutive input records. The first pattern, @var{begpat}, controls +where the range begins, and the second one, @var{endpat}, controls where +it ends. For example, + +@example +awk '$1 == "on", $1 == "off"' +@end example + +@noindent +prints every record between @samp{on}/@samp{off} pairs, inclusive. + +A range pattern starts out by matching @var{begpat} +against every input record; when a record matches @var{begpat}, the +range pattern becomes @dfn{turned on}. The range pattern matches this +record. As long as it stays turned on, it automatically matches every +input record read. It also matches @var{endpat} against +every input record; when that succeeds, the range pattern is turned +off again for the following record. Then it goes back to checking +@var{begpat} against each record. + +The record that turns on the range pattern and the one that turns it +off both match the range pattern. If you don't want to operate on +these records, you can write @code{if} statements in the rule's action +to distinguish them from the records you are interested in. + +It is possible for a pattern to be turned both on and off by the same +record, if the record satisfies both conditions. Then the action is +executed for just that record. + +For example, suppose you have text between two identical markers (say +the @samp{%} symbol) that you wish to ignore. You might try to +combine a range pattern that describes the delimited text with the +@code{next} statement +(not discussed yet, @pxref{Next Statement, , The @code{next} Statement}), +which causes @code{awk} to skip any further processing of the current +record and start over again with the next input record. Such a program +would look like this: + +@example +/^%$/,/^%$/ @{ next @} + @{ print @} +@end example + +@noindent +@cindex skipping lines between markers +This program fails because the range pattern is both turned on and turned off +by the first line with just a @samp{%} on it. To accomplish this task, you +must write the program this way, using a flag: + +@example +/^%$/ @{ skip = ! skip; next @} +skip == 1 @{ next @} # skip lines with `skip' set +@end example + +Note that in a range pattern, the @samp{,} has the lowest precedence +(is evaluated last) of all the operators. Thus, for example, the +following program attempts to combine a range pattern with another, +simpler test. + +@example +echo Yes | awk '/1/,/2/ || /Yes/' +@end example + +The author of this program intended it to mean @samp{(/1/,/2/) || /Yes/}. +However, @code{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}. +This cannot be changed or worked around; range patterns do not combine +with other patterns. + +@node BEGIN/END, Empty, Ranges, Pattern Overview +@subsection The @code{BEGIN} and @code{END} Special Patterns + +@cindex @code{BEGIN} special pattern +@cindex pattern, @code{BEGIN} +@cindex @code{END} special pattern +@cindex pattern, @code{END} +@code{BEGIN} and @code{END} are special patterns. They are not used to +match input records. Rather, they supply start-up or +clean-up actions for your @code{awk} script. + +@menu +* Using BEGIN/END:: How and why to use BEGIN/END rules. +* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. +@end menu + +@node Using BEGIN/END, I/O And BEGIN/END, BEGIN/END, BEGIN/END +@subsubsection Startup and Cleanup Actions + +A @code{BEGIN} rule is executed, once, before the first input record +has been read. An @code{END} rule is executed, once, after all the +input has been read. For example: + +@example +@group +$ awk ' +> BEGIN @{ print "Analysis of \"foo\"" @} +> /foo/ @{ ++n @} +> END @{ print "\"foo\" appears " n " times." @}' BBS-list +@print{} Analysis of "foo" +@print{} "foo" appears 4 times. +@end group +@end example + +This program finds the number of records in the input file @file{BBS-list} +that contain the string @samp{foo}. The @code{BEGIN} rule prints a title +for the report. There is no need to use the @code{BEGIN} rule to +initialize the counter @code{n} to zero, as @code{awk} does this +automatically (@pxref{Variables}). + +The second rule increments the variable @code{n} every time a +record containing the pattern @samp{foo} is read. The @code{END} rule +prints the value of @code{n} at the end of the run. + +The special patterns @code{BEGIN} and @code{END} cannot be used in ranges +or with boolean operators (indeed, they cannot be used with any operators). + +An @code{awk} program may have multiple @code{BEGIN} and/or @code{END} +rules. They are executed in the order they appear, all the @code{BEGIN} +rules at start-up and all the @code{END} rules at termination. +@code{BEGIN} and @code{END} rules may be intermixed with other rules. +This feature was added in the 1987 version of @code{awk}, and is included +in the POSIX standard. The original (1978) version of @code{awk} +required you to put the @code{BEGIN} rule at the beginning of the +program, and the @code{END} rule at the end, and only allowed one of +each. This is no longer required, but it is a good idea in terms of +program organization and readability. + +Multiple @code{BEGIN} and @code{END} rules are useful for writing +library functions, since each library file can have its own @code{BEGIN} and/or +@code{END} rule to do its own initialization and/or cleanup. Note that +the order in which library functions are named on the command line +controls the order in which their @code{BEGIN} and @code{END} rules are +executed. Therefore you have to be careful to write such rules in +library files so that the order in which they are executed doesn't matter. +@xref{Options, ,Command Line Options}, for more information on +using library functions. +@xref{Library Functions, ,A Library of @code{awk} Functions}, +for a number of useful library functions. + +@cindex dark corner +If an @code{awk} program only has a @code{BEGIN} rule, and no other +rules, then the program exits after the @code{BEGIN} rule has been run. +(The original version of @code{awk} used to keep reading and ignoring input +until end of file was seen.) However, if an @code{END} rule exists, +then the input will be read, even if there are no other rules in +the program. This is necessary in case the @code{END} rule checks the +@code{FNR} and @code{NR} variables (d.c.). + +@code{BEGIN} and @code{END} rules must have actions; there is no default +action for these rules since there is no current record when they run. + +@node I/O And BEGIN/END, , Using BEGIN/END, BEGIN/END +@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules + +@cindex I/O from @code{BEGIN} and @code{END} +There are several (sometimes subtle) issues involved when doing I/O +from a @code{BEGIN} or @code{END} rule. + +The first has to do with the value of @code{$0} in a @code{BEGIN} +rule. Since @code{BEGIN} rules are executed before any input is read, +there simply is no input record, and therefore no fields, when +executing @code{BEGIN} rules. References to @code{$0} and the fields +yield a null string or zero, depending upon the context. One way +to give @code{$0} a real value is to execute a @code{getline} command +without a variable (@pxref{Getline, ,Explicit Input with @code{getline}}). +Another way is to simply assign a value to it. + +@cindex differences between @code{gawk} and @code{awk} +The second point is similar to the first, but from the other direction. +Inside an @code{END} rule, what is the value of @code{$0} and @code{NF}? +Traditionally, due largely to implementation issues, @code{$0} and +@code{NF} were @emph{undefined} inside an @code{END} rule. +The POSIX standard specified that @code{NF} was available in an @code{END} +rule, containing the number of fields from the last input record. +Due most probably to an oversight, the standard does not say that @code{$0} +is also preserved, although logically one would think that it should be. +In fact, @code{gawk} does preserve the value of @code{$0} for use in +@code{END} rules. Be aware, however, that Unix @code{awk}, and possibly +other implementations, do not. + +The third point follows from the first two. What is the meaning of +@samp{print} inside a @code{BEGIN} or @code{END} rule? The meaning is +the same as always, @samp{print $0}. If @code{$0} is the null string, +then this prints an empty line. Many long time @code{awk} programmers +use @samp{print} in @code{BEGIN} and @code{END} rules, to mean +@samp{@w{print ""}}, relying on @code{$0} being null. While you might +generally get away with this in @code{BEGIN} rules, in @code{gawk} at +least, it is a very bad idea in @code{END} rules. It is also poor +style, since if you want an empty line in the output, you +should say so explicitly in your program. + +@node Empty, , BEGIN/END, Pattern Overview +@subsection The Empty Pattern + +@cindex empty pattern +@cindex pattern, empty +An empty (i.e.@: non-existent) pattern is considered to match @emph{every} +input record. For example, the program: + +@example +awk '@{ print $1 @}' BBS-list +@end example + +@noindent +prints the first field of every record. + +@node Action Overview, , Pattern Overview, Patterns and Actions +@section Overview of Actions +@cindex action, definition of +@cindex curly braces +@cindex action, curly braces +@cindex action, separating statements + +An @code{awk} program or script consists of a series of +rules and function definitions, interspersed. (Functions are +described later. @xref{User-defined, ,User-defined Functions}.) + +A rule contains a pattern and an action, either of which (but not +both) may be +omitted. The purpose of the @dfn{action} is to tell @code{awk} what to do +once a match for the pattern is found. Thus, in outline, an @code{awk} +program generally looks like this: + +@example +@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} +@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} +@dots{} +function @var{name}(@var{args}) @{ @dots{} @} +@dots{} +@end example + +An action consists of one or more @code{awk} @dfn{statements}, enclosed +in curly braces (@samp{@{} and @samp{@}}). Each statement specifies one +thing to be done. The statements are separated by newlines or +semicolons. + +The curly braces around an action must be used even if the action +contains only one statement, or even if it contains no statements at +all. However, if you omit the action entirely, omit the curly braces as +well. An omitted action is equivalent to @samp{@{ print $0 @}}. + +@example +/foo/ @{ @} # match foo, do nothing - empty action +/foo/ # match foo, print the record - omitted action +@end example + +Here are the kinds of statements supported in @code{awk}: + +@itemize @bullet +@item +Expressions, which can call functions or assign values to variables +(@pxref{Expressions}). Executing +this kind of statement simply computes the value of the expression. +This is useful when the expression has side effects +(@pxref{Assignment Ops, ,Assignment Expressions}). + +@item +Control statements, which specify the control flow of @code{awk} +programs. The @code{awk} language gives you C-like constructs +(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few +special ones (@pxref{Statements, ,Control Statements in Actions}). + +@item +Compound statements, which consist of one or more statements enclosed in +curly braces. A compound statement is used in order to put several +statements together in the body of an @code{if}, @code{while}, @code{do} +or @code{for} statement. + +@item +Input statements, using the @code{getline} command +(@pxref{Getline, ,Explicit Input with @code{getline}}), the @code{next} +statement (@pxref{Next Statement, ,The @code{next} Statement}), +and the @code{nextfile} statement +(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}). + +@item +Output statements, @code{print} and @code{printf}. +@xref{Printing, ,Printing Output}. + +@item +Deletion statements, for deleting array elements. +@xref{Delete, ,The @code{delete} Statement}. +@end itemize + +@iftex +The next chapter covers control statements in detail. +@end iftex + +@node Statements, Built-in Variables, Patterns and Actions, Top +@chapter Control Statements in Actions +@cindex control statement + +@dfn{Control statements} such as @code{if}, @code{while}, and so on +control the flow of execution in @code{awk} programs. Most of the +control statements in @code{awk} are patterned on similar statements in +C. + +All the control statements start with special keywords such as @code{if} +and @code{while}, to distinguish them from simple expressions. + +@cindex compound statement +@cindex statement, compound +Many control statements contain other statements; for example, the +@code{if} statement contains another statement which may or may not be +executed. The contained statement is called the @dfn{body}. If you +want to include more than one statement in the body, group them into a +single @dfn{compound statement} with curly braces, separating them with +newlines or semicolons. + +@menu +* If Statement:: Conditionally execute some @code{awk} + statements. +* While Statement:: Loop until some condition is satisfied. +* Do Statement:: Do specified action while looping until some + condition is satisfied. +* For Statement:: Another looping statement, that provides + initialization and increment clauses. +* Break Statement:: Immediately exit the innermost enclosing loop. +* Continue Statement:: Skip to the end of the innermost enclosing + loop. +* Next Statement:: Stop processing the current input record. +* Nextfile Statement:: Stop processing the current file. +* Exit Statement:: Stop execution of @code{awk}. +@end menu + +@node If Statement, While Statement, Statements, Statements +@section The @code{if}-@code{else} Statement + +@cindex @code{if}-@code{else} statement +The @code{if}-@code{else} statement is @code{awk}'s decision-making +statement. It looks like this: + +@example +if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]} +@end example + +@noindent +The @var{condition} is an expression that controls what the rest of the +statement will do. If @var{condition} is true, @var{then-body} is +executed; otherwise, @var{else-body} is executed. +The @code{else} part of the statement is +optional. The condition is considered false if its value is zero or +the null string, and true otherwise. + +Here is an example: + +@example +if (x % 2 == 0) + print "x is even" +else + print "x is odd" +@end example + +In this example, if the expression @samp{x % 2 == 0} is true (that is, +the value of @code{x} is evenly divisible by two), then the first @code{print} +statement is executed, otherwise the second @code{print} statement is +executed. + +If the @code{else} appears on the same line as @var{then-body}, and +@var{then-body} is not a compound statement (i.e.@: not surrounded by +curly braces), then a semicolon must separate @var{then-body} from +@code{else}. To illustrate this, let's rewrite the previous example: + +@example +if (x % 2 == 0) print "x is even"; else + print "x is odd" +@end example + +@noindent +If you forget the @samp{;}, @code{awk} won't be able to interpret the +statement, and you will get a syntax error. + +We would not actually write this example this way, because a human +reader might fail to see the @code{else} if it were not the first thing +on its line. + +@node While Statement, Do Statement, If Statement, Statements +@section The @code{while} Statement +@cindex @code{while} statement +@cindex loop +@cindex body of a loop + +In programming, a @dfn{loop} means a part of a program that can +be executed two or more times in succession. + +The @code{while} statement is the simplest looping statement in +@code{awk}. It repeatedly executes a statement as long as a condition is +true. It looks like this: + +@example +while (@var{condition}) + @var{body} +@end example + +@noindent +Here @var{body} is a statement that we call the @dfn{body} of the loop, +and @var{condition} is an expression that controls how long the loop +keeps running. + +The first thing the @code{while} statement does is test @var{condition}. +If @var{condition} is true, it executes the statement @var{body}. +@ifinfo +(The @var{condition} is true when the value +is not zero and not a null string.) +@end ifinfo +After @var{body} has been executed, +@var{condition} is tested again, and if it is still true, @var{body} is +executed again. This process repeats until @var{condition} is no longer +true. If @var{condition} is initially false, the body of the loop is +never executed, and @code{awk} continues with the statement following +the loop. + +This example prints the first three fields of each record, one per line. + +@example +awk '@{ i = 1 + while (i <= 3) @{ + print $i + i++ + @} +@}' inventory-shipped +@end example + +@noindent +Here the body of the loop is a compound statement enclosed in braces, +containing two statements. + +The loop works like this: first, the value of @code{i} is set to one. +Then, the @code{while} tests whether @code{i} is less than or equal to +three. This is true when @code{i} equals one, so the @code{i}-th +field is printed. Then the @samp{i++} increments the value of @code{i} +and the loop repeats. The loop terminates when @code{i} reaches four. + +As you can see, a newline is not required between the condition and the +body; but using one makes the program clearer unless the body is a +compound statement or is very simple. The newline after the open-brace +that begins the compound statement is not required either, but the +program would be harder to read without it. + +@node Do Statement, For Statement, While Statement, Statements +@section The @code{do}-@code{while} Statement + +The @code{do} loop is a variation of the @code{while} looping statement. +The @code{do} loop executes the @var{body} once, and then repeats @var{body} +as long as @var{condition} is true. It looks like this: + +@example +@group +do + @var{body} +while (@var{condition}) +@end group +@end example + +Even if @var{condition} is false at the start, @var{body} is executed at +least once (and only once, unless executing @var{body} makes +@var{condition} true). Contrast this with the corresponding +@code{while} statement: + +@example +while (@var{condition}) + @var{body} +@end example + +@noindent +This statement does not execute @var{body} even once if @var{condition} +is false to begin with. + +Here is an example of a @code{do} statement: + +@example +awk '@{ i = 1 + do @{ + print $0 + i++ + @} while (i <= 10) +@}' +@end example + +@noindent +This program prints each input record ten times. It isn't a very +realistic example, since in this case an ordinary @code{while} would do +just as well. But this reflects actual experience; there is only +occasionally a real use for a @code{do} statement. + +@node For Statement, Break Statement, Do Statement, Statements +@section The @code{for} Statement +@cindex @code{for} statement + +The @code{for} statement makes it more convenient to count iterations of a +loop. The general form of the @code{for} statement looks like this: + +@example +for (@var{initialization}; @var{condition}; @var{increment}) + @var{body} +@end example + +@noindent +The @var{initialization}, @var{condition} and @var{increment} parts are +arbitrary @code{awk} expressions, and @var{body} stands for any +@code{awk} statement. + +The @code{for} statement starts by executing @var{initialization}. +Then, as long +as @var{condition} is true, it repeatedly executes @var{body} and then +@var{increment}. Typically @var{initialization} sets a variable to +either zero or one, @var{increment} adds one to it, and @var{condition} +compares it against the desired number of iterations. + +Here is an example of a @code{for} statement: + +@example +@group +awk '@{ for (i = 1; i <= 3; i++) + print $i +@}' inventory-shipped +@end group +@end example + +@noindent +This prints the first three fields of each input record, one field per +line. + +You cannot set more than one variable in the +@var{initialization} part unless you use a multiple assignment statement +such as @samp{x = y = 0}, which is possible only if all the initial values +are equal. (But you can initialize additional variables by writing +their assignments as separate statements preceding the @code{for} loop.) + +The same is true of the @var{increment} part; to increment additional +variables, you must write separate statements at the end of the loop. +The C compound expression, using C's comma operator, would be useful in +this context, but it is not supported in @code{awk}. + +Most often, @var{increment} is an increment expression, as in the +example above. But this is not required; it can be any expression +whatever. For example, this statement prints all the powers of two +between one and 100: + +@example +for (i = 1; i <= 100; i *= 2) + print i +@end example + +Any of the three expressions in the parentheses following the @code{for} may +be omitted if there is nothing to be done there. Thus, @w{@samp{for (; x +> 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the +@var{condition} is omitted, it is treated as @var{true}, effectively +yielding an @dfn{infinite loop} (i.e.@: a loop that will never +terminate). + +In most cases, a @code{for} loop is an abbreviation for a @code{while} +loop, as shown here: + +@example +@var{initialization} +while (@var{condition}) @{ + @var{body} + @var{increment} +@} +@end example + +@noindent +The only exception is when the @code{continue} statement +(@pxref{Continue Statement, ,The @code{continue} Statement}) is used +inside the loop; changing a @code{for} statement to a @code{while} +statement in this way can change the effect of the @code{continue} +statement inside the loop. + +There is an alternate version of the @code{for} loop, for iterating over +all the indices of an array: + +@example +for (i in array) + @var{do something with} array[i] +@end example + +@noindent +@xref{Scanning an Array, ,Scanning All Elements of an Array}, +for more information on this version of the @code{for} loop. + +The @code{awk} language has a @code{for} statement in addition to a +@code{while} statement because often a @code{for} loop is both less work to +type and more natural to think of. Counting the number of iterations is +very common in loops. It can be easier to think of this counting as part +of looping rather than as something to do inside the loop. + +The next section has more complicated examples of @code{for} loops. + +@node Break Statement, Continue Statement, For Statement, Statements +@section The @code{break} Statement +@cindex @code{break} statement +@cindex loops, exiting + +The @code{break} statement jumps out of the innermost @code{for}, +@code{while}, or @code{do} loop that encloses it. The +following example finds the smallest divisor of any integer, and also +identifies prime numbers: + +@example +awk '# find smallest divisor of num + @{ num = $1 + for (div = 2; div*div <= num; div++) + if (num % div == 0) + break + if (num % div == 0) + printf "Smallest divisor of %d is %d\n", num, div + else + printf "%d is prime\n", num + @}' +@end example + +When the remainder is zero in the first @code{if} statement, @code{awk} +immediately @dfn{breaks out} of the containing @code{for} loop. This means +that @code{awk} proceeds immediately to the statement following the loop +and continues processing. (This is very different from the @code{exit} +statement which stops the entire @code{awk} program. +@xref{Exit Statement, ,The @code{exit} Statement}.) + +Here is another program equivalent to the previous one. It illustrates how +the @var{condition} of a @code{for} or @code{while} could just as well be +replaced with a @code{break} inside an @code{if}: + +@example +@group +awk '# find smallest divisor of num + @{ num = $1 + for (div = 2; ; div++) @{ + if (num % div == 0) @{ + printf "Smallest divisor of %d is %d\n", num, div + break + @} + if (div*div > num) @{ + printf "%d is prime\n", num + break + @} + @} +@}' +@end group +@end example + +@cindex @code{break}, outside of loops +@cindex historical features +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@cindex dark corner +As described above, the @code{break} statement has no meaning when +used outside the body of a loop. However, although it was never documented, +historical implementations of @code{awk} have treated the @code{break} +statement outside of a loop as if it were a @code{next} statement +(@pxref{Next Statement, ,The @code{next} Statement}). +Recent versions of Unix @code{awk} no longer allow this usage. +@code{gawk} will support this use of @code{break} only if @samp{--traditional} +has been specified on the command line +(@pxref{Options, ,Command Line Options}). +Otherwise, it will be treated as an error, since the POSIX standard +specifies that @code{break} should only be used inside the body of a +loop (d.c.). + +@node Continue Statement, Next Statement, Break Statement, Statements +@section The @code{continue} Statement + +@cindex @code{continue} statement +The @code{continue} statement, like @code{break}, is used only inside +@code{for}, @code{while}, and @code{do} loops. It skips +over the rest of the loop body, causing the next cycle around the loop +to begin immediately. Contrast this with @code{break}, which jumps out +of the loop altogether. + +@c The point of this program was to illustrate the use of continue with +@c a while loop. But Karl Berry points out that that is done adequately +@c below, and that this example is very un-awk-like. So for now, we'll +@c omit it. +@ignore +In Texinfo source files, text that the author wishes to ignore can be +enclosed between lines that start with @samp{@@ignore} and end with +@samp{@@end ignore}. Here is a program that strips out lines between +@samp{@@ignore} and @samp{@@end ignore} pairs. + +@example +BEGIN @{ + while (getline > 0) @{ + if (/^@@ignore/) + ignoring = 1 + else if (/^@@end[ \t]+ignore/) @{ + ignoring = 0 + continue + @} + if (ignoring) + continue + print + @} +@} +@end example + +When an @samp{@@ignore} is seen, the @code{ignoring} flag is set to one (true). +When @samp{@@end ignore} is seen, the flag is reset to zero (false). As long +as the flag is true, the input record is not printed, because the +@code{continue} restarts the @code{while} loop, skipping over the @code{print} +statement. + +@c Exercise!!! +@c How could this program be written to make better use of the awk language? +@end ignore + +The @code{continue} statement in a @code{for} loop directs @code{awk} to +skip the rest of the body of the loop, and resume execution with the +increment-expression of the @code{for} statement. The following program +illustrates this fact: + +@example +awk 'BEGIN @{ + for (x = 0; x <= 20; x++) @{ + if (x == 5) + continue + printf "%d ", x + @} + print "" +@}' +@end example + +@noindent +This program prints all the numbers from zero to 20, except for five, for +which the @code{printf} is skipped. Since the increment @samp{x++} +is not skipped, @code{x} does not remain stuck at five. Contrast the +@code{for} loop above with this @code{while} loop: + +@example +awk 'BEGIN @{ + x = 0 + while (x <= 20) @{ + if (x == 5) + continue + printf "%d ", x + x++ + @} + print "" +@}' +@end example + +@noindent +This program loops forever once @code{x} gets to five. + +@cindex @code{continue}, outside of loops +@cindex historical features +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@cindex dark corner +As described above, the @code{continue} statement has no meaning when +used outside the body of a loop. However, although it was never documented, +historical implementations of @code{awk} have treated the @code{continue} +statement outside of a loop as if it were a @code{next} statement +(@pxref{Next Statement, ,The @code{next} Statement}). +Recent versions of Unix @code{awk} no longer allow this usage. +@code{gawk} will support this use of @code{continue} only if +@samp{--traditional} has been specified on the command line +(@pxref{Options, ,Command Line Options}). +Otherwise, it will be treated as an error, since the POSIX standard +specifies that @code{continue} should only be used inside the body of a +loop (d.c.). + +@node Next Statement, Nextfile Statement, Continue Statement, Statements +@section The @code{next} Statement +@cindex @code{next} statement + +The @code{next} statement forces @code{awk} to immediately stop processing +the current record and go on to the next record. This means that no +further rules are executed for the current record. The rest of the +current rule's action is not executed either. + +Contrast this with the effect of the @code{getline} function +(@pxref{Getline, ,Explicit Input with @code{getline}}). That too causes +@code{awk} to read the next record immediately, but it does not alter the +flow of control in any way. So the rest of the current action executes +with a new input record. + +At the highest level, @code{awk} program execution is a loop that reads +an input record and then tests each rule's pattern against it. If you +think of this loop as a @code{for} statement whose body contains the +rules, then the @code{next} statement is analogous to a @code{continue} +statement: it skips to the end of the body of this implicit loop, and +executes the increment (which reads another record). + +For example, if your @code{awk} program works only on records with four +fields, and you don't want it to fail when given bad input, you might +use this rule near the beginning of the program: + +@example +@group +NF != 4 @{ + err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) + print err > "/dev/stderr" + next +@} +@end group +@end example + +@noindent +so that the following rules will not see the bad record. The error +message is redirected to the standard error output stream, as error +messages should be. @xref{Special Files, ,Special File Names in @code{gawk}}. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +According to the POSIX standard, the behavior is undefined if +the @code{next} statement is used in a @code{BEGIN} or @code{END} rule. +@code{gawk} will treat it as a syntax error. +Although POSIX permits it, +some other @code{awk} implementations don't allow the @code{next} +statement inside function bodies +(@pxref{User-defined, ,User-defined Functions}). +Just as any other @code{next} statement, a @code{next} inside a +function body reads the next record and starts processing it with the +first rule in the program. + +If the @code{next} statement causes the end of the input to be reached, +then the code in any @code{END} rules will be executed. +@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}. + +@cindex @code{next}, inside a user-defined function +@strong{Caution:} Some @code{awk} implementations generate a run-time +error if you use the @code{next} statement inside a user-defined function +(@pxref{User-defined, , User-defined Functions}). +@code{gawk} does not have this problem. + +@node Nextfile Statement, Exit Statement, Next Statement, Statements +@section The @code{nextfile} Statement +@cindex @code{nextfile} statement +@cindex differences between @code{gawk} and @code{awk} + +@code{gawk} provides the @code{nextfile} statement, +which is similar to the @code{next} statement. +However, instead of abandoning processing of the current record, the +@code{nextfile} statement instructs @code{gawk} to stop processing the +current data file. + +Upon execution of the @code{nextfile} statement, @code{FILENAME} is +updated to the name of the next data file listed on the command line, +@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing +starts over with the first rule in the progam. @xref{Built-in Variables}. + +If the @code{nextfile} statement causes the end of the input to be reached, +then the code in any @code{END} rules will be executed. +@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}. + +The @code{nextfile} statement is a @code{gawk} extension; it is not +(currently) available in any other @code{awk} implementation. +@xref{Nextfile Function, ,Implementing @code{nextfile} as a Function}, +for a user-defined function you can use to simulate the @code{nextfile} +statement. + +The @code{nextfile} statement would be useful if you have many data +files to process, and you expect that you +would not want to process every record in every file. +Normally, in order to move on to +the next data file, you would have to continue scanning the unwanted +records. The @code{nextfile} statement accomplishes this much more +efficiently. + +@cindex @code{next file} statement +@strong{Caution:} Versions of @code{gawk} prior to 3.0 used two +words (@samp{next file}) for the @code{nextfile} statement. This was +changed in 3.0 to one word, since the treatment of @samp{file} was +inconsistent. When it appeared after @code{next}, it was a keyword. +Otherwise, it was a regular identifier. The old usage is still +accepted. However, @code{gawk} will generate a warning message, and +support for @code{next file} will eventually be discontinued in a +future version of @code{gawk}. + +@node Exit Statement, , Nextfile Statement, Statements +@section The @code{exit} Statement + +@cindex @code{exit} statement +The @code{exit} statement causes @code{awk} to immediately stop +executing the current rule and to stop processing input; any remaining input +is ignored. It looks like this: + +@example +exit @r{[}@var{return code}@r{]} +@end example + +If an @code{exit} statement is executed from a @code{BEGIN} rule the +program stops processing everything immediately. No input records are +read. However, if an @code{END} rule is present, it is executed +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). + +If @code{exit} is used as part of an @code{END} rule, it causes +the program to stop immediately. + +An @code{exit} statement that is not part +of a @code{BEGIN} or @code{END} rule stops the execution of any further +automatic rules for the current record, skips reading any remaining input +records, and executes +the @code{END} rule if there is one. + +If you do not want the @code{END} rule to do its job in this case, you +can set a variable to non-zero before the @code{exit} statement, and check +that variable in the @code{END} rule. +@xref{Assert Function, ,Assertions}, +for an example that does this. + +@cindex dark corner +If an argument is supplied to @code{exit}, its value is used as the exit +status code for the @code{awk} process. If no argument is supplied, +@code{exit} returns status zero (success). In the case where an argument +is supplied to a first @code{exit} statement, and then @code{exit} is +called a second time with no argument, the previously supplied exit value +is used (d.c.). + +For example, let's say you've discovered an error condition you really +don't know how to handle. Conventionally, programs report this by +exiting with a non-zero status. Your @code{awk} program can do this +using an @code{exit} statement with a non-zero argument. Here is an +example: + +@example +@group +BEGIN @{ + if (("date" | getline date_now) < 0) @{ + print "Can't get system date" > "/dev/stderr" + exit 1 + @} + print "current date is", date_now + close("date") +@} +@end group +@end example + +@node Built-in Variables, Arrays, Statements, Top +@chapter Built-in Variables +@cindex built-in variables + +Most @code{awk} variables are available for you to use for your own +purposes; they never change except when your program assigns values to +them, and never affect anything except when your program examines them. +However, a few variables in @code{awk} have special built-in meanings. +Some of them @code{awk} examines automatically, so that they enable you +to tell @code{awk} how to do certain things. Others are set +automatically by @code{awk}, so that they carry information from the +internal workings of @code{awk} to your program. + +This chapter documents all the built-in variables of @code{gawk}. Most +of them are also documented in the chapters describing their areas of +activity. + +@menu +* User-modified:: Built-in variables that you change to control + @code{awk}. +* Auto-set:: Built-in variables where @code{awk} gives you + information. +* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. +@end menu + +@node User-modified, Auto-set, Built-in Variables, Built-in Variables +@section Built-in Variables that Control @code{awk} +@cindex built-in variables, user modifiable + +This is an alphabetical list of the variables which you can change to +control how @code{awk} does certain things. Those variables that are +specific to @code{gawk} are marked with an asterisk, @samp{*}. + +@table @code +@vindex CONVFMT +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +@item CONVFMT +This string controls conversion of numbers to +strings (@pxref{Conversion, ,Conversion of Strings and Numbers}). +It works by being passed, in effect, as the first argument to the +@code{sprintf} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +Its default value is @code{"%.6g"}. +@code{CONVFMT} was introduced by the POSIX standard. + +@vindex FIELDWIDTHS +@item FIELDWIDTHS * +This is a space separated list of columns that tells @code{gawk} +how to split input with fixed, columnar boundaries. It is an +experimental feature. Assigning to @code{FIELDWIDTHS} +overrides the use of @code{FS} for field splitting. +@xref{Constant Size, ,Reading Fixed-width Data}, for more information. + +If @code{gawk} is in compatibility mode +(@pxref{Options, ,Command Line Options}), then @code{FIELDWIDTHS} +has no special meaning, and field splitting operations are done based +exclusively on the value of @code{FS}. + +@vindex FS +@item FS +@code{FS} is the input field separator +(@pxref{Field Separators, ,Specifying How Fields are Separated}). +The value is a single-character string or a multi-character regular +expression that matches the separations between fields in an input +record. If the value is the null string (@code{""}), then each +character in the record becomes a separate field. + +The default value is @w{@code{" "}}, a string consisting of a single +space. As a special exception, this value means that any +sequence of spaces, tabs, and/or newlines is a single separator.@footnote{In +POSIX @code{awk}, newline does not count as whitespace.} It also causes +spaces, tabs, and newlines at the beginning and end of a record to be ignored. + +You can set the value of @code{FS} on the command line using the +@samp{-F} option: + +@example +awk -F, '@var{program}' @var{input-files} +@end example + +If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting, +assigning a value to @code{FS} will cause @code{gawk} to return to +the normal, @code{FS}-based, field splitting. An easy way to do this +is to simply say @samp{FS = FS}, perhaps with an explanatory comment. + +@vindex IGNORECASE +@item IGNORECASE * +If @code{IGNORECASE} is non-zero or non-null, then all string comparisons, +and all regular expression matching are case-independent. Thus, regexp +matching with @samp{~} and @samp{!~}, and the @code{gensub}, +@code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub} +functions, record termination with @code{RS}, and field splitting with +@code{FS} all ignore case when doing their particular regexp operations. +The value of @code{IGNORECASE} does @emph{not} affect array subscripting. +@xref{Case-sensitivity, ,Case-sensitivity in Matching}. + +If @code{gawk} is in compatibility mode +(@pxref{Options, ,Command Line Options}), +then @code{IGNORECASE} has no special meaning, and string +and regexp operations are always case-sensitive. + +@vindex OFMT +@item OFMT +This string controls conversion of numbers to +strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for +printing with the @code{print} statement. It works by being passed, in +effect, as the first argument to the @code{sprintf} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +Its default value is @code{"%.6g"}. Earlier versions of @code{awk} +also used @code{OFMT} to specify the format for converting numbers to +strings in general expressions; this is now done by @code{CONVFMT}. + +@vindex OFS +@item OFS +This is the output field separator (@pxref{Output Separators}). It is +output between the fields output by a @code{print} statement. Its +default value is @w{@code{" "}}, a string consisting of a single space. + +@vindex ORS +@item ORS +This is the output record separator. It is output at the end of every +@code{print} statement. Its default value is @code{"\n"}. +(@xref{Output Separators}.) + +@vindex RS +@item RS +This is @code{awk}'s input record separator. Its default value is a string +containing a single newline character, which means that an input record +consists of a single line of text. +It can also be the null string, in which case records are separated by +runs of blank lines, or a regexp, in which case records are separated by +matches of the regexp in the input text. +(@xref{Records, ,How Input is Split into Records}.) + +@vindex SUBSEP +@item SUBSEP +@code{SUBSEP} is the subscript separator. It has the default value of +@code{"\034"}, and is used to separate the parts of the indices of a +multi-dimensional array. Thus, the expression @code{@w{foo["A", "B"]}} +really accesses @code{foo["A\034B"]} +(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}). +@end table + +@node Auto-set, ARGC and ARGV, User-modified, Built-in Variables +@section Built-in Variables that Convey Information +@cindex built-in variables, convey information + +This is an alphabetical list of the variables that are set +automatically by @code{awk} on certain occasions in order to provide +information to your program. Those variables that are specific to +@code{gawk} are marked with an asterisk, @samp{*}. + +@table @code +@vindex ARGC +@vindex ARGV +@item ARGC +@itemx ARGV +The command-line arguments available to @code{awk} programs are stored in +an array called @code{ARGV}. @code{ARGC} is the number of command-line +arguments present. @xref{Other Arguments, ,Other Command Line Arguments}. +Unlike most @code{awk} arrays, +@code{ARGV} is indexed from zero to @code{ARGC} @minus{} 1. For example: + +@example +@group +$ awk 'BEGIN @{ +> for (i = 0; i < ARGC; i++) +> print ARGV[i] +> @}' inventory-shipped BBS-list +@print{} awk +@print{} inventory-shipped +@print{} BBS-list +@end group +@end example + +@noindent +In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]} +contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains +@code{"BBS-list"}. The value of @code{ARGC} is three, one more than the +index of the last element in @code{ARGV}, since the elements are numbered +from zero. + +The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing +the array from zero to @code{ARGC} @minus{} 1, are derived from the C language's +method of accessing command line arguments. +@xref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}, for information +about how @code{awk} uses these variables. + +@vindex ARGIND +@item ARGIND * +The index in @code{ARGV} of the current file being processed. +Every time @code{gawk} opens a new data file for processing, it sets +@code{ARGIND} to the index in @code{ARGV} of the file name. +When @code{gawk} is processing the input files, it is always +true that @samp{FILENAME == ARGV[ARGIND]}. + +This variable is useful in file processing; it allows you to tell how far +along you are in the list of data files, and to distinguish between +successive instances of the same filename on the command line. + +While you can change the value of @code{ARGIND} within your @code{awk} +program, @code{gawk} will automatically set it to a new value when the +next file is opened. + +This variable is a @code{gawk} extension. In other @code{awk} implementations, +or if @code{gawk} is in compatibility mode +(@pxref{Options, ,Command Line Options}), +it is not special. + +@vindex ENVIRON +@item ENVIRON +An associative array that contains the values of the environment. The array +indices are the environment variable names; the values are the values of +the particular environment variables. For example, +@code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array +does not affect the environment passed on to any programs that +@code{awk} may spawn via redirection or the @code{system} function. +(In a future version of @code{gawk}, it may do so.) + +Some operating systems may not have environment variables. +On such systems, the @code{ENVIRON} array is empty (except for +@w{@code{ENVIRON["AWKPATH"]}}). + +@vindex ERRNO +@item ERRNO * +If a system error occurs either doing a redirection for @code{getline}, +during a read for @code{getline}, or during a @code{close} operation, +then @code{ERRNO} will contain a string describing the error. + +This variable is a @code{gawk} extension. In other @code{awk} implementations, +or if @code{gawk} is in compatibility mode +(@pxref{Options, ,Command Line Options}), +it is not special. + +@cindex dark corner +@vindex FILENAME +@item FILENAME +This is the name of the file that @code{awk} is currently reading. +When no data files are listed on the command line, @code{awk} reads +from the standard input, and @code{FILENAME} is set to @code{"-"}. +@code{FILENAME} is changed each time a new file is read +(@pxref{Reading Files, ,Reading Input Files}). +Inside a @code{BEGIN} rule, the value of @code{FILENAME} is +@code{""}, since there are no input files being processed +yet.@footnote{Some early implementations of Unix @code{awk} initialized +@code{FILENAME} to @code{"-"}, even if there were data files to be +processed. This behavior was incorrect, and should not be relied +upon in your programs.} (d.c.) + +@vindex FNR +@item FNR +@code{FNR} is the current record number in the current file. @code{FNR} is +incremented each time a new record is read +(@pxref{Getline, ,Explicit Input with @code{getline}}). It is reinitialized +to zero each time a new input file is started. + +@vindex NF +@item NF +@code{NF} is the number of fields in the current input record. +@code{NF} is set each time a new record is read, when a new field is +created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}). + +@vindex NR +@item NR +This is the number of input records @code{awk} has processed since +the beginning of the program's execution +(@pxref{Records, ,How Input is Split into Records}). +@code{NR} is set each time a new record is read. + +@vindex RLENGTH +@item RLENGTH +@code{RLENGTH} is the length of the substring matched by the +@code{match} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +@code{RLENGTH} is set by invoking the @code{match} function. Its value +is the length of the matched string, or @minus{}1 if no match was found. + +@vindex RSTART +@item RSTART +@code{RSTART} is the start-index in characters of the substring matched by the +@code{match} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +@code{RSTART} is set by invoking the @code{match} function. Its value +is the position of the string where the matched substring starts, or zero +if no match was found. + +@vindex RT +@item RT * +@code{RT} is set each time a record is read. It contains the input text +that matched the text denoted by @code{RS}, the record separator. + +This variable is a @code{gawk} extension. In other @code{awk} implementations, +or if @code{gawk} is in compatibility mode +(@pxref{Options, ,Command Line Options}), +it is not special. +@end table + +@cindex dark corner +A side note about @code{NR} and @code{FNR}. +@code{awk} simply increments both of these variables +each time it reads a record, instead of setting them to the absolute +value of the number of records read. This means that your program can +change these variables, and their new values will be incremented for +each record (d.c.). For example: + +@example +@group +$ echo '1 +> 2 +> 3 +> 4' | awk 'NR == 2 @{ NR = 17 @} +> @{ print NR @}' +@print{} 1 +@print{} 17 +@print{} 18 +@print{} 19 +@end group +@end example + +@noindent +Before @code{FNR} was added to the @code{awk} language +(@pxref{V7/SVR3.1, ,Major Changes between V7 and SVR3.1}), +many @code{awk} programs used this feature to track the number of +records in a file by resetting @code{NR} to zero when @code{FILENAME} +changed. + +@node ARGC and ARGV, , Auto-set, Built-in Variables +@section Using @code{ARGC} and @code{ARGV} + +In @ref{Auto-set, , Built-in Variables that Convey Information}, +you saw this program describing the information contained in @code{ARGC} +and @code{ARGV}: + +@example +@group +$ awk 'BEGIN @{ +> for (i = 0; i < ARGC; i++) +> print ARGV[i] +> @}' inventory-shipped BBS-list +@print{} awk +@print{} inventory-shipped +@print{} BBS-list +@end group +@end example + +@noindent +In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]} +contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains +@code{"BBS-list"}. + +Notice that the @code{awk} program is not entered in @code{ARGV}. The +other special command line options, with their arguments, are also not +entered. But variable assignments on the command line @emph{are} +treated as arguments, and do show up in the @code{ARGV} array. + +Your program can alter @code{ARGC} and the elements of @code{ARGV}. +Each time @code{awk} reaches the end of an input file, it uses the next +element of @code{ARGV} as the name of the next input file. By storing a +different string there, your program can change which files are read. +You can use @code{"-"} to represent the standard input. By storing +additional elements and incrementing @code{ARGC} you can cause +additional files to be read. + +If you decrease the value of @code{ARGC}, that eliminates input files +from the end of the list. By recording the old value of @code{ARGC} +elsewhere, your program can treat the eliminated arguments as +something other than file names. + +To eliminate a file from the middle of the list, store the null string +(@code{""}) into @code{ARGV} in place of the file's name. As a +special feature, @code{awk} ignores file names that have been +replaced with the null string. +You may also use the @code{delete} statement to remove elements from +@code{ARGV} (@pxref{Delete, ,The @code{delete} Statement}). + +All of these actions are typically done from the @code{BEGIN} rule, +before actual processing of the input begins. +@xref{Split Program, ,Splitting a Large File Into Pieces}, and see +@ref{Tee Program, ,Duplicating Output Into Multiple Files}, for an example +of each way of removing elements from @code{ARGV}. + +The following fragment processes @code{ARGV} in order to examine, and +then remove, command line options. + +@example +@group +BEGIN @{ + for (i = 1; i < ARGC; i++) @{ + if (ARGV[i] == "-v") + verbose = 1 + else if (ARGV[i] == "-d") + debug = 1 +@end group +@group + else if (ARGV[i] ~ /^-?/) @{ + e = sprintf("%s: unrecognized option -- %c", + ARGV[0], substr(ARGV[i], 1, ,1)) + print e > "/dev/stderr" + @} else + break + delete ARGV[i] + @} +@} +@end group +@end example + +To actually get the options into the @code{awk} program, you have to +end the @code{awk} options with @samp{--}, and then supply your options, +like so: + +@example +awk -f myprog -- -v -d file1 file2 @dots{} +@end example + +@cindex differences between @code{gawk} and @code{awk} +This is not necessary in @code{gawk}: Unless @samp{--posix} has been +specified, @code{gawk} silently puts any unrecognized options into +@code{ARGV} for the @code{awk} program to deal with. + +As soon as it +sees an unknown option, @code{gawk} stops looking for other options it might +otherwise recognize. The above example with @code{gawk} would be: + +@example +gawk -f myprog -d -v file1 file2 @dots{} +@end example + +@noindent +Since @samp{-d} is not a valid @code{gawk} option, the following @samp{-v} +is passed on to the @code{awk} program. + +@node Arrays, Built-in, Built-in Variables, Top +@chapter Arrays in @code{awk} + +An @dfn{array} is a table of values, called @dfn{elements}. The +elements of an array are distinguished by their indices. @dfn{Indices} +may be either numbers or strings. @code{awk} maintains a single set +of names that may be used for naming variables, arrays and functions +(@pxref{User-defined, ,User-defined Functions}). +Thus, you cannot have a variable and an array with the same name in the +same @code{awk} program. + +@menu +* Array Intro:: Introduction to Arrays +* Reference to Elements:: How to examine one element of an array. +* Assigning Elements:: How to change an element of an array. +* Array Example:: Basic Example of an Array +* Scanning an Array:: A variation of the @code{for} statement. It + loops through the indices of an array's + existing elements. +* Delete:: The @code{delete} statement removes an element + from an array. +* Numeric Array Subscripts:: How to use numbers as subscripts in + @code{awk}. +* Uninitialized Subscripts:: Using Uninitialized variables as subscripts. +* Multi-dimensional:: Emulating multi-dimensional arrays in + @code{awk}. +* Multi-scanning:: Scanning multi-dimensional arrays. +@end menu + +@node Array Intro, Reference to Elements, Arrays, Arrays +@section Introduction to Arrays + +@cindex arrays +The @code{awk} language provides one-dimensional @dfn{arrays} for storing groups +of related strings or numbers. + +Every @code{awk} array must have a name. Array names have the same +syntax as variable names; any valid variable name would also be a valid +array name. But you cannot use one name in both ways (as an array and +as a variable) in one @code{awk} program. + +Arrays in @code{awk} superficially resemble arrays in other programming +languages; but there are fundamental differences. In @code{awk}, you +don't need to specify the size of an array before you start to use it. +Additionally, any number or string in @code{awk} may be used as an +array index, not just consecutive integers. + +In most other languages, you have to @dfn{declare} an array and specify +how many elements or components it contains. In such languages, the +declaration causes a contiguous block of memory to be allocated for that +many elements. An index in the array usually must be a positive integer; for +example, the index zero specifies the first element in the array, which is +actually stored at the beginning of the block of memory. Index one +specifies the second element, which is stored in memory right after the +first element, and so on. It is impossible to add more elements to the +array, because it has room for only as many elements as you declared. +(Some languages allow arbitrary starting and ending indices, +e.g., @samp{15 .. 27}, but the size of the array is still fixed when +the array is declared.) + +A contiguous array of four elements might look like this, +conceptually, if the element values are eight, @code{"foo"}, +@code{""} and 30: + +@iftex +@c from Karl Berry, much thanks for the help. +@tex +\bigskip % space above the table (about 1 linespace) +\offinterlineskip +\newdimen\width \width = 1.5cm +\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt +\centerline{\vbox{ +\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr +\noalign{\hrule width\hwidth} + &&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad value\cr +\noalign{\hrule width\hwidth} +\noalign{\smallskip} + &\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad index\cr +} +}} +@end tex +@end iftex +@ifinfo +@example ++---------+---------+--------+---------+ +| 8 | "foo" | "" | 30 | @r{value} ++---------+---------+--------+---------+ + 0 1 2 3 @r{index} +@end example +@end ifinfo + +@noindent +Only the values are stored; the indices are implicit from the order of +the values. Eight is the value at index zero, because eight appears in the +position with zero elements before it. + +@cindex arrays, definition of +@cindex associative arrays +@cindex arrays, associative +Arrays in @code{awk} are different: they are @dfn{associative}. This means +that each array is a collection of pairs: an index, and its corresponding +array element value: + +@example +@r{Element} 4 @r{Value} 30 +@r{Element} 2 @r{Value} "foo" +@r{Element} 1 @r{Value} 8 +@r{Element} 3 @r{Value} "" +@end example + +@noindent +We have shown the pairs in jumbled order because their order is irrelevant. + +One advantage of associative arrays is that new pairs can be added +at any time. For example, suppose we add to the above array a tenth element +whose value is @w{@code{"number ten"}}. The result is this: + +@example +@r{Element} 10 @r{Value} "number ten" +@r{Element} 4 @r{Value} 30 +@r{Element} 2 @r{Value} "foo" +@r{Element} 1 @r{Value} 8 +@r{Element} 3 @r{Value} "" +@end example + +@noindent +@cindex sparse arrays +@cindex arrays, sparse +Now the array is @dfn{sparse}, which just means some indices are missing: +it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9. +@c ok, I should spell out the above, but ... + +Another consequence of associative arrays is that the indices don't +have to be positive integers. Any number, or even a string, can be +an index. For example, here is an array which translates words from +English into French: + +@example +@r{Element} "dog" @r{Value} "chien" +@r{Element} "cat" @r{Value} "chat" +@r{Element} "one" @r{Value} "un" +@r{Element} 1 @r{Value} "un" +@end example + +@noindent +Here we decided to translate the number one in both spelled-out and +numeric form---thus illustrating that a single array can have both +numbers and strings as indices. +(In fact, array subscripts are always strings; this is discussed +in more detail in +@ref{Numeric Array Subscripts, ,Using Numbers to Subscript Arrays}.) + +@cindex Array subscripts and @code{IGNORECASE} +@cindex @code{IGNORECASE} and array subscripts +@vindex IGNORECASE +The value of @code{IGNORECASE} has no effect upon array subscripting. +You must use the exact same string value to retrieve an array element +as you used to store it. + +When @code{awk} creates an array for you, e.g., with the @code{split} +built-in function, +that array's indices are consecutive integers starting at one. +(@xref{String Functions, ,Built-in Functions for String Manipulation}.) + +@node Reference to Elements, Assigning Elements, Array Intro, Arrays +@section Referring to an Array Element +@cindex array reference +@cindex element of array +@cindex reference to array + +The principal way of using an array is to refer to one of its elements. +An array reference is an expression which looks like this: + +@example +@var{array}[@var{index}] +@end example + +@noindent +Here, @var{array} is the name of an array. The expression @var{index} is +the index of the element of the array that you want. + +The value of the array reference is the current value of that array +element. For example, @code{foo[4.3]} is an expression for the element +of array @code{foo} at index @samp{4.3}. + +If you refer to an array element that has no recorded value, the value +of the reference is @code{""}, the null string. This includes elements +to which you have not assigned any value, and elements that have been +deleted (@pxref{Delete, ,The @code{delete} Statement}). Such a reference +automatically creates that array element, with the null string as its value. +(In some cases, this is unfortunate, because it might waste memory inside +@code{awk}.) + +@cindex arrays, presence of elements +@cindex arrays, the @code{in} operator +You can find out if an element exists in an array at a certain index with +the expression: + +@example +@var{index} in @var{array} +@end example + +@noindent +This expression tests whether or not the particular index exists, +without the side effect of creating that element if it is not present. +The expression has the value one (true) if @code{@var{array}[@var{index}]} +exists, and zero (false) if it does not exist. + +For example, to test whether the array @code{frequencies} contains the +index @samp{2}, you could write this statement: + +@example +if (2 in frequencies) + print "Subscript 2 is present." +@end example + +Note that this is @emph{not} a test of whether or not the array +@code{frequencies} contains an element whose @emph{value} is two. +(There is no way to do that except to scan all the elements.) Also, this +@emph{does not} create @code{frequencies[2]}, while the following +(incorrect) alternative would do so: + +@example +if (frequencies[2] != "") + print "Subscript 2 is present." +@end example + +@node Assigning Elements, Array Example, Reference to Elements, Arrays +@section Assigning Array Elements +@cindex array assignment +@cindex element assignment + +Array elements are lvalues: they can be assigned values just like +@code{awk} variables: + +@example +@var{array}[@var{subscript}] = @var{value} +@end example + +@noindent +Here @var{array} is the name of your array. The expression +@var{subscript} is the index of the element of the array that you want +to assign a value. The expression @var{value} is the value you are +assigning to that element of the array. + +@node Array Example, Scanning an Array, Assigning Elements, Arrays +@section Basic Array Example + +The following program takes a list of lines, each beginning with a line +number, and prints them out in order of line number. The line numbers are +not in order, however, when they are first read: they are scrambled. This +program sorts the lines by making an array using the line numbers as +subscripts. It then prints out the lines in sorted order of their numbers. +It is a very simple program, and gets confused if it encounters repeated +numbers, gaps, or lines that don't begin with a number. + +@example +@c file eg/misc/arraymax.awk +@{ + if ($1 > max) + max = $1 + arr[$1] = $0 +@} + +END @{ + for (x = 1; x <= max; x++) + print arr[x] +@} +@c endfile +@end example + +The first rule keeps track of the largest line number seen so far; +it also stores each line into the array @code{arr}, at an index that +is the line's number. + +The second rule runs after all the input has been read, to print out +all the lines. + +When this program is run with the following input: + +@example +@group +@c file eg/misc/arraymax.data +5 I am the Five man +2 Who are you? The new number two! +4 . . . And four on the floor +1 Who is number one? +3 I three you. +@c endfile +@end group +@end example + +@noindent +its output is this: + +@example +1 Who is number one? +2 Who are you? The new number two! +3 I three you. +4 . . . And four on the floor +5 I am the Five man +@end example + +If a line number is repeated, the last line with a given number overrides +the others. + +Gaps in the line numbers can be handled with an easy improvement to the +program's @code{END} rule: + +@example +END @{ + for (x = 1; x <= max; x++) + if (x in arr) + print arr[x] +@} +@end example + +@node Scanning an Array, Delete, Array Example, Arrays +@section Scanning All Elements of an Array +@cindex @code{for (x in @dots{})} +@cindex arrays, special @code{for} statement +@cindex scanning an array + +In programs that use arrays, you often need a loop that executes +once for each element of an array. In other languages, where arrays are +contiguous and indices are limited to positive integers, this is +easy: you can +find all the valid indices by counting from the lowest index +up to the highest. This +technique won't do the job in @code{awk}, since any number or string +can be an array index. So @code{awk} has a special kind of @code{for} +statement for scanning an array: + +@example +for (@var{var} in @var{array}) + @var{body} +@end example + +@noindent +This loop executes @var{body} once for each index in @var{array} that your +program has previously used, with the +variable @var{var} set to that index. + +Here is a program that uses this form of the @code{for} statement. The +first rule scans the input records and notes which words appear (at +least once) in the input, by storing a one into the array @code{used} with +the word as index. The second rule scans the elements of @code{used} to +find all the distinct words that appear in the input. It prints each +word that is more than 10 characters long, and also prints the number of +such words. @xref{String Functions, ,Built-in Functions for String Manipulation}, for more information +on the built-in function @code{length}. + +@example +# Record a 1 for each word that is used at least once. +@{ + for (i = 1; i <= NF; i++) + used[$i] = 1 +@} + +# Find number of distinct words more than 10 characters long. +END @{ + for (x in used) + if (length(x) > 10) @{ + ++num_long_words + print x + @} + print num_long_words, "words longer than 10 characters" +@} +@end example + +@noindent +@xref{Word Sorting, ,Generating Word Usage Counts}, +for a more detailed example of this type. + +The order in which elements of the array are accessed by this statement +is determined by the internal arrangement of the array elements within +@code{awk} and cannot be controlled or changed. This can lead to +problems if new elements are added to @var{array} by statements in +the loop body; you cannot predict whether or not the @code{for} loop will +reach them. Similarly, changing @var{var} inside the loop may produce +strange results. It is best to avoid such things. + +@node Delete, Numeric Array Subscripts, Scanning an Array, Arrays +@section The @code{delete} Statement +@cindex @code{delete} statement +@cindex deleting elements of arrays +@cindex removing elements of arrays +@cindex arrays, deleting an element + +You can remove an individual element of an array using the @code{delete} +statement: + +@example +delete @var{array}[@var{index}] +@end example + +Once you have deleted an array element, you can no longer obtain any +value the element once had. It is as if you had never referred +to it and had never given it any value. + +Here is an example of deleting elements in an array: + +@example +for (i in frequencies) + delete frequencies[i] +@end example + +@noindent +This example removes all the elements from the array @code{frequencies}. + +If you delete an element, a subsequent @code{for} statement to scan the array +will not report that element, and the @code{in} operator to check for +the presence of that element will return zero (i.e.@: false): + +@example +delete foo[4] +if (4 in foo) + print "This will never be printed" +@end example + +It is important to note that deleting an element is @emph{not} the +same as assigning it a null value (the empty string, @code{""}). + +@example +foo[4] = "" +if (4 in foo) + print "This is printed, even though foo[4] is empty" +@end example + +It is not an error to delete an element that does not exist. + +@cindex arrays, deleting entire contents +@cindex deleting entire arrays +@cindex differences between @code{gawk} and @code{awk} +You can delete all the elements of an array with a single statement, +by leaving off the subscript in the @code{delete} statement. + +@example +delete @var{array} +@end example + +This ability is a @code{gawk} extension; it is not available in +compatibility mode (@pxref{Options, ,Command Line Options}). + +Using this version of the @code{delete} statement is about three times +more efficient than the equivalent loop that deletes each element one +at a time. + +@cindex portability issues +The following statement provides a portable, but non-obvious way to clear +out an array. + +@cindex Brennan, Michael +@example +@group +# thanks to Michael Brennan for pointing this out +split("", array) +@end group +@end example + +The @code{split} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}) +clears out the target array first. This call asks it to split +apart the null string. Since there is no data to split out, the +function simply clears the array and then returns. + +@node Numeric Array Subscripts, Uninitialized Subscripts, Delete, Arrays +@section Using Numbers to Subscript Arrays + +An important aspect of arrays to remember is that @emph{array subscripts +are always strings}. If you use a numeric value as a subscript, +it will be converted to a string value before it is used for subscripting +(@pxref{Conversion, ,Conversion of Strings and Numbers}). + +@cindex conversions, during subscripting +@cindex numbers, used as subscripts +@vindex CONVFMT +This means that the value of the built-in variable @code{CONVFMT} can potentially +affect how your program accesses elements of an array. For example: + +@example +xyz = 12.153 +data[xyz] = 1 +CONVFMT = "%2.2f" +@group +if (xyz in data) + printf "%s is in data\n", xyz +else + printf "%s is not in data\n", xyz +@end group +@end example + +@noindent +This prints @samp{12.15 is not in data}. The first statement gives +@code{xyz} a numeric value. Assigning to +@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"} +(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}), +and assigns one to @code{data["12.153"]}. The program then changes +the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new +string value from @code{xyz}, this time @code{"12.15"}, since the value of +@code{CONVFMT} only allows two significant digits. This test fails, +since @code{"12.15"} is a different string from @code{"12.153"}. + +According to the rules for conversions +(@pxref{Conversion, ,Conversion of Strings and Numbers}), integer +values are always converted to strings as integers, no matter what the +value of @code{CONVFMT} may happen to be. So the usual case of: + +@example +for (i = 1; i <= maxsub; i++) + @i{do something with} array[i] +@end example + +@noindent +will work, no matter what the value of @code{CONVFMT}. + +Like many things in @code{awk}, the majority of the time things work +as you would expect them to work. But it is useful to have a precise +knowledge of the actual rules, since sometimes they can have a subtle +effect on your programs. + +@node Uninitialized Subscripts, Multi-dimensional, Numeric Array Subscripts, Arrays +@section Using Uninitialized Variables as Subscripts + +@cindex uninitialized variables, as array subscripts +@cindex array subscripts, uninitialized variables +Suppose you want to print your input data in reverse order. +A reasonable attempt at a program to do so (with some test +data) might look like this: + +@example +@group +$ echo 'line 1 +> line 2 +> line 3' | awk '@{ l[lines] = $0; ++lines @} +> END @{ +> for (i = lines-1; i >= 0; --i) +> print l[i] +> @}' +@print{} line 3 +@print{} line 2 +@end group +@end example + +Unfortunately, the very first line of input data did not come out in the +output! + +At first glance, this program should have worked. The variable @code{lines} +is uninitialized, and uninitialized variables have the numeric value zero. +So, the value of @code{l[0]} should have been printed. + +The issue here is that subscripts for @code{awk} arrays are @strong{always} +strings. And uninitialized variables, when used as strings, have the +value @code{""}, not zero. Thus, @samp{line 1} ended up stored in +@code{l[""]}. + +The following version of the program works correctly: + +@example +@{ l[lines++] = $0 @} +END @{ + for (i = lines - 1; i >= 0; --i) + print l[i] +@} +@end example + +Here, the @samp{++} forces @code{lines} to be numeric, thus making +the ``old value'' numeric zero, which is then converted to @code{"0"} +as the array subscript. + +@cindex null string, as array subscript +@cindex dark corner +As we have just seen, even though it is somewhat unusual, the null string +(@code{""}) is a valid array subscript (d.c.). If @samp{--lint} is provided +on the command line (@pxref{Options, ,Command Line Options}), +@code{gawk} will warn about the use of the null string as a subscript. + +@node Multi-dimensional, Multi-scanning, Uninitialized Subscripts, Arrays +@section Multi-dimensional Arrays + +@cindex subscripts in arrays +@cindex arrays, multi-dimensional subscripts +@cindex multi-dimensional subscripts +A multi-dimensional array is an array in which an element is identified +by a sequence of indices, instead of a single index. For example, a +two-dimensional array requires two indices. The usual way (in most +languages, including @code{awk}) to refer to an element of a +two-dimensional array named @code{grid} is with +@code{grid[@var{x},@var{y}]}. + +@vindex SUBSEP +Multi-dimensional arrays are supported in @code{awk} through +concatenation of indices into one string. What happens is that +@code{awk} converts the indices into strings +(@pxref{Conversion, ,Conversion of Strings and Numbers}) and +concatenates them together, with a separator between them. This creates +a single string that describes the values of the separate indices. The +combined string is used as a single index into an ordinary, +one-dimensional array. The separator used is the value of the built-in +variable @code{SUBSEP}. + +For example, suppose we evaluate the expression @samp{foo[5,12] = "value"} +when the value of @code{SUBSEP} is @code{"@@"}. The numbers five and 12 are +converted to strings and +concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus, +the array element @code{foo["5@@12"]} is set to @code{"value"}. + +Once the element's value is stored, @code{awk} has no record of whether +it was stored with a single index or a sequence of indices. The two +expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always +equivalent. + +The default value of @code{SUBSEP} is the string @code{"\034"}, +which contains a non-printing character that is unlikely to appear in an +@code{awk} program or in most input data. + +The usefulness of choosing an unlikely character comes from the fact +that index values that contain a string matching @code{SUBSEP} lead to +combined strings that are ambiguous. Suppose that @code{SUBSEP} were +@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a", +"b@@c"]}} would be indistinguishable because both would actually be +stored as @samp{foo["a@@b@@c"]}. + +You can test whether a particular index-sequence exists in a +``multi-dimensional'' array with the same operator @samp{in} used for single +dimensional arrays. Instead of a single index as the left-hand operand, +write the whole sequence of indices, separated by commas, in +parentheses: + +@example +(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array} +@end example + +The following example treats its input as a two-dimensional array of +fields; it rotates this array 90 degrees clockwise and prints the +result. It assumes that all lines have the same number of +elements. + +@example +@group +awk '@{ + if (max_nf < NF) + max_nf = NF + max_nr = NR + for (x = 1; x <= NF; x++) + vector[x, NR] = $x +@} +@end group + +@group +END @{ + for (x = 1; x <= max_nf; x++) @{ + for (y = max_nr; y >= 1; --y) + printf("%s ", vector[x, y]) + printf("\n") + @} +@}' +@end group +@end example + +@noindent +When given the input: + +@example +@group +1 2 3 4 5 6 +2 3 4 5 6 1 +3 4 5 6 1 2 +4 5 6 1 2 3 +@end group +@end example + +@noindent +it produces: + +@example +@group +4 3 2 1 +5 4 3 2 +6 5 4 3 +1 6 5 4 +2 1 6 5 +3 2 1 6 +@end group +@end example + +@node Multi-scanning, , Multi-dimensional, Arrays +@section Scanning Multi-dimensional Arrays + +There is no special @code{for} statement for scanning a +``multi-dimensional'' array; there cannot be one, because in truth there +are no multi-dimensional arrays or elements; there is only a +multi-dimensional @emph{way of accessing} an array. + +However, if your program has an array that is always accessed as +multi-dimensional, you can get the effect of scanning it by combining +the scanning @code{for} statement +(@pxref{Scanning an Array, ,Scanning All Elements of an Array}) with the +@code{split} built-in function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +It works like this: + +@example +for (combined in array) @{ + split(combined, separate, SUBSEP) + @dots{} +@} +@end example + +@noindent +This sets @code{combined} to +each concatenated, combined index in the array, and splits it +into the individual indices by breaking it apart where the value of +@code{SUBSEP} appears. The split-out indices become the elements of +the array @code{separate}. + +Thus, suppose you have previously stored a value in @code{array[1, "foo"]}; +then an element with index @code{"1\034foo"} exists in +@code{array}. (Recall that the default value of @code{SUBSEP} is +the character with code 034.) Sooner or later the @code{for} statement +will find that index and do an iteration with @code{combined} set to +@code{"1\034foo"}. Then the @code{split} function is called as +follows: + +@example +split("1\034foo", separate, "\034") +@end example + +@noindent +The result of this is to set @code{separate[1]} to @code{"1"} and +@code{separate[2]} to @code{"foo"}. Presto, the original sequence of +separate indices has been recovered. + +@node Built-in, User-defined, Arrays, Top +@chapter Built-in Functions + +@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!! +@cindex built-in functions +@dfn{Built-in} functions are functions that are always available for +your @code{awk} program to call. This chapter defines all the built-in +functions in @code{awk}; some of them are mentioned in other sections, +but they are summarized here for your convenience. (You can also define +new functions yourself. @xref{User-defined, ,User-defined Functions}.) + +@menu +* Calling Built-in:: How to call built-in functions. +* Numeric Functions:: Functions that work with numbers, including + @code{int}, @code{sin} and @code{rand}. +* String Functions:: Functions for string manipulation, such as + @code{split}, @code{match}, and + @code{sprintf}. +* I/O Functions:: Functions for files and shell commands. +* Time Functions:: Functions for dealing with time stamps. +@end menu + +@node Calling Built-in, Numeric Functions, Built-in, Built-in +@section Calling Built-in Functions + +To call a built-in function, write the name of the function followed +by arguments in parentheses. For example, @samp{atan2(y + z, 1)} +is a call to the function @code{atan2}, with two arguments. + +Whitespace is ignored between the built-in function name and the +open-parenthesis, but we recommend that you avoid using whitespace +there. User-defined functions do not permit whitespace in this way, and +you will find it easier to avoid mistakes by following a simple +convention which always works: no whitespace after a function name. + +@cindex differences between @code{gawk} and @code{awk} +Each built-in function accepts a certain number of arguments. +In some cases, arguments can be omitted. The defaults for omitted +arguments vary from function to function and are described under the +individual functions. In some @code{awk} implementations, extra +arguments given to built-in functions are ignored. However, in @code{gawk}, +it is a fatal error to give extra arguments to a built-in function. + +When a function is called, expressions that create the function's actual +parameters are evaluated completely before the function call is performed. +For example, in the code fragment: + +@example +i = 4 +j = sqrt(i++) +@end example + +@noindent +the variable @code{i} is set to five before @code{sqrt} is called +with a value of four for its actual parameter. + +@cindex evaluation, order of +@cindex order of evaluation +The order of evaluation of the expressions used for the function's +parameters is undefined. Thus, you should not write programs that +assume that parameters are evaluated from left to right or from +right to left. For example, + +@example +i = 5 +j = atan2(i++, i *= 2) +@end example + +If the order of evaluation is left to right, then @code{i} first becomes +six, and then 12, and @code{atan2} is called with the two arguments six +and 12. But if the order of evaluation is right to left, @code{i} +first becomes 10, and then 11, and @code{atan2} is called with the +two arguments 11 and 10. + +@node Numeric Functions, String Functions, Calling Built-in, Built-in +@section Numeric Built-in Functions + +Here is a full list of built-in functions that work with numbers. +Optional parameters are enclosed in square brackets (``['' and ``]''). + +@table @code +@item int(@var{x}) +@findex int +This produces the nearest integer to @var{x}, located between @var{x} and zero, +truncated toward zero. + +For example, @code{int(3)} is three, @code{int(3.9)} is three, @code{int(-3.9)} +is @minus{}3, and @code{int(-3)} is @minus{}3 as well. + +@item sqrt(@var{x}) +@findex sqrt +This gives you the positive square root of @var{x}. It reports an error +if @var{x} is negative. Thus, @code{sqrt(4)} is two. + +@item exp(@var{x}) +@findex exp +This gives you the exponential of @var{x} (@code{e ^ @var{x}}), or reports +an error if @var{x} is out of range. The range of values @var{x} can have +depends on your machine's floating point representation. + +@item log(@var{x}) +@findex log +This gives you the natural logarithm of @var{x}, if @var{x} is positive; +otherwise, it reports an error. + +@item sin(@var{x}) +@findex sin +This gives you the sine of @var{x}, with @var{x} in radians. + +@item cos(@var{x}) +@findex cos +This gives you the cosine of @var{x}, with @var{x} in radians. + +@item atan2(@var{y}, @var{x}) +@findex atan2 +This gives you the arctangent of @code{@var{y} / @var{x}} in radians. + +@item rand() +@findex rand +This gives you a random number. The values of @code{rand} are +uniformly-distributed between zero and one. +The value is never zero and never one. + +Often you want random integers instead. Here is a user-defined function +you can use to obtain a random non-negative integer less than @var{n}: + +@example +function randint(n) @{ + return int(n * rand()) +@} +@end example + +@noindent +The multiplication produces a random real number greater than zero and less +than @code{n}. We then make it an integer (using @code{int}) between zero +and @code{n} @minus{} 1, inclusive. + +Here is an example where a similar function is used to produce +random integers between one and @var{n}. This program +prints a new random number for each input record. + +@example +@group +awk ' +# Function to roll a simulated die. +function roll(n) @{ return 1 + int(rand() * n) @} +@end group + +@group +# Roll 3 six-sided dice and +# print total number of points. +@{ + printf("%d points\n", + roll(6)+roll(6)+roll(6)) +@}' +@end group +@end example + +@cindex seed for random numbers +@cindex random numbers, seed of +@comment MAWK uses a different seed each time. +@strong{Caution:} In most @code{awk} implementations, including @code{gawk}, +@code{rand} starts generating numbers from the same +starting number, or @dfn{seed}, each time you run @code{awk}. Thus, +a program will generate the same results each time you run it. +The numbers are random within one @code{awk} run, but predictable +from run to run. This is convenient for debugging, but if you want +a program to do different things each time it is used, you must change +the seed to a value that will be different in each run. To do this, +use @code{srand}. + +@item srand(@r{[}@var{x}@r{]}) +@findex srand +The function @code{srand} sets the starting point, or seed, +for generating random numbers to the value @var{x}. + +Each seed value leads to a particular sequence of random +numbers.@footnote{Computer generated random numbers really are not truly +random. They are technically known as ``pseudo-random.'' This means +that while the numbers in a sequence appear to be random, you can in +fact generate the same sequence of random numbers over and over again.} +Thus, if you set the seed to the same value a second time, you will get +the same sequence of random numbers again. + +If you omit the argument @var{x}, as in @code{srand()}, then the current +date and time of day are used for a seed. This is the way to get random +numbers that are truly unpredictable. + +The return value of @code{srand} is the previous seed. This makes it +easy to keep track of the seeds for use in consistently reproducing +sequences of random numbers. +@end table + +@node String Functions, I/O Functions, Numeric Functions, Built-in +@section Built-in Functions for String Manipulation + +The functions in this section look at or change the text of one or more +strings. +Optional parameters are enclosed in square brackets (``['' and ``]''). + +@table @code +@item index(@var{in}, @var{find}) +@findex index +This searches the string @var{in} for the first occurrence of the string +@var{find}, and returns the position in characters where that occurrence +begins in the string @var{in}. For example: + +@example +$ awk 'BEGIN @{ print index("peanut", "an") @}' +@print{} 3 +@end example + +@noindent +If @var{find} is not found, @code{index} returns zero. +(Remember that string indices in @code{awk} start at one.) + +@item length(@r{[}@var{string}@r{]}) +@findex length +This gives you the number of characters in @var{string}. If +@var{string} is a number, the length of the digit string representing +that number is returned. For example, @code{length("abcde")} is five. By +contrast, @code{length(15 * 35)} works out to three. How? Well, 15 * 35 = +525, and 525 is then converted to the string @code{"525"}, which has +three characters. + +If no argument is supplied, @code{length} returns the length of @code{$0}. + +@cindex historical features +@cindex portability issues +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +In older versions of @code{awk}, you could call the @code{length} function +without any parentheses. Doing so is marked as ``deprecated'' in the +POSIX standard. This means that while you can do this in your +programs, it is a feature that can eventually be removed from a future +version of the standard. Therefore, for maximal portability of your +@code{awk} programs, you should always supply the parentheses. + +@item match(@var{string}, @var{regexp}) +@findex match +The @code{match} function searches the string, @var{string}, for the +longest, leftmost substring matched by the regular expression, +@var{regexp}. It returns the character position, or @dfn{index}, of +where that substring begins (one, if it starts at the beginning of +@var{string}). If no match is found, it returns zero. + +@vindex RSTART +@vindex RLENGTH +The @code{match} function sets the built-in variable @code{RSTART} to +the index. It also sets the built-in variable @code{RLENGTH} to the +length in characters of the matched substring. If no match is found, +@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1. + +For example: + +@example +@group +@c file eg/misc/findpat.sh +awk '@{ + if ($1 == "FIND") + regex = $2 + else @{ + where = match($0, regex) + if (where != 0) + print "Match of", regex, "found at", \ + where, "in", $0 + @} +@}' +@c endfile +@end group +@end example + +@noindent +This program looks for lines that match the regular expression stored in +the variable @code{regex}. This regular expression can be changed. If the +first word on a line is @samp{FIND}, @code{regex} is changed to be the +second word on that line. Therefore, given: + +@example +@c file eg/misc/findpat.data +FIND ru+n +My program runs +but not very quickly +FIND Melvin +JF+KM +This line is property of Reality Engineering Co. +Melvin was here. +@c endfile +@end example + +@noindent +@code{awk} prints: + +@example +Match of ru+n found at 12 in My program runs +Match of Melvin found at 1 in Melvin was here. +@end example + +@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]}) +@findex split +This divides @var{string} into pieces separated by @var{fieldsep}, +and stores the pieces in @var{array}. The first piece is stored in +@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so +forth. The string value of the third argument, @var{fieldsep}, is +a regexp describing where to split @var{string} (much as @code{FS} can +be a regexp describing where to split input records). If +the @var{fieldsep} is omitted, the value of @code{FS} is used. +@code{split} returns the number of elements created. + +The @code{split} function splits strings into pieces in a +manner similar to the way input lines are split into fields. For example: + +@example +split("cul-de-sac", a, "-") +@end example + +@noindent +splits the string @samp{cul-de-sac} into three fields using @samp{-} as the +separator. It sets the contents of the array @code{a} as follows: + +@example +a[1] = "cul" +a[2] = "de" +a[3] = "sac" +@end example + +@noindent +The value returned by this call to @code{split} is three. + +As with input field-splitting, when the value of @var{fieldsep} is +@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements +are separated by runs of whitespace. + +@cindex differences between @code{gawk} and @code{awk} +Also as with input field-splitting, if @var{fieldsep} is the null string, each +individual character in the string is split into its own array element. +(This is a @code{gawk}-specific extension.) + +@cindex dark corner +Recent implementations of @code{awk}, including @code{gawk}, allow +the third argument to be a regexp constant (@code{/abc/}), as well as a +string (d.c.). The POSIX standard allows this as well. + +Before splitting the string, @code{split} deletes any previously existing +elements in the array @var{array} (d.c.). + +@item sprintf(@var{format}, @var{expression1},@dots{}) +@findex sprintf +This returns (without printing) the string that @code{printf} would +have printed out with the same arguments +(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}). +For example: + +@example +sprintf("pi = %.2f (approx.)", 22/7) +@end example + +@noindent +returns the string @w{@code{"pi = 3.14 (approx.)"}}. + +@ignore +2e: For sub, gsub, and gensub, either here or in the "how much matches" + section, we need some explanation that it is possible to match the + null string when using closures like *. E.g., + + $ echo abc | awk '{ gsub(/m*/, "X"); print }' + @print{} XaXbXcX + + Although this makes a certain amount of sense, it can be very + suprising. +@end ignore + +@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) +@findex sub +The @code{sub} function alters the value of @var{target}. +It searches this value, which is treated as a string, for the +leftmost longest substring matched by the regular expression, @var{regexp}, +extending this match as far as possible. Then the entire string is +changed by replacing the matched text with @var{replacement}. +The modified string becomes the new value of @var{target}. + +This function is peculiar because @var{target} is not simply +used to compute a value, and not just any expression will do: it +must be a variable, field or array element, so that @code{sub} can +store a modified value there. If this argument is omitted, then the +default is to use and alter @code{$0}. + +For example: + +@example +str = "water, water, everywhere" +sub(/at/, "ith", str) +@end example + +@noindent +sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the +leftmost, longest occurrence of @samp{at} with @samp{ith}. + +The @code{sub} function returns the number of substitutions made (either +one or zero). + +If the special character @samp{&} appears in @var{replacement}, it +stands for the precise substring that was matched by @var{regexp}. (If +the regexp can match more than one string, then this precise substring +may vary.) For example: + +@example +awk '@{ sub(/candidate/, "& and his wife"); print @}' +@end example + +@noindent +changes the first occurrence of @samp{candidate} to @samp{candidate +and his wife} on each input line. + +Here is another example: + +@example +awk 'BEGIN @{ + str = "daabaaa" + sub(/a*/, "c&c", str) + print str +@}' +@print{} dcaacbaaa +@end example + +@noindent +This shows how @samp{&} can represent a non-constant string, and also +illustrates the ``leftmost, longest'' rule in regexp matching +(@pxref{Leftmost Longest, ,How Much Text Matches?}). + +The effect of this special character (@samp{&}) can be turned off by putting a +backslash before it in the string. As usual, to insert one backslash in +the string, you must write two backslashes. Therefore, write @samp{\\&} +in a string constant to include a literal @samp{&} in the replacement. +For example, here is how to replace the first @samp{|} on each line with +an @samp{&}: + +@example +awk '@{ sub(/\|/, "\\&"); print @}' +@end example + +@cindex @code{sub}, third argument of +@cindex @code{gsub}, third argument of +@strong{Note:} As mentioned above, the third argument to @code{sub} must +be a variable, field or array reference. +Some versions of @code{awk} allow the third argument to +be an expression which is not an lvalue. In such a case, @code{sub} +would still search for the pattern and return zero or one, but the result of +the substitution (if any) would be thrown away because there is no place +to put it. Such versions of @code{awk} accept expressions like +this: + +@example +sub(/USA/, "United States", "the USA and Canada") +@end example + +@noindent +For historical compatibility, @code{gawk} will accept erroneous code, +such as in the above example. However, using any other non-changeable +object as the third parameter will cause a fatal error, and your program +will not run. + +@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) +@findex gsub +This is similar to the @code{sub} function, except @code{gsub} replaces +@emph{all} of the longest, leftmost, @emph{non-overlapping} matching +substrings it can find. The @samp{g} in @code{gsub} stands for +``global,'' which means replace everywhere. For example: + +@example +awk '@{ gsub(/Britain/, "United Kingdom"); print @}' +@end example + +@noindent +replaces all occurrences of the string @samp{Britain} with @samp{United +Kingdom} for all input records. + +The @code{gsub} function returns the number of substitutions made. If +the variable to be searched and altered, @var{target}, is +omitted, then the entire input record, @code{$0}, is used. + +As in @code{sub}, the characters @samp{&} and @samp{\} are special, +and the third argument must be an lvalue. +@end table + +@table @code +@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]}) +@findex gensub +@code{gensub} is a general substitution function. Like @code{sub} and +@code{gsub}, it searches the target string @var{target} for matches of +the regular expression @var{regexp}. Unlike @code{sub} and +@code{gsub}, the modified string is returned as the result of the +function, and the original target string is @emph{not} changed. If +@var{how} is a string beginning with @samp{g} or @samp{G}, then it +replaces all matches of @var{regexp} with @var{replacement}. +Otherwise, @var{how} is a number indicating which match of @var{regexp} +to replace. If no @var{target} is supplied, @code{$0} is used instead. + +@code{gensub} provides an additional feature that is not available +in @code{sub} or @code{gsub}: the ability to specify components of +a regexp in the replacement text. This is done by using parentheses +in the regexp to mark the components, and then specifying @samp{\@var{n}} +in the replacement text, where @var{n} is a digit from one to nine. +For example: + +@example +@group +$ gawk ' +> BEGIN @{ +> a = "abc def" +> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) +> print b +> @}' +@print{} def abc +@end group +@end example + +@noindent +As described above for @code{sub}, you must type two backslashes in order +to get one into the string. + +In the replacement text, the sequence @samp{\0} represents the entire +matched text, as does the character @samp{&}. + +This example shows how you can use the third argument to control +which match of the regexp should be changed. + +@example +$ echo a b c a b c | +> gawk '@{ print gensub(/a/, "AA", 2) @}' +@print{} a b c AA b c +@end example + +In this case, @code{$0} is used as the default target string. +@code{gensub} returns the new string as its result, which is +passed directly to @code{print} for printing. + +If the @var{how} argument is a string that does not begin with @samp{g} or +@samp{G}, or if it is a number that is less than zero, only one +substitution is performed. + +@cindex differences between @code{gawk} and @code{awk} +@code{gensub} is a @code{gawk} extension; it is not available +in compatibility mode (@pxref{Options, ,Command Line Options}). + +@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]}) +@findex substr +This returns a @var{length}-character-long substring of @var{string}, +starting at character number @var{start}. The first character of a +string is character number one. For example, +@code{substr("washington", 5, 3)} returns @code{"ing"}. + +If @var{length} is not present, this function returns the whole suffix of +@var{string} that begins at character number @var{start}. For example, +@code{substr("washington", 5)} returns @code{"ington"}. The whole +suffix is also returned +if @var{length} is greater than the number of characters remaining +in the string, counting from character number @var{start}. + +@strong{Note:} The string returned by @code{substr} @emph{cannot} be +assigned to. Thus, it is a mistake to attempt to change a portion of +a string, like this: + +@example +string = "abcdef" +# try to get "abCDEf", won't work +substr(string, 3, 3) = "CDE" +@end example + +@noindent +or to use @code{substr} as the third agument of @code{sub} or @code{gsub}: + +@example +gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG +@end example + +@cindex case conversion +@cindex conversion of case +@item tolower(@var{string}) +@findex tolower +This returns a copy of @var{string}, with each upper-case character +in the string replaced with its corresponding lower-case character. +Non-alphabetic characters are left unchanged. For example, +@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}. + +@item toupper(@var{string}) +@findex toupper +This returns a copy of @var{string}, with each lower-case character +in the string replaced with its corresponding upper-case character. +Non-alphabetic characters are left unchanged. For example, +@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}. +@end table + +@c fakenode --- for prepinfo +@subheading More About @samp{\} and @samp{&} with @code{sub}, @code{gsub} and @code{gensub} + +@cindex escape processing, @code{sub} et. al. +When using @code{sub}, @code{gsub} or @code{gensub}, and trying to get literal +backslashes and ampersands into the replacement text, you need to remember +that there are several levels of @dfn{escape processing} going on. + +First, there is the @dfn{lexical} level, which is when @code{awk} reads +your program, and builds an internal copy of your program that can +be executed. + +Then there is the run-time level, when @code{awk} actually scans the +replacement string to determine what to generate. + +At both levels, @code{awk} looks for a defined set of characters that +can come after a backslash. At the lexical level, it looks for the +escape sequences listed in @ref{Escape Sequences}. +Thus, for every @samp{\} that @code{awk} will process at the run-time +level, you type two @samp{\}s at the lexical level. +When a character that is not valid for an escape sequence follows the +@samp{\}, Unix @code{awk} and @code{gawk} both simply remove the initial +@samp{\}, and put the following character into the string. Thus, for +example, @code{"a\qb"} is treated as @code{"aqb"}. + +At the run-time level, the various functions handle sequences of +@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex. + +Historically, the @code{sub} and @code{gsub} functions treated the two +character sequence @samp{\&} specially; this sequence was replaced in +the generated text with a single @samp{&}. Any other @samp{\} within +the @var{replacement} string that did not precede an @samp{&} was passed +through unchanged. To illustrate with a table: + +@c Thank to Karl Berry for help with the TeX stuff. +@tex +\vbox{\bigskip +% This table has lots of &'s and \'s, so unspecialize them. +\catcode`\& = \other \catcode`\\ = \other +% But then we need character for escape and tab. +@catcode`! = 4 +@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr + You type!@code{sub} sees!@code{sub} generates@cr +@hrulefill!@hrulefill!@hrulefill@cr + @code{\&}! @code{&}!the matched text@cr + @code{\\&}! @code{\&}!a literal @samp{&}@cr + @code{\\\&}! @code{\&}!a literal @samp{&}@cr +@code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr +@code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr +@code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr + @code{\\q}! @code{\q}!a literal @samp{\q}@cr +} +@bigskip} +@end tex +@ifinfo +@display + You type @code{sub} sees @code{sub} generates + -------- ---------- --------------- + @code{\&} @code{&} the matched text + @code{\\&} @code{\&} a literal @samp{&} + @code{\\\&} @code{\&} a literal @samp{&} + @code{\\\\&} @code{\\&} a literal @samp{\&} + @code{\\\\\&} @code{\\&} a literal @samp{\&} +@code{\\\\\\&} @code{\\\&} a literal @samp{\\&} + @code{\\q} @code{\q} a literal @samp{\q} +@end display +@end ifinfo + +@noindent +This table shows both the lexical level processing, where +an odd number of backslashes becomes an even number at the run time level, +and the run-time processing done by @code{sub}. +(For the sake of simplicity, the rest of the tables below only show the +case of even numbers of @samp{\}s entered at the lexical level.) + +The problem with the historical approach is that there is no way to get +a literal @samp{\} followed by the matched text. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +The 1992 POSIX standard attempted to fix this problem. The standard +says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&} +after the @samp{\}. If either one follows a @samp{\}, that character is +output literally. The interpretation of @samp{\} and @samp{&} then becomes +like this: + +@c thanks to Karl Berry for formatting this table +@tex +\vbox{\bigskip +% This table has lots of &'s and \'s, so unspecialize them. +\catcode`\& = \other \catcode`\\ = \other +% But then we need character for escape and tab. +@catcode`! = 4 +@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr + You type!@code{sub} sees!@code{sub} generates@cr +@hrulefill!@hrulefill!@hrulefill@cr + @code{&}! @code{&}!the matched text@cr + @code{\\&}! @code{\&}!a literal @samp{&}@cr +@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr +@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr +} +@bigskip} +@end tex +@ifinfo +@display + You type @code{sub} sees @code{sub} generates + -------- ---------- --------------- + @code{&} @code{&} the matched text + @code{\\&} @code{\&} a literal @samp{&} + @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text +@code{\\\\\\&} @code{\\\&} a literal @samp{\&} +@end display +@end ifinfo + +@noindent +This would appear to solve the problem. +Unfortunately, the phrasing of the standard is unusual. It +says, in effect, that @samp{\} turns off the special meaning of any +following character, but that for anything other than @samp{\} and @samp{&}, +such special meaning is undefined. This wording leads to two problems. + +@enumerate +@item +Backslashes must now be doubled in the @var{replacement} string, breaking +historical @code{awk} programs. + +@item +To make sure that an @code{awk} program is portable, @emph{every} character +in the @var{replacement} string must be preceded with a +backslash.@footnote{This consequence was certainly unintended.} +@c I can say that, 'cause I was involved in making this change +@end enumerate + +The POSIX standard is under revision.@footnote{As of @value{UPDATE-MONTH}, +with final approval and publication hopefully sometime in 1997.} +Because of the above problems, proposed text for the revised standard +reverts to rules that correspond more closely to the original existing +practice. The proposed rules have special cases that make it possible +to produce a @samp{\} preceding the matched text. + +@tex +\vbox{\bigskip +% This table has lots of &'s and \'s, so unspecialize them. +\catcode`\& = \other \catcode`\\ = \other +% But then we need character for escape and tab. +@catcode`! = 4 +@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr + You type!@code{sub} sees!@code{sub} generates@cr +@hrulefill!@hrulefill!@hrulefill@cr +@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr +@code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr + @code{\\&}! @code{\&}!a literal @samp{&}@cr + @code{\\q}! @code{\q}!a literal @samp{\q}@cr +} +@bigskip} +@end tex +@ifinfo +@display + You type @code{sub} sees @code{sub} generates + -------- ---------- --------------- +@code{\\\\\\&} @code{\\\&} a literal @samp{\&} + @code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text + @code{\\&} @code{\&} a literal @samp{&} + @code{\\q} @code{\q} a literal @samp{\q} +@end display +@end ifinfo + +In a nutshell, at the run-time level, there are now three special sequences +of characters, @samp{\\\&}, @samp{\\&} and @samp{\&}, whereas historically, +there was only one. However, as in the historical case, any @samp{\} that +is not part of one of these three sequences is not special, and appears +in the output literally. + +@code{gawk} 3.0 follows these proposed POSIX rules for @code{sub} and +@code{gsub}. +@c As much as we think it's a lousy idea. You win some, you lose some. Sigh. +Whether these proposed rules will actually become codified into the +standard is unknown at this point. Subsequent @code{gawk} releases will +track the standard and implement whatever the final version specifies; +this @value{DOCUMENT} will be updated as well. + +The rules for @code{gensub} are considerably simpler. At the run-time +level, whenever @code{gawk} sees a @samp{\}, if the following character +is a digit, then the text that matched the corresponding parenthesized +subexpression is placed in the generated output. Otherwise, +no matter what the character after the @samp{\} is, that character will +appear in the generated text, and the @samp{\} will not. + +@tex +\vbox{\bigskip +% This table has lots of &'s and \'s, so unspecialize them. +\catcode`\& = \other \catcode`\\ = \other +% But then we need character for escape and tab. +@catcode`! = 4 +@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr + You type!@code{gensub} sees!@code{gensub} generates@cr +@hrulefill!@hrulefill!@hrulefill@cr + @code{&}! @code{&}!the matched text@cr + @code{\\&}! @code{\&}!a literal @samp{&}@cr + @code{\\\\}! @code{\\}!a literal @samp{\}@cr + @code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr +@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr + @code{\\q}! @code{\q}!a literal @samp{q}@cr +} +@bigskip} +@end tex +@ifinfo +@display + You type @code{gensub} sees @code{gensub} generates + -------- ------------- ------------------ + @code{&} @code{&} the matched text + @code{\\&} @code{\&} a literal @samp{&} + @code{\\\\} @code{\\} a literal @samp{\} + @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text +@code{\\\\\\&} @code{\\\&} a literal @samp{\&} + @code{\\q} @code{\q} a literal @samp{q} +@end display +@end ifinfo + +Because of the complexity of the lexical and run-time level processing, +and the special cases for @code{sub} and @code{gsub}, +we recommend the use of @code{gawk} and @code{gensub} for when you have +to do substitutions. + +@node I/O Functions, Time Functions, String Functions, Built-in +@section Built-in Functions for Input/Output + +The following functions are related to Input/Output (I/O). +Optional parameters are enclosed in square brackets (``['' and ``]''). + +@table @code +@item close(@var{filename}) +@findex close +Close the file @var{filename}, for input or output. The argument may +alternatively be a shell command that was used for redirecting to or +from a pipe; then the pipe is closed. +@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}, +for more information. + +@item fflush(@r{[}@var{filename}@r{]}) +@findex fflush +@cindex portability issues +@cindex flushing buffers +@cindex buffers, flushing +@cindex buffering output +@cindex output, buffering +Flush any buffered output associated @var{filename}, which is either a +file opened for writing, or a shell command for redirecting output to +a pipe. + +Many utility programs will @dfn{buffer} their output; they save information +to be written to a disk file or terminal in memory, until there is enough +for it to be worthwhile to send the data to the ouput device. +This is often more efficient than writing +every little bit of information as soon as it is ready. However, sometimes +it is necessary to force a program to @dfn{flush} its buffers; that is, +write the information to its destination, even if a buffer is not full. +This is the purpose of the @code{fflush} function; @code{gawk} too +buffers its output, and the @code{fflush} function can be used to force +@code{gawk} to flush its buffers. + +@code{fflush} is a recent (1994) addition to the Bell Labs research +version of @code{awk}; it is not part of the POSIX standard, and will +not be available if @samp{--posix} has been specified on the command +line (@pxref{Options, ,Command Line Options}). + +@code{gawk} extends the @code{fflush} function in two ways. The first +is to allow no argument at all. In this case, the buffer for the +standard output is flushed. The second way is to allow the null string +(@w{@code{""}}) as the argument. In this case, the buffers for +@emph{all} open output files and pipes are flushed. + +@code{fflush} returns zero if the buffer was successfully flushed, +and nonzero otherwise. + +@item system(@var{command}) +@findex system +@cindex interaction, @code{awk} and other programs +The system function allows the user to execute operating system commands +and then return to the @code{awk} program. The @code{system} function +executes the command given by the string @var{command}. It returns, as +its value, the status returned by the command that was executed. + +For example, if the following fragment of code is put in your @code{awk} +program: + +@example +END @{ + system("date | mail -s 'awk run done' root") +@} +@end example + +@noindent +the system administrator will be sent mail when the @code{awk} program +finishes processing input and begins its end-of-input processing. + +Note that redirecting @code{print} or @code{printf} into a pipe is often +enough to accomplish your task. However, if your @code{awk} +program is interactive, @code{system} is useful for cranking up large +self-contained programs, such as a shell or an editor. + +Some operating systems cannot implement the @code{system} function. +@code{system} causes a fatal error if it is not supported. +@end table + +@c fakenode --- for prepinfo +@subheading Interactive vs. Non-Interactive Buffering +@cindex buffering, interactive vs. non-interactive +@cindex buffering, non-interactive vs. interactive +@cindex interactive buffering vs. non-interactive +@cindex non-interactive buffering vs. interactive + +As a side point, buffering issues can be even more confusing depending +upon whether or not your program is @dfn{interactive}, i.e., communicating +with a user sitting at a keyboard.@footnote{A program is interactive +if the standard output is connected +to a terminal device.} + +Interactive programs generally @dfn{line buffer} their output; they +write out every line. Non-interactive programs wait until they have +a full buffer, which may be many lines of output. + +@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for +@c motivating me to write this section. +Here is an example of the difference. + +@example +$ awk '@{ print $1 + $2 @}' +1 1 +@print{} 2 +2 3 +@print{} 5 +@kbd{Control-d} +@end example + +@noindent +Each line of output is printed immediately. Compare that behavior +with this example. + +@example +$ awk '@{ print $1 + $2 @}' | cat +1 1 +2 3 +@kbd{Control-d} +@print{} 2 +@print{} 5 +@end example + +@noindent +Here, no output is printed until after the @kbd{Control-d} is typed, since +it is all buffered, and sent down the pipe to @code{cat} in one shot. + +@c fakenode --- for prepinfo +@subheading Controlling Output Buffering with @code{system} +@cindex flushing buffers +@cindex buffers, flushing +@cindex buffering output +@cindex output, buffering + +The @code{fflush} function provides explicit control over output buffering for +individual files and pipes. However, its use is not portable to many other +@code{awk} implementations. An alternative method to flush output +buffers is by calling @code{system} with a null string as its argument: + +@example +system("") # flush output +@end example + +@noindent +@code{gawk} treats this use of the @code{system} function as a special +case, and is smart enough not to run a shell (or other command +interpreter) with the empty command. Therefore, with @code{gawk}, this +idiom is not only useful, it is efficient. While this method should work +with other @code{awk} implementations, it will not necessarily avoid +starting an unnecessary shell. (Other implementations may only +flush the buffer associated with the standard output, and not necessarily +all buffered output.) + +If you think about what a programmer expects, it makes sense that +@code{system} should flush any pending output. The following program: + +@example +BEGIN @{ + print "first print" + system("echo system echo") + print "second print" +@} +@end example + +@noindent +must print + +@example +first print +system echo +second print +@end example + +@noindent +and not + +@example +system echo +first print +second print +@end example + +If @code{awk} did not flush its buffers before calling @code{system}, the +latter (undesirable) output is what you would see. + +@node Time Functions, , I/O Functions, Built-in +@section Functions for Dealing with Time Stamps + +@cindex timestamps +@cindex time of day +A common use for @code{awk} programs is the processing of log files +containing time stamp information, indicating when a +particular log record was written. Many programs log their time stamp +in the form returned by the @code{time} system call, which is the +number of seconds since a particular epoch. On POSIX systems, +it is the number of seconds since Midnight, January 1, 1970, UTC. + +In order to make it easier to process such log files, and to produce +useful reports, @code{gawk} provides two functions for working with time +stamps. Both of these are @code{gawk} extensions; they are not specified +in the POSIX standard, nor are they in any other known version +of @code{awk}. + +Optional parameters are enclosed in square brackets (``['' and ``]''). + +@table @code +@item systime() +@findex systime +This function returns the current time as the number of seconds since +the system epoch. On POSIX systems, this is the number of seconds +since Midnight, January 1, 1970, UTC. It may be a different number on +other systems. + +@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]}) +@findex strftime +This function returns a string. It is similar to the function of the +same name in ANSI C. The time specified by @var{timestamp} is used to +produce a string, based on the contents of the @var{format} string. +The @var{timestamp} is in the same format as the value returned by the +@code{systime} function. If no @var{timestamp} argument is supplied, +@code{gawk} will use the current time of day as the time stamp. +If no @var{format} argument is supplied, @code{strftime} uses +@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces +output (almost) equivalent to that of the @code{date} utility. +(Versions of @code{gawk} prior to 3.0 require the @var{format} argument.) +@end table + +The @code{systime} function allows you to compare a time stamp from a +log file with the current time of day. In particular, it is easy to +determine how long ago a particular record was logged. It also allows +you to produce log records using the ``seconds since the epoch'' format. + +The @code{strftime} function allows you to easily turn a time stamp +into human-readable information. It is similar in nature to the @code{sprintf} +function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}), +in that it copies non-format specification characters verbatim to the +returned string, while substituting date and time values for format +specifications in the @var{format} string. + +@code{strftime} is guaranteed by the ANSI C standard to support +the following date format specifications: + +@table @code +@item %a +The locale's abbreviated weekday name. + +@item %A +The locale's full weekday name. + +@item %b +The locale's abbreviated month name. + +@item %B +The locale's full month name. + +@item %c +The locale's ``appropriate'' date and time representation. + +@item %d +The day of the month as a decimal number (01--31). + +@item %H +The hour (24-hour clock) as a decimal number (00--23). + +@item %I +The hour (12-hour clock) as a decimal number (01--12). + +@item %j +The day of the year as a decimal number (001--366). + +@item %m +The month as a decimal number (01--12). + +@item %M +The minute as a decimal number (00--59). + +@item %p +The locale's equivalent of the AM/PM designations associated +with a 12-hour clock. + +@item %S +The second as a decimal number (00--60).@footnote{Occasionally there are +minutes in a year with a leap second, which is why the +seconds can go up to 60.} + +@item %U +The week number of the year (the first Sunday as the first day of week one) +as a decimal number (00--53). + +@item %w +The weekday as a decimal number (0--6). Sunday is day zero. + +@item %W +The week number of the year (the first Monday as the first day of week one) +as a decimal number (00--53). + +@item %x +The locale's ``appropriate'' date representation. + +@item %X +The locale's ``appropriate'' time representation. + +@item %y +The year without century as a decimal number (00--99). + +@item %Y +The year with century as a decimal number (e.g., 1995). + +@item %Z +The time zone name or abbreviation, or no characters if +no time zone is determinable. + +@item %% +A literal @samp{%}. +@end table + +If a conversion specifier is not one of the above, the behavior is +undefined.@footnote{This is because ANSI C leaves the +behavior of the C version of @code{strftime} undefined, and @code{gawk} +will use the system's version of @code{strftime} if it's there. +Typically, the conversion specifier will either not appear in the +returned string, or it will appear literally.} + +@cindex locale, definition of +Informally, a @dfn{locale} is the geographic place in which a program +is meant to run. For example, a common way to abbreviate the date +September 4, 1991 in the United States would be ``9/4/91''. +In many countries in Europe, however, it would be abbreviated ``4.9.91''. +Thus, the @samp{%x} specification in a @code{"US"} locale might produce +@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce +@samp{4.9.91}. The ANSI C standard defines a default @code{"C"} +locale, which is an environment that is typical of what most C programmers +are used to. + +A public-domain C version of @code{strftime} is supplied with @code{gawk} +for systems that are not yet fully ANSI-compliant. If that version is +used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}), +then the following additional format specifications are available: + +@table @code +@item %D +Equivalent to specifying @samp{%m/%d/%y}. + +@item %e +The day of the month, padded with a space if it is only one digit. + +@item %h +Equivalent to @samp{%b}, above. + +@item %n +A newline character (ASCII LF). + +@item %r +Equivalent to specifying @samp{%I:%M:%S %p}. + +@item %R +Equivalent to specifying @samp{%H:%M}. + +@item %T +Equivalent to specifying @samp{%H:%M:%S}. + +@item %t +A tab character. + +@item %k +The hour (24-hour clock) as a decimal number (0-23). +Single digit numbers are padded with a space. + +@item %l +The hour (12-hour clock) as a decimal number (1-12). +Single digit numbers are padded with a space. + +@item %C +The century, as a number between 00 and 99. + +@item %u +The weekday as a decimal number +[1 (Monday)--7]. + +@cindex ISO 8601 +@item %V +The week number of the year (the first Monday as the first +day of week one) as a decimal number (01--53). +The method for determining the week number is as specified by ISO 8601 +(to wit: if the week containing January 1 has four or more days in the +new year, then it is week one, otherwise it is week 53 of the previous year +and the next week is week one). + +@item %G +The year with century of the ISO week number, as a decimal number. + +For example, January 1, 1993, is in week 53 of 1992. Thus, the year +of its ISO week number is 1992, even though its year is 1993. +Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year +of its ISO week number is 1974, even though its year is 1973. + +@item %g +The year without century of the ISO week number, as a decimal number (00--99). + +@item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI +@itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy +These are ``alternate representations'' for the specifications +that use only the second letter (@samp{%c}, @samp{%C}, and so on). +They are recognized, but their normal representations are +used.@footnote{If you don't understand any of this, don't worry about +it; these facilities are meant to make it easier to ``internationalize'' +programs.} +(These facilitate compliance with the POSIX @code{date} utility.) + +@item %v +The date in VMS format (e.g., 20-JUN-1991). + +@cindex RFC-822 +@cindex RFC-1036 +@item %z +The timezone offset in a +HHMM format (e.g., the format necessary to +produce RFC-822/RFC-1036 date headers). +@end table + +This example is an @code{awk} implementation of the POSIX +@code{date} utility. Normally, the @code{date} utility prints the +current date and time of day in a well known format. However, if you +provide an argument to it that begins with a @samp{+}, @code{date} +will copy non-format specifier characters to the standard output, and +will interpret the current time according to the format specifiers in +the string. For example: + +@example +$ date '+Today is %A, %B %d, %Y.' +@print{} Today is Thursday, July 11, 1991. +@end example + +Here is the @code{gawk} version of the @code{date} utility. +It has a shell ``wrapper'', to handle the @samp{-u} option, +which requires that @code{date} run as if the time zone +was set to UTC. + +@example +@group +#! /bin/sh +# +# date --- approximate the P1003.2 'date' command + +case $1 in +-u) TZ=GMT0 # use UTC + export TZ + shift ;; +esac +@end group + +@group +gawk 'BEGIN @{ + format = "%a %b %d %H:%M:%S %Z %Y" + exitval = 0 +@end group + +@group + if (ARGC > 2) + exitval = 1 + else if (ARGC == 2) @{ + format = ARGV[1] + if (format ~ /^\+/) + format = substr(format, 2) # remove leading + + @} + print strftime(format) + exit exitval +@}' "$@@" +@end group +@end example + +@node User-defined, Invoking Gawk, Built-in, Top +@chapter User-defined Functions + +@cindex user-defined functions +@cindex functions, user-defined +Complicated @code{awk} programs can often be simplified by defining +your own functions. User-defined functions can be called just like +built-in ones (@pxref{Function Calls}), but it is up to you to define +them---to tell @code{awk} what they should do. + +@menu +* Definition Syntax:: How to write definitions and what they mean. +* Function Example:: An example function definition and what it + does. +* Function Caveats:: Things to watch out for. +* Return Statement:: Specifying the value a function returns. +@end menu + +@node Definition Syntax, Function Example, User-defined, User-defined +@section Function Definition Syntax +@cindex defining functions +@cindex function definition + +Definitions of functions can appear anywhere between the rules of an +@code{awk} program. Thus, the general form of an @code{awk} program is +extended to include sequences of rules @emph{and} user-defined function +definitions. +There is no need in @code{awk} to put the definition of a function +before all uses of the function. This is because @code{awk} reads the +entire program before starting to execute any of it. + +The definition of a function named @var{name} looks like this: + +@example +function @var{name}(@var{parameter-list}) +@{ + @var{body-of-function} +@} +@end example + +@cindex names, use of +@cindex namespaces +@noindent +@var{name} is the name of the function to be defined. A valid function +name is like a valid variable name: a sequence of letters, digits and +underscores, not starting with a digit. +Within a single @code{awk} program, any particular name can only be +used as a variable, array or function. + +@var{parameter-list} is a list of the function's arguments and local +variable names, separated by commas. When the function is called, +the argument names are used to hold the argument values given in +the call. The local variables are initialized to the empty string. +A function cannot have two parameters with the same name. + +The @var{body-of-function} consists of @code{awk} statements. It is the +most important part of the definition, because it says what the function +should actually @emph{do}. The argument names exist to give the body a +way to talk about the arguments; local variables, to give the body +places to keep temporary values. + +Argument names are not distinguished syntactically from local variable +names; instead, the number of arguments supplied when the function is +called determines how many argument variables there are. Thus, if three +argument values are given, the first three names in @var{parameter-list} +are arguments, and the rest are local variables. + +It follows that if the number of arguments is not the same in all calls +to the function, some of the names in @var{parameter-list} may be +arguments on some occasions and local variables on others. Another +way to think of this is that omitted arguments default to the +null string. + +Usually when you write a function you know how many names you intend to +use for arguments and how many you intend to use as local variables. It is +conventional to place some extra space between the arguments and +the local variables, to document how your function is supposed to be used. + +@cindex variable shadowing +During execution of the function body, the arguments and local variable +values hide or @dfn{shadow} any variables of the same names used in the +rest of the program. The shadowed variables are not accessible in the +function definition, because there is no way to name them while their +names have been taken away for the local variables. All other variables +used in the @code{awk} program can be referenced or set normally in the +function's body. + +The arguments and local variables last only as long as the function body +is executing. Once the body finishes, you can once again access the +variables that were shadowed while the function was running. + +@cindex recursive function +@cindex function, recursive +The function body can contain expressions which call functions. They +can even call this function, either directly or by way of another +function. When this happens, we say the function is @dfn{recursive}. + +@cindex @code{awk} language, POSIX version +@cindex POSIX @code{awk} +In many @code{awk} implementations, including @code{gawk}, +the keyword @code{function} may be +abbreviated @code{func}. However, POSIX only specifies the use of +the keyword @code{function}. This actually has some practical implications. +If @code{gawk} is in POSIX-compatibility mode +(@pxref{Options, ,Command Line Options}), then the following +statement will @emph{not} define a function: + +@example +func foo() @{ a = sqrt($1) ; print a @} +@end example + +@noindent +Instead it defines a rule that, for each record, concatenates the value +of the variable @samp{func} with the return value of the function @samp{foo}. +If the resulting string is non-null, the action is executed. +This is probably not what was desired. (@code{awk} accepts this input as +syntactically valid, since functions may be used before they are defined +in @code{awk} programs.) + +@cindex portability issues +To ensure that your @code{awk} programs are portable, always use the +keyword @code{function} when defining a function. + +@node Function Example, Function Caveats, Definition Syntax, User-defined +@section Function Definition Examples + +Here is an example of a user-defined function, called @code{myprint}, that +takes a number and prints it in a specific format. + +@example +function myprint(num) +@{ + printf "%6.3g\n", num +@} +@end example + +@noindent +To illustrate, here is an @code{awk} rule which uses our @code{myprint} +function: + +@example +$3 > 0 @{ myprint($3) @} +@end example + +@noindent +This program prints, in our special format, all the third fields that +contain a positive number in our input. Therefore, when given: + +@example +@group + 1.2 3.4 5.6 7.8 + 9.10 11.12 -13.14 15.16 +17.18 19.20 21.22 23.24 +@end group +@end example + +@noindent +this program, using our function to format the results, prints: + +@example + 5.6 + 21.2 +@end example + +This function deletes all the elements in an array. + +@example +function delarray(a, i) +@{ + for (i in a) + delete a[i] +@} +@end example + +When working with arrays, it is often necessary to delete all the elements +in an array and start over with a new list of elements +(@pxref{Delete, ,The @code{delete} Statement}). +Instead of having +to repeat this loop everywhere in your program that you need to clear out +an array, your program can just call @code{delarray}. + +Here is an example of a recursive function. It takes a string +as an input parameter, and returns the string in backwards order. + +@example +function rev(str, start) +@{ + if (start == 0) + return "" + + return (substr(str, start, 1) rev(str, start - 1)) +@} +@end example + +If this function is in a file named @file{rev.awk}, we can test it +this way: + +@example +$ echo "Don't Panic!" | +> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk +@print{} !cinaP t'noD +@end example + +Here is an example that uses the built-in function @code{strftime}. +(@xref{Time Functions, ,Functions for Dealing with Time Stamps}, +for more information on @code{strftime}.) +The C @code{ctime} function takes a timestamp and returns it in a string, +formatted in a well known fashion. Here is an @code{awk} version: + +@example +@c file eg/lib/ctime.awk +@group +# ctime.awk +# +# awk version of C ctime(3) function + +function ctime(ts, format) +@{ + format = "%a %b %d %H:%M:%S %Z %Y" + if (ts == 0) + ts = systime() # use current time as default + return strftime(format, ts) +@} +@c endfile +@end group +@end example + +@node Function Caveats, Return Statement, Function Example, User-defined +@section Calling User-defined Functions + +@cindex call by value +@cindex call by reference +@cindex calling a function +@cindex function call +@dfn{Calling a function} means causing the function to run and do its job. +A function call is an expression, and its value is the value returned by +the function. + +A function call consists of the function name followed by the arguments +in parentheses. What you write in the call for the arguments are +@code{awk} expressions; each time the call is executed, these +expressions are evaluated, and the values are the actual arguments. For +example, here is a call to @code{foo} with three arguments (the first +being a string concatenation): + +@example +foo(x y, "lose", 4 * z) +@end example + +@strong{Caution:} whitespace characters (spaces and tabs) are not allowed +between the function name and the open-parenthesis of the argument list. +If you write whitespace by mistake, @code{awk} might think that you mean +to concatenate a variable with an expression in parentheses. However, it +notices that you used a function name and not a variable name, and reports +an error. + +@cindex call by value +When a function is called, it is given a @emph{copy} of the values of +its arguments. This is known as @dfn{call by value}. The caller may use +a variable as the expression for the argument, but the called function +does not know this: it only knows what value the argument had. For +example, if you write this code: + +@example +foo = "bar" +z = myfunc(foo) +@end example + +@noindent +then you should not think of the argument to @code{myfunc} as being +``the variable @code{foo}.'' Instead, think of the argument as the +string value, @code{"bar"}. + +If the function @code{myfunc} alters the values of its local variables, +this has no effect on any other variables. Thus, if @code{myfunc} +does this: + +@example +@group +function myfunc(str) +@{ + print str + str = "zzz" + print str +@} +@end group +@end example + +@noindent +to change its first argument variable @code{str}, this @emph{does not} +change the value of @code{foo} in the caller. The role of @code{foo} in +calling @code{myfunc} ended when its value, @code{"bar"}, was computed. +If @code{str} also exists outside of @code{myfunc}, the function body +cannot alter this outer value, because it is shadowed during the +execution of @code{myfunc} and cannot be seen or changed from there. + +@cindex call by reference +However, when arrays are the parameters to functions, they are @emph{not} +copied. Instead, the array itself is made available for direct manipulation +by the function. This is usually called @dfn{call by reference}. +Changes made to an array parameter inside the body of a function @emph{are} +visible outside that function. +@ifinfo +This can be @strong{very} dangerous if you do not watch what you are +doing. For example: +@end ifinfo +@iftex +@emph{This can be very dangerous if you do not watch what you are +doing.} For example: +@end iftex + +@example +function changeit(array, ind, nvalue) +@{ + array[ind] = nvalue +@} + +BEGIN @{ + a[1] = 1; a[2] = 2; a[3] = 3 + changeit(a, 2, "two") + printf "a[1] = %s, a[2] = %s, a[3] = %s\n", + a[1], a[2], a[3] +@} +@end example + +@noindent +This program prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because +@code{changeit} stores @code{"two"} in the second element of @code{a}. + +@cindex undefined functions +@cindex functions, undefined +Some @code{awk} implementations allow you to call a function that +has not been defined, and only report a problem at run-time when the +program actually tries to call the function. For example: + +@example +@group +BEGIN @{ + if (0) + foo() + else + bar() +@} +function bar() @{ @dots{} @} +# note that `foo' is not defined +@end group +@end example + +@noindent +Since the @samp{if} statement will never be true, it is not really a +problem that @code{foo} has not been defined. Usually though, it is a +problem if a program calls an undefined function. + +@ignore +At one point, I had gawk dieing on this, but later decided that this might +break old programs and/or test suites. +@end ignore + +If @samp{--lint} has been specified +(@pxref{Options, ,Command Line Options}), +@code{gawk} will report about calls to undefined functions. + +Some @code{awk} implementations generate a run-time +error if you use the @code{next} statement +(@pxref{Next Statement, , The @code{next} Statement}) +inside a user-defined function. +@code{gawk} does not have this problem. + +@node Return Statement, , Function Caveats, User-defined +@section The @code{return} Statement +@cindex @code{return} statement + +The body of a user-defined function can contain a @code{return} statement. +This statement returns control to the rest of the @code{awk} program. It +can also be used to return a value for use in the rest of the @code{awk} +program. It looks like this: + +@example +return @r{[}@var{expression}@r{]} +@end example + +The @var{expression} part is optional. If it is omitted, then the returned +value is undefined and, therefore, unpredictable. + +A @code{return} statement with no value expression is assumed at the end of +every function definition. So if control reaches the end of the function +body, then the function returns an unpredictable value. @code{awk} +will @emph{not} warn you if you use the return value of such a function. + +Sometimes, you want to write a function for what it does, not for +what it returns. Such a function corresponds to a @code{void} function +in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not +return any value; you should simply bear in mind that if you use the return +value of such a function, you do so at your own risk. + +Here is an example of a user-defined function that returns a value +for the largest number among the elements of an array: + +@example +@group +function maxelt(vec, i, ret) +@{ + for (i in vec) @{ + if (ret == "" || vec[i] > ret) + ret = vec[i] + @} + return ret +@} +@end group +@end example + +@noindent +You call @code{maxelt} with one argument, which is an array name. The local +variables @code{i} and @code{ret} are not intended to be arguments; +while there is nothing to stop you from passing two or three arguments +to @code{maxelt}, the results would be strange. The extra space before +@code{i} in the function parameter list indicates that @code{i} and +@code{ret} are not supposed to be arguments. This is a convention that +you should follow when you define functions. + +Here is a program that uses our @code{maxelt} function. It loads an +array, calls @code{maxelt}, and then reports the maximum number in that +array: + +@example +@group +awk ' +function maxelt(vec, i, ret) +@{ + for (i in vec) @{ + if (ret == "" || vec[i] > ret) + ret = vec[i] + @} + return ret +@} +@end group + +@group +# Load all fields of each record into nums. +@{ + for(i = 1; i <= NF; i++) + nums[NR, i] = $i +@} + +END @{ + print maxelt(nums) +@}' +@end group +@end example + +Given the following input: + +@example +@group + 1 5 23 8 16 +44 3 5 2 8 26 +256 291 1396 2962 100 +-6 467 998 1101 +99385 11 0 225 +@end group +@end example + +@noindent +our program tells us (predictably) that @code{99385} is the largest number +in our array. + +@node Invoking Gawk, Library Functions, User-defined, Top +@chapter Running @code{awk} +@cindex command line +@cindex invocation of @code{gawk} +@cindex arguments, command line +@cindex options, command line +@cindex long options +@cindex options, long + +There are two ways to run @code{awk}: with an explicit program, or with +one or more program files. Here are templates for both of them; items +enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional. + +Besides traditional one-letter POSIX-style options, @code{gawk} also +supports GNU long options. + +@example +awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{} +awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} +@end example + +@cindex empty program +@cindex dark corner +It is possible to invoke @code{awk} with an empty program: + +@example +$ awk '' datafile1 datafile2 +@end example + +@noindent +Doing so makes little sense though; @code{awk} will simply exit +silently when given an empty program (d.c.). If @samp{--lint} has +been specified on the command line, @code{gawk} will issue a +warning that the program is empty. + +@menu +* Options:: Command line options and their meanings. +* Other Arguments:: Input file names and variable assignments. +* AWKPATH Variable:: Searching directories for @code{awk} programs. +* Obsolete:: Obsolete Options and/or features. +* Undocumented:: Undocumented Options and Features. +* Known Bugs:: Known Bugs in @code{gawk}. +@end menu + +@node Options, Other Arguments, Invoking Gawk, Invoking Gawk +@section Command Line Options + +Options begin with a dash, and consist of a single character. +GNU style long options consist of two dashes and a keyword. +The keyword can be abbreviated, as long the abbreviation allows the option +to be uniquely identified. If the option takes an argument, then the +keyword is either immediately followed by an equals sign (@samp{=}) and the +argument's value, or the keyword and the argument's value are separated +by whitespace. For brevity, the discussion below only refers to the +traditional short options; however the long and short options are +interchangeable in all contexts. + +Each long option for @code{gawk} has a corresponding +POSIX-style option. The options and their meanings are as follows: + +@table @code +@item -F @var{fs} +@itemx --field-separator @var{fs} +@cindex @code{-F} option +@cindex @code{--field-separator} option +Sets the @code{FS} variable to @var{fs} +(@pxref{Field Separators, ,Specifying How Fields are Separated}). + +@item -f @var{source-file} +@itemx --file @var{source-file} +@cindex @code{-f} option +@cindex @code{--file} option +Indicates that the @code{awk} program is to be found in @var{source-file} +instead of in the first non-option argument. + +@item -v @var{var}=@var{val} +@itemx --assign @var{var}=@var{val} +@cindex @code{-v} option +@cindex @code{--assign} option +Sets the variable @var{var} to the value @var{val} @strong{before} +execution of the program begins. Such variable values are available +inside the @code{BEGIN} rule +(@pxref{Other Arguments, ,Other Command Line Arguments}). + +The @samp{-v} option can only set one variable, but you can use +it more than once, setting another variable each time, like this: +@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}. + +@item -mf @var{NNN} +@itemx -mr @var{NNN} +Set various memory limits to the value @var{NNN}. The @samp{f} flag sets +the maximum number of fields, and the @samp{r} flag sets the maximum +record size. These two flags and the @samp{-m} option are from the +Bell Labs research version of Unix @code{awk}. They are provided +for compatibility, but otherwise ignored by +@code{gawk}, since @code{gawk} has no predefined limits. + +@item -W @var{gawk-opt} +@cindex @code{-W} option +Following the POSIX standard, options that are implementation +specific are supplied as arguments to the @samp{-W} option. These options +also have corresponding GNU style long options. +See below. + +@item -- +Signals the end of the command line options. The following arguments +are not treated as options even if they begin with @samp{-}. This +interpretation of @samp{--} follows the POSIX argument parsing +conventions. + +This is useful if you have file names that start with @samp{-}, +or in shell scripts, if you have file names that will be specified +by the user which could start with @samp{-}. +@end table + +The following @code{gawk}-specific options are available: + +@table @code +@item -W traditional +@itemx -W compat +@itemx --traditional +@itemx --compat +@cindex @code{--compat} option +@cindex @code{--traditional} option +@cindex compatibility mode +Specifies @dfn{compatibility mode}, in which the GNU extensions to +the @code{awk} language are disabled, so that @code{gawk} behaves just +like the Bell Labs research version of Unix @code{awk}. +@samp{--traditional} is the preferred form of this option. +@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}}, +which summarizes the extensions. Also see +@ref{Compatibility Mode, ,Downward Compatibility and Debugging}. + +@item -W copyleft +@itemx -W copyright +@itemx --copyleft +@itemx --copyright +@cindex @code{--copyleft} option +@cindex @code{--copyright} option +Print the short version of the General Public License, and then exit. +This option may disappear in a future version of @code{gawk}. + +@item -W help +@itemx -W usage +@itemx --help +@itemx --usage +@cindex @code{--help} option +@cindex @code{--usage} option +Print a ``usage'' message summarizing the short and long style options +that @code{gawk} accepts, and then exit. + +@item -W lint +@itemx --lint +@cindex @code{--lint} option +Warn about constructs that are dubious or non-portable to +other @code{awk} implementations. +Some warnings are issued when @code{gawk} first reads your program. Others +are issued at run-time, as your program executes. + +@item -W lint-old +@itemx --lint-old +@cindex @code{--lint-old} option +Warn about constructs that are not available in +the original Version 7 Unix version of @code{awk} +(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}). + +@item -W posix +@itemx --posix +@cindex @code{--posix} option +@cindex POSIX mode +Operate in strict POSIX mode. This disables all @code{gawk} +extensions (just like @samp{--traditional}), and adds the following additional +restrictions: + +@c IMPORTANT! Keep this list in sync with the one in node POSIX + +@itemize @bullet +@item +@code{\x} escape sequences are not recognized +(@pxref{Escape Sequences}). + +@item +Newlines do not act as whitespace to separate fields when @code{FS} is +equal to a single space. + +@item +The synonym @code{func} for the keyword @code{function} is not +recognized (@pxref{Definition Syntax, ,Function Definition Syntax}). + +@item +The operators @samp{**} and @samp{**=} cannot be used in +place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators}, +and also @pxref{Assignment Ops, ,Assignment Expressions}). + +@item +Specifying @samp{-Ft} on the command line does not set the value +of @code{FS} to be a single tab character +(@pxref{Field Separators, ,Specifying How Fields are Separated}). + +@item +The @code{fflush} built-in function is not supported +(@pxref{I/O Functions, , Built-in Functions for Input/Output}). +@end itemize + +If you supply both @samp{--traditional} and @samp{--posix} on the +command line, @samp{--posix} will take precedence. @code{gawk} +will also issue a warning if both options are supplied. + +@item -W re-interval +@itemx --re-interval +Allow interval expressions +(@pxref{Regexp Operators, , Regular Expression Operators}), +in regexps. +Because interval expressions were traditionally not available in @code{awk}, +@code{gawk} does not provide them by default. This prevents old @code{awk} +programs from breaking. + +@item -W source @var{program-text} +@itemx --source @var{program-text} +@cindex @code{--source} option +Program source code is taken from the @var{program-text}. This option +allows you to mix source code in files with source +code that you enter on the command line. This is particularly useful +when you have library functions that you wish to use from your command line +programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). + +@item -W version +@itemx --version +@cindex @code{--version} option +Prints version information for this particular copy of @code{gawk}. +This allows you to determine if your copy of @code{gawk} is up to date +with respect to whatever the Free Software Foundation is currently +distributing. +It is also useful for bug reports +(@pxref{Bugs, , Reporting Problems and Bugs}). +@end table + +Any other options are flagged as invalid with a warning message, but +are otherwise ignored. + +In compatibility mode, as a special case, if the value of @var{fs} supplied +to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab +character (@code{"\t"}). This is only true for @samp{--traditional}, and not +for @samp{--posix} +(@pxref{Field Separators, ,Specifying How Fields are Separated}). + +The @samp{-f} option may be used more than once on the command line. +If it is, @code{awk} reads its program source from all of the named files, as +if they had been concatenated together into one big file. This is +useful for creating libraries of @code{awk} functions. Useful functions +can be written once, and then retrieved from a standard place, instead +of having to be included into each individual program. + +You can type in a program at the terminal and still use library functions, +by specifying @samp{-f /dev/tty}. @code{awk} will read a file from the terminal +to use as part of the @code{awk} program. After typing your program, +type @kbd{Control-d} (the end-of-file character) to terminate it. +(You may also use @samp{-f -} to read program source from the standard +input, but then you will not be able to also use the standard input as a +source of data.) + +Because it is clumsy using the standard @code{awk} mechanisms to mix source +file and command line @code{awk} programs, @code{gawk} provides the +@samp{--source} option. This does not require you to pre-empt the standard +input for your source code, and allows you to easily mix command line +and library source code +(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). + +If no @samp{-f} or @samp{--source} option is specified, then @code{gawk} +will use the first non-option command line argument as the text of the +program source code. + +@cindex @code{POSIXLY_CORRECT} environment variable +@cindex environment variable, @code{POSIXLY_CORRECT} +If the environment variable @code{POSIXLY_CORRECT} exists, +then @code{gawk} will behave in strict POSIX mode, exactly as if +you had supplied the @samp{--posix} command line option. +Many GNU programs look for this environment variable to turn on +strict POSIX mode. If you supply @samp{--lint} on the command line, +and @code{gawk} turns on POSIX mode because of @code{POSIXLY_CORRECT}, +then it will print a warning message indicating that POSIX +mode is in effect. + +You would typically set this variable in your shell's startup file. +For a Bourne compatible shell (such as Bash), you would add these +lines to the @file{.profile} file in your home directory. + +@example +@group +POSIXLY_CORRECT=true +export POSIXLY_CORRECT +@end group +@end example + +For a @code{csh} compatible shell,@footnote{Not recommended.} +you would add this line to the @file{.login} file in your home directory. + +@example +setenv POSIXLY_CORRECT true +@end example + +@node Other Arguments, AWKPATH Variable, Options, Invoking Gawk +@section Other Command Line Arguments + +Any additional arguments on the command line are normally treated as +input files to be processed in the order specified. However, an +argument that has the form @code{@var{var}=@var{value}}, assigns +the value @var{value} to the variable @var{var}---it does not specify a +file at all. + +@vindex ARGIND +@vindex ARGV +All these arguments are made available to your @code{awk} program in the +@code{ARGV} array (@pxref{Built-in Variables}). Command line options +and the program text (if present) are omitted from @code{ARGV}. +All other arguments, including variable assignments, are +included. As each element of @code{ARGV} is processed, @code{gawk} +sets the variable @code{ARGIND} to the index in @code{ARGV} of the +current element. + +The distinction between file name arguments and variable-assignment +arguments is made when @code{awk} is about to open the next input file. +At that point in execution, it checks the ``file name'' to see whether +it is really a variable assignment; if so, @code{awk} sets the variable +instead of reading a file. + +Therefore, the variables actually receive the given values after all +previously specified files have been read. In particular, the values of +variables assigned in this fashion are @emph{not} available inside a +@code{BEGIN} rule +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), +since such rules are run before @code{awk} begins scanning the argument list. + +@cindex dark corner +The variable values given on the command line are processed for escape +sequences (d.c.) (@pxref{Escape Sequences}). + +In some earlier implementations of @code{awk}, when a variable assignment +occurred before any file names, the assignment would happen @emph{before} +the @code{BEGIN} rule was executed. @code{awk}'s behavior was thus +inconsistent; some command line assignments were available inside the +@code{BEGIN} rule, while others were not. However, +some applications came to depend +upon this ``feature.'' When @code{awk} was changed to be more consistent, +the @samp{-v} option was added to accommodate applications that depended +upon the old behavior. + +The variable assignment feature is most useful for assigning to variables +such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and +output formats, before scanning the data files. It is also useful for +controlling state if multiple passes are needed over a data file. For +example: + +@cindex multiple passes over data +@cindex passes, multiple +@example +awk 'pass == 1 @{ @var{pass 1 stuff} @} + pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata +@end example + +Given the variable assignment feature, the @samp{-F} option for setting +the value of @code{FS} is not +strictly necessary. It remains for historical compatibility. + +@node AWKPATH Variable, Obsolete, Other Arguments, Invoking Gawk +@section The @code{AWKPATH} Environment Variable +@cindex @code{AWKPATH} environment variable +@cindex environment variable, @code{AWKPATH} +@cindex search path +@cindex directory search +@cindex path, search +@cindex differences between @code{gawk} and @code{awk} + +The previous section described how @code{awk} program files can be named +on the command line with the @samp{-f} option. In most @code{awk} +implementations, you must supply a precise path name for each program +file, unless the file is in the current directory. + +@cindex search path, for source files +But in @code{gawk}, if the file name supplied to the @samp{-f} option +does not contain a @samp{/}, then @code{gawk} searches a list of +directories (called the @dfn{search path}), one by one, looking for a +file with the specified name. + +The search path is a string consisting of directory names +separated by colons. @code{gawk} gets its search path from the +@code{AWKPATH} environment variable. If that variable does not exist, +@code{gawk} uses a default path, which is +@samp{.:/usr/local/share/awk}.@footnote{Your version of @code{gawk} +may use a directory that is different than @file{/usr/local/share/awk}; it +will depend upon how @code{gawk} was built and installed. The actual +directory will be the value of @samp{$(datadir)} generated when +@code{gawk} was configured. You probably don't need to worry about this +though.} (Programs written for use by +system administrators should use an @code{AWKPATH} variable that +does not include the current directory, @file{.}.) + +The search path feature is particularly useful for building up libraries +of useful @code{awk} functions. The library files can be placed in a +standard directory that is in the default path, and then specified on +the command line with a short file name. Otherwise, the full file name +would have to be typed for each file. + +By using both the @samp{--source} and @samp{-f} options, your command line +@code{awk} programs can use facilities in @code{awk} library files. +@xref{Library Functions, , A Library of @code{awk} Functions}. + +Path searching is not done if @code{gawk} is in compatibility mode. +This is true for both @samp{--traditional} and @samp{--posix}. +@xref{Options, ,Command Line Options}. + +@strong{Note:} if you want files in the current directory to be found, +you must include the current directory in the path, either by including +@file{.} explicitly in the path, or by writing a null entry in the +path. (A null entry is indicated by starting or ending the path with a +colon, or by placing two colons next to each other (@samp{::}).) If the +current directory is not included in the path, then files cannot be +found in the current directory. This path search mechanism is identical +to the shell's. +@c someday, @cite{The Bourne Again Shell}.... + +Starting with version 3.0, if @code{AWKPATH} is not defined in the +environment, @code{gawk} will place its default search path into +@code{ENVIRON["AWKPATH"]}. This makes it easy to determine +the actual search path @code{gawk} will use. + +@node Obsolete, Undocumented, AWKPATH Variable, Invoking Gawk +@section Obsolete Options and/or Features + +@cindex deprecated options +@cindex obsolete options +@cindex deprecated features +@cindex obsolete features +This section describes features and/or command line options from +previous releases of @code{gawk} that are either not available in the +current version, or that are still supported but deprecated (meaning that +they will @emph{not} be in the next release). + +@c update this section for each release! + +For version @value{VERSION}.@value{PATCHLEVEL} of @code{gawk}, there are no +command line options +or other deprecated features from the previous version of @code{gawk}. +@iftex +This section +@end iftex +@ifinfo +This node +@end ifinfo +is thus essentially a place holder, +in case some option becomes obsolete in a future version of @code{gawk}. + +@ignore +@c This is pretty old news... +The public-domain version of @code{strftime} that is distributed with +@code{gawk} changed for the 2.14 release. The @samp{%V} conversion specifier +that used to generate the date in VMS format was changed to @samp{%v}. +This is because the POSIX standard for the @code{date} utility now +specifies a @samp{%V} conversion specifier. +@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details. +@end ignore + +@node Undocumented, Known Bugs, Obsolete, Invoking Gawk +@section Undocumented Options and Features +@cindex undocumented features +@display +@i{Use the Source, Luke!} +Obi-Wan +@end display +@sp 1 + +This section intentionally left blank. + +@c Read The Source, Luke! + +@ignore +@c If these came out in the Info file or TeX document, then they wouldn't +@c be undocumented, would they? + +@code{gawk} has one undocumented option: + +@table @code +@item -W nostalgia +@itemx --nostalgia +Print the message @code{"awk: bailing out near line 1"} and dump core. +This option was inspired by the common behavior of very early versions of +Unix @code{awk}, and by a t--shirt. +@end table + +Early versions of @code{awk} used to not require any separator (either +a newline or @samp{;}) between the rules in @code{awk} programs. Thus, +it was common to see one-line programs like: + +@example +awk '@{ sum += $1 @} END @{ print sum @}' +@end example + +@code{gawk} actually supports this, but it is purposely undocumented +since it is considered bad style. The correct way to write such a program +is either + +@example +awk '@{ sum += $1 @} ; END @{ print sum @}' +@end example + +@noindent +or + +@example +awk '@{ sum += $1 @} + END @{ print sum @}' data +@end example + +@noindent +@xref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a fuller +explanation. + +@end ignore + +@node Known Bugs, , Undocumented, Invoking Gawk +@section Known Bugs in @code{gawk} +@cindex bugs, known in @code{gawk} +@cindex known bugs + +@itemize @bullet +@item +The @samp{-F} option for changing the value of @code{FS} +(@pxref{Options, ,Command Line Options}) +is not necessary given the command line variable +assignment feature; it remains only for backwards compatibility. + +@item +If your system actually has support for @file{/dev/fd} and the +associated @file{/dev/stdin}, @file{/dev/stdout}, and +@file{/dev/stderr} files, you may get different output from @code{gawk} +than you would get on a system without those files. When @code{gawk} +interprets these files internally, it synchronizes output to the +standard output with output to @file{/dev/stdout}, while on a system +with those files, the output is actually to different open files +(@pxref{Special Files, ,Special File Names in @code{gawk}}). + +@item +Syntactically invalid single character programs tend to overflow +the parse stack, generating a rather unhelpful message. Such programs +are surprisingly difficult to diagnose in the completely general case, +and the effort to do so really is not worth it. +@end itemize + +@node Library Functions, Sample Programs, Invoking Gawk, Top +@chapter A Library of @code{awk} Functions + +@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!! +This chapter presents a library of useful @code{awk} functions. The +sample programs presented later +(@pxref{Sample Programs, ,Practical @code{awk} Programs}) +use these functions. +The functions are presented here in a progression from simple to complex. + +@ref{Extract Program, ,Extracting Programs from Texinfo Source Files}, +presents a program that you can use to extract the source code for +these example library functions and programs from the Texinfo source +for this @value{DOCUMENT}. +(This has already been done as part of the @code{gawk} distribution.) + +If you have written one or more useful, general purpose @code{awk} functions, +and would like to contribute them for a subsequent edition of this @value{DOCUMENT}, +please contact the author. @xref{Bugs, ,Reporting Problems and Bugs}, +for information on doing this. Don't just send code, as you will be +required to either place your code in the public domain, +publish it under the GPL (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}), +or assign the copyright in it to the Free Software Foundation. + +@menu +* Portability Notes:: What to do if you don't have @code{gawk}. +* Nextfile Function:: Two implementations of a @code{nextfile} + function. +* Assert Function:: A function for assertions in @code{awk} + programs. +* Round Function:: A function for rounding if @code{sprintf} does + not do it correctly. +* Ordinal Functions:: Functions for using characters as numbers and + vice versa. +* Join Function:: A function to join an array into a string. +* Mktime Function:: A function to turn a date into a timestamp. +* Gettimeofday Function:: A function to get formatted times. +* Filetrans Function:: A function for handling data file transitions. +* Getopt Function:: A function for processing command line + arguments. +* Passwd Functions:: Functions for getting user information. +* Group Functions:: Functions for getting group information. +* Library Names:: How to best name private global variables in + library functions. +@end menu + +@node Portability Notes, Nextfile Function, Library Functions, Library Functions +@section Simulating @code{gawk}-specific Features +@cindex portability issues + +The programs in this chapter and in +@ref{Sample Programs, ,Practical @code{awk} Programs}, +freely use features that are specific to @code{gawk}. +This section briefly discusses how you can rewrite these programs for +different implementations of @code{awk}. + +Diagnostic error messages are sent to @file{/dev/stderr}. +Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"}, if your system +does not have a @file{/dev/stderr}, or if you cannot use @code{gawk}. + +A number of programs use @code{nextfile} +(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}), +to skip any remaining input in the input file. +@ref{Nextfile Function, ,Implementing @code{nextfile} as a Function}, +shows you how to write a function that will do the same thing. + +Finally, some of the programs choose to ignore upper-case and lower-case +distinctions in their input. They do this by assigning one to @code{IGNORECASE}. +You can achieve the same effect by adding the following rule to the +beginning of the program: + +@example +# ignore case +@{ $0 = tolower($0) @} +@end example + +@noindent +Also, verify that all regexp and string constants used in +comparisons only use lower-case letters. + +@node Nextfile Function, Assert Function, Portability Notes, Library Functions +@section Implementing @code{nextfile} as a Function + +@cindex skipping input files +@cindex input files, skipping +The @code{nextfile} statement presented in +@ref{Nextfile Statement, ,The @code{nextfile} Statement}, +is a @code{gawk}-specific extension. It is not available in other +implementations of @code{awk}. This section shows two versions of a +@code{nextfile} function that you can use to simulate @code{gawk}'s +@code{nextfile} statement if you cannot use @code{gawk}. + +Here is a first attempt at writing a @code{nextfile} function. + +@example +@group +# nextfile --- skip remaining records in current file + +# this should be read in before the "main" awk program + +function nextfile() @{ _abandon_ = FILENAME; next @} + +_abandon_ == FILENAME @{ next @} +@end group +@end example + +This file should be included before the main program, because it supplies +a rule that must be executed first. This rule compares the current data +file's name (which is always in the @code{FILENAME} variable) to a private +variable named @code{_abandon_}. If the file name matches, then the action +part of the rule executes a @code{next} statement, to go on to the next +record. (The use of @samp{_} in the variable name is a convention. +It is discussed more fully in +@ref{Library Names, , Naming Library Function Global Variables}.) + +The use of the @code{next} statement effectively creates a loop that reads +all the records from the current data file. +Eventually, the end of the file is reached, and +a new data file is opened, changing the value of @code{FILENAME}. +Once this happens, the comparison of @code{_abandon_} to @code{FILENAME} +fails, and execution continues with the first rule of the ``real'' program. + +The @code{nextfile} function itself simply sets the value of @code{_abandon_} +and then executes a @code{next} statement to start the loop +going.@footnote{Some implementations of @code{awk} do not allow you to +execute @code{next} from within a function body. Some other work-around +will be necessary if you use such a version.} +@c mawk is what we're talking about. + +This initial version has a subtle problem. What happens if the same data +file is listed @emph{twice} on the command line, one right after the other, +or even with just a variable assignment between the two occurrences of +the file name? + +@c @findex nextfile +@c do it this way, since all the indices are merged +@cindex @code{nextfile} function +In such a case, +this code will skip right through the file, a second time, even though +it should stop when it gets to the end of the first occurrence. +Here is a second version of @code{nextfile} that remedies this problem. + +@example +@group +@c file eg/lib/nextfile.awk +# nextfile --- skip remaining records in current file +# correctly handle successive occurrences of the same file +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May, 1993 + +# this should be read in before the "main" awk program + +function nextfile() @{ _abandon_ = FILENAME; next @} + +_abandon_ == FILENAME @{ + if (FNR == 1) + _abandon_ = "" + else + next +@} +@c endfile +@end group +@end example + +The @code{nextfile} function has not changed. It sets @code{_abandon_} +equal to the current file name and then executes a @code{next} satement. +The @code{next} statement reads the next record and increments @code{FNR}, +so @code{FNR} is guaranteed to have a value of at least two. +However, if @code{nextfile} is called for the last record in the file, +then @code{awk} will close the current data file and move on to the next +one. Upon doing so, @code{FILENAME} will be set to the name of the new file, +and @code{FNR} will be reset to one. If this next file is the same as +the previous one, @code{_abandon_} will still be equal to @code{FILENAME}. +However, @code{FNR} will be equal to one, telling us that this is a new +occurrence of the file, and not the one we were reading when the +@code{nextfile} function was executed. In that case, @code{_abandon_} +is reset to the empty string, so that further executions of this rule +will fail (until the next time that @code{nextfile} is called). + +If @code{FNR} is not one, then we are still in the original data file, +and the program executes a @code{next} statement to skip through it. + +An important question to ask at this point is: ``Given that the +functionality of @code{nextfile} can be provided with a library file, +why is it built into @code{gawk}?'' This is an important question. Adding +features for little reason leads to larger, slower programs that are +harder to maintain. + +The answer is that building @code{nextfile} into @code{gawk} provides +significant gains in efficiency. If the @code{nextfile} function is executed +at the beginning of a large data file, @code{awk} still has to scan the entire +file, splitting it up into records, just to skip over it. The built-in +@code{nextfile} can simply close the file immediately and proceed to the +next one, saving a lot of time. This is particularly important in +@code{awk}, since @code{awk} programs are generally I/O bound (i.e.@: +they spend most of their time doing input and output, instead of performing +computations). + +@node Assert Function, Round Function, Nextfile Function, Library Functions +@section Assertions + +@cindex assertions +@cindex @code{assert}, C version +When writing large programs, it is often useful to be able to know +that a condition or set of conditions is true. Before proceeding with a +particular computation, you make a statement about what you believe to be +the case. Such a statement is known as an +``assertion.'' The C language provides an @code{<assert.h>} header file +and corresponding @code{assert} macro that the programmer can use to make +assertions. If an assertion fails, the @code{assert} macro arranges to +print a diagnostic message describing the condition that should have +been true but was not, and then it kills the program. In C, using +@code{assert} looks this: + +@example +#include <assert.h> + +int myfunc(int a, double b) +@{ + assert(a <= 5 && b >= 17); + @dots{} +@} +@end example + +If the assertion failed, the program would print a message similar to +this: + +@example +prog.c:5: assertion failed: a <= 5 && b >= 17 +@end example + +@findex assert +The ANSI C language makes it possible to turn the condition into a string for use +in printing the diagnostic message. This is not possible in @code{awk}, so +this @code{assert} function also requires a string version of the condition +that is being tested. + +@example +@c @group +@c file eg/lib/assert.awk +# assert --- assert that a condition is true. Otherwise exit. +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May, 1993 + +function assert(condition, string) +@{ + if (! condition) @{ + printf("%s:%d: assertion failed: %s\n", + FILENAME, FNR, string) > "/dev/stderr" + _assert_exit = 1 + exit 1 + @} +@} + +END @{ + if (_assert_exit) + exit 1 +@} +@c endfile +@c @end group +@end example + +The @code{assert} function tests the @code{condition} parameter. If it +is false, it prints a message to standard error, using the @code{string} +parameter to describe the failed condition. It then sets the variable +@code{_assert_exit} to one, and executes the @code{exit} statement. +The @code{exit} statement jumps to the @code{END} rule. If the @code{END} +rules finds @code{_assert_exit} to be true, then it exits immediately. + +The purpose of the @code{END} rule with its test is to +keep any other @code{END} rules from running. When an assertion fails, the +program should exit immediately. +If no assertions fail, then @code{_assert_exit} will still be +false when the @code{END} rule is run normally, and the rest of the +program's @code{END} rules will execute. +For all of this to work correctly, @file{assert.awk} must be the +first source file read by @code{awk}. + +You would use this function in your programs this way: + +@example +function myfunc(a, b) +@{ + assert(a <= 5 && b >= 17, "a <= 5 && b >= 17") + @dots{} +@} +@end example + +@noindent +If the assertion failed, you would see a message like this: + +@example +mydata:1357: assertion failed: a <= 5 && b >= 17 +@end example + +There is a problem with this version of @code{assert}, that it may not +be possible to work around. An @code{END} rule is automatically added +to the program calling @code{assert}. Normally, if a program consists +of just a @code{BEGIN} rule, the input files and/or standard input are +not read. However, now that the program has an @code{END} rule, @code{awk} +will attempt to read the input data files, or standard input +(@pxref{Using BEGIN/END, , Startup and Cleanup Actions}), +most likely causing the program to hang, waiting for input. + +@node Round Function, Ordinal Functions, Assert Function, Library Functions +@section Rounding Numbers + +@cindex rounding +The way @code{printf} and @code{sprintf} +(@pxref{Printf, , Using @code{printf} Statements for Fancier Printing}) +do rounding will often depend +upon the system's C @code{sprintf} subroutine. +On many machines, +@code{sprintf} rounding is ``unbiased,'' which means it doesn't always +round a trailing @samp{.5} up, contrary to naive expectations. In unbiased +rounding, @samp{.5} rounds to even, rather than always up, so 1.5 rounds to +2 but 4.5 rounds to 4. +The result is that if you are using a format that does +rounding (e.g., @code{"%.0f"}) you should check what your system does. +The following function does traditional rounding; +it might be useful if your awk's @code{printf} does unbiased rounding. + +@findex round +@example +@c file eg/lib/round.awk +# round --- do normal rounding +# +# Arnold Robbins, arnold@@gnu.ai.mit.edu, August, 1996 +# Public Domain + +function round(x, ival, aval, fraction) +@{ + ival = int(x) # integer part, int() truncates + + # see if fractional part + if (ival == x) # no fraction + return x + + if (x < 0) @{ + aval = -x # absolute value + ival = int(aval) + fraction = aval - ival + if (fraction >= .5) + return int(x) - 1 # -2.5 --> -3 + else + return int(x) # -2.3 --> -2 + @} else @{ + fraction = x - ival + if (fraction >= .5) + return ival + 1 + else + return ival + @} +@} + +# test harness +@{ print $0, round($0) @} +@c endfile +@end example + +@node Ordinal Functions, Join Function, Round Function, Library Functions +@section Translating Between Characters and Numbers + +@cindex numeric character values +@cindex values of characters as numbers +One commercial implementation of @code{awk} supplies a built-in function, +@code{ord}, which takes a character and returns the numeric value for that +character in the machine's character set. If the string passed to +@code{ord} has more than one character, only the first one is used. + +The inverse of this function is @code{chr} (from the function of the same +name in Pascal), which takes a number and returns the corresponding character. + +Both functions can be written very nicely in @code{awk}; there is no real +reason to build them into the @code{awk} interpreter. + +@findex ord +@findex chr +@example +@group +@c file eg/lib/ord.awk +# ord.awk --- do ord and chr +# +# Global identifiers: +# _ord_: numerical values indexed by characters +# _ord_init: function to initialize _ord_ +# +# Arnold Robbins +# arnold@@gnu.ai.mit.edu +# Public Domain +# 16 January, 1992 +# 20 July, 1992, revised + +BEGIN @{ _ord_init() @} +@c endfile +@end group + +@c @group +@c file eg/lib/ord.awk +function _ord_init( low, high, i, t) +@{ + low = sprintf("%c", 7) # BEL is ascii 7 + if (low == "\a") @{ # regular ascii + low = 0 + high = 127 + @} else if (sprintf("%c", 128 + 7) == "\a") @{ + # ascii, mark parity + low = 128 + high = 255 + @} else @{ # ebcdic(!) + low = 0 + high = 255 + @} + + for (i = low; i <= high; i++) @{ + t = sprintf("%c", i) + _ord_[t] = i + @} +@} +@c endfile +@c @end group +@end example + +@cindex character sets +@cindex character encodings +@cindex ASCII +@cindex EBCDIC +@cindex mark parity +Some explanation of the numbers used by @code{chr} is worthwhile. +The most prominent character set in use today is ASCII. Although an +eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only +defines characters that use the values from zero to 127.@footnote{ASCII +has been extended in many countries to use the values from 128 to 255 +for country-specific characters. If your system uses these extensions, +you can simplify @code{_ord_init} to simply loop from zero to 255.} +At least one computer manufacturer that we know of +@c Pr1me, blech +uses ASCII, but with mark parity, meaning that the leftmost bit in the byte +is always one. What this means is that on those systems, characters +have numeric values from 128 to 255. +Finally, large mainframe systems use the EBCDIC character set, which +uses all 256 values. +While there are other character sets in use on some older systems, +they are not really worth worrying about. + +@example +@group +@c file eg/lib/ord.awk +function ord(str, c) +@{ + # only first character is of interest + c = substr(str, 1, 1) + return _ord_[c] +@} +@c endfile +@end group + +@group +@c file eg/lib/ord.awk +function chr(c) +@{ + # force c to be numeric by adding 0 + return sprintf("%c", c + 0) +@} +@c endfile +@end group + +@c @group +@c file eg/lib/ord.awk +#### test code #### +# BEGIN \ +# @{ +# for (;;) @{ +# printf("enter a character: ") +# if (getline var <= 0) +# break +# printf("ord(%s) = %d\n", var, ord(var)) +# @} +# @} +@c endfile +@c @end group +@end example + +An obvious improvement to these functions would be to move the code for the +@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was +written this way initially for ease of development. + +There is a ``test program'' in a @code{BEGIN} rule, for testing the +function. It is commented out for production use. + +@node Join Function, Mktime Function, Ordinal Functions, Library Functions +@section Merging an Array Into a String + +@cindex merging strings +When doing string processing, it is often useful to be able to join +all the strings in an array into one long string. The following function, +@code{join}, accomplishes this task. It is used later in several of +the application programs +(@pxref{Sample Programs, ,Practical @code{awk} Programs}). + +Good function design is important; this function needs to be general, but it +should also have a reasonable default behavior. It is called with an array +and the beginning and ending indices of the elements in the array to be +merged. This assumes that the array indices are numeric---a reasonable +assumption since the array was likely created with @code{split} +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). + +@findex join +@example +@group +@c file eg/lib/join.awk +# join.awk --- join an array into a string +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +function join(array, start, end, sep, result, i) +@{ + if (sep == "") + sep = " " + else if (sep == SUBSEP) # magic value + sep = "" + result = array[start] + for (i = start + 1; i <= end; i++) + result = result sep array[i] + return result +@} +@c endfile +@end group +@end example + +An optional additional argument is the separator to use when joining the +strings back together. If the caller supplies a non-empty value, +@code{join} uses it. If it is not supplied, it will have a null +value. In this case, @code{join} uses a single blank as a default +separator for the strings. If the value is equal to @code{SUBSEP}, +then @code{join} joins the strings with no separator between them. +@code{SUBSEP} serves as a ``magic'' value to indicate that there should +be no separation between the component strings. + +It would be nice if @code{awk} had an assignment operator for concatenation. +The lack of an explicit operator for concatenation makes string operations +more difficult than they really need to be. + +@node Mktime Function, Gettimeofday Function, Join Function, Library Functions +@section Turning Dates Into Timestamps + +The @code{systime} function built in to @code{gawk} +returns the current time of day as +a timestamp in ``seconds since the Epoch.'' This timestamp +can be converted into a printable date of almost infinitely variable +format using the built-in @code{strftime} function. +(For more information on @code{systime} and @code{strftime}, +@pxref{Time Functions, ,Functions for Dealing with Time Stamps}.) + +@cindex converting dates to timestamps +@cindex dates, converting to timestamps +@cindex timestamps, converting from dates +An interesting but difficult problem is to convert a readable representation +of a date back into a timestamp. The ANSI C library provides a @code{mktime} +function that does the basic job, converting a canonical representation of a +date into a timestamp. + +It would appear at first glance that @code{gawk} would have to supply a +@code{mktime} built-in function that was simply a ``hook'' to the C language +version. In fact though, @code{mktime} can be implemented entirely in +@code{awk}. + +Here is a version of @code{mktime} for @code{awk}. It takes a simple +representation of the date and time, and converts it into a timestamp. + +The code is presented here intermixed with explanatory prose. In +@ref{Extract Program, ,Extracting Programs from Texinfo Source Files}, +you will see how the Texinfo source file for this @value{DOCUMENT} +can be processed to extract the code into a single source file. + +The program begins with a descriptive comment and a @code{BEGIN} rule +that initializes a table @code{_tm_months}. This table is a two-dimensional +array that has the lengths of the months. The first index is zero for +regular years, and one for leap years. The values are the same for all the +months in both kinds of years, except for February; thus the use of multiple +assignment. + +@example +@c @group +@c file eg/lib/mktime.awk +# mktime.awk --- convert a canonical date representation +# into a timestamp +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +BEGIN \ +@{ + # Initialize table of month lengths + _tm_months[0,1] = _tm_months[1,1] = 31 + _tm_months[0,2] = 28; _tm_months[1,2] = 29 + _tm_months[0,3] = _tm_months[1,3] = 31 + _tm_months[0,4] = _tm_months[1,4] = 30 + _tm_months[0,5] = _tm_months[1,5] = 31 + _tm_months[0,6] = _tm_months[1,6] = 30 + _tm_months[0,7] = _tm_months[1,7] = 31 + _tm_months[0,8] = _tm_months[1,8] = 31 + _tm_months[0,9] = _tm_months[1,9] = 30 + _tm_months[0,10] = _tm_months[1,10] = 31 + _tm_months[0,11] = _tm_months[1,11] = 30 + _tm_months[0,12] = _tm_months[1,12] = 31 +@} +@c endfile +@c @end group +@end example + +The benefit of merging multiple @code{BEGIN} rules +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}) +is particularly clear when writing library files. Functions in library +files can cleanly initialize their own private data and also provide clean-up +actions in private @code{END} rules. + +The next function is a simple one that computes whether a given year is or +is not a leap year. If a year is evenly divisible by four, but not evenly +divisible by 100, or if it is evenly divisible by 400, then it is a leap +year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be. +@c Change this after the year 2000 to ``2000 was'' (:-) + +@findex _tm_isleap +@example +@group +@c file eg/lib/mktime.awk +# decide if a year is a leap year +function _tm_isleap(year, ret) +@{ + ret = (year % 4 == 0 && year % 100 != 0) || + (year % 400 == 0) + + return ret +@} +@c endfile +@end group +@end example + +This function is only used a few times in this file, and its computation +could have been written @dfn{in-line} (at the point where it's used). +Making it a separate function made the original development easier, and also +avoids the possibility of typing errors when duplicating the code in +multiple places. + +The next function is more interesting. It does most of the work of +generating a timestamp, which is converting a date and time into some number +of seconds since the Epoch. The caller passes an array (rather +imaginatively named @code{a}) containing six +values: the year including century, the month as a number between one and 12, +the day of the month, the hour as a number between zero and 23, the minute in +the hour, and the seconds within the minute. + +The function uses several local variables to precompute the number of +seconds in an hour, seconds in a day, and seconds in a year. Often, +similar C code simply writes out the expression in-line, expecting the +compiler to do @dfn{constant folding}. E.g., most C compilers would +turn @samp{60 * 60} into @samp{3600} at compile time, instead of recomputing +it every time at run time. Precomputing these values makes the +function more efficient. + +@findex _tm_addup +@example +@c @group +@c file eg/lib/mktime.awk +# convert a date into seconds +function _tm_addup(a, total, yearsecs, daysecs, + hoursecs, i, j) +@{ + hoursecs = 60 * 60 + daysecs = 24 * hoursecs + yearsecs = 365 * daysecs + + total = (a[1] - 1970) * yearsecs + +@group + # extra day for leap years + for (i = 1970; i < a[1]; i++) + if (_tm_isleap(i)) + total += daysecs +@end group + +@group + j = _tm_isleap(a[1]) + for (i = 1; i < a[2]; i++) + total += _tm_months[j, i] * daysecs +@end group + + total += (a[3] - 1) * daysecs + total += a[4] * hoursecs + total += a[5] * 60 + total += a[6] + + return total +@} +@c endfile +@c @end group +@end example + +The function starts with a first approximation of all the seconds between +Midnight, January 1, 1970,@footnote{This is the Epoch on POSIX systems. +It may be different on other systems.} and the beginning of the current +year. It then goes through all those years, and for every leap year, +adds an additional day's worth of seconds. + +The variable @code{j} holds either one or zero, if the current year is or is not +a leap year. +For every month in the current year prior to the current month, it adds +the number of seconds in the month, using the appropriate entry in the +@code{_tm_months} array. + +Finally, it adds in the seconds for the number of days prior to the current +day, and the number of hours, minutes, and seconds in the current day. + +The result is a count of seconds since January 1, 1970. This value is not +yet what is needed though. The reason why is described shortly. + +The main @code{mktime} function takes a single character string argument. +This string is a representation of a date and time in a ``canonical'' +(fixed) form. This string should be +@code{"@var{year} @var{month} @var{day} @var{hour} @var{minute} @var{second}"}. + +@findex mktime +@example +@c @group +@c file eg/lib/mktime.awk +# mktime --- convert a date into seconds, +# compensate for time zone + +function mktime(str, res1, res2, a, b, i, j, t, diff) +@{ + i = split(str, a, " ") # don't rely on FS + + if (i != 6) + return -1 + + # force numeric + for (j in a) + a[j] += 0 + +@group + # validate + if (a[1] < 1970 || + a[2] < 1 || a[2] > 12 || + a[3] < 1 || a[3] > 31 || + a[4] < 0 || a[4] > 23 || + a[5] < 0 || a[5] > 59 || + a[6] < 0 || a[6] > 60 ) + return -1 +@end group + + res1 = _tm_addup(a) + t = strftime("%Y %m %d %H %M %S", res1) + + if (_tm_debug) + printf("(%s) -> (%s)\n", str, t) > "/dev/stderr" + + split(t, b, " ") + res2 = _tm_addup(b) + + diff = res1 - res2 + + if (_tm_debug) + printf("diff = %d seconds\n", diff) > "/dev/stderr" + + res1 += diff + + return res1 +@} +@c endfile +@c @end group +@end example + +The function first splits the string into an array, using spaces and tabs as +separators. If there are not six elements in the array, it returns an +error, signaled as the value @minus{}1. +Next, it forces each element of the array to be numeric, by adding zero to it. +The following @samp{if} statement then makes sure that each element is +within an allowable range. (This checking could be extended further, e.g., +to make sure that the day of the month is within the correct range for the +particular month supplied.) All of this is essentially preliminary set-up +and error checking. + +Recall that @code{_tm_addup} generated a value in seconds since Midnight, +January 1, 1970. This value is not directly usable as the result we want, +@emph{since the calculation does not account for the local timezone}. In other +words, the value represents the count in seconds since the Epoch, but only +for UTC (Universal Coordinated Time). If the local timezone is east or west +of UTC, then some number of hours should be either added to, or subtracted from +the resulting timestamp. + +For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west +of (behind) UTC. It is only four hours behind UTC if daylight savings +time is in effect. +If you are calling @code{mktime} in Atlanta, with the argument +@code{@w{"1993 5 23 18 23 12"}}, the result from @code{_tm_addup} will be +for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to +add another four hours worth of seconds to the result. + +How can @code{mktime} determine how far away it is from UTC? This is +surprisingly easy. The returned timestamp represents the time passed to +@code{mktime} @emph{as UTC}. This timestamp can be fed back to +@code{strftime}, which will format it as a @emph{local} time; i.e.@: as +if it already had the UTC difference added in to it. This is done by +giving @code{@w{"%Y %m %d %H %M %S"}} to @code{strftime} as the format +argument. It returns the computed timestamp in the original string +format. The result represents a time that accounts for the UTC +difference. When the new time is converted back to a timestamp, the +difference between the two timestamps is the difference (in seconds) +between the local timezone and UTC. This difference is then added back +to the original result. An example demonstrating this is presented below. + +Finally, there is a ``main'' program for testing the function. + +@example +@c @group +@c file eg/lib/mktime.awk +BEGIN @{ + if (_tm_test) @{ + printf "Enter date as yyyy mm dd hh mm ss: " + getline _tm_test_date + + t = mktime(_tm_test_date) + r = strftime("%Y %m %d %H %M %S", t) + printf "Got back (%s)\n", r + @} +@} +@c endfile +@c @end group +@end example + +The entire program uses two variables that can be set on the command +line to control debugging output and to enable the test in the final +@code{BEGIN} rule. Here is the result of a test run. (Note that debugging +output is to standard error, and test output is to standard output.) + +@example +@c @group +$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1 +@print{} Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10 +@error{} (1993 5 23 15 35 10) -> (1993 05 23 11 35 10) +@error{} diff = 14400 seconds +@print{} Got back (1993 05 23 15 35 10) +@c @end group +@end example + +The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993. +The first line +of debugging output shows the resulting time as UTC---four hours ahead of +the local time zone. The second line shows that the difference is 14400 +seconds, which is four hours. (The difference is only four hours, since +daylight savings time is in effect during May.) +The final line of test output shows that the timezone compensation +algorithm works; the returned time is the same as the entered time. + +This program does not solve the general problem of turning an arbitrary date +representation into a timestamp. That problem is very involved. However, +the @code{mktime} function provides a foundation upon which to build. Other +software can convert month names into numeric months, and AM/PM times into +24-hour clocks, to generate the ``canonical'' format that @code{mktime} +requires. + +@node Gettimeofday Function, Filetrans Function, Mktime Function, Library Functions +@section Managing the Time of Day + +@cindex formatted timestamps +@cindex timestamps, formatted +The @code{systime} and @code{strftime} functions described in +@ref{Time Functions, ,Functions for Dealing with Time Stamps}, +provide the minimum functionality necessary for dealing with the time of day +in human readable form. While @code{strftime} is extensive, the control +formats are not necessarily easy to remember or intuitively obvious when +reading a program. + +The following function, @code{gettimeofday}, populates a user-supplied array +with pre-formatted time information. It returns a string with the current +time formatted in the same way as the @code{date} utility. + +@findex gettimeofday +@example +@c @group +@c file eg/lib/gettime.awk +# gettimeofday --- get the time of day in a usable format +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain, May 1993 +# +# Returns a string in the format of output of date(1) +# Populates the array argument time with individual values: +# time["second"] -- seconds (0 - 59) +# time["minute"] -- minutes (0 - 59) +# time["hour"] -- hours (0 - 23) +# time["althour"] -- hours (0 - 12) +# time["monthday"] -- day of month (1 - 31) +# time["month"] -- month of year (1 - 12) +# time["monthname"] -- name of the month +# time["shortmonth"] -- short name of the month +# time["year"] -- year within century (0 - 99) +# time["fullyear"] -- year with century (19xx or 20xx) +# time["weekday"] -- day of week (Sunday = 0) +# time["altweekday"] -- day of week (Monday = 0) +# time["weeknum"] -- week number, Sunday first day +# time["altweeknum"] -- week number, Monday first day +# time["dayname"] -- name of weekday +# time["shortdayname"] -- short name of weekday +# time["yearday"] -- day of year (0 - 365) +# time["timezone"] -- abbreviation of timezone name +# time["ampm"] -- AM or PM designation + +@group +function gettimeofday(time, ret, now, i) +@{ + # get time once, avoids unnecessary system calls + now = systime() + + # return date(1)-style output + ret = strftime("%a %b %d %H:%M:%S %Z %Y", now) + + # clear out target array + for (i in time) + delete time[i] +@end group + +@group + # fill in values, force numeric values to be + # numeric by adding 0 + time["second"] = strftime("%S", now) + 0 + time["minute"] = strftime("%M", now) + 0 + time["hour"] = strftime("%H", now) + 0 + time["althour"] = strftime("%I", now) + 0 + time["monthday"] = strftime("%d", now) + 0 + time["month"] = strftime("%m", now) + 0 + time["monthname"] = strftime("%B", now) + time["shortmonth"] = strftime("%b", now) + time["year"] = strftime("%y", now) + 0 + time["fullyear"] = strftime("%Y", now) + 0 + time["weekday"] = strftime("%w", now) + 0 + time["altweekday"] = strftime("%u", now) + 0 + time["dayname"] = strftime("%A", now) + time["shortdayname"] = strftime("%a", now) + time["yearday"] = strftime("%j", now) + 0 + time["timezone"] = strftime("%Z", now) + time["ampm"] = strftime("%p", now) + time["weeknum"] = strftime("%U", now) + 0 + time["altweeknum"] = strftime("%W", now) + 0 + + return ret +@} +@end group +@c endfile +@end example + +The string indices are easier to use and read than the various formats +required by @code{strftime}. The @code{alarm} program presented in +@ref{Alarm Program, ,An Alarm Clock Program}, +uses this function. + +@c exercise!!! +The @code{gettimeofday} function is presented above as it was written. A +more general design for this function would have allowed the user to supply +an optional timestamp value that would have been used instead of the current +time. + +@node Filetrans Function, Getopt Function, Gettimeofday Function, Library Functions +@section Noting Data File Boundaries + +@cindex per file initialization and clean-up +The @code{BEGIN} and @code{END} rules are each executed exactly once, at +the beginning and end respectively of your @code{awk} program +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). +We (the @code{gawk} authors) once had a user who mistakenly thought that the +@code{BEGIN} rule was executed at the beginning of each data file and the +@code{END} rule was executed at the end of each data file. When informed +that this was not the case, the user requested that we add new special +patterns to @code{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that +would have the desired behavior. He even supplied us the code to do so. + +However, after a little thought, I came up with the following library program. +It arranges to call two user-supplied functions, @code{beginfile} and +@code{endfile}, at the beginning and end of each data file. +Besides solving the problem in only nine(!) lines of code, it does so +@emph{portably}; this will work with any implementation of @code{awk}. + +@example +@c @group +# transfile.awk +# +# Give the user a hook for filename transitions +# +# The user must supply functions beginfile() and endfile() +# that each take the name of the file being started or +# finished, respectively. +# +# Arnold Robbins, arnold@@gnu.ai.mit.edu, January 1992 +# Public Domain + +FILENAME != _oldfilename \ +@{ + if (_oldfilename != "") + endfile(_oldfilename) + _oldfilename = FILENAME + beginfile(FILENAME) +@} + +END @{ endfile(FILENAME) @} +@c @end group +@end example + +This file must be loaded before the user's ``main'' program, so that the +rule it supplies will be executed first. + +This rule relies on @code{awk}'s @code{FILENAME} variable that +automatically changes for each new data file. The current file name is +saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does +not equal @code{_oldfilename}, then a new data file is being processed, and +it is necessary to call @code{endfile} for the old file. Since +@code{endfile} should only be called if a file has been processed, the +program first checks to make sure that @code{_oldfilename} is not the null +string. The program then assigns the current file name to +@code{_oldfilename}, and calls @code{beginfile} for the file. +Since, like all @code{awk} variables, @code{_oldfilename} will be +initialized to the null string, this rule executes correctly even for the +first data file. + +The program also supplies an @code{END} rule, to do the final processing for +the last file. Since this @code{END} rule comes before any @code{END} rules +supplied in the ``main'' program, @code{endfile} will be called first. Once +again the value of multiple @code{BEGIN} and @code{END} rules should be clear. + +@findex beginfile +@findex endfile +This version has same problem as the first version of @code{nextfile} +(@pxref{Nextfile Function, ,Implementing @code{nextfile} as a Function}). +If the same data file occurs twice in a row on command line, then +@code{endfile} and @code{beginfile} will not be executed at the end of the +first pass and at the beginning of the second pass. +This version solves the problem. + +@example +@c @group +@c file eg/lib/ftrans.awk +# ftrans.awk --- handle data file transitions +# +# user supplies beginfile() and endfile() functions +# +# Arnold Robbins, arnold@@gnu.ai.mit.edu. November 1992 +# Public Domain + +FNR == 1 @{ + if (_filename_ != "") + endfile(_filename_) + _filename_ = FILENAME + beginfile(FILENAME) +@} + +END @{ endfile(_filename_) @} +@c endfile +@c @end group +@end example + +In @ref{Wc Program, ,Counting Things}, +you will see how this library function can be used, and +how it simplifies writing the main program. + +@node Getopt Function, Passwd Functions, Filetrans Function, Library Functions +@section Processing Command Line Options + +@cindex @code{getopt}, C version +@cindex processing arguments +@cindex argument processing +Most utilities on POSIX compatible systems take options or ``switches'' on +the command line that can be used to change the way a program behaves. +@code{awk} is an example of such a program +(@pxref{Options, ,Command Line Options}). +Often, options take @dfn{arguments}, data that the program needs to +correctly obey the command line option. For example, @code{awk}'s +@samp{-F} option requires a string to use as the field separator. +The first occurrence on the command line of either @samp{--} or a +string that does not begin with @samp{-} ends the options. + +Most Unix systems provide a C function named @code{getopt} for processing +command line arguments. The programmer provides a string describing the one +letter options. If an option requires an argument, it is followed in the +string with a colon. @code{getopt} is also passed the +count and values of the command line arguments, and is called in a loop. +@code{getopt} processes the command line arguments for option letters. +Each time around the loop, it returns a single character representing the +next option letter that it found, or @samp{?} if it found an invalid option. +When it returns @minus{}1, there are no options left on the command line. + +When using @code{getopt}, options that do not take arguments can be +grouped together. Furthermore, options that take arguments require that the +argument be present. The argument can immediately follow the option letter, +or it can be a separate command line argument. + +Given a hypothetical program that takes +three command line options, @samp{-a}, @samp{-b}, and @samp{-c}, and +@samp{-b} requires an argument, all of the following are valid ways of +invoking the program: + +@example +@c @group +prog -a -b foo -c data1 data2 data3 +prog -ac -bfoo -- data1 data2 data3 +prog -acbfoo data1 data2 data3 +@c @end group +@end example + +Notice that when the argument is grouped with its option, the rest of +the command line argument is considered to be the option's argument. +In the above example, @samp{-acbfoo} indicates that all of the +@samp{-a}, @samp{-b}, and @samp{-c} options were supplied, +and that @samp{foo} is the argument to the @samp{-b} option. + +@code{getopt} provides four external variables that the programmer can use. + +@table @code +@item optind +The index in the argument value array (@code{argv}) where the first +non-option command line argument can be found. + +@item optarg +The string value of the argument to an option. + +@item opterr +Usually @code{getopt} prints an error message when it finds an invalid +option. Setting @code{opterr} to zero disables this feature. (An +application might wish to print its own error message.) + +@item optopt +The letter representing the command line option. +While not usually documented, most versions supply this variable. +@end table + +The following C fragment shows how @code{getopt} might process command line +arguments for @code{awk}. + +@example +@group +int +main(int argc, char *argv[]) +@{ + @dots{} + /* print our own message */ + opterr = 0; +@end group +@group + while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{ + switch (c) @{ + case 'f': /* file */ + @dots{} + break; + case 'F': /* field separator */ + @dots{} + break; + case 'v': /* variable assignment */ + @dots{} + break; + case 'W': /* extension */ + @dots{} + break; + case '?': + default: + usage(); + break; + @} + @} + @dots{} +@} +@end group +@end example + +As a side point, @code{gawk} actually uses the GNU @code{getopt_long} +function to process both normal and GNU-style long options +(@pxref{Options, ,Command Line Options}). + +The abstraction provided by @code{getopt} is very useful, and would be quite +handy in @code{awk} programs as well. Here is an @code{awk} version of +@code{getopt}. This function highlights one of the greatest weaknesses in +@code{awk}, which is that it is very poor at manipulating single characters. +Repeated calls to @code{substr} are necessary for accessing individual +characters (@pxref{String Functions, ,Built-in Functions for String Manipulation}). + +The discussion walks through the code a bit at a time. + +@example +@c @group +@c file eg/lib/getopt.awk +# getopt --- do C library getopt(3) function in awk +# +# arnold@@gnu.ai.mit.edu +# Public domain +# +# Initial version: March, 1991 +# Revised: May, 1993 + +@group +# External variables: +# Optind -- index of ARGV for first non-option argument +# Optarg -- string value of argument to current option +# Opterr -- if non-zero, print our own diagnostic +# Optopt -- current option letter +@end group + +# Returns +# -1 at end of options +# ? for unrecognized option +# <c> a character representing the current option + +# Private Data +# _opti index in multi-flag option, e.g., -abc +@c endfile +@c @end group +@end example + +The function starts out with some documentation: who wrote the code, +and when it was revised, followed by a list of the global variables it uses, +what the return values are and what they mean, and any global variables that +are ``private'' to this library function. Such documentation is essential +for any program, and particularly for library functions. + +@findex getopt +@example +@c @group +@c file eg/lib/getopt.awk +function getopt(argc, argv, options, optl, thisopt, i) +@{ + optl = length(options) + if (optl == 0) # no options given + return -1 + + if (argv[Optind] == "--") @{ # all done + Optind++ + _opti = 0 + return -1 + @} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{ + _opti = 0 + return -1 + @} +@c endfile +@c @end group +@end example + +The function first checks that it was indeed called with a string of options +(the @code{options} parameter). If @code{options} has a zero length, +@code{getopt} immediately returns @minus{}1. + +The next thing to check for is the end of the options. A @samp{--} ends the +command line options, as does any command line argument that does not begin +with a @samp{-}. @code{Optind} is used to step through the array of command +line arguments; it retains its value across calls to @code{getopt}, since it +is a global variable. + +The regexp used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is +perhaps a bit of overkill; it checks for a @samp{-} followed by anything +that is not whitespace and not a colon. +If the current command line argument does not match this pattern, +it is not an option, and it ends option processing. + +@example +@group +@c file eg/lib/getopt.awk + if (_opti == 0) + _opti = 2 + thisopt = substr(argv[Optind], _opti, 1) + Optopt = thisopt + i = index(options, thisopt) + if (i == 0) @{ + if (Opterr) + printf("%c -- invalid option\n", + thisopt) > "/dev/stderr" + if (_opti >= length(argv[Optind])) @{ + Optind++ + _opti = 0 + @} else + _opti++ + return "?" + @} +@c endfile +@end group +@end example + +The @code{_opti} variable tracks the position in the current command line +argument (@code{argv[Optind]}). In the case that multiple options were +grouped together with one @samp{-} (e.g., @samp{-abx}), it is necessary +to return them to the user one at a time. + +If @code{_opti} is equal to zero, it is set to two, the index in the string +of the next character to look at (we skip the @samp{-}, which is at position +one). The variable @code{thisopt} holds the character, obtained with +@code{substr}. It is saved in @code{Optopt} for the main program to use. + +If @code{thisopt} is not in the @code{options} string, then it is an +invalid option. If @code{Opterr} is non-zero, @code{getopt} prints an error +message on the standard error that is similar to the message from the C +version of @code{getopt}. + +Since the option is invalid, it is necessary to skip it and move on to the +next option character. If @code{_opti} is greater than or equal to the +length of the current command line argument, then it is necessary to move on +to the next one, so @code{Optind} is incremented and @code{_opti} is reset +to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely +incremented. + +In any case, since the option was invalid, @code{getopt} returns @samp{?}. +The main program can examine @code{Optopt} if it needs to know what the +invalid option letter actually was. + +@example +@group +@c file eg/lib/getopt.awk + if (substr(options, i + 1, 1) == ":") @{ + # get option argument + if (length(substr(argv[Optind], _opti + 1)) > 0) + Optarg = substr(argv[Optind], _opti + 1) + else + Optarg = argv[++Optind] + _opti = 0 + @} else + Optarg = "" +@c endfile +@end group +@end example + +If the option requires an argument, the option letter is followed by a colon +in the @code{options} string. If there are remaining characters in the +current command line argument (@code{argv[Optind]}), then the rest of that +string is assigned to @code{Optarg}. Otherwise, the next command line +argument is used (@samp{-xFOO} vs. @samp{@w{-x FOO}}). In either case, +@code{_opti} is reset to zero, since there are no more characters left to +examine in the current command line argument. + +@example +@c @group +@c file eg/lib/getopt.awk + if (_opti == 0 || _opti >= length(argv[Optind])) @{ + Optind++ + _opti = 0 + @} else + _opti++ + return thisopt +@} +@c endfile +@c @end group +@end example + +Finally, if @code{_opti} is either zero or greater than the length of the +current command line argument, it means this element in @code{argv} is +through being processed, so @code{Optind} is incremented to point to the +next element in @code{argv}. If neither condition is true, then only +@code{_opti} is incremented, so that the next option letter can be processed +on the next call to @code{getopt}. + +@example +@c @group +@c file eg/lib/getopt.awk +BEGIN @{ + Opterr = 1 # default is to diagnose + Optind = 1 # skip ARGV[0] + + # test program + if (_getopt_test) @{ + while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1) + printf("c = <%c>, optarg = <%s>\n", + _go_c, Optarg) + printf("non-option arguments:\n") + for (; Optind < ARGC; Optind++) + printf("\tARGV[%d] = <%s>\n", + Optind, ARGV[Optind]) + @} +@} +@c endfile +@c @end group +@end example + +The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one. +@code{Opterr} is set to one, since the default behavior is for @code{getopt} +to print a diagnostic message upon seeing an invalid option. @code{Optind} +is set to one, since there's no reason to look at the program name, which is +in @code{ARGV[0]}. + +The rest of the @code{BEGIN} rule is a simple test program. Here is the +result of two sample runs of the test program. + +@example +@group +$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x +@print{} c = <a>, optarg = <> +@print{} c = <c>, optarg = <> +@print{} c = <b>, optarg = <ARG> +@print{} non-option arguments: +@print{} ARGV[3] = <bax> +@print{} ARGV[4] = <-x> +@end group + +@group +$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc +@print{} c = <a>, optarg = <> +@error{} x -- invalid option +@print{} c = <?>, optarg = <> +@print{} non-option arguments: +@print{} ARGV[4] = <xyz> +@print{} ARGV[5] = <abc> +@end group +@end example + +The first @samp{--} terminates the arguments to @code{awk}, so that it does +not try to interpret the @samp{-a} etc. as its own options. + +Several of the sample programs presented in +@ref{Sample Programs, ,Practical @code{awk} Programs}, +use @code{getopt} to process their arguments. + +@node Passwd Functions, Group Functions, Getopt Function, Library Functions +@section Reading the User Database + +@cindex @file{/dev/user} +The @file{/dev/user} special file +(@pxref{Special Files, ,Special File Names in @code{gawk}}) +provides access to the current user's real and effective user and group id +numbers, and if available, the user's supplementary group set. +However, since these are numbers, they do not provide very useful +information to the average user. There needs to be some way to find the +user information associated with the user and group numbers. This +section presents a suite of functions for retrieving information from the +user database. @xref{Group Functions, ,Reading the Group Database}, +for a similar suite that retrieves information from the group database. + +@cindex @code{getpwent}, C version +@cindex user information +@cindex login information +@cindex account information +@cindex password file +The POSIX standard does not define the file where user information is +kept. Instead, it provides the @code{<pwd.h>} header file +and several C language subroutines for obtaining user information. +The primary function is @code{getpwent}, for ``get password entry.'' +The ``password'' comes from the original user database file, +@file{/etc/passwd}, which kept user information, along with the +encrypted passwords (hence the name). + +While an @code{awk} program could simply read @file{/etc/passwd} directly +(the format is well known), because of the way password +files are handled on networked systems, +this file may not contain complete information about the system's set of users. + +@cindex @code{pwcat} program +To be sure of being +able to produce a readable, complete version of the user database, it is +necessary to write a small C program that calls @code{getpwent}. +@code{getpwent} is defined to return a pointer to a @code{struct passwd}. +Each time it is called, it returns the next entry in the database. +When there are no more entries, it returns @code{NULL}, the null pointer. +When this happens, the C program should call @code{endpwent} to close the +database. +Here is @code{pwcat}, a C program that ``cats'' the password database. + +@findex pwcat.c +@example +@c @group +@c file eg/lib/pwcat.c +/* + * pwcat.c + * + * Generate a printable version of the password database + * + * Arnold Robbins + * arnold@@gnu.ai.mit.edu + * May 1993 + * Public Domain + */ + +#include <stdio.h> +#include <pwd.h> + +int +main(argc, argv) +int argc; +char **argv; +@{ + struct passwd *p; + + while ((p = getpwent()) != NULL) + printf("%s:%s:%d:%d:%s:%s:%s\n", + p->pw_name, p->pw_passwd, p->pw_uid, + p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell); + + endpwent(); + exit(0); +@} +@c endfile +@c @end group +@end example + +If you don't understand C, don't worry about it. +The output from @code{pwcat} is the user database, in the traditional +@file{/etc/passwd} format of colon-separated fields. The fields are: + +@table @asis +@item Login name +The user's login name. + +@item Encrypted password +The user's encrypted password. This may not be available on some systems. + +@item User-ID +The user's numeric user-id number. + +@item Group-ID +The user's numeric group-id number. + +@item Full name +The user's full name, and perhaps other information associated with the +user. + +@item Home directory +The user's login, or ``home'' directory (familiar to shell programmers as +@code{$HOME}). + +@item Login shell +The program that will be run when the user logs in. This is usually a +shell, such as Bash (the Gnu Bourne-Again shell). +@end table + +Here are a few lines representative of @code{pwcat}'s output. + +@example +@c @group +$ pwcat +@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh +@print{} nobody:*:65534:65534::/: +@print{} daemon:*:1:1::/: +@print{} sys:*:2:2::/:/bin/csh +@print{} bin:*:3:3::/bin: +@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh +@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh +@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh +@dots{} +@c @end group +@end example + +With that introduction, here is a group of functions for getting user +information. There are several functions here, corresponding to the C +functions of the same name. + +@findex _pw_init +@example +@c file eg/lib/passwdawk.in +@group +# passwd.awk --- access password file information +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +BEGIN @{ + # tailor this to suit your system + _pw_awklib = "/usr/local/libexec/awk/" +@} +@end group + +@group +function _pw_init( oldfs, oldrs, olddol0, pwcat) +@{ + if (_pw_inited) + return + oldfs = FS + oldrs = RS + olddol0 = $0 + FS = ":" + RS = "\n" + pwcat = _pw_awklib "pwcat" + while ((pwcat | getline) > 0) @{ + _pw_byname[$1] = $0 + _pw_byuid[$3] = $0 + _pw_bycount[++_pw_total] = $0 + @} + close(pwcat) + _pw_count = 0 + _pw_inited = 1 + FS = oldfs + RS = oldrs + $0 = olddol0 +@} +@c endfile +@end group +@end example + +The @code{BEGIN} rule sets a private variable to the directory where +@code{pwcat} is stored. Since it is used to help out an @code{awk} library +routine, we have chosen to put it in @file{/usr/local/libexec/awk}. +You might want it to be in a different directory on your system. + +The function @code{_pw_init} keeps three copies of the user information +in three associative arrays. The arrays are indexed by user name +(@code{_pw_byname}), by user-id number (@code{_pw_byuid}), and by order of +occurrence (@code{_pw_bycount}). + +The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} only +needs to be called once. + +Since this function uses @code{getline} to read information from +@code{pwcat}, it first saves the values of @code{FS}, @code{RS}, and +@code{$0}. Doing so is necessary, since these functions could be called +from anywhere within a user's program, and the user may have his or her +own values for @code{FS} and @code{RS}. +@ignore +Problem, what if FIELDWIDTHS is in use? Sigh. +@end ignore + +The main part of the function uses a loop to read database lines, split +the line into fields, and then store the line into each array as necessary. +When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline, +setting @code{@w{_pw_inited}} to one, and restoring @code{FS}, @code{RS}, and +@code{$0}. The use of @code{@w{_pw_count}} will be explained below. + +@findex getpwnam +@example +@group +@c file eg/lib/passwdawk.in +function getpwnam(name) +@{ + _pw_init() + if (name in _pw_byname) + return _pw_byname[name] + return "" +@} +@c endfile +@end group +@end example + +The @code{getpwnam} function takes a user name as a string argument. If that +user is in the database, it returns the appropriate line. Otherwise it +returns the null string. + +@findex getpwuid +@example +@group +@c file eg/lib/passwdawk.in +function getpwuid(uid) +@{ + _pw_init() + if (uid in _pw_byuid) + return _pw_byuid[uid] + return "" +@} +@c endfile +@end group +@end example + +Similarly, +the @code{getpwuid} function takes a user-id number argument. If that +user number is in the database, it returns the appropriate line. Otherwise it +returns the null string. + +@findex getpwent +@example +@c @group +@c file eg/lib/passwdawk.in +function getpwent() +@{ + _pw_init() + if (_pw_count < _pw_total) + return _pw_bycount[++_pw_count] + return "" +@} +@c endfile +@c @end group +@end example + +The @code{getpwent} function simply steps through the database, one entry at +a time. It uses @code{_pw_count} to track its current position in the +@code{_pw_bycount} array. + +@findex endpwent +@example +@c @group +@c file eg/lib/passwdawk.in +function endpwent() +@{ + _pw_count = 0 +@} +@c endfile +@c @end group +@end example + +The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that +subsequent calls to @code{getpwent} will start over again. + +A conscious design decision in this suite is that each subroutine calls +@code{@w{_pw_init}} to initialize the database arrays. The overhead of running +a separate process to generate the user database, and the I/O to scan it, +will only be incurred if the user's main program actually calls one of these +functions. If this library file is loaded along with a user's program, but +none of the routines are ever called, then there is no extra run-time overhead. +(The alternative would be to move the body of @code{@w{_pw_init}} into a +@code{BEGIN} rule, which would always run @code{pwcat}. This simplifies the +code but runs an extra process that may never be needed.) + +In turn, calling @code{_pw_init} is not too expensive, since the +@code{_pw_inited} variable keeps the program from reading the data more than +once. If you are worried about squeezing every last cycle out of your +@code{awk} program, the check of @code{_pw_inited} could be moved out of +@code{_pw_init} and duplicated in all the other functions. In practice, +this is not necessary, since most @code{awk} programs are I/O bound, and it +would clutter up the code. + +The @code{id} program in @ref{Id Program, ,Printing Out User Information}, +uses these functions. + +@node Group Functions, Library Names, Passwd Functions, Library Functions +@section Reading the Group Database + +@cindex @code{getgrent}, C version +@cindex group information +@cindex account information +@cindex group file +Much of the discussion presented in +@ref{Passwd Functions, ,Reading the User Database}, +applies to the group database as well. Although there has traditionally +been a well known file, @file{/etc/group}, in a well known format, the POSIX +standard only provides a set of C library routines +(@code{<grp.h>} and @code{getgrent}) +for accessing the information. +Even though this file may exist, it likely does not have +complete information. Therefore, as with the user database, it is necessary +to have a small C program that generates the group database as its output. + +@cindex @code{grcat} program +Here is @code{grcat}, a C program that ``cats'' the group database. + +@findex grcat.c +@example +@c @group +@c file eg/lib/grcat.c +/* + * grcat.c + * + * Generate a printable version of the group database + * + * Arnold Robbins, arnold@@gnu.ai.mit.edu + * May 1993 + * Public Domain + */ + +#include <stdio.h> +#include <grp.h> + +@group +int +main(argc, argv) +int argc; +char **argv; +@{ + struct group *g; + int i; +@end group + + while ((g = getgrent()) != NULL) @{ + printf("%s:%s:%d:", g->gr_name, g->gr_passwd, + g->gr_gid); + for (i = 0; g->gr_mem[i] != NULL; i++) @{ + printf("%s", g->gr_mem[i]); + if (g->gr_mem[i+1] != NULL) + putchar(','); + @} + putchar('\n'); + @} + endgrent(); + exit(0); +@} +@c endfile +@c @end group +@end example + +Each line in the group database represent one group. The fields are +separated with colons, and represent the following information. + +@table @asis +@item Group Name +The name of the group. + +@item Group Password +The encrypted group password. In practice, this field is never used. It is +usually empty, or set to @samp{*}. + +@item Group ID Number +The numeric group-id number. This number should be unique within the file. + +@item Group Member List +A comma-separated list of user names. These users are members of the group. +Most Unix systems allow users to be members of several groups +simultaneously. If your system does, then reading @file{/dev/user} will +return those group-id numbers in @code{$5} through @code{$NF}. +(Note that @file{/dev/user} is a @code{gawk} extension; +@pxref{Special Files, ,Special File Names in @code{gawk}}.) +@end table + +Here is what running @code{grcat} might produce: + +@example +@group +$ grcat +@print{} wheel:*:0:arnold +@print{} nogroup:*:65534: +@print{} daemon:*:1: +@print{} kmem:*:2: +@print{} staff:*:10:arnold,miriam,andy +@print{} other:*:20: +@dots{} +@end group +@end example + +Here are the functions for obtaining information from the group database. +There are several, modeled after the C library functions of the same names. + +@findex _gr_init +@example +@group +@c file eg/lib/groupawk.in +# group.awk --- functions for dealing with the group file +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +BEGIN \ +@{ + # Change to suit your system + _gr_awklib = "/usr/local/libexec/awk/" +@} +@c endfile +@end group + +@group +@c file eg/lib/groupawk.in +function _gr_init( oldfs, oldrs, olddol0, grcat, n, a, i) +@{ + if (_gr_inited) + return +@end group + +@group + oldfs = FS + oldrs = RS + olddol0 = $0 + FS = ":" + RS = "\n" +@end group + +@group + grcat = _gr_awklib "grcat" + while ((grcat | getline) > 0) @{ + if ($1 in _gr_byname) + _gr_byname[$1] = _gr_byname[$1] "," $4 + else + _gr_byname[$1] = $0 + if ($3 in _gr_bygid) + _gr_bygid[$3] = _gr_bygid[$3] "," $4 + else + _gr_bygid[$3] = $0 + + n = split($4, a, "[ \t]*,[ \t]*") +@end group +@group + for (i = 1; i <= n; i++) + if (a[i] in _gr_groupsbyuser) + _gr_groupsbyuser[a[i]] = \ + _gr_groupsbyuser[a[i]] " " $1 + else + _gr_groupsbyuser[a[i]] = $1 +@end group + +@group + _gr_bycount[++_gr_count] = $0 + @} +@end group +@group + close(grcat) + _gr_count = 0 + _gr_inited++ + FS = oldfs + RS = oldrs + $0 = olddol0 +@} +@c endfile +@end group +@end example + +The @code{BEGIN} rule sets a private variable to the directory where +@code{grcat} is stored. Since it is used to help out an @code{awk} library +routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might +want it to be in a different directory on your system. + +These routines follow the same general outline as the user database routines +(@pxref{Passwd Functions, ,Reading the User Database}). +The @code{@w{_gr_inited}} variable is used to +ensure that the database is scanned no more than once. +The @code{@w{_gr_init}} function first saves @code{FS}, @code{RS}, and +@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for +scanning the group information. + +The group information is stored is several associative arrays. +The arrays are indexed by group name (@code{@w{_gr_byname}}), by group-id number +(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}). +There is an additional array indexed by user name (@code{@w{_gr_groupsbyuser}}), +that is a space separated list of groups that each user belongs to. + +Unlike the user database, it is possible to have multiple records in the +database for the same group. This is common when a group has a large number +of members. Such a pair of entries might look like: + +@example +tvpeople:*:101:johny,jay,arsenio +tvpeople:*:101:david,conan,tom,joan +@end example + +For this reason, @code{_gr_init} looks to see if a group name or +group-id number has already been seen. If it has, then the user names are +simply concatenated onto the previous list of users. (There is actually a +subtle problem with the code presented above. Suppose that +the first time there were no names. This code adds the names with +a leading comma. It also doesn't check that there is a @code{$4}.) + +Finally, @code{_gr_init} closes the pipeline to @code{grcat}, restores +@code{FS}, @code{RS}, and @code{$0}, initializes @code{_gr_count} to zero +(it is used later), and makes @code{_gr_inited} non-zero. + +@findex getgrnam +@example +@c @group +@c file eg/lib/groupawk.in +function getgrnam(group) +@{ + _gr_init() + if (group in _gr_byname) + return _gr_byname[group] + return "" +@} +@c endfile +@c @end group +@end example + +The @code{getgrnam} function takes a group name as its argument, and if that +group exists, it is returned. Otherwise, @code{getgrnam} returns the null +string. + +@findex getgrgid +@example +@c @group +@c file eg/lib/groupawk.in +function getgrgid(gid) +@{ + _gr_init() + if (gid in _gr_bygid) + return _gr_bygid[gid] + return "" +@} +@c endfile +@c @end group +@end example + +The @code{getgrgid} function is similar, it takes a numeric group-id, and +looks up the information associated with that group-id. + +@findex getgruser +@example +@group +@c file eg/lib/groupawk.in +function getgruser(user) +@{ + _gr_init() + if (user in _gr_groupsbyuser) + return _gr_groupsbyuser[user] + return "" +@} +@c endfile +@end group +@end example + +The @code{getgruser} function does not have a C counterpart. It takes a +user name, and returns the list of groups that have the user as a member. + +@findex getgrent +@example +@c @group +@c file eg/lib/groupawk.in +function getgrent() +@{ + _gr_init() + if (++gr_count in _gr_bycount) + return _gr_bycount[_gr_count] + return "" +@} +@c endfile +@c @end group +@end example + +The @code{getgrent} function steps through the database one entry at a time. +It uses @code{_gr_count} to track its position in the list. + +@findex endgrent +@example +@group +@c file eg/lib/groupawk.in +function endgrent() +@{ + _gr_count = 0 +@} +@c endfile +@end group +@end example + +@code{endgrent} resets @code{_gr_count} to zero so that @code{getgrent} can +start over again. + +As with the user database routines, each function calls @code{_gr_init} to +initialize the arrays. Doing so only incurs the extra overhead of running +@code{grcat} if these functions are used (as opposed to moving the body of +@code{_gr_init} into a @code{BEGIN} rule). + +Most of the work is in scanning the database and building the various +associative arrays. The functions that the user calls are themselves very +simple, relying on @code{awk}'s associative arrays to do work. + +The @code{id} program in @ref{Id Program, ,Printing Out User Information}, +uses these functions. + +@node Library Names, , Group Functions, Library Functions +@section Naming Library Function Global Variables + +@cindex namespace issues in @code{awk} +@cindex documenting @code{awk} programs +@cindex programs, documenting +Due to the way the @code{awk} language evolved, variables are either +@dfn{global} (usable by the entire program), or @dfn{local} (usable just by +a specific function). There is no intermediate state analogous to +@code{static} variables in C. + +Library functions often need to have global variables that they can use to +preserve state information between calls to the function. For example, +@code{getopt}'s variable @code{_opti} +(@pxref{Getopt Function, ,Processing Command Line Options}), +and the @code{_tm_months} array used by @code{mktime} +(@pxref{Mktime Function, ,Turning Dates Into Timestamps}). +Such variables are called @dfn{private}, since the only functions that need to +use them are the ones in the library. + +When writing a library function, you should try to choose names for your +private variables so that they will not conflict with any variables used by +either another library function or a user's main program. For example, a +name like @samp{i} or @samp{j} is not a good choice, since user programs +often use variable names like these for their own purposes. + +The example programs shown in this chapter all start the names of their +private variables with an underscore (@samp{_}). Users generally don't use +leading underscores in their variable names, so this convention immediately +decreases the chances that the variable name will be accidentally shared +with the user's program. + +In addition, several of the library functions use a prefix that helps +indicate what function or set of functions uses the variables. For example, +@code{_tm_months} in @code{mktime} +(@pxref{Mktime Function, ,Turning Dates Into Timestamps}), and +@code{_pw_byname} in the user data base routines +(@pxref{Passwd Functions, ,Reading the User Database}). +This convention is recommended, since it even further decreases the chance +of inadvertent conflict among variable names. +Note that this convention can be used equally well both for variable names +and for private function names too. + +While I could have re-written all the library routines to use this +convention, I did not do so, in order to show how my own @code{awk} +programming style has evolved, and to provide some basis for this +discussion. + +As a final note on variable naming, if a function makes global variables +available for use by a main program, it is a good convention to start that +variable's name with a capital letter. +For example, @code{getopt}'s @code{Opterr} and @code{Optind} variables +(@pxref{Getopt Function, ,Processing Command Line Options}). +The leading capital letter indicates that it is global, while the fact that +the variable name is not all capital letters indicates that the variable is +not one of @code{awk}'s built-in variables, like @code{FS}. + +It is also important that @emph{all} variables in library functions +that do not need to save state are in fact declared local. If this is +not done, the variable could accidentally be used in the user's program, +leading to bugs that are very difficult to track down. + +@example +function lib_func(x, y, l1, l2) +@{ + @dots{} + @var{use variable} some_var # some_var could be local + @dots{} # but is not by oversight +@} +@end example + +@cindex Tcl +A different convention, common in the Tcl community, is to use a single +associative array to hold the values needed by the library function(s), or +``package.'' This significantly decreases the number of actual global names +in use. For example, the functions described in +@ref{Passwd Functions, , Reading the User Database}, +might have used @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}}, +@code{@w{PW_data["count"]}} and @code{@w{PW_data["awklib"]}}, instead of +@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}}, +and @code{@w{_pw_count}}. + +The conventions presented in this section are exactly that, conventions. You +are not required to write your programs this way, we merely recommend that +you do so. + +@node Sample Programs, Language History, Library Functions, Top +@chapter Practical @code{awk} Programs + +This chapter presents a potpourri of @code{awk} programs for your reading +enjoyment. +@iftex +There are two sections. The first presents @code{awk} +versions of several common POSIX utilities. +The second is a grab-bag of interesting programs. +@end iftex + +Many of these programs use the library functions presented in +@ref{Library Functions, ,A Library of @code{awk} Functions}. + +@menu +* Clones:: Clones of common utilities. +* Miscellaneous Programs:: Some interesting @code{awk} programs. +@end menu + +@node Clones, Miscellaneous Programs, Sample Programs, Sample Programs +@section Re-inventing Wheels for Fun and Profit + +This section presents a number of POSIX utilities that are implemented in +@code{awk}. Re-inventing these programs in @code{awk} is often enjoyable, +since the algorithms can be very clearly expressed, and usually the code is +very concise and simple. This is true because @code{awk} does so much for you. + +It should be noted that these programs are not necessarily intended to +replace the installed versions on your system. Instead, their +purpose is to illustrate @code{awk} language programming for ``real world'' +tasks. + +The programs are presented in alphabetical order. + +@menu +* Cut Program:: The @code{cut} utility. +* Egrep Program:: The @code{egrep} utility. +* Id Program:: The @code{id} utility. +* Split Program:: The @code{split} utility. +* Tee Program:: The @code{tee} utility. +* Uniq Program:: The @code{uniq} utility. +* Wc Program:: The @code{wc} utility. +@end menu + +@node Cut Program, Egrep Program, Clones, Clones +@subsection Cutting Out Fields and Columns + +@cindex @code{cut} utility +The @code{cut} utility selects, or ``cuts,'' either characters or fields +from its standard +input and sends them to its standard output. @code{cut} can cut out either +a list of characters, or a list of fields. By default, fields are separated +by tabs, but you may supply a command line option to change the field +@dfn{delimiter}, i.e.@: the field separator character. @code{cut}'s definition +of fields is less general than @code{awk}'s. + +A common use of @code{cut} might be to pull out just the login name of +logged-on users from the output of @code{who}. For example, the following +pipeline generates a sorted, unique list of the logged on users: + +@example +who | cut -c1-8 | sort | uniq +@end example + +The options for @code{cut} are: + +@table @code +@item -c @var{list} +Use @var{list} as the list of characters to cut out. Items within the list +may be separated by commas, and ranges of characters can be separated with +dashes. The list @samp{1-8,15,22-35} specifies characters one through +eight, 15, and 22 through 35. + +@item -f @var{list} +Use @var{list} as the list of fields to cut out. + +@item -d @var{delim} +Use @var{delim} as the field separator character instead of the tab +character. + +@item -s +Suppress printing of lines that do not contain the field delimiter. +@end table + +The @code{awk} implementation of @code{cut} uses the @code{getopt} library +function (@pxref{Getopt Function, ,Processing Command Line Options}), +and the @code{join} library function +(@pxref{Join Function, ,Merging an Array Into a String}). + +The program begins with a comment describing the options and a @code{usage} +function which prints out a usage message and exits. @code{usage} is called +if invalid arguments are supplied. + +@findex cut.awk +@example +@c @group +@c file eg/prog/cut.awk +# cut.awk --- implement cut in awk +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +# Options: +# -f list Cut fields +# -d c Field delimiter character +# -c list Cut characters +# +# -s Suppress lines without the delimiter character + +function usage( e1, e2) +@{ + e1 = "usage: cut [-f list] [-d c] [-s] [files...]" + e2 = "usage: cut [-c list] [files...]" + print e1 > "/dev/stderr" + print e2 > "/dev/stderr" + exit 1 +@} +@c endfile +@c @end group +@end example + +@noindent +The variables @code{e1} and @code{e2} are used so that the function +fits nicely on the +@iftex +page. +@end iftex +@ifinfo +screen. +@end ifinfo + +Next comes a @code{BEGIN} rule that parses the command line options. +It sets @code{FS} to a single tab character, since that is @code{cut}'s +default field separator. The output field separator is also set to be the +same as the input field separator. Then @code{getopt} is used to step +through the command line options. One or the other of the variables +@code{by_fields} or @code{by_chars} is set to true, to indicate that +processing should be done by fields or by characters respectively. +When cutting by characters, the output field separator is set to the null +string. + +@example +@c @group +@c file eg/prog/cut.awk +BEGIN \ +@{ + FS = "\t" # default + OFS = FS + while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{ + if (c == "f") @{ + by_fields = 1 + fieldlist = Optarg +@group + @} else if (c == "c") @{ + by_chars = 1 + fieldlist = Optarg + OFS = "" + @} else if (c == "d") @{ + if (length(Optarg) > 1) @{ + printf("Using first character of %s" \ + " for delimiter\n", Optarg) > "/dev/stderr" + Optarg = substr(Optarg, 1, 1) + @} + FS = Optarg + OFS = FS + if (FS == " ") # defeat awk semantics + FS = "[ ]" + @} else if (c == "s") + suppress++ + else + usage() + @} +@end group + + for (i = 1; i < Optind; i++) + ARGV[i] = "" +@c endfile +@c @end group +@end example + +Special care is taken when the field delimiter is a space. Using +@code{@w{" "}} (a single space) for the value of @code{FS} is +incorrect---@code{awk} would +separate fields with runs of spaces, tabs and/or newlines, and we want them to be +separated with individual spaces. Also, note that after @code{getopt} is +through, we have to clear out all the elements of @code{ARGV} from one to +@code{Optind}, so that @code{awk} will not try to process the command line +options as file names. + +After dealing with the command line options, the program verifies that the +options make sense. Only one or the other of @samp{-c} and @samp{-f} should +be used, and both require a field list. Then either @code{set_fieldlist} or +@code{set_charlist} is called to pull apart the list of fields or +characters. + +@example +@c @group +@c file eg/prog/cut.awk + if (by_fields && by_chars) + usage() + + if (by_fields == 0 && by_chars == 0) + by_fields = 1 # default + + if (fieldlist == "") @{ + print "cut: needs list for -c or -f" > "/dev/stderr" + exit 1 + @} + +@group + if (by_fields) + set_fieldlist() + else + set_charlist() +@} +@c endfile +@end group +@end example + +Here is @code{set_fieldlist}. It first splits the field list apart +at the commas, into an array. Then, for each element of the array, it +looks to see if it is actually a range, and if so splits it apart. The range +is verified to make sure the first number is smaller than the second. +Each number in the list is added to the @code{flist} array, which simply +lists the fields that will be printed. +Normal field splitting is used. +The program lets @code{awk} +handle the job of doing the field splitting. + +@example +@c @group +@c file eg/prog/cut.awk +function set_fieldlist( n, m, i, j, k, f, g) +@{ + n = split(fieldlist, f, ",") + j = 1 # index in flist + for (i = 1; i <= n; i++) @{ + if (index(f[i], "-") != 0) @{ # a range + m = split(f[i], g, "-") + if (m != 2 || g[1] >= g[2]) @{ + printf("bad field list: %s\n", + f[i]) > "/dev/stderr" + exit 1 + @} + for (k = g[1]; k <= g[2]; k++) + flist[j++] = k + @} else + flist[j++] = f[i] + @} + nfields = j - 1 +@} +@c endfile +@c @end group +@end example + +The @code{set_charlist} function is more complicated than @code{set_fieldlist}. +The idea here is to use @code{gawk}'s @code{FIELDWIDTHS} variable +(@pxref{Constant Size, ,Reading Fixed-width Data}), +which describes constant width input. When using a character list, that is +exactly what we have. + +Setting up @code{FIELDWIDTHS} is more complicated than simply listing the +fields that need to be printed. We have to keep track of the fields to be +printed, and also the intervening characters that have to be skipped. +For example, suppose you wanted characters one through eight, 15, and +22 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value +for @code{FIELDWIDTHS} would be @code{@w{"8 6 1 6 14"}}. This gives us five +fields, and what should be printed are @code{$1}, @code{$3}, and @code{$5}. +The intermediate fields are ``filler,'' stuff in between the desired data. + +@code{flist} lists the fields to be printed, and @code{t} tracks the +complete field list, including filler fields. + +@example +@c @group +@c file eg/prog/cut.awk +function set_charlist( field, i, j, f, g, t, + filler, last, len) +@{ + field = 1 # count total fields + n = split(fieldlist, f, ",") + j = 1 # index in flist + for (i = 1; i <= n; i++) @{ + if (index(f[i], "-") != 0) @{ # range + m = split(f[i], g, "-") + if (m != 2 || g[1] >= g[2]) @{ + printf("bad character list: %s\n", + f[i]) > "/dev/stderr" + exit 1 + @} + len = g[2] - g[1] + 1 + if (g[1] > 1) # compute length of filler + filler = g[1] - last - 1 + else + filler = 0 + if (filler) + t[field++] = filler + t[field++] = len # length of field + last = g[2] + flist[j++] = field - 1 + @} else @{ + if (f[i] > 1) + filler = f[i] - last - 1 + else + filler = 0 + if (filler) + t[field++] = filler + t[field++] = 1 + last = f[i] + flist[j++] = field - 1 + @} + @} +@group + FIELDWIDTHS = join(t, 1, field - 1) + nfields = j - 1 +@} +@end group +@c endfile +@end example + +Here is the rule that actually processes the data. If the @samp{-s} option +was given, then @code{suppress} will be true. The first @code{if} statement +makes sure that the input record does have the field separator. If +@code{cut} is processing fields, @code{suppress} is true, and the field +separator character is not in the record, then the record is skipped. + +If the record is valid, then at this point, @code{gawk} has split the data +into fields, either using the character in @code{FS} or using fixed-length +fields and @code{FIELDWIDTHS}. The loop goes through the list of fields +that should be printed. If the corresponding field has data in it, it is +printed. If the next field also has data, then the separator character is +written out in between the fields. + +@c 2e: Could use `index($0, FS) != 0' instead of `$0 !~ FS', below + +@example +@c @group +@c file eg/prog/cut.awk +@{ + if (by_fields && suppress && $0 !~ FS) + next + + for (i = 1; i <= nfields; i++) @{ + if ($flist[i] != "") @{ + printf "%s", $flist[i] + if (i < nfields && $flist[i+1] != "") + printf "%s", OFS + @} + @} + print "" +@} +@c endfile +@c @end group +@end example + +This version of @code{cut} relies on @code{gawk}'s @code{FIELDWIDTHS} +variable to do the character-based cutting. While it would be possible in +other @code{awk} implementations to use @code{substr} +(@pxref{String Functions, ,Built-in Functions for String Manipulation}), +it would also be extremely painful to do so. +The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem +of picking the input line apart by characters. + +@node Egrep Program, Id Program, Cut Program, Clones +@subsection Searching for Regular Expressions in Files + +@cindex @code{egrep} utility +The @code{egrep} utility searches files for patterns. It uses regular +expressions that are almost identical to those available in @code{awk} +(@pxref{Regexp Constants, ,Regular Expression Constants}). It is used this way: + +@example +egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{} +@end example + +The @var{pattern} is a regexp. +In typical usage, the regexp is quoted to prevent the shell from expanding +any of the special characters as file name wildcards. +Normally, @code{egrep} prints the +lines that matched. If multiple file names are provided on the command +line, each output line is preceded by the name of the file and a colon. + +@c NEEDED +@page +The options are: + +@table @code +@item -c +Print out a count of the lines that matched the pattern, instead of the +lines themselves. + +@item -s +Be silent. No output is produced, and the exit value indicates whether +or not the pattern was matched. + +@item -v +Invert the sense of the test. @code{egrep} prints the lines that do +@emph{not} match the pattern, and exits successfully if the pattern was not +matched. + +@item -i +Ignore case distinctions in both the pattern and the input data. + +@item -l +Only print the names of the files that matched, not the lines that matched. + +@item -e @var{pattern} +Use @var{pattern} as the regexp to match. The purpose of the @samp{-e} +option is to allow patterns that start with a @samp{-}. +@end table + +This version uses the @code{getopt} library function +(@pxref{Getopt Function, ,Processing Command Line Options}), +and the file transition library program +(@pxref{Filetrans Function, ,Noting Data File Boundaries}). + +The program begins with a descriptive comment, and then a @code{BEGIN} rule +that processes the command line arguments with @code{getopt}. The @samp{-i} +(ignore case) option is particularly easy with @code{gawk}; we just use the +@code{IGNORECASE} built in variable +(@pxref{Built-in Variables}). + +@findex egrep.awk +@example +@c @group +@c file eg/prog/egrep.awk +# egrep.awk --- simulate egrep in awk +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +# Options: +# -c count of lines +# -s silent - use exit value +# -v invert test, success if no match +# -i ignore case +# -l print filenames only +# -e argument is pattern + +BEGIN @{ + while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{ + if (c == "c") + count_only++ + else if (c == "s") + no_print++ + else if (c == "v") + invert++ + else if (c == "i") + IGNORECASE = 1 + else if (c == "l") + filenames_only++ + else if (c == "e") + pattern = Optarg + else + usage() + @} +@c endfile +@c @end group +@end example + +Next comes the code that handles the @code{egrep} specific behavior. If no +pattern was supplied with @samp{-e}, the first non-option on the command +line is used. The @code{awk} command line arguments up to @code{ARGV[Optind]} +are cleared, so that @code{awk} won't try to process them as files. If no +files were specified, the standard input is used, and if multiple files were +specified, we make sure to note this so that the file names can precede the +matched lines in the output. + +The last two lines are commented out, since they are not needed in +@code{gawk}. They should be uncommented if you have to use another version +of @code{awk}. + +@example +@c @group +@c file eg/prog/egrep.awk + if (pattern == "") + pattern = ARGV[Optind++] + + for (i = 1; i < Optind; i++) + ARGV[i] = "" + if (Optind >= ARGC) @{ + ARGV[1] = "-" + ARGC = 2 + @} else if (ARGC - Optind > 1) + do_filenames++ + +# if (IGNORECASE) +# pattern = tolower(pattern) +@} +@c endfile +@c @end group +@end example + +The next set of lines should be uncommented if you are not using +@code{gawk}. This rule translates all the characters in the input line +into lower-case if the @samp{-i} option was specified. The rule is +commented out since it is not necessary with @code{gawk}. +@c bug: if a match happens, we output the translated line, not the original + +@example +@c @group +@c file eg/prog/egrep.awk +#@{ +# if (IGNORECASE) +# $0 = tolower($0) +#@} +@c endfile +@c @end group +@end example + +The @code{beginfile} function is called by the rule in @file{ftrans.awk} +when each new file is processed. In this case, it is very simple; all it +does is initialize a variable @code{fcount} to zero. @code{fcount} tracks +how many lines in the current file matched the pattern. + +@example +@group +@c file eg/prog/egrep.awk +function beginfile(junk) +@{ + fcount = 0 +@} +@c endfile +@end group +@end example + +The @code{endfile} function is called after each file has been processed. +It is used only when the user wants a count of the number of lines that +matched. @code{no_print} will be true only if the exit status is desired. +@code{count_only} will be true if line counts are desired. @code{egrep} +will therefore only print line counts if printing and counting are enabled. +The output format must be adjusted depending upon the number of files to be +processed. Finally, @code{fcount} is added to @code{total}, so that we +know how many lines altogether matched the pattern. + +@example +@c @group +@c file eg/prog/egrep.awk +function endfile(file) +@{ + if (! no_print && count_only) + if (do_filenames) + print file ":" fcount + else + print fcount + + total += fcount +@} +@c endfile +@c @end group +@end example + +This rule does most of the work of matching lines. The variable +@code{matches} will be true if the line matched the pattern. If the user +wants lines that did not match, the sense of the @code{matches} is inverted +using the @samp{!} operator. @code{fcount} is incremented with the value of +@code{matches}, which will be either one or zero, depending upon a +successful or unsuccessful match. If the line did not match, the +@code{next} statement just moves on to the next record. + +There are several optimizations for performance in the following few lines +of code. If the user only wants exit status (@code{no_print} is true), and +we don't have to count lines, then it is enough to know that one line in +this file matched, and we can skip on to the next file with @code{nextfile}. +Along similar lines, if we are only printing file names, and we +don't need to count lines, we can print the file name, and then skip to the +next file with @code{nextfile}. + +Finally, each line is printed, with a leading filename and colon if +necessary. + +@ignore +2e: note, probably better to recode the last few lines as + if (! count_only) @{ + if (no_print) + nextfile + + if (filenames_only) @{ + print FILENAME + nextfile + @} + + if (do_filenames) + print FILENAME ":" $0 + else + print + @} +@end ignore + +@example +@c @group +@c file eg/prog/egrep.awk +@{ + matches = ($0 ~ pattern) + if (invert) + matches = ! matches + + fcount += matches # 1 or 0 + +@group + if (! matches) + next +@end group + + if (no_print && ! count_only) + nextfile + + if (filenames_only && ! count_only) @{ + print FILENAME + nextfile + @} + + if (do_filenames && ! count_only) + print FILENAME ":" $0 + else if (! count_only) + print +@} +@c endfile +@c @end group +@end example + +@c @strong{Exercise}: rearrange the code inside @samp{if (! count_only)}. + +The @code{END} rule takes care of producing the correct exit status. If +there were no matches, the exit status is one, otherwise it is zero. + +@example +@c @group +@c file eg/prog/egrep.awk +END \ +@{ + if (total == 0) + exit 1 + exit 0 +@} +@c endfile +@c @end group +@end example + +The @code{usage} function prints a usage message in case of invalid options +and then exits. + +@example +@c @group +@c file eg/prog/egrep.awk +function usage( e) +@{ + e = "Usage: egrep [-csvil] [-e pat] [files ...]" + print e > "/dev/stderr" + exit 1 +@} +@c endfile +@c @end group +@end example + +The variable @code{e} is used so that the function fits nicely +on the printed page. + +@cindex backslash continuation +Just a note on programming style. You may have noticed that the @code{END} +rule uses backslash continuation, with the open brace on a line by +itself. This is so that it more closely resembles the way functions +are written. Many of the examples +@iftex +in this chapter +@end iftex +use this style. You can decide for yourself if you like writing +your @code{BEGIN} and @code{END} rules this way, +or not. + +@node Id Program, Split Program, Egrep Program, Clones +@subsection Printing Out User Information + +@cindex @code{id} utility +The @code{id} utility lists a user's real and effective user-id numbers, +real and effective group-id numbers, and the user's group set, if any. +@code{id} will only print the effective user-id and group-id if they are +different from the real ones. If possible, @code{id} will also supply the +corresponding user and group names. The output might look like this: + +@example +$ id +@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty) +@end example + +This information is exactly what is provided by @code{gawk}'s +@file{/dev/user} special file (@pxref{Special Files, ,Special File Names in @code{gawk}}). +However, the @code{id} utility provides a more palatable output than just a +string of numbers. + +Here is a simple version of @code{id} written in @code{awk}. +It uses the user database library functions +(@pxref{Passwd Functions, ,Reading the User Database}), +and the group database library functions +(@pxref{Group Functions, ,Reading the Group Database}). + +The program is fairly straightforward. All the work is done in the +@code{BEGIN} rule. The user and group id numbers are obtained from +@file{/dev/user}. If there is no support for @file{/dev/user}, the program +gives up. + +The code is repetitive. The entry in the user database for the real user-id +number is split into parts at the @samp{:}. The name is the first field. +Similar code is used for the effective user-id number, and the group +numbers. + +@findex id.awk +@example +@c @group +@c file eg/prog/id.awk +# id.awk --- implement id in awk +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +# output is: +# uid=12(foo) euid=34(bar) gid=3(baz) \ +# egid=5(blat) groups=9(nine),2(two),1(one) + +BEGIN \ +@{ + if ((getline < "/dev/user") < 0) @{ + err = "id: no /dev/user support - cannot run" + print err > "/dev/stderr" + exit 1 + @} + close("/dev/user") + + uid = $1 + euid = $2 + gid = $3 + egid = $4 + + printf("uid=%d", uid) + pw = getpwuid(uid) +@group + if (pw != "") @{ + split(pw, a, ":") + printf("(%s)", a[1]) + @} +@end group + + if (euid != uid) @{ + printf(" euid=%d", euid) + pw = getpwuid(euid) + if (pw != "") @{ + split(pw, a, ":") + printf("(%s)", a[1]) + @} + @} + + printf(" gid=%d", gid) + pw = getgrgid(gid) + if (pw != "") @{ + split(pw, a, ":") + printf("(%s)", a[1]) + @} + + if (egid != gid) @{ + printf(" egid=%d", egid) + pw = getgrgid(egid) + if (pw != "") @{ + split(pw, a, ":") + printf("(%s)", a[1]) + @} + @} + + if (NF > 4) @{ + printf(" groups="); + for (i = 5; i <= NF; i++) @{ + printf("%d", $i) + pw = getgrgid($i) + if (pw != "") @{ + split(pw, a, ":") + printf("(%s)", a[1]) + @} +@group + if (i < NF) + printf(",") +@end group + @} + @} + print "" +@} +@c endfile +@c @end group +@end example + +@c exercise!!! +@ignore +The POSIX version of @code{id} takes arguments that control which +information is printed. Modify this version to accept the same +arguments and perform in the same way. +@end ignore + +@node Split Program, Tee Program, Id Program, Clones +@subsection Splitting a Large File Into Pieces + +@cindex @code{split} utility +The @code{split} program splits large text files into smaller pieces. By default, +the output files are named @file{xaa}, @file{xab}, and so on. Each file has +1000 lines in it, with the likely exception of the last file. To change the +number of lines in each file, you supply a number on the command line +preceded with a minus, e.g., @samp{-500} for files with 500 lines in them +instead of 1000. To change the name of the output files to something like +@file{myfileaa}, @file{myfileab}, and so on, you supply an additional +argument that specifies the filename. + +Here is a version of @code{split} in @code{awk}. It uses the @code{ord} and +@code{chr} functions presented in +@ref{Ordinal Functions, ,Translating Between Characters and Numbers}. + +The program first sets its defaults, and then tests to make sure there are +not too many arguments. It then looks at each argument in turn. The +first argument could be a minus followed by a number. If it is, this happens +to look like a negative number, so it is made positive, and that is the +count of lines. The data file name is skipped over, and the final argument +is used as the prefix for the output file names. + +@findex split.awk +@example +@c @group +@c file eg/prog/split.awk +# split.awk --- do split in awk +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +# usage: split [-num] [file] [outname] + +BEGIN @{ + outfile = "x" # default + count = 1000 + if (ARGC > 4) + usage() + + i = 1 + if (ARGV[i] ~ /^-[0-9]+$/) @{ + count = -ARGV[i] + ARGV[i] = "" + i++ + @} + # test argv in case reading from stdin instead of file + if (i in ARGV) + i++ # skip data file name + if (i in ARGV) @{ + outfile = ARGV[i] + ARGV[i] = "" + @} + + s1 = s2 = "a" + out = (outfile s1 s2) +@} +@c endfile +@c @end group +@end example + +The next rule does most of the work. @code{tcount} (temporary count) tracks +how many lines have been printed to the output file so far. If it is greater +than @code{count}, it is time to close the current file and start a new one. +@code{s1} and @code{s2} track the current suffixes for the file name. If +they are both @samp{z}, the file is just too big. Otherwise, @code{s1} +moves to the next letter in the alphabet and @code{s2} starts over again at +@samp{a}. + +@example +@c @group +@c file eg/prog/split.awk +@{ + if (++tcount > count) @{ + close(out) + if (s2 == "z") @{ + if (s1 == "z") @{ + printf("split: %s is too large to split\n", \ + FILENAME) > "/dev/stderr" + exit 1 + @} + s1 = chr(ord(s1) + 1) + s2 = "a" + @} else + s2 = chr(ord(s2) + 1) + out = (outfile s1 s2) + tcount = 1 + @} + print > out +@} +@c endfile +@c @end group +@end example + +The @code{usage} function simply prints an error message and exits. + +@example +@c @group +@c file eg/prog/split.awk +function usage( e) +@{ + e = "usage: split [-num] [file] [outname]" + print e > "/dev/stderr" + exit 1 +@} +@c endfile +@c @end group +@end example + +@noindent +The variable @code{e} is used so that the function +fits nicely on the +@iftex +page. +@end iftex +@ifinfo +screen. +@end ifinfo + +This program is a bit sloppy; it relies on @code{awk} to close the last file +for it automatically, instead of doing it in an @code{END} rule. + +@node Tee Program, Uniq Program, Split Program, Clones +@subsection Duplicating Output Into Multiple Files + +@cindex @code{tee} utility +The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies +its standard input to its standard output, and also duplicates it to the +files named on the command line. Its usage is: + +@example +tee @r{[}-a@r{]} file @dots{} +@end example + +The @samp{-a} option tells @code{tee} to append to the named files, instead of +truncating them and starting over. + +The @code{BEGIN} rule first makes a copy of all the command line arguments, +into an array named @code{copy}. +@code{ARGV[0]} is not copied, since it is not needed. +@code{tee} cannot use @code{ARGV} directly, since @code{awk} will attempt to +process each file named in @code{ARGV} as input data. + +If the first argument is @samp{-a}, then the flag variable +@code{append} is set to true, and both @code{ARGV[1]} and +@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no file +names were supplied, and @code{tee} prints a usage message and exits. +Finally, @code{awk} is forced to read the standard input by setting +@code{ARGV[1]} to @code{"-"}, and @code{ARGC} to two. + +@c 2e: the `ARGC--' in the `if (ARGV[1] == "-a")' isn't needed. + +@findex tee.awk +@example +@group +@c file eg/prog/tee.awk +# tee.awk --- tee in awk +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 +# Revised December 1995 +@end group + +@group +BEGIN \ +@{ + for (i = 1; i < ARGC; i++) + copy[i] = ARGV[i] +@end group + +@group + if (ARGV[1] == "-a") @{ + append = 1 + delete ARGV[1] + delete copy[1] + ARGC-- + @} +@end group +@group + if (ARGC < 2) @{ + print "usage: tee [-a] file ..." > "/dev/stderr" + exit 1 + @} +@end group +@group + ARGV[1] = "-" + ARGC = 2 +@} +@c endfile +@end group +@end example + +The single rule does all the work. Since there is no pattern, it is +executed for each line of input. The body of the rule simply prints the +line into each file on the command line, and then to the standard output. + +@example +@group +@c file eg/prog/tee.awk +@{ + # moving the if outside the loop makes it run faster + if (append) + for (i in copy) + print >> copy[i] + else + for (i in copy) + print > copy[i] + print +@} +@c endfile +@end group +@end example + +It would have been possible to code the loop this way: + +@example +for (i in copy) + if (append) + print >> copy[i] + else + print > copy[i] +@end example + +@noindent +This is more concise, but it is also less efficient. The @samp{if} is +tested for each record and for each output file. By duplicating the loop +body, the @samp{if} is only tested once for each input record. If there are +@var{N} input records and @var{M} input files, the first method only +executes @var{N} @samp{if} statements, while the second would execute +@var{N}@code{*}@var{M} @samp{if} statements. + +Finally, the @code{END} rule cleans up, by closing all the output files. + +@example +@c @group +@c file eg/prog/tee.awk +END \ +@{ + for (i in copy) + close(copy[i]) +@} +@c endfile +@c @end group +@end example + +@node Uniq Program, Wc Program, Tee Program, Clones +@subsection Printing Non-duplicated Lines of Text + +@cindex @code{uniq} utility +The @code{uniq} utility reads sorted lines of data on its standard input, +and (by default) removes duplicate lines. In other words, only unique lines +are printed, hence the name. @code{uniq} has a number of options. The usage is: + +@example +uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]} +@end example + +The option meanings are: + +@table @code +@item -d +Only print repeated lines. + +@item -u +Only print non-repeated lines. + +@item -c +Count lines. This option overrides @samp{-d} and @samp{-u}. Both repeated +and non-repeated lines are counted. + +@item -@var{n} +Skip @var{n} fields before comparing lines. The definition of fields +is similar to @code{awk}'s default: non-whitespace characters separated +by runs of spaces and/or tabs. + +@item +@var{n} +Skip @var{n} characters before comparing lines. Any fields specified with +@samp{-@var{n}} are skipped first. + +@item @var{input file} +Data is read from the input file named on the command line, instead of from +the standard input. + +@item @var{output file} +The generated output is sent to the named output file, instead of to the +standard output. +@end table + +Normally @code{uniq} behaves as if both the @samp{-d} and @samp{-u} options +had been provided. + +Here is an @code{awk} implementation of @code{uniq}. It uses the +@code{getopt} library function +(@pxref{Getopt Function, ,Processing Command Line Options}), +and the @code{join} library function +(@pxref{Join Function, ,Merging an Array Into a String}). + +The program begins with a @code{usage} function and then a brief outline of +the options and their meanings in a comment. + +The @code{BEGIN} rule deals with the command line arguments and options. It +uses a trick to get @code{getopt} to handle options of the form @samp{-25}, +treating such an option as the option letter @samp{2} with an argument of +@samp{5}. If indeed two or more digits were supplied (@code{Optarg} looks +like a number), @code{Optarg} is +concatenated with the option digit, and then result is added to zero to make +it into a number. If there is only one digit in the option, then +@code{Optarg} is not needed, and @code{Optind} must be decremented so that +@code{getopt} will process it next time. This code is admittedly a bit +tricky. + +If no options were supplied, then the default is taken, to print both +repeated and non-repeated lines. The output file, if provided, is assigned +to @code{outputfile}. Earlier, @code{outputfile} was initialized to the +standard output, @file{/dev/stdout}. + +@findex uniq.awk +@example +@c @group +@c file eg/prog/uniq.awk +# uniq.awk --- do uniq in awk +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +@group +function usage( e) +@{ + e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]" + print e > "/dev/stderr" + exit 1 +@} +@end group + +@group +# -c count lines. overrides -d and -u +# -d only repeated lines +# -u only non-repeated lines +# -n skip n fields +# +n skip n characters, skip fields first +@end group + +BEGIN \ +@{ + count = 1 + outputfile = "/dev/stdout" + opts = "udc0:1:2:3:4:5:6:7:8:9:" + while ((c = getopt(ARGC, ARGV, opts)) != -1) @{ + if (c == "u") + non_repeated_only++ + else if (c == "d") + repeated_only++ + else if (c == "c") + do_count++ + else if (index("0123456789", c) != 0) @{ + # getopt requires args to options + # this messes us up for things like -5 + if (Optarg ~ /^[0-9]+$/) + fcount = (c Optarg) + 0 + else @{ + fcount = c + 0 + Optind-- + @} + @} else + usage() + @} + + if (ARGV[Optind] ~ /^\+[0-9]+$/) @{ + charcount = substr(ARGV[Optind], 2) + 0 + Optind++ + @} + + for (i = 1; i < Optind; i++) + ARGV[i] = "" + + if (repeated_only == 0 && non_repeated_only == 0) + repeated_only = non_repeated_only = 1 + +@group + if (ARGC - Optind == 2) @{ + outputfile = ARGV[ARGC - 1] + ARGV[ARGC - 1] = "" + @} +@} +@c endfile +@end group +@end example + +The following function, @code{are_equal}, compares the current line, +@code{$0}, to the +previous line, @code{last}. It handles skipping fields and characters. + +If no field count and no character count were specified, @code{are_equal} +simply returns one or zero depending upon the result of a simple string +comparison of @code{last} and @code{$0}. Otherwise, things get more +complicated. + +If fields have to be skipped, each line is broken into an array using +@code{split} +(@pxref{String Functions, ,Built-in Functions for String Manipulation}), +and then the desired fields are joined back into a line using @code{join}. +The joined lines are stored in @code{clast} and @code{cline}. +If no fields are skipped, @code{clast} and @code{cline} are set to +@code{last} and @code{$0} respectively. + +Finally, if characters are skipped, @code{substr} is used to strip off the +leading @code{charcount} characters in @code{clast} and @code{cline}. The +two strings are then compared, and @code{are_equal} returns the result. + +@example +@c @group +@c file eg/prog/uniq.awk +function are_equal( n, m, clast, cline, alast, aline) +@{ + if (fcount == 0 && charcount == 0) + return (last == $0) + + if (fcount > 0) @{ + n = split(last, alast) + m = split($0, aline) + clast = join(alast, fcount+1, n) + cline = join(aline, fcount+1, m) + @} else @{ + clast = last + cline = $0 + @} + if (charcount) @{ + clast = substr(clast, charcount + 1) + cline = substr(cline, charcount + 1) + @} + + return (clast == cline) +@} +@c endfile +@c @end group +@end example + +The following two rules are the body of the program. The first one is +executed only for the very first line of data. It sets @code{last} equal to +@code{$0}, so that subsequent lines of text have something to be compared to. + +The second rule does the work. The variable @code{equal} will be one or zero +depending upon the results of @code{are_equal}'s comparison. If @code{uniq} +is counting repeated lines, then the @code{count} variable is incremented if +the lines are equal. Otherwise the line is printed and @code{count} is +reset, since the two lines are not equal. + +If @code{uniq} is not counting, @code{count} is incremented if the lines are +equal. Otherwise, if @code{uniq} is counting repeated lines, and more than +one line has been seen, or if @code{uniq} is counting non-repeated lines, +and only one line has been seen, then the line is printed, and @code{count} +is reset. + +Finally, similar logic is used in the @code{END} rule to print the final +line of input data. + +@example +@c @group +@c file eg/prog/uniq.awk +@group +NR == 1 @{ + last = $0 + next +@} +@end group + +@{ + equal = are_equal() + + if (do_count) @{ # overrides -d and -u + if (equal) + count++ + else @{ + printf("%4d %s\n", count, last) > outputfile + last = $0 + count = 1 # reset + @} + next + @} + + if (equal) + count++ + else @{ + if ((repeated_only && count > 1) || + (non_repeated_only && count == 1)) + print last > outputfile + last = $0 + count = 1 + @} +@} + +@group +END @{ + if (do_count) + printf("%4d %s\n", count, last) > outputfile + else if ((repeated_only && count > 1) || + (non_repeated_only && count == 1)) + print last > outputfile +@} +@end group +@c endfile +@c @end group +@end example + +@node Wc Program, , Uniq Program, Clones +@subsection Counting Things + +@cindex @code{wc} utility +The @code{wc} (word count) utility counts lines, words, and characters in +one or more input files. Its usage is: + +@example +wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]} +@end example + +If no files are specified on the command line, @code{wc} reads its standard +input. If there are multiple files, it will also print total counts for all +the files. The options and their meanings are: + +@table @code +@item -l +Only count lines. + +@item -w +Only count words. +A ``word'' is a contiguous sequence of non-whitespace characters, separated +by spaces and/or tabs. Happily, this is the normal way @code{awk} separates +fields in its input data. + +@item -c +Only count characters. +@end table + +Implementing @code{wc} in @code{awk} is particularly elegant, since +@code{awk} does a lot of the work for us; it splits lines into words (i.e.@: +fields) and counts them, it counts lines (i.e.@: records) for us, and it can +easily tell us how long a line is. + +This version uses the @code{getopt} library function +(@pxref{Getopt Function, ,Processing Command Line Options}), +and the file transition functions +(@pxref{Filetrans Function, ,Noting Data File Boundaries}). + +This version has one major difference from traditional versions of @code{wc}. +Our version always prints the counts in the order lines, words, +and characters. Traditional versions note the order of the @samp{-l}, +@samp{-w}, and @samp{-c} options on the command line, and print the counts +in that order. + +The @code{BEGIN} rule does the argument processing. +The variable @code{print_total} will +be true if more than one file was named on the command line. + +@findex wc.awk +@example +@c @group +@c file eg/prog/wc.awk +# wc.awk --- count lines, words, characters +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +# Options: +# -l only count lines +# -w only count words +# -c only count characters +# +# Default is to count lines, words, characters + +BEGIN @{ + # let getopt print a message about + # invalid options. we ignore them + while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{ + if (c == "l") + do_lines = 1 + else if (c == "w") + do_words = 1 + else if (c == "c") + do_chars = 1 + @} + for (i = 1; i < Optind; i++) + ARGV[i] = "" + + # if no options, do all + if (! do_lines && ! do_words && ! do_chars) + do_lines = do_words = do_chars = 1 + + print_total = (ARGC - i > 2) +@} +@c endfile +@c @end group +@end example + +The @code{beginfile} function is simple; it just resets the counts of lines, +words, and characters to zero, and saves the current file name in +@code{fname}. + +The @code{endfile} function adds the current file's numbers to the running +totals of lines, words, and characters. It then prints out those numbers +for the file that was just read. It relies on @code{beginfile} to reset the +numbers for the following data file. + +@example +@c @group +@c file eg/prog/wc.awk +function beginfile(file) +@{ + chars = lines = words = 0 + fname = FILENAME +@} + +function endfile(file) +@{ + tchars += chars + tlines += lines + twords += words +@group + if (do_lines) + printf "\t%d", lines +@end group + if (do_words) + printf "\t%d", words + if (do_chars) + printf "\t%d", chars + printf "\t%s\n", fname +@} +@c endfile +@c @end group +@end example + +There is one rule that is executed for each line. It adds the length of the +record to @code{chars}. It has to add one, since the newline character +separating records (the value of @code{RS}) is not part of the record +itself. @code{lines} is incremented for each line read, and @code{words} is +incremented by the value of @code{NF}, the number of ``words'' on this +line.@footnote{Examine the code in +@ref{Filetrans Function, ,Noting Data File Boundaries}. +Why must @code{wc} use a separate @code{lines} variable, instead of using +the value of @code{FNR} in @code{endfile}?} + +Finally, the @code{END} rule simply prints the totals for all the files. + +@example +@c @group +@c file eg/prog/wc.awk +# do per line +@{ + chars += length($0) + 1 # get newline + lines++ + words += NF +@} + +END @{ + if (print_total) @{ + if (do_lines) + printf "\t%d", tlines + if (do_words) + printf "\t%d", twords + if (do_chars) + printf "\t%d", tchars + print "\ttotal" + @} +@} +@c endfile +@c @end group +@end example + +@node Miscellaneous Programs, , Clones, Sample Programs +@section A Grab Bag of @code{awk} Programs + +This section is a large ``grab bag'' of miscellaneous programs. +We hope you find them both interesting and enjoyable. + +@menu +* Dupword Program:: Finding duplicated words in a document. +* Alarm Program:: An alarm clock. +* Translate Program:: A program similar to the @code{tr} utility. +* Labels Program:: Printing mailing labels. +* Word Sorting:: A program to produce a word usage count. +* History Sorting:: Eliminating duplicate entries from a history + file. +* Extract Program:: Pulling out programs from Texinfo source + files. +* Simple Sed:: A Simple Stream Editor. +* Igawk Program:: A wrapper for @code{awk} that includes files. +@end menu + +@node Dupword Program, Alarm Program, Miscellaneous Programs, Miscellaneous Programs +@subsection Finding Duplicated Words in a Document + +A common error when writing large amounts of prose is to accidentally +duplicate words. Often you will see this in text as something like ``the +the program does the following @dots{}.'' When the text is on-line, often +the duplicated words occur at the end of one line and the beginning of +another, making them very difficult to spot. +@c as here! + +This program, @file{dupword.awk}, scans through a file one line at a time, +and looks for adjacent occurrences of the same word. It also saves the last +word on a line (in the variable @code{prev}) for comparison with the first +word on the next line. + +The first two statements make sure that the line is all lower-case, so that, +for example, +``The'' and ``the'' compare equal to each other. The second statement +removes all non-alphanumeric and non-whitespace characters from the line, so +that punctuation does not affect the comparison either. This sometimes +leads to reports of duplicated words that really are different, but this is +unusual. + +@c FIXME: add check for $i != "" +@findex dupword.awk +@example +@group +@c file eg/prog/dupword.awk +# dupword --- find duplicate words in text +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# December 1991 + +@{ + $0 = tolower($0) + gsub(/[^A-Za-z0-9 \t]/, ""); + if ($1 == prev) + printf("%s:%d: duplicate %s\n", + FILENAME, FNR, $1) + for (i = 2; i <= NF; i++) + if ($i == $(i-1)) + printf("%s:%d: duplicate %s\n", + FILENAME, FNR, $i) + prev = $NF +@} +@c endfile +@end group +@end example + +@node Alarm Program, Translate Program, Dupword Program, Miscellaneous Programs +@subsection An Alarm Clock Program + +The following program is a simple ``alarm clock'' program. +You give it a time of day, and an optional message. At the given time, +it prints the message on the standard output. In addition, you can give it +the number of times to repeat the message, and also a delay between +repetitions. + +This program uses the @code{gettimeofday} function from +@ref{Gettimeofday Function, ,Managing the Time of Day}. + +All the work is done in the @code{BEGIN} rule. The first part is argument +checking and setting of defaults; the delay, the count, and the message to +print. If the user supplied a message, but it does not contain the ASCII BEL +character (known as the ``alert'' character, @samp{\a}), then it is added to +the message. (On many systems, printing the ASCII BEL generates some sort +of audible alert. Thus, when the alarm goes off, the system calls attention +to itself, in case the user is not looking at their computer or terminal.) + +@findex alarm.awk +@example +@c @group +@c file eg/prog/alarm.awk +# alarm --- set an alarm +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +# usage: alarm time [ "message" [ count [ delay ] ] ] + +BEGIN \ +@{ + # Initial argument sanity checking + usage1 = "usage: alarm time ['message' [count [delay]]]" + usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1]) + + if (ARGC < 2) @{ + print usage > "/dev/stderr" + exit 1 + @} else if (ARGC == 5) @{ + delay = ARGV[4] + 0 + count = ARGV[3] + 0 + message = ARGV[2] + @} else if (ARGC == 4) @{ + count = ARGV[3] + 0 + message = ARGV[2] + @} else if (ARGC == 3) @{ + message = ARGV[2] + @} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{ + print usage1 > "/dev/stderr" + print usage2 > "/dev/stderr" + exit 1 + @} + + # set defaults for once we reach the desired time + if (delay == 0) + delay = 180 # 3 minutes + if (count == 0) + count = 5 +@group + if (message == "") + message = sprintf("\aIt is now %s!\a", ARGV[1]) + else if (index(message, "\a") == 0) + message = "\a" message "\a" +@end group +@c endfile +@end example + +The next section of code turns the alarm time into hours and minutes, +and converts it if necessary to a 24-hour clock. Then it turns that +time into a count of the seconds since midnight. Next it turns the current +time into a count of seconds since midnight. The difference between the two +is how long to wait before setting off the alarm. + +@example +@c @group +@c file eg/prog/alarm.awk + # split up dest time + split(ARGV[1], atime, ":") + hour = atime[1] + 0 # force numeric + minute = atime[2] + 0 # force numeric + + # get current broken down time + gettimeofday(now) + + # if time given is 12-hour hours and it's after that + # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m., + # then add 12 to real hour + if (hour < 12 && now["hour"] > hour) + hour += 12 + + # set target time in seconds since midnight + target = (hour * 60 * 60) + (minute * 60) + + # get current time in seconds since midnight + current = (now["hour"] * 60 * 60) + \ + (now["minute"] * 60) + now["second"] + + # how long to sleep for + naptime = target - current + if (naptime <= 0) @{ + print "time is in the past!" > "/dev/stderr" + exit 1 + @} +@c endfile +@c @end group +@end example + +Finally, the program uses the @code{system} function +(@pxref{I/O Functions, ,Built-in Functions for Input/Output}) +to call the @code{sleep} utility. The @code{sleep} utility simply pauses +for the given number of seconds. If the exit status is not zero, +the program assumes that @code{sleep} was interrupted, and exits. If +@code{sleep} exited with an OK status (zero), then the program prints the +message in a loop, again using @code{sleep} to delay for however many +seconds are necessary. + +@example +@c @group +@c file eg/prog/alarm.awk + # zzzzzz..... go away if interrupted + if (system(sprintf("sleep %d", naptime)) != 0) + exit 1 + + # time to notify! + command = sprintf("sleep %d", delay) + for (i = 1; i <= count; i++) @{ + print message + # if sleep command interrupted, go away + if (system(command) != 0) + break + @} + + exit 0 +@} +@c endfile +@c @end group +@end example + +@node Translate Program, Labels Program, Alarm Program, Miscellaneous Programs +@subsection Transliterating Characters + +The system @code{tr} utility transliterates characters. For example, it is +often used to map upper-case letters into lower-case, for further +processing. + +@example +@var{generate data} | tr '[A-Z]' '[a-z]' | @var{process data} @dots{} +@end example + +You give @code{tr} two lists of characters enclosed in square brackets. +Usually, the lists are quoted to keep the shell from attempting to do a +filename expansion.@footnote{On older, non-POSIX systems, @code{tr} often +does not require that the lists be enclosed in square brackets and quoted. +This is a feature.} When processing the input, the +first character in the first list is replaced with the first character in the +second list, the second character in the first list is replaced with the +second character in the second list, and so on. +If there are more characters in the ``from'' list than in the ``to'' list, +the last character of the ``to'' list is used for the remaining characters +in the ``from'' list. + +Some time ago, +@c early or mid-1989! +a user proposed to us that we add a transliteration function to @code{gawk}. +Being opposed to ``creeping featurism,'' I wrote the following program to +prove that character transliteration could be done with a user-level +function. This program is not as complete as the system @code{tr} utility, +but it will do most of the job. + +The @code{translate} program demonstrates one of the few weaknesses of +standard +@code{awk}: dealing with individual characters is very painful, requiring +repeated use of the @code{substr}, @code{index}, and @code{gsub} built-in +functions +(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@footnote{This +program was written before @code{gawk} acquired the ability to +split each character in a string into separate array elements. +How might this ability simplify the program?} + +There are two functions. The first, @code{stranslate}, takes three +arguments. + +@table @code +@item from +A list of characters to translate from. + +@item to +A list of characters to translate to. + +@item target +The string to do the translation on. +@end table + +Associative arrays make the translation part fairly easy. @code{t_ar} holds +the ``to'' characters, indexed by the ``from'' characters. Then a simple +loop goes through @code{from}, one character at a time. For each character +in @code{from}, if the character appears in @code{target}, @code{gsub} +is used to change it to the corresponding @code{to} character. + +The @code{translate} function simply calls @code{stranslate} using @code{$0} +as the target. The main program sets two global variables, @code{FROM} and +@code{TO}, from the command line, and then changes @code{ARGV} so that +@code{awk} will read from the standard input. + +Finally, the processing rule simply calls @code{translate} for each record. + +@findex translate.awk +@example +@c @group +@c file eg/prog/translate.awk +# translate --- do tr like stuff +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# August 1989 + +# bugs: does not handle things like: tr A-Z a-z, it has +# to be spelled out. However, if `to' is shorter than `from', +# the last character in `to' is used for the rest of `from'. + +function stranslate(from, to, target, lf, lt, t_ar, i, c) +@{ + lf = length(from) + lt = length(to) + for (i = 1; i <= lt; i++) + t_ar[substr(from, i, 1)] = substr(to, i, 1) + if (lt < lf) + for (; i <= lf; i++) + t_ar[substr(from, i, 1)] = substr(to, lt, 1) + for (i = 1; i <= lf; i++) @{ + c = substr(from, i, 1) + if (index(target, c) > 0) + gsub(c, t_ar[c], target) + @} + return target +@} + +@group +function translate(from, to) +@{ + return $0 = stranslate(from, to, $0) +@} +@end group + +# main program +BEGIN @{ + if (ARGC < 3) @{ + print "usage: translate from to" > "/dev/stderr" + exit + @} + FROM = ARGV[1] + TO = ARGV[2] + ARGC = 2 + ARGV[1] = "-" +@} + +@{ + translate(FROM, TO) + print +@} +@c endfile +@c @end group +@end example + +While it is possible to do character transliteration in a user-level +function, it is not necessarily efficient, and we started to consider adding +a built-in function. However, shortly after writing this program, we learned +that the System V Release 4 @code{awk} had added the @code{toupper} and +@code{tolower} functions. These functions handle the vast majority of the +cases where character transliteration is necessary, and so we chose to +simply add those functions to @code{gawk} as well, and then leave well +enough alone. + +An obvious improvement to this program would be to set up the +@code{t_ar} array only once, in a @code{BEGIN} rule. However, this +assumes that the ``from'' and ``to'' lists +will never change throughout the lifetime of the program. + +@node Labels Program, Word Sorting, Translate Program, Miscellaneous Programs +@subsection Printing Mailing Labels + +Here is a ``real world''@footnote{``Real world'' is defined as +``a program actually used to get something done.''} +program. This script reads lists of names and +addresses, and generates mailing labels. Each page of labels has 20 labels +on it, two across and ten down. The addresses are guaranteed to be no more +than five lines of data. Each address is separated from the next by a blank +line. + +The basic idea is to read 20 labels worth of data. Each line of each label +is stored in the @code{line} array. The single rule takes care of filling +the @code{line} array and printing the page when 20 labels have been read. + +The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that +@code{awk} will split records at blank lines +(@pxref{Records, ,How Input is Split into Records}). +It sets @code{MAXLINES} to 100, since @code{MAXLINE} is the maximum number +of lines on the page (20 * 5 = 100). + +Most of the work is done in the @code{printpage} function. +The label lines are stored sequentially in the @code{line} array. But they +have to be printed horizontally; @code{line[1]} next to @code{line[6]}, +@code{line[2]} next to @code{line[7]}, and so on. Two loops are used to +accomplish this. The outer loop, controlled by @code{i}, steps through +every 10 lines of data; this is each row of labels. The inner loop, +controlled by @code{j}, goes through the lines within the row. +As @code{j} goes from zero to four, @samp{i+j} is the @code{j}'th line in +the row, and @samp{i+j+5} is the entry next to it. The output ends up +looking something like this: + +@example +line 1 line 6 +line 2 line 7 +line 3 line 8 +line 4 line 9 +line 5 line 10 +@end example + +As a final note, at lines 21 and 61, an extra blank line is printed, to keep +the output lined up on the labels. This is dependent on the particular +brand of labels in use when the program was written. You will also note +that there are two blank lines at the top and two blank lines at the bottom. + +The @code{END} rule arranges to flush the final page of labels; there may +not have been an even multiple of 20 labels in the data. + +@findex labels.awk +@example +@c @group +@c file eg/prog/labels.awk +# labels.awk +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# June 1992 + +# Program to print labels. Each label is 5 lines of data +# that may have blank lines. The label sheets have 2 +# blank lines at the top and 2 at the bottom. + +BEGIN @{ RS = "" ; MAXLINES = 100 @} + +function printpage( i, j) +@{ + if (Nlines <= 0) + return + + printf "\n\n" # header + + for (i = 1; i <= Nlines; i += 10) @{ + if (i == 21 || i == 61) + print "" + for (j = 0; j < 5; j++) @{ + if (i + j > MAXLINES) + break + printf " %-41s %s\n", line[i+j], line[i+j+5] + @} + print "" + @} + + printf "\n\n" # footer + + for (i in line) + line[i] = "" +@} + +# main rule +@{ + if (Count >= 20) @{ + printpage() + Count = 0 + Nlines = 0 + @} + n = split($0, a, "\n") + for (i = 1; i <= n; i++) + line[++Nlines] = a[i] + for (; i <= 5; i++) + line[++Nlines] = "" + Count++ +@} + +END \ +@{ + printpage() +@} +@c endfile +@c @end group +@end example + +@node Word Sorting, History Sorting, Labels Program, Miscellaneous Programs +@subsection Generating Word Usage Counts + +The following @code{awk} program prints +the number of occurrences of each word in its input. It illustrates the +associative nature of @code{awk} arrays by using strings as subscripts. It +also demonstrates the @samp{for @var{x} in @var{array}} construction. +Finally, it shows how @code{awk} can be used in conjunction with other +utility programs to do a useful task of some complexity with a minimum of +effort. Some explanations follow the program listing. + +@example +awk ' +# Print list of word frequencies +@{ + for (i = 1; i <= NF; i++) + freq[$i]++ +@} + +END @{ + for (word in freq) + printf "%s\t%d\n", word, freq[word] +@}' +@end example + +The first thing to notice about this program is that it has two rules. The +first rule, because it has an empty pattern, is executed on every line of +the input. It uses @code{awk}'s field-accessing mechanism +(@pxref{Fields, ,Examining Fields}) to pick out the individual words from +the line, and the built-in variable @code{NF} (@pxref{Built-in Variables}) +to know how many fields are available. + +For each input word, an element of the array @code{freq} is incremented to +reflect that the word has been seen an additional time. + +The second rule, because it has the pattern @code{END}, is not executed +until the input has been exhausted. It prints out the contents of the +@code{freq} table that has been built up inside the first action. + +This program has several problems that would prevent it from being +useful by itself on real text files: + +@itemize @bullet +@item +Words are detected using the @code{awk} convention that fields are +separated by whitespace and that other characters in the input (except +newlines) don't have any special meaning to @code{awk}. This means that +punctuation characters count as part of words. + +@item +The @code{awk} language considers upper- and lower-case characters to be +distinct. Therefore, @samp{bartender} and @samp{Bartender} are not treated +as the same word. This is undesirable since, in normal text, words +are capitalized if they begin sentences, and a frequency analyzer should not +be sensitive to capitalization. + +@item +The output does not come out in any useful order. You're more likely to be +interested in which words occur most frequently, or having an alphabetized +table of how frequently each word occurs. +@end itemize + +The way to solve these problems is to use some of the more advanced +features of the @code{awk} language. First, we use @code{tolower} to remove +case distinctions. Next, we use @code{gsub} to remove punctuation +characters. Finally, we use the system @code{sort} utility to process the +output of the @code{awk} script. Here is the new version of +the program: + +@findex wordfreq.sh +@example +@c file eg/prog/wordfreq.awk +# Print list of word frequencies +@{ + $0 = tolower($0) # remove case distinctions + gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation + for (i = 1; i <= NF; i++) + freq[$i]++ +@} +@c endfile + +END @{ + for (word in freq) + printf "%s\t%d\n", word, freq[word] +@} +@end example + +Assuming we have saved this program in a file named @file{wordfreq.awk}, +and that the data is in @file{file1}, the following pipeline + +@example +awk -f wordfreq.awk file1 | sort +1 -nr +@end example + +@noindent +produces a table of the words appearing in @file{file1} in order of +decreasing frequency. + +The @code{awk} program suitably massages the data and produces a word +frequency table, which is not ordered. + +The @code{awk} script's output is then sorted by the @code{sort} utility and +printed on the terminal. The options given to @code{sort} in this example +specify to sort using the second field of each input line (skipping one field), +that the sort keys should be treated as numeric quantities (otherwise +@samp{15} would come before @samp{5}), and that the sorting should be done +in descending (reverse) order. + +We could have even done the @code{sort} from within the program, by +changing the @code{END} action to: + +@example +@c file eg/prog/wordfreq.awk +END @{ + sort = "sort +1 -nr" + for (word in freq) + printf "%s\t%d\n", word, freq[word] | sort + close(sort) +@} +@c endfile +@end example + +You would have to use this way of sorting on systems that do not +have true pipes. + +See the general operating system documentation for more information on how +to use the @code{sort} program. + +@node History Sorting, Extract Program, Word Sorting, Miscellaneous Programs +@subsection Removing Duplicates from Unsorted Text + +The @code{uniq} program +(@pxref{Uniq Program, ,Printing Non-duplicated Lines of Text}), +removes duplicate lines from @emph{sorted} data. + +Suppose, however, you need to remove duplicate lines from a data file, but +that you wish to preserve the order the lines are in? A good example of +this might be a shell history file. The history file keeps a copy of all +the commands you have entered, and it is not unusual to repeat a command +several times in a row. Occasionally you might wish to compact the history +by removing duplicate entries. Yet it is desirable to maintain the order +of the original commands. + +This simple program does the job. It uses two arrays. The @code{data} +array is indexed by the text of each line. +For each line, @code{data[$0]} is incremented. + +If a particular line has not +been seen before, then @code{data[$0]} will be zero. +In that case, the text of the line is stored in @code{lines[count]}. +Each element of @code{lines} is a unique command, and the indices of +@code{lines} indicate the order in which those lines were encountered. +The @code{END} rule simply prints out the lines, in order. + +@cindex Rakitzis, Byron +@findex histsort.awk +@example +@group +@c file eg/prog/histsort.awk +# histsort.awk --- compact a shell history file +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +# Thanks to Byron Rakitzis for the general idea +@{ + if (data[$0]++ == 0) + lines[++count] = $0 +@} + +END @{ + for (i = 1; i <= count; i++) + print lines[i] +@} +@c endfile +@end group +@end example + +This program also provides a foundation for generating other useful +information. For example, using the following @code{print} satement in the +@code{END} rule would indicate how often a particular command was used. + +@example +print data[lines[i]], lines[i] +@end example + +This works because @code{data[$0]} was incremented each time a line was +seen. + +@node Extract Program, Simple Sed, History Sorting, Miscellaneous Programs +@subsection Extracting Programs from Texinfo Source Files + +@iftex +Both this chapter and the previous chapter +(@ref{Library Functions, ,A Library of @code{awk} Functions}), +present a large number of @code{awk} programs. +@end iftex +@ifinfo +The nodes +@ref{Library Functions, ,A Library of @code{awk} Functions}, +and @ref{Sample Programs, ,Practical @code{awk} Programs}, +are the top level nodes for a large number of @code{awk} programs. +@end ifinfo +If you wish to experiment with these programs, it is tedious to have to type +them in by hand. Here we present a program that can extract parts of a +Texinfo input file into separate files. + +This @value{DOCUMENT} is written in Texinfo, the GNU project's document +formatting language. A single Texinfo source file can be used to produce both +printed and on-line documentation. +@iftex +Texinfo is fully documented in @cite{Texinfo---The GNU Documentation Format}, +available from the Free Software Foundation. +@end iftex +@ifinfo +The Texinfo language is described fully, starting with +@ref{Top, , Introduction, texi, Texinfo---The GNU Documentation Format}. +@end ifinfo + +For our purposes, it is enough to know three things about Texinfo input +files. + +@itemize @bullet +@item +The ``at'' symbol, @samp{@@}, is special in Texinfo, much like @samp{\} in C +or @code{awk}. Literal @samp{@@} symbols are represented in Texinfo source +files as @samp{@@@@}. + +@item +Comments start with either @samp{@@c} or @samp{@@comment}. +The file extraction program will work by using special comments that start +at the beginning of a line. + +@item +Example text that should not be split across a page boundary is bracketed +between lines containing @samp{@@group} and @samp{@@end group} commands. +@end itemize + +The following program, @file{extract.awk}, reads through a Texinfo source +file, and does two things, based on the special comments. +Upon seeing @samp{@w{@@c system @dots{}}}, +it runs a command, by extracting the command text from the +control line and passing it on to the @code{system} function +(@pxref{I/O Functions, ,Built-in Functions for Input/Output}). +Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to +the file @var{filename}, until @samp{@@c endfile} is encountered. +The rules in @file{extract.awk} will match either @samp{@@c} or +@samp{@@comment} by letting the @samp{omment} part be optional. +Lines containing @samp{@@group} and @samp{@@end group} are simply removed. +@file{extract.awk} uses the @code{join} library function +(@pxref{Join Function, ,Merging an Array Into a String}). + +The example programs in the on-line Texinfo source for @cite{@value{TITLE}} +(@file{gawk.texi}) have all been bracketed inside @samp{file}, +and @samp{endfile} lines. The @code{gawk} distribution uses a copy of +@file{extract.awk} to extract the sample +programs and install many of them in a standard directory, where +@code{gawk} can find them. + +@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that +mixed upper-case and lower-case letters in the directives won't matter. + +The first rule handles calling @code{system}, checking that a command was +given (@code{NF} is at least three), and also checking that the command +exited with a zero exit status, signifying OK. + +@findex extract.awk +@example +@c @group +@c file eg/prog/extract.awk +# extract.awk --- extract files and run programs +# from texinfo files +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# May 1993 + +BEGIN @{ IGNORECASE = 1 @} + +@group +/^@@c(omment)?[ \t]+system/ \ +@{ + if (NF < 3) @{ + e = (FILENAME ":" FNR) + e = (e ": badly formed `system' line") + print e > "/dev/stderr" + next + @} + $1 = "" + $2 = "" + stat = system($0) + if (stat != 0) @{ + e = (FILENAME ":" FNR) + e = (e ": warning: system returned " stat) + print e > "/dev/stderr" + @} +@} +@end group +@c endfile +@end example + +@noindent +The variable @code{e} is used so that the function +fits nicely on the +@iftex +page. +@end iftex +@ifinfo +screen. +@end ifinfo + +The second rule handles moving data into files. It verifies that a file +name was given in the directive. If the file named is not the current file, +then the current file is closed. This means that an @samp{@@c endfile} was +not given for that file. (We should probably print a diagnostic in this +case, although at the moment we do not.) + +The @samp{for} loop does the work. It reads lines using @code{getline} +(@pxref{Getline, ,Explicit Input with @code{getline}}). +For an unexpected end of file, it calls the @code{@w{unexpected_eof}} +function. If the line is an ``endfile'' line, then it breaks out of +the loop. +If the line is an @samp{@@group} or @samp{@@end group} line, then it +ignores it, and goes on to the next line. + +Most of the work is in the following few lines. If the line has no @samp{@@} +symbols, it can be printed directly. Otherwise, each leading @samp{@@} must be +stripped off. + +To remove the @samp{@@} symbols, the line is split into separate elements of +the array @code{a}, using the @code{split} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +Each element of @code{a} that is empty indicates two successive @samp{@@} +symbols in the original line. For each two empty elements (@samp{@@@@} in +the original file), we have to add back in a single @samp{@@} symbol. + +When the processing of the array is finished, @code{join} is called with the +value of @code{SUBSEP}, to rejoin the pieces back into a single +line. That line is then printed to the output file. + +@example +@c @group +@c file eg/prog/extract.awk +@group +/^@@c(omment)?[ \t]+file/ \ +@{ + if (NF != 3) @{ + e = (FILENAME ":" FNR ": badly formed `file' line") + print e > "/dev/stderr" + next + @} +@end group + if ($3 != curfile) @{ + if (curfile != "") + close(curfile) + curfile = $3 + @} + + for (;;) @{ + if ((getline line) <= 0) + unexpected_eof() + if (line ~ /^@@c(omment)?[ \t]+endfile/) + break + else if (line ~ /^@@(end[ \t]+)?group/) + continue + if (index(line, "@@") == 0) @{ + print line > curfile + continue + @} + n = split(line, a, "@@") +@group + # if a[1] == "", means leading @@, + # don't add one back in. +@end group + for (i = 2; i <= n; i++) @{ + if (a[i] == "") @{ # was an @@@@ + a[i] = "@@" + if (a[i+1] == "") + i++ + @} + @} + print join(a, 1, n, SUBSEP) > curfile + @} +@} +@c endfile +@c @end group +@end example + +An important thing to note is the use of the @samp{>} redirection. +Output done with @samp{>} only opens the file once; it stays open and +subsequent output is appended to the file +(@pxref{Redirection, , Redirecting Output of @code{print} and @code{printf}}). +This allows us to easily mix program text and explanatory prose for the same +sample source file (as has been done here!) without any hassle. The file is +only closed when a new data file name is encountered, or at the end of the +input file. + +Finally, the function @code{@w{unexpected_eof}} prints an appropriate +error message and then exits. + +The @code{END} rule handles the final cleanup, closing the open file. + +@example +@c file eg/prog/extract.awk +@group +function unexpected_eof() +@{ + printf("%s:%d: unexpected EOF or error\n", \ + FILENAME, FNR) > "/dev/stderr" + exit 1 +@} +@end group + +END @{ + if (curfile) + close(curfile) +@} +@c endfile +@end example + +@node Simple Sed, Igawk Program, Extract Program, Miscellaneous Programs +@subsection A Simple Stream Editor + +@cindex @code{sed} utility +The @code{sed} utility is a ``stream editor,'' a program that reads a +stream of data, makes changes to it, and passes the modified data on. +It is often used to make global changes to a large file, or to a stream +of data generated by a pipeline of commands. + +While @code{sed} is a complicated program in its own right, its most common +use is to perform global substitutions in the middle of a pipeline: + +@example +command1 < orig.data | sed 's/old/new/g' | command2 > result +@end example + +Here, the @samp{s/old/new/g} tells @code{sed} to look for the regexp +@samp{old} on each input line, and replace it with the text @samp{new}, +globally (i.e.@: all the occurrences on a line). This is similar to +@code{awk}'s @code{gsub} function +(@pxref{String Functions, , Built-in Functions for String Manipulation}). + +The following program, @file{awksed.awk}, accepts at least two command line +arguments; the pattern to look for and the text to replace it with. Any +additional arguments are treated as data file names to process. If none +are provided, the standard input is used. + +@cindex Brennan, Michael +@cindex @code{awksed} +@cindex simple stream editor +@cindex stream editor, simple +@example +@c @group +@c file eg/prog/awksed.awk +# awksed.awk --- do s/foo/bar/g using just print +# Thanks to Michael Brennan for the idea + +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# August 1995 + +@group +function usage() +@{ + print "usage: awksed pat repl [files...]" > "/dev/stderr" + exit 1 +@} +@end group + +BEGIN @{ + # validate arguments + if (ARGC < 3) + usage() + + RS = ARGV[1] + ORS = ARGV[2] + + # don't use arguments as files + ARGV[1] = ARGV[2] = "" +@} + +# look ma, no hands! +@{ + if (RT == "") + printf "%s", $0 + else + print +@} +@c endfile +@c @end group +@end example + +The program relies on @code{gawk}'s ability to have @code{RS} be a regexp +and on the setting of @code{RT} to the actual text that terminated the +record (@pxref{Records, ,How Input is Split into Records}). + +The idea is to have @code{RS} be the pattern to look for. @code{gawk} +will automatically set @code{$0} to the text between matches of the pattern. +This is text that we wish to keep, unmodified. Then, by setting @code{ORS} +to the replacement text, a simple @code{print} statement will output the +text we wish to keep, followed by the replacement text. + +There is one wrinkle to this scheme, which is what to do if the last record +doesn't end with text that matches @code{RS}? Using a @code{print} +statement unconditionally prints the replacement text, which is not correct. + +However, if the file did not end in text that matches @code{RS}, @code{RT} +will be set to the null string. In this case, we can print @code{$0} using +@code{printf} +(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}). + +The @code{BEGIN} rule handles the setup, checking for the right number +of arguments, and calling @code{usage} if there is a problem. Then it sets +@code{RS} and @code{ORS} from the command line arguments, and sets +@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they will +not be treated as file names +(@pxref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}). + +The @code{usage} function prints an error message and exits. + +Finally, the single rule handles the printing scheme outlined above, +using @code{print} or @code{printf} as appropriate, depending upon the +value of @code{RT}. + +@ignore +Exercise, compare the performance of this version with the more +straightforward: + +BEGIN { + pat = ARGV[1] + repl = ARGV[2] + ARGV[1] = ARGV[2] = "" +} + +{ gsub(pat, repl); print } + +Exercise: what are the advantages and disadvantages of this version vs. sed? + Advantage: egrep regexps + speed (?) + Disadvantage: no & in replacement text + +Others? +@end ignore + +@node Igawk Program, , Simple Sed, Miscellaneous Programs +@subsection An Easy Way to Use Library Functions + +Using library functions in @code{awk} can be very beneficial. It +encourages code re-use and the writing of general functions. Programs are +smaller, and therefore clearer. +However, using library functions is only easy when writing @code{awk} +programs; it is painful when running them, requiring multiple @samp{-f} +options. If @code{gawk} is unavailable, then so too is the @code{AWKPATH} +environment variable and the ability to put @code{awk} functions into a +library directory (@pxref{Options, ,Command Line Options}). + +It would be nice to be able to write programs like so: + +@example +# library functions +@@include getopt.awk +@@include join.awk +@dots{} + +# main program +BEGIN @{ + while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) + @dots{} + @dots{} +@} +@end example + +The following program, @file{igawk.sh}, provides this service. +It simulates @code{gawk}'s searching of the @code{AWKPATH} variable, +and also allows @dfn{nested} includes; i.e.@: a file that has been included +with @samp{@@include} can contain further @samp{@@include} statements. +@code{igawk} will make an effort to only include files once, so that nested +includes don't accidentally include a library function twice. + +@code{igawk} should behave externally just like @code{gawk}. This means it +should accept all of @code{gawk}'s command line arguments, including the +ability to have multiple source files specified via @samp{-f}, and the +ability to mix command line and library source files. + +The program is written using the POSIX Shell (@code{sh}) command language. +The way the program works is as follows: + +@enumerate +@item +Loop through the arguments, saving anything that doesn't represent +@code{awk} source code for later, when the expanded program is run. + +@item +For any arguments that do represent @code{awk} text, put the arguments into +a temporary file that will be expanded. There are two cases. + +@enumerate a +@item +Literal text, provided with @samp{--source} or @samp{--source=}. This +text is just echoed directly. The @code{echo} program will automatically +supply a trailing newline. + +@item +File names provided with @samp{-f}. We use a neat trick, and echo +@samp{@@include @var{filename}} into the temporary file. Since the file +inclusion program will work the way @code{gawk} does, this will get the text +of the file included into the program at the correct point. +@end enumerate + +@item +Run an @code{awk} program (naturally) over the temporary file to expand +@samp{@@include} statements. The expanded program is placed in a second +temporary file. + +@item +Run the expanded program with @code{gawk} and any other original command line +arguments that the user supplied (such as the data file names). +@end enumerate + +The initial part of the program turns on shell tracing if the first +argument was @samp{debug}. Otherwise, a shell @code{trap} statement +arranges to clean up any temporary files on program exit or upon an +interrupt. + +@c 2e: For the temp file handling, go with Darrel's ig=${TMP:-/tmp}/igs.$$ +@c 2e: or something as similar as possible. + +The next part loops through all the command line arguments. +There are several cases of interest. + +@table @code +@item -- +This ends the arguments to @code{igawk}. Anything else should be passed on +to the user's @code{awk} program without being evaluated. + +@item -W +This indicates that the next option is specific to @code{gawk}. To make +argument processing easier, the @samp{-W} is appended to the front of the +remaining arguments and the loop continues. (This is an @code{sh} +programming trick. Don't worry about it if you are not familiar with +@code{sh}.) + +@item -v +@itemx -F +These are saved and passed on to @code{gawk}. + +@item -f +@itemx --file +@itemx --file= +@itemx -Wfile= +The file name is saved to the temporary file @file{/tmp/ig.s.$$} with an +@samp{@@include} statement. +The @code{sed} utility is used to remove the leading option part of the +argument (e.g., @samp{--file=}). + +@item --source +@itemx --source= +@itemx -Wsource= +The source text is echoed into @file{/tmp/ig.s.$$}. + +@item --version +@itemx --version +@itemx -Wversion +@code{igawk} prints its version number, and runs @samp{gawk --version} +to get the @code{gawk} version information, and then exits. +@end table + +If none of @samp{-f}, @samp{--file}, @samp{-Wfile}, @samp{--source}, +or @samp{-Wsource}, were supplied, then the first non-option argument +should be the @code{awk} program. If there are no command line +arguments left, @code{igawk} prints an error message and exits. +Otherwise, the first argument is echoed into @file{/tmp/ig.s.$$}. + +In any case, after the arguments have been processed, +@file{/tmp/ig.s.$$} contains the complete text of the original @code{awk} +program. + +The @samp{$$} in @code{sh} represents the current process ID number. +It is often used in shell programs to generate unique temporary file +names. This allows multiple users to run @code{igawk} without worrying +that the temporary file names will clash. + +@cindex @code{sed} utility +Here's the program: + +@findex igawk.sh +@example +@c @group +@c file eg/prog/igawk.sh +#! /bin/sh + +# igawk --- like gawk but do @@include processing +# Arnold Robbins, arnold@@gnu.ai.mit.edu, Public Domain +# July 1993 + +if [ "$1" = debug ] +then + set -x + shift +else + # cleanup on exit, hangup, interrupt, quit, termination + trap 'rm -f /tmp/ig.[se].$$' 0 1 2 3 15 +fi + +while [ $# -ne 0 ] # loop over arguments +do + case $1 in + --) shift; break;; + + -W) shift + set -- -W"$@@" + continue;; + + -[vF]) opts="$opts $1 '$2'" + shift;; + + -[vF]*) opts="$opts '$1'" ;; + + -f) echo @@include "$2" >> /tmp/ig.s.$$ + shift;; + +@group + -f*) f=`echo "$1" | sed 's/-f//'` + echo @@include "$f" >> /tmp/ig.s.$$ ;; +@end group + + -?file=*) # -Wfile or --file + f=`echo "$1" | sed 's/-.file=//'` + echo @@include "$f" >> /tmp/ig.s.$$ ;; + + -?file) # get arg, $2 + echo @@include "$2" >> /tmp/ig.s.$$ + shift;; + + -?source=*) # -Wsource or --source + t=`echo "$1" | sed 's/-.source=//'` + echo "$t" >> /tmp/ig.s.$$ ;; + + -?source) # get arg, $2 + echo "$2" >> /tmp/ig.s.$$ + shift;; + + -?version) + echo igawk: version 1.0 1>&2 + gawk --version + exit 0 ;; + + -[W-]*) opts="$opts '$1'" ;; + + *) break;; + esac + shift +done + +if [ ! -s /tmp/ig.s.$$ ] +then + if [ -z "$1" ] + then + echo igawk: no program! 1>&2 + exit 1 + else + echo "$1" > /tmp/ig.s.$$ + shift + fi +fi + +# at this point, /tmp/ig.s.$$ has the program +@c endfile +@c @end group +@end example + +The @code{awk} program to process @samp{@@include} directives reads through +the program, one line at a time using @code{getline} +(@pxref{Getline, ,Explicit Input with @code{getline}}). +The input file names and @samp{@@include} statements are managed using a +stack. As each @samp{@@include} is encountered, the current file name is +``pushed'' onto the stack, and the file named in the @samp{@@include} +directive becomes +the current file name. As each file is finished, the stack is ``popped,'' +and the previous input file becomes the current input file again. +The process is started by making the original file the first one on the +stack. + +The @code{pathto} function does the work of finding the full path to a +file. It simulates @code{gawk}'s behavior when searching the @code{AWKPATH} +environment variable +(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). +If a file name has a @samp{/} in it, no path search +is done. Otherwise, the file name is concatenated with the name of each +directory in the path, and an attempt is made to open the generated file +name. The only way in @code{awk} to test if a file can be read is to go +ahead and try to read it with @code{getline}; that is what @code{pathto} +does.@footnote{On some very old versions of @code{awk}, the test +@samp{getline junk < t} can loop forever if the file exists but is empty. +Caveat Emptor.} +If the file can be read, it is closed, and the file name is +returned. +@ignore +An alternative way to test for the file's existence would be to call +@samp{system("test -r " t)}, which uses the @code{test} utility to +see if the file exists and is readable. The disadvantage to this method +is that it requires creating an extra process, and can thus be slightly +slower. +@end ignore + +@example +@c @group +@c file eg/prog/igawk.sh +gawk -- ' +# process @@include directives + +function pathto(file, i, t, junk) +@{ + if (index(file, "/") != 0) + return file + + for (i = 1; i <= ndirs; i++) @{ + t = (pathlist[i] "/" file) + if ((getline junk < t) > 0) @{ + # found it + close(t) + return t + @} + @} + return "" +@} +@c endfile +@c @end group +@end example + +The main program is contained inside one @code{BEGIN} rule. The first thing it +does is set up the @code{pathlist} array that @code{pathto} uses. After +splitting the path on @samp{:}, null elements are replaced with @code{"."}, +which represents the current directory. + +@example +@group +@c file eg/prog/igawk.sh +BEGIN @{ + path = ENVIRON["AWKPATH"] + ndirs = split(path, pathlist, ":") + for (i = 1; i <= ndirs; i++) @{ + if (pathlist[i] == "") + pathlist[i] = "." + @} +@c endfile +@end group +@end example + +The stack is initialized with @code{ARGV[1]}, which will be @file{/tmp/ig.s.$$}. +The main loop comes next. Input lines are read in succession. Lines that +do not start with @samp{@@include} are printed verbatim. + +If the line does start with @samp{@@include}, the file name is in @code{$2}. +@code{pathto} is called to generate the full path. If it could not, then we +print an error message and continue. + +The next thing to check is if the file has been included already. The +@code{processed} array is indexed by the full file name of each included +file, and it tracks this information for us. If the file has been +seen, a warning message is printed. Otherwise, the new file name is +pushed onto the stack and processing continues. + +Finally, when @code{getline} encounters the end of the input file, the file +is closed and the stack is popped. When @code{stackptr} is less than zero, +the program is done. + +@example +@c @group +@c file eg/prog/igawk.sh + stackptr = 0 + input[stackptr] = ARGV[1] # ARGV[1] is first file + + for (; stackptr >= 0; stackptr--) @{ + while ((getline < input[stackptr]) > 0) @{ + if (tolower($1) != "@@include") @{ + print + continue + @} + fpath = pathto($2) + if (fpath == "") @{ + printf("igawk:%s:%d: cannot find %s\n", \ + input[stackptr], FNR, $2) > "/dev/stderr" + continue + @} +@group + if (! (fpath in processed)) @{ + processed[fpath] = input[stackptr] + input[++stackptr] = fpath + @} else + print $2, "included in", input[stackptr], \ + "already included in", \ + processed[fpath] > "/dev/stderr" + @} +@end group +@group + close(input[stackptr]) + @} +@}' /tmp/ig.s.$$ > /tmp/ig.e.$$ +@end group +@c endfile +@c @end group +@end example + +The last step is to call @code{gawk} with the expanded program and the original +options and command line arguments that the user supplied. @code{gawk}'s +exit status is passed back on to @code{igawk}'s calling program. + +@c this causes more problems than it solves, so leave it out. +@ignore +The special file @file{/dev/null} is passed as a data file to @code{gawk} +to handle an interesting case. Suppose that the user's program only has +a @code{BEGIN} rule, and there are no data files to read. The program should exit without reading any data +files. However, suppose that an included library file defines an @code{END} +rule of its own. In this case, @code{gawk} will hang, reading standard +input. In order to avoid this, @file{/dev/null} is explicitly to the +command line. Reading from @file{/dev/null} always returns an immediate +end of file indication. + +@c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh. +@end ignore + +@example +@c @group +@c file eg/prog/igawk.sh +eval gawk -f /tmp/ig.e.$$ $opts -- "$@@" + +exit $? +@c endfile +@c @end group +@end example + +This version of @code{igawk} represents my third attempt at this program. +There are three key simplifications that made the program work better. + +@enumerate +@item +Using @samp{@@include} even for the files named with @samp{-f} makes building +the initial collected @code{awk} program much simpler; all the +@samp{@@include} processing can be done once. + +@item +The @code{pathto} function doesn't try to save the line read with +@code{getline} when testing for the file's accessibility. Trying to save +this line for use with the main program complicates things considerably. +@c what problem does this engender though - exercise +@c answer, reading from "-" or /dev/stdin + +@item +Using a @code{getline} loop in the @code{BEGIN} rule does it all in one +place. It is not necessary to call out to a separate loop for processing +nested @samp{@@include} statements. +@end enumerate + +Also, this program illustrates that it is often worthwhile to combine +@code{sh} and @code{awk} programming together. You can usually accomplish +quite a lot, without having to resort to low-level programming in C or C++, and it +is frequently easier to do certain kinds of string and argument manipulation +using the shell than it is in @code{awk}. + +Finally, @code{igawk} shows that it is not always necessary to add new +features to a program; they can often be layered on top. With @code{igawk}, +there is no real reason to build @samp{@@include} processing into +@code{gawk} itself. + +As an additional example of this, consider the idea of having two +files in a directory in the search path. + +@table @file +@item default.awk +This file would contain a set of default library functions, such +as @code{getopt} and @code{assert}. + +@item site.awk +This file would contain library functions that are specific to a site or +installation, i.e.@: locally developed functions. +Having a separate file allows @file{default.awk} to change with +new @code{gawk} releases, without requiring the system administrator to +update it each time by adding the local functions. +@end table + +One user +@c Karl Berry, karl@ileaf.com, 10/95 +suggested that @code{gawk} be modified to automatically read these files +upon startup. Instead, it would be very simple to modify @code{igawk} +to do this. Since @code{igawk} can process nested @samp{@@include} +directives, @file{default.awk} could simply contain @samp{@@include} +statements for the desired library functions. + +@c Exercise: make this change + +@node Language History, Gawk Summary, Sample Programs, Top +@chapter The Evolution of the @code{awk} Language + +This @value{DOCUMENT} describes the GNU implementation of @code{awk}, which follows +the POSIX specification. Many @code{awk} users are only familiar +with the original @code{awk} implementation in Version 7 Unix. +(This implementation was the basis for @code{awk} in Berkeley Unix, +through 4.3--Reno. The 4.4 release of Berkeley Unix uses @code{gawk} 2.15.2 +for its version of @code{awk}.) This chapter briefly describes the +evolution of the @code{awk} language, with cross references to other parts +of the @value{DOCUMENT} where you can find more information. + +@menu +* V7/SVR3.1:: The major changes between V7 and System V + Release 3.1. +* SVR4:: Minor changes between System V Releases 3.1 + and 4. +* POSIX:: New features from the POSIX standard. +* BTL:: New features from the Bell Laboratories + version of @code{awk}. +* POSIX/GNU:: The extensions in @code{gawk} not in POSIX + @code{awk}. +@end menu + +@node V7/SVR3.1, SVR4, Language History, Language History +@section Major Changes between V7 and SVR3.1 + +The @code{awk} language evolved considerably between the release of +Version 7 Unix (1978) and the new version first made generally available in +System V Release 3.1 (1987). This section summarizes the changes, with +cross-references to further details. + +@itemize @bullet +@item +The requirement for @samp{;} to separate rules on a line +(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). + +@item +User-defined functions, and the @code{return} statement +(@pxref{User-defined, ,User-defined Functions}). + +@item +The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}). + +@item +The @code{do}-@code{while} statement +(@pxref{Do Statement, ,The @code{do}-@code{while} Statement}). + +@item +The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and +@code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}). + +@item +The built-in functions @code{gsub}, @code{sub}, and @code{match} +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). + +@item +The built-in functions @code{close}, and @code{system} +(@pxref{I/O Functions, ,Built-in Functions for Input/Output}). + +@item +The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART}, +and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}). + +@item +The conditional expression using the ternary operator @samp{?:} +(@pxref{Conditional Exp, ,Conditional Expressions}). + +@item +The exponentiation operator @samp{^} +(@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator +form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}). + +@item +C-compatible operator precedence, which breaks some old @code{awk} +programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}). + +@item +Regexps as the value of @code{FS} +(@pxref{Field Separators, ,Specifying How Fields are Separated}), and as the +third argument to the @code{split} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). + +@item +Dynamic regexps as operands of the @samp{~} and @samp{!~} operators +(@pxref{Regexp Usage, ,How to Use Regular Expressions}). + +@item +The escape sequences @samp{\b}, @samp{\f}, and @samp{\r} +(@pxref{Escape Sequences}). +(Some vendors have updated their old versions of @code{awk} to +recognize @samp{\r}, @samp{\b}, and @samp{\f}, but this is not +something you can rely on.) + +@item +Redirection of input for the @code{getline} function +(@pxref{Getline, ,Explicit Input with @code{getline}}). + +@item +Multiple @code{BEGIN} and @code{END} rules +(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). + +@item +Multi-dimensional arrays +(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}). +@end itemize + +@node SVR4, POSIX, V7/SVR3.1, Language History +@section Changes between SVR3.1 and SVR4 + +@cindex @code{awk} language, V.4 version +The System V Release 4 version of Unix @code{awk} added these features +(some of which originated in @code{gawk}): + +@itemize @bullet +@item +The @code{ENVIRON} variable (@pxref{Built-in Variables}). + +@item +Multiple @samp{-f} options on the command line +(@pxref{Options, ,Command Line Options}). + +@item +The @samp{-v} option for assigning variables before program execution begins +(@pxref{Options, ,Command Line Options}). + +@item +The @samp{--} option for terminating command line options. + +@item +The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences +(@pxref{Escape Sequences}). + +@item +A defined return value for the @code{srand} built-in function +(@pxref{Numeric Functions, ,Numeric Built-in Functions}). + +@item +The @code{toupper} and @code{tolower} built-in string functions +for case translation +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). + +@item +A cleaner specification for the @samp{%c} format-control letter in the +@code{printf} function +(@pxref{Control Letters, ,Format-Control Letters}). + +@item +The ability to dynamically pass the field width and precision (@code{"%*.*d"}) +in the argument list of the @code{printf} function +(@pxref{Control Letters, ,Format-Control Letters}). + +@item +The use of regexp constants such as @code{/foo/} as expressions, where +they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/} +(@pxref{Using Constant Regexps, ,Using Regular Expression Constants}). +@end itemize + +@node POSIX, BTL, SVR4, Language History +@section Changes between SVR4 and POSIX @code{awk} + +The POSIX Command Language and Utilities standard for @code{awk} +introduced the following changes into the language: + +@itemize @bullet +@item +The use of @samp{-W} for implementation-specific options. + +@item +The use of @code{CONVFMT} for controlling the conversion of numbers +to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}). + +@item +The concept of a numeric string, and tighter comparison rules to go +with it (@pxref{Typing and Comparison, ,Variable Typing and Comparison Expressions}). + +@item +More complete documentation of many of the previously undocumented +features of the language. +@end itemize + +The following common extensions are not permitted by the POSIX +standard: + +@c IMPORTANT! Keep this list in sync with the one in node Options + +@itemize @bullet +@item +@code{\x} escape sequences are not recognized +(@pxref{Escape Sequences}). + +@item +Newlines do not act as whitespace to separate fields when @code{FS} is +equal to a single space. + +@item +The synonym @code{func} for the keyword @code{function} is not +recognized (@pxref{Definition Syntax, ,Function Definition Syntax}). + +@item +The operators @samp{**} and @samp{**=} cannot be used in +place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators}, +and also @pxref{Assignment Ops, ,Assignment Expressions}). + +@item +Specifying @samp{-Ft} on the command line does not set the value +of @code{FS} to be a single tab character +(@pxref{Field Separators, ,Specifying How Fields are Separated}). + +@item +The @code{fflush} built-in function is not supported +(@pxref{I/O Functions, , Built-in Functions for Input/Output}). +@end itemize + +@node BTL, POSIX/GNU, POSIX, Language History +@section Extensions in the Bell Laboratories @code{awk} + +@cindex Kernighan, Brian +Brian Kernighan, one of the original designers of Unix @code{awk}, +has made his version available via anonymous @code{ftp} +(@pxref{Other Versions, ,Other Freely Available @code{awk} Implementations}). +This section describes extensions in his version of @code{awk} that are +not in POSIX @code{awk}. + +@itemize @bullet +@item +The @samp{-mf @var{NNN}} and @samp{-mr @var{NNN}} command line options +to set the maximum number of fields, and the maximum +record size, respectively +(@pxref{Options, ,Command Line Options}). + +@item +The @code{fflush} built-in function for flushing buffered output +(@pxref{I/O Functions, ,Built-in Functions for Input/Output}). + +@ignore +@item +The @code{SYMTAB} array, that allows access to the internal symbol +table of @code{awk}. This feature is not documented, largely because +it is somewhat shakily implemented. For instance, you cannot access arrays +or array elements through it. +@end ignore +@end itemize + +@node POSIX/GNU, , BTL, Language History +@section Extensions in @code{gawk} Not in POSIX @code{awk} + +@cindex compatibility mode +The GNU implementation, @code{gawk}, adds a number of features. +This sections lists them in the order they were added to @code{gawk}. +They can all be disabled with either the @samp{--traditional} or +@samp{--posix} options +(@pxref{Options, ,Command Line Options}). + +Version 2.10 of @code{gawk} introduced these features: + +@itemize @bullet +@item +The @code{AWKPATH} environment variable for specifying a path search for +the @samp{-f} command line option +(@pxref{Options, ,Command Line Options}). + +@item +The @code{IGNORECASE} variable and its effects +(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}). + +@item +The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and +@file{/dev/fd/@var{n}} file name interpretation +(@pxref{Special Files, ,Special File Names in @code{gawk}}). +@end itemize + +Version 2.13 of @code{gawk} introduced these features: + +@itemize @bullet +@item +The @code{FIELDWIDTHS} variable and its effects +(@pxref{Constant Size, ,Reading Fixed-width Data}). + +@item +The @code{systime} and @code{strftime} built-in functions for obtaining +and printing time stamps +(@pxref{Time Functions, ,Functions for Dealing with Time Stamps}). + +@item +The @samp{-W lint} option to provide source code and run time error +and portability checking +(@pxref{Options, ,Command Line Options}). + +@item +The @samp{-W compat} option to turn off these extensions +(@pxref{Options, ,Command Line Options}). + +@item +The @samp{-W posix} option for full POSIX compliance +(@pxref{Options, ,Command Line Options}). +@end itemize + +Version 2.14 of @code{gawk} introduced these features: + +@itemize @bullet +@item +The @code{next file} statement for skipping to the next data file +(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}). +@end itemize + +Version 2.15 of @code{gawk} introduced these features: + +@itemize @bullet +@item +The @code{ARGIND} variable, that tracks the movement of @code{FILENAME} +through @code{ARGV} (@pxref{Built-in Variables}). + +@item +The @code{ERRNO} variable, that contains the system error message when +@code{getline} returns @minus{}1, or when @code{close} fails +(@pxref{Built-in Variables}). + +@item +The ability to use GNU-style long named options that start with @samp{--} +(@pxref{Options, ,Command Line Options}). + +@item +The @samp{--source} option for mixing command line and library +file source code +(@pxref{Options, ,Command Line Options}). + +@item +The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and +@file{/dev/user} file name interpretation +(@pxref{Special Files, ,Special File Names in @code{gawk}}). +@end itemize + +Version 3.0 of @code{gawk} introduced these features: + +@itemize @bullet +@item +The @code{next file} statement became @code{nextfile} +(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}). + +@item +The @samp{--lint-old} option to +warn about constructs that are not available in +the original Version 7 Unix version of @code{awk} +(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}). + +@item +The @samp{--traditional} option was added as a better name for +@samp{--compat} (@pxref{Options, ,Command Line Options}). + +@item +The ability for @code{FS} to be a null string, and for the third +argument to @code{split} to be the null string +(@pxref{Single Character Fields, , Making Each Character a Separate Field}). + +@item +The ability for @code{RS} to be a regexp +(@pxref{Records, , How Input is Split into Records}). + +@item +The @code{RT} variable +(@pxref{Records, , How Input is Split into Records}). + +@item +The @code{gensub} function for more powerful text manipulation +(@pxref{String Functions, , Built-in Functions for String Manipulation}). + +@item +The @code{strftime} function acquired a default time format, +allowing it to be called with no arguments +(@pxref{Time Functions, , Functions for Dealing with Time Stamps}). + +@item +Full support for both POSIX and GNU regexps +(@pxref{Regexp, , Regular Expressions}). + +@item +The @samp{--re-interval} option to provide interval expressions in regexps +(@pxref{Regexp Operators, , Regular Expression Operators}). + +@item +@code{IGNORECASE} changed, now applying to string comparison as well +as regexp operations +(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}). + +@item +The @samp{-m} option and the @code{fflush} function from the +Bell Labs research version of @code{awk} +(@pxref{Options, ,Command Line Options}; also +@pxref{I/O Functions, ,Built-in Functions for Input/Output}). + +@item +The use of GNU Autoconf to control the configuration process +(@pxref{Quick Installation, , Compiling @code{gawk} for Unix}). + +@item +Amiga support +(@pxref{Amiga Installation, ,Installing @code{gawk} on an Amiga}). + +@c XXX ADD MORE STUFF HERE + +@end itemize + +@node Gawk Summary, Installation, Language History, Top +@appendix @code{gawk} Summary + +This appendix provides a brief summary of the @code{gawk} command line and the +@code{awk} language. It is designed to serve as ``quick reference.'' It is +therefore terse, but complete. + +@menu +* Command Line Summary:: Recapitulation of the command line. +* Language Summary:: A terse review of the language. +* Variables/Fields:: Variables, fields, and arrays. +* Rules Summary:: Patterns and Actions, and their component + parts. +* Actions Summary:: Quick overview of actions. +* Functions Summary:: Defining and calling functions. +* Historical Features:: Some undocumented but supported ``features''. +@end menu + +@node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary +@appendixsec Command Line Options Summary + +The command line consists of options to @code{gawk} itself, the +@code{awk} program text (if not supplied via the @samp{-f} option), and +values to be made available in the @code{ARGC} and @code{ARGV} +predefined @code{awk} variables: + +@example +gawk @r{[@var{POSIX or GNU style options}]} -f @var{source-file} @r{[@code{--}]} @var{file} @dots{} +gawk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} +@end example + +The options that @code{gawk} accepts are: + +@table @code +@item -F @var{fs} +@itemx --field-separator @var{fs} +Use @var{fs} for the input field separator (the value of the @code{FS} +predefined variable). + +@item -f @var{program-file} +@itemx --file @var{program-file} +Read the @code{awk} program source from the file @var{program-file}, instead +of from the first command line argument. + +@item -mf @var{NNN} +@itemx -mr @var{NNN} +The @samp{f} flag sets +the maximum number of fields, and the @samp{r} flag sets the maximum +record size. These options are ignored by @code{gawk}, since @code{gawk} +has no predefined limits; they are only for compatibility with the +Bell Labs research version of Unix @code{awk}. + +@item -v @var{var}=@var{val} +@itemx --assign @var{var}=@var{val} +Assign the variable @var{var} the value @var{val} before program execution +begins. + +@item -W traditional +@itemx -W compat +@itemx --traditional +@itemx --compat +Use compatibility mode, in which @code{gawk} extensions are turned +off. + +@item -W copyleft +@itemx -W copyright +@itemx --copyleft +@itemx --copyright +Print the short version of the General Public License on the standard +output, and exit. This option may disappear in a future version of @code{gawk}. + +@item -W help +@itemx -W usage +@itemx --help +@itemx --usage +Print a relatively short summary of the available options on the standard +output, and exit. + +@item -W lint +@itemx --lint +Give warnings about dubious or non-portable @code{awk} constructs. + +@item -W lint-old +@itemx --lint-old +Warn about constructs that are not available in +the original Version 7 Unix version of @code{awk}. + +@item -W posix +@itemx --posix +Use POSIX compatibility mode, in which @code{gawk} extensions +are turned off and additional restrictions apply. + +@item -W re-interval +@itemx --re-interval +Allow interval expressions +(@pxref{Regexp Operators, , Regular Expression Operators}), +in regexps. + +@item -W source=@var{program-text} +@itemx --source @var{program-text} +Use @var{program-text} as @code{awk} program source code. This option allows +mixing command line source code with source code from files, and is +particularly useful for mixing command line programs with library functions. + +@item -W version +@itemx --version +Print version information for this particular copy of @code{gawk} on the error +output. + +@item -- +Signal the end of options. This is useful to allow further arguments to the +@code{awk} program itself to start with a @samp{-}. This is mainly for +consistency with POSIX argument parsing conventions. +@end table + +Any other options are flagged as invalid, but are otherwise ignored. +@xref{Options, ,Command Line Options}, for more details. + +@node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary +@appendixsec Language Summary + +An @code{awk} program consists of a sequence of zero or more pattern-action +statements and optional function definitions. One or the other of the +pattern and action may be omitted. + +@example +@var{pattern} @{ @var{action statements} @} +@var{pattern} + @{ @var{action statements} @} + +function @var{name}(@var{parameter list}) @{ @var{action statements} @} +@end example + +@code{gawk} first reads the program source from the +@var{program-file}(s), if specified, or from the first non-option +argument on the command line. The @samp{-f} option may be used multiple +times on the command line. @code{gawk} reads the program text from all +the @var{program-file} files, effectively concatenating them in the +order they are specified. This is useful for building libraries of +@code{awk} functions, without having to include them in each new +@code{awk} program that uses them. To use a library function in a file +from a program typed in on the command line, specify +@samp{--source '@var{program}'}, and type your program in between the single +quotes. +@xref{Options, ,Command Line Options}. + +The environment variable @code{AWKPATH} specifies a search path to use +when finding source files named with the @samp{-f} option. The default +path, which is +@samp{.:/usr/local/share/awk}@footnote{The path may use a directory +other than @file{/usr/local/share/awk}, depending upon how @code{gawk} +was built and installed.} is used if @code{AWKPATH} is not set. +If a file name given to the @samp{-f} option contains a @samp{/} character, +no path search is performed. +@xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}. + +@code{gawk} compiles the program into an internal form, and then proceeds to +read each file named in the @code{ARGV} array. +The initial values of @code{ARGV} come from the command line arguments. +If there are no files named +on the command line, @code{gawk} reads the standard input. + +If a ``file'' named on the command line has the form +@samp{@var{var}=@var{val}}, it is treated as a variable assignment: the +variable @var{var} is assigned the value @var{val}. +If any of the files have a value that is the null string, that +element in the list is skipped. + +For each record in the input, @code{gawk} tests to see if it matches any +@var{pattern} in the @code{awk} program. For each pattern that the record +matches, the associated @var{action} is executed. + +@node Variables/Fields, Rules Summary, Language Summary, Gawk Summary +@appendixsec Variables and Fields + +@code{awk} variables are not declared; they come into existence when they are +first used. Their values are either floating-point numbers or strings. +@code{awk} also has one-dimensional arrays; multiple-dimensional arrays +may be simulated. There are several predefined variables that +@code{awk} sets as a program runs; these are summarized below. + +@menu +* Fields Summary:: Input field splitting. +* Built-in Summary:: @code{awk}'s built-in variables. +* Arrays Summary:: Using arrays. +* Data Type Summary:: Values in @code{awk} are numbers or strings. +@end menu + +@node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields +@appendixsubsec Fields + +As each input line is read, @code{gawk} splits the line into +@var{fields}, using the value of the @code{FS} variable as the field +separator. If @code{FS} is a single character, fields are separated by +that character. Otherwise, @code{FS} is expected to be a full regular +expression. In the special case that @code{FS} is a single space, +fields are separated by runs of spaces, tabs and/or newlines.@footnote{In +POSIX @code{awk}, newline does not separate fields.} +If @code{FS} is the null string (@code{""}), then each individual +character in the record becomes a separate field. +Note that the value +of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching}) +also affects how fields are split when @code{FS} is a regular expression. + +Each field in the input line may be referenced by its position, @code{$1}, +@code{$2}, and so on. @code{$0} is the whole line. The value of a field may +be assigned to as well. Field numbers need not be constants: + +@example +n = 5 +print $n +@end example + +@noindent +prints the fifth field in the input line. The variable @code{NF} is set to +the total number of fields in the input line. + +References to non-existent fields (i.e.@: fields after @code{$NF}) return +the null string. However, assigning to a non-existent field (e.g., +@code{$(NF+2) = 5}) increases the value of @code{NF}, creates any +intervening fields with the null string as their value, and causes the +value of @code{$0} to be recomputed, with the fields being separated by +the value of @code{OFS}. +Decrementing @code{NF} causes the values of fields past the new value to +be lost, and the value of @code{$0} to be recomputed, with the fields being +separated by the value of @code{OFS}. +@xref{Reading Files, ,Reading Input Files}. + +@node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields +@appendixsubsec Built-in Variables + +@code{gawk}'s built-in variables are: + +@table @code +@item ARGC +The number of elements in @code{ARGV}. See below for what is actually +included in @code{ARGV}. + +@item ARGIND +The index in @code{ARGV} of the current file being processed. +When @code{gawk} is processing the input data files, +it is always true that @samp{FILENAME == ARGV[ARGIND]}. + +@item ARGV +The array of command line arguments. The array is indexed from zero to +@code{ARGC} @minus{} 1. Dynamically changing @code{ARGC} and +the contents of @code{ARGV} +can control the files used for data. A null-valued element in +@code{ARGV} is ignored. @code{ARGV} does not include the options to +@code{awk} or the text of the @code{awk} program itself. + +@item CONVFMT +The conversion format to use when converting numbers to strings. + +@item FIELDWIDTHS +A space separated list of numbers describing the fixed-width input data. + +@item ENVIRON +An array of environment variable values. The array +is indexed by variable name, each element being the value of that +variable. Thus, the environment variable @code{HOME} is +@code{ENVIRON["HOME"]}. One possible value might be @file{/home/arnold}. + +Changing this array does not affect the environment seen by programs +which @code{gawk} spawns via redirection or the @code{system} function. +(This may change in a future version of @code{gawk}.) + +Some operating systems do not have environment variables. +The @code{ENVIRON} array is empty when running on these systems. + +@item ERRNO +The system error message when an error occurs using @code{getline} +or @code{close}. + +@item FILENAME +The name of the current input file. If no files are specified on the command +line, the value of @code{FILENAME} is the null string. + +@item FNR +The input record number in the current input file. + +@item FS +The input field separator, a space by default. + +@item IGNORECASE +The case-sensitivity flag for string comparisons and regular expression +operations. If @code{IGNORECASE} has a non-zero value, then pattern +matching in rules, record separating with @code{RS}, field splitting +with @code{FS}, regular expression matching with @samp{~} and +@samp{!~}, and the @code{gensub}, @code{gsub}, @code{index}, +@code{match}, @code{split} and @code{sub} built-in functions all +ignore case when doing regular expression operations, and all string +comparisons are done ignoring case. +The value of @code{IGNORECASE} does @emph{not} affect array subscripting. + +@item NF +The number of fields in the current input record. + +@item NR +The total number of input records seen so far. + +@item OFMT +The output format for numbers for the @code{print} statement, +@code{"%.6g"} by default. + +@item OFS +The output field separator, a space by default. + +@item ORS +The output record separator, by default a newline. + +@item RS +The input record separator, by default a newline. +If @code{RS} is set to the null string, then records are separated by +blank lines. When @code{RS} is set to the null string, then the newline +character always acts as a field separator, in addition to whatever value +@code{FS} may have. If @code{RS} is set to a multi-character +string, it denotes a regexp; input text matching the regexp +separates records. + +@item RT +The input text that matched the text denoted by @code{RS}, +the record separator. + +@item RSTART +The index of the first character last matched by @code{match}; zero if no match. + +@item RLENGTH +The length of the string last matched by @code{match}; @minus{}1 if no match. + +@item SUBSEP +The string used to separate multiple subscripts in array elements, by +default @code{"\034"}. +@end table + +@xref{Built-in Variables}, for more information. + +@node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields +@appendixsubsec Arrays + +Arrays are subscripted with an expression between square brackets +(@samp{[} and @samp{]}). Array subscripts are @emph{always} strings; +numbers are converted to strings as necessary, following the standard +conversion rules +(@pxref{Conversion, ,Conversion of Strings and Numbers}). + +If you use multiple expressions separated by commas inside the square +brackets, then the array subscript is a string consisting of the +concatenation of the individual subscript values, converted to strings, +separated by the subscript separator (the value of @code{SUBSEP}). + +The special operator @code{in} may be used in a conditional context +to see if an array has an index consisting of a particular value. + +@example +if (val in array) + print array[val] +@end example + +If the array has multiple subscripts, use @samp{(i, j, @dots{}) in @var{array}} +to test for existence of an element. + +The @code{in} construct may also be used in a @code{for} loop to iterate +over all the elements of an array. +@xref{Scanning an Array, ,Scanning All Elements of an Array}. + +You can remove an element from an array using the @code{delete} statement. + +You can clear an entire array using @samp{delete @var{array}}. + +@xref{Arrays, ,Arrays in @code{awk}}. + +@node Data Type Summary, , Arrays Summary, Variables/Fields +@appendixsubsec Data Types + +The value of an @code{awk} expression is always either a number +or a string. + +Some contexts (such as arithmetic operators) require numeric +values. They convert strings to numbers by interpreting the text +of the string as a number. If the string does not look like a +number, it converts to zero. + +Other contexts (such as concatenation) require string values. +They convert numbers to strings by effectively printing them +with @code{sprintf}. +@xref{Conversion, ,Conversion of Strings and Numbers}, for the details. + +To force conversion of a string value to a number, simply add zero +to it. If the value you start with is already a number, this +does not change it. + +To force conversion of a numeric value to a string, concatenate it with +the null string. + +Comparisons are done numerically if both operands are numeric, or if +one is numeric and the other is a numeric string. Otherwise one or +both operands are converted to strings and a string comparison is +performed. Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} +elements, @code{ENVIRON} elements and the elements of an array created +by @code{split} are the only items that can be numeric strings. String +constants, such as @code{"3.1415927"} are not numeric strings, they are +string constants. The full rules for comparisons are described in +@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}. + +Uninitialized variables have the string value @code{""} (the null, or +empty, string). In contexts where a number is required, this is +equivalent to zero. + +@xref{Variables}, for more information on variable naming and initialization; +@pxref{Conversion, ,Conversion of Strings and Numbers}, for more information +on how variable values are interpreted. + +@node Rules Summary, Actions Summary, Variables/Fields, Gawk Summary +@appendixsec Patterns + +@menu +* Pattern Summary:: Quick overview of patterns. +* Regexp Summary:: Quick overview of regular expressions. +@end menu + +An @code{awk} program is mostly composed of rules, each consisting of a +pattern followed by an action. The action is enclosed in @samp{@{} and +@samp{@}}. Either the pattern may be missing, or the action may be +missing, but not both. If the pattern is missing, the +action is executed for every input record. A missing action is +equivalent to @samp{@w{@{ print @}}}, which prints the entire line. + +@c These paragraphs repeated for both patterns and actions. I don't +@c like this, but I also don't see any way around it. Update both copies +@c if they need fixing. +Comments begin with the @samp{#} character, and continue until the end of the +line. Blank lines may be used to separate statements. Statements normally +end with a newline; however, this is not the case for lines ending in a +@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines +ending in @code{do} or @code{else} also have their statements automatically +continued on the following line. In other cases, a line can be continued by +ending it with a @samp{\}, in which case the newline is ignored. + +Multiple statements may be put on one line by separating each one with +a @samp{;}. +This applies to both the statements within the action part of a rule (the +usual case), and to the rule statements. + +@xref{Comments, ,Comments in @code{awk} Programs}, for information on +@code{awk}'s commenting convention; +@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a +description of the line continuation mechanism in @code{awk}. + +@node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary +@appendixsubsec Pattern Summary + +@code{awk} patterns may be one of the following: + +@example +/@var{regular expression}/ +@var{relational expression} +@var{pattern} && @var{pattern} +@var{pattern} || @var{pattern} +@var{pattern} ? @var{pattern} : @var{pattern} +(@var{pattern}) +! @var{pattern} +@var{pattern1}, @var{pattern2} +BEGIN +END +@end example + +@code{BEGIN} and @code{END} are two special kinds of patterns that are not +tested against the input. The action parts of all @code{BEGIN} rules are +concatenated as if all the statements had been written in a single @code{BEGIN} +rule. They are executed before any of the input is read. Similarly, all the +@code{END} rules are concatenated, and executed when all the input is exhausted (or +when an @code{exit} statement is executed). @code{BEGIN} and @code{END} +patterns cannot be combined with other patterns in pattern expressions. +@code{BEGIN} and @code{END} rules cannot have missing action parts. + +For @code{/@var{regular-expression}/} patterns, the associated statement is +executed for each input record that matches the regular expression. Regular +expressions are summarized below. + +A @var{relational expression} may use any of the operators defined below in +the section on actions. These generally test whether certain fields match +certain regular expressions. + +The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,'' +logical ``or,'' and logical ``not,'' respectively, as in C. They do +short-circuit evaluation, also as in C, and are used for combining more +primitive pattern expressions. As in most languages, parentheses may be +used to change the order of evaluation. + +The @samp{?:} operator is like the same operator in C. If the first +pattern matches, then the second pattern is matched against the input +record; otherwise, the third is matched. Only one of the second and +third patterns is matched. + +The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a +range pattern. It matches all input lines starting with a line that +matches @var{pattern1}, and continuing until a line that matches +@var{pattern2}, inclusive. A range pattern cannot be used as an operand +of any of the pattern operators. + +@xref{Pattern Overview, ,Pattern Elements}. + +@node Regexp Summary, , Pattern Summary, Rules Summary +@appendixsubsec Regular Expressions + +Regular expressions are based on POSIX EREs (extended regular expressions). +The escape sequences allowed in string constants are also valid in +regular expressions (@pxref{Escape Sequences}). +Regexps are composed of characters as follows: + +@table @code +@item @var{c} +matches the character @var{c} (assuming @var{c} is none of the characters +listed below). + +@item \@var{c} +matches the literal character @var{c}. + +@item . +matches any character, @emph{including} newline. +In strict POSIX mode, @samp{.} does not match the @sc{nul} +character, which is a character with all bits equal to zero. + +@item ^ +matches the beginning of a string. + +@item $ +matches the end of a string. + +@item [@var{abc}@dots{}] +matches any of the characters @var{abc}@dots{} (character list). + +@item [[:@var{class}:]] +matches any character in the character class @var{class}. Allowable classes +are @code{alnum}, @code{alpha}, @code{blank}, @code{cntrl}, +@code{digit}, @code{graph}, @code{lower}, @code{print}, @code{punct}, +@code{space}, @code{upper}, and @code{xdigit}. + +@item [[.@var{symbol}.]] +matches the multi-character collating symbol @var{symbol}. +@code{gawk} does not currently support collating symbols. + +@item [[=@var{classname}=]] +matches any of the equivalent characters in the current locale named by the +equivalence class @var{classname}. +@code{gawk} does not currently support equivalence classes. + +@item [^@var{abc}@dots{}] +matches any character except @var{abc}@dots{} (negated +character list). + +@item @var{r1}|@var{r2} +matches either @var{r1} or @var{r2} (alternation). + +@item @var{r1r2} +matches @var{r1}, and then @var{r2} (concatenation). + +@item @var{r}+ +matches one or more @var{r}'s. + +@item @var{r}* +matches zero or more @var{r}'s. + +@item @var{r}? +matches zero or one @var{r}'s. + +@item (@var{r}) +matches @var{r} (grouping). + +@item @var{r}@{@var{n}@} +@itemx @var{r}@{@var{n},@} +@itemx @var{r}@{@var{n},@var{m}@} +matches at least @var{n}, @var{n} to any number, or @var{n} to @var{m} +occurrences of @var{r} (interval expressions). + +@item \y +matches the empty string at either the beginning or the +end of a word. + +@item \B +matches the empty string within a word. + +@item \< +matches the empty string at the beginning of a word. + +@item \> +matches the empty string at the end of a word. + +@item \w +matches any word-constituent character (alphanumeric characters and +the underscore). + +@item \W +matches any character that is not word-constituent. + +@item \` +matches the empty string at the beginning of a buffer (same as a string +in @code{gawk}). + +@item \' +matches the empty string at the end of a buffer. +@end table + +The various command line options +control how @code{gawk} interprets characters in regexps. + +@c NOTE!!! Keep this in sync with the same table in the regexp chapter! +@table @asis +@item No options +In the default case, @code{gawk} provide all the facilities of +POSIX regexps and the GNU regexp operators described above. +However, interval expressions are not supported. + +@item @code{--posix} +Only POSIX regexps are supported, the GNU operators are not special +(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions +are allowed. + +@item @code{--traditional} +Traditional Unix @code{awk} regexps are matched. The GNU operators +are not special, interval expressions are not available, and neither +are the POSIX character classes (@code{[[:alnum:]]} and so on). +Characters described by octal and hexadecimal escape sequences are +treated literally, even if they represent regexp metacharacters. + +@item @code{--re-interval} +Allow interval expressions in regexps, even if @samp{--traditional} +has been provided. +@end table + +@xref{Regexp, ,Regular Expressions}. + +@node Actions Summary, Functions Summary, Rules Summary, Gawk Summary +@appendixsec Actions + +Action statements are enclosed in braces, @samp{@{} and @samp{@}}. +A missing action statement is equivalent to @samp{@w{@{ print @}}}. + +Action statements consist of the usual assignment, conditional, and looping +statements found in most languages. The operators, control statements, +and Input/Output statements available are similar to those in C. + +@c These paragraphs repeated for both patterns and actions. I don't +@c like this, but I also don't see any way around it. Update both copies +@c if they need fixing. +Comments begin with the @samp{#} character, and continue until the end of the +line. Blank lines may be used to separate statements. Statements normally +end with a newline; however, this is not the case for lines ending in a +@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines +ending in @code{do} or @code{else} also have their statements automatically +continued on the following line. In other cases, a line can be continued by +ending it with a @samp{\}, in which case the newline is ignored. + +Multiple statements may be put on one line by separating each one with +a @samp{;}. +This applies to both the statements within the action part of a rule (the +usual case), and to the rule statements. + +@xref{Comments, ,Comments in @code{awk} Programs}, for information on +@code{awk}'s commenting convention; +@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a +description of the line continuation mechanism in @code{awk}. + +@menu +* Operator Summary:: @code{awk} operators. +* Control Flow Summary:: The control statements. +* I/O Summary:: The I/O statements. +* Printf Summary:: A summary of @code{printf}. +* Special File Summary:: Special file names interpreted internally. +* Built-in Functions Summary:: Built-in numeric and string functions. +* Time Functions Summary:: Built-in time functions. +* String Constants Summary:: Escape sequences in strings. +@end menu + +@node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary +@appendixsubsec Operators + +The operators in @code{awk}, in order of decreasing precedence, are: + +@table @code +@item (@dots{}) +Grouping. + +@item $ +Field reference. + +@item ++ -- +Increment and decrement, both prefix and postfix. + +@item ^ +Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment +operator, but they are not specified in the POSIX standard). + +@item + - ! +Unary plus, unary minus, and logical negation. + +@item * / % +Multiplication, division, and modulus. + +@item + - +Addition and subtraction. + +@item @var{space} +String concatenation. + +@item < <= > >= != == +The usual relational operators. + +@item ~ !~ +Regular expression match, negated match. + +@item in +Array membership. + +@item && +Logical ``and''. + +@item || +Logical ``or''. + +@item ?: +A conditional expression. This has the form @samp{@var{expr1} ? +@var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the +expression is @var{expr2}; otherwise it is @var{expr3}. Only one of +@var{expr2} and @var{expr3} is evaluated. + +@item = += -= *= /= %= ^= +Assignment. Both absolute assignment (@code{@var{var}=@var{value}}) +and operator assignment (the other forms) are supported. +@end table + +@xref{Expressions}. + +@node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary +@appendixsubsec Control Statements + +The control statements are as follows: + +@example +if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]} +while (@var{condition}) @var{statement} +do @var{statement} while (@var{condition}) +for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement} +for (@var{var} in @var{array}) @var{statement} +break +continue +delete @var{array}[@var{index}] +delete @var{array} +exit @r{[} @var{expression} @r{]} +@{ @var{statements} @} +@end example + +@xref{Statements, ,Control Statements in Actions}. + +@node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary +@appendixsubsec I/O Statements + +The Input/Output statements are as follows: + +@table @code +@item getline +Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}. +@xref{Getline, ,Explicit Input with @code{getline}}. + +@item getline <@var{file} +Set @code{$0} from next record of @var{file}; set @code{NF}. + +@item getline @var{var} +Set @var{var} from next input record; set @code{NR}, @code{FNR}. + +@item getline @var{var} <@var{file} +Set @var{var} from next record of @var{file}. + +@item @var{command} | getline +Run @var{command}, piping its output into @code{getline}; sets @code{$0}, +@code{NF}, @code{NR}. + +@item @var{command} | getline @code{var} +Run @var{command}, piping its output into @code{getline}; sets @var{var}. + +@item next +Stop processing the current input record. The next input record is read and +processing starts over with the first pattern in the @code{awk} program. +If the end of the input data is reached, the @code{END} rule(s), if any, +are executed. +@xref{Next Statement, ,The @code{next} Statement}. + +@item nextfile +Stop processing the current input file. The next input record read comes +from the next input file. @code{FILENAME} is updated, @code{FNR} is set to one, +@code{ARGIND} is incremented, +and processing starts over with the first pattern in the @code{awk} program. +If the end of the input data is reached, the @code{END} rule(s), if any, +are executed. +Earlier versions of @code{gawk} used @samp{next file}; this usage is still +supported, but is considered to be deprecated. +@xref{Nextfile Statement, ,The @code{nextfile} Statement}. + +@item print +Prints the current record. +@xref{Printing, ,Printing Output}. + +@item print @var{expr-list} +Prints expressions. + +@item print @var{expr-list} > @var{file} +Prints expressions to @var{file}. If @var{file} does not exist, it is +created. If it does exist, its contents are deleted the first time the +@code{print} is executed. + +@item print @var{expr-list} >> @var{file} +Prints expressions to @var{file}. The previous contents of @var{file} +are retained, and the output of @code{print} is appended to the file. + +@item print @var{expr-list} | @var{command} +Prints expressions, sending the output down a pipe to @var{command}. +The pipeline to the command stays open until the @code{close} function +is called. + +@item printf @var{fmt, expr-list} +Format and print. + +@item printf @var{fmt, expr-list} > file +Format and print to @var{file}. If @var{file} does not exist, it is +created. If it does exist, its contents are deleted the first time the +@code{printf} is executed. + +@item printf @var{fmt, expr-list} >> @var{file} +Format and print to @var{file}. The previous contents of @var{file} +are retained, and the output of @code{printf} is appended to the file. + +@item printf @var{fmt, expr-list} | @var{command} +Format and print, sending the output down a pipe to @var{command}. +The pipeline to the command stays open until the @code{close} function +is called. +@end table + +@code{getline} returns zero on end of file, and @minus{}1 on an error. +In the event of an error, @code{getline} will set @code{ERRNO} to +the value of a system-dependent string that describes the error. + +@node Printf Summary, Special File Summary, I/O Summary, Actions Summary +@appendixsubsec @code{printf} Summary + +Conversion specification have the form +@code{%}[@var{flag}][@var{width}][@code{.}@var{prec}]@var{format}. +@c whew! +Items in brackets are optional. + +The @code{awk} @code{printf} statement and @code{sprintf} function +accept the following conversion specification formats: + +@table @code +@item %c +An ASCII character. If the argument used for @samp{%c} is numeric, it is +treated as a character and printed. Otherwise, the argument is assumed to +be a string, and the only first character of that string is printed. + +@item %d +@itemx %i +A decimal number (the integer part). + +@item %e +@itemx %E +A floating point number of the form +@samp{@r{[}-@r{]}d.dddddde@r{[}+-@r{]}dd}. +The @samp{%E} format uses @samp{E} instead of @samp{e}. + +@item %f +A floating point number of the form +@r{[}@code{-}@r{]}@code{ddd.dddddd}. + +@item %g +@itemx %G +Use either the @samp{%e} or @samp{%f} formats, whichever produces a shorter +string, with non-significant zeros suppressed. +@samp{%G} will use @samp{%E} instead of @samp{%e}. + +@item %o +An unsigned octal number (again, an integer). + +@item %s +A character string. + +@item %x +@itemx %X +An unsigned hexadecimal number (an integer). +The @samp{%X} format uses @samp{A} through @samp{F} instead of +@samp{a} through @samp{f} for decimal 10 through 15. + +@item %% +A single @samp{%} character; no argument is converted. +@end table + +There are optional, additional parameters that may lie between the @samp{%} +and the control letter: + +@table @code +@item - +The expression should be left-justified within its field. + +@item @var{space} +For numeric conversions, prefix positive values with a space, and +negative values with a minus sign. + +@item + +The plus sign, used before the width modifier (see below), +says to always supply a sign for numeric conversions, even if the data +to be formatted is positive. The @samp{+} overrides the space modifier. + +@item # +Use an ``alternate form'' for certain control letters. +For @samp{o}, supply a leading zero. +For @samp{x}, and @samp{X}, supply a leading @samp{0x} or @samp{0X} for +a non-zero result. +For @samp{e}, @samp{E}, and @samp{f}, the result will always contain a +decimal point. +For @samp{g}, and @samp{G}, trailing zeros are not removed from the result. + +@item 0 +A leading @samp{0} (zero) acts as a flag, that indicates output should be +padded with zeros instead of spaces. +This applies even to non-numeric output formats. +This flag only has an effect when the field width is wider than the +value to be printed. + +@item @var{width} +The field should be padded to this width. The field is normally padded +with spaces. If the @samp{0} flag has been used, it is padded with zeros. + +@item .@var{prec} +A number that specifies the precision to use when printing. +For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the +number of digits you want printed to the right of the decimal point. +For the @samp{g}, and @samp{G} formats, it specifies the maximum number +of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u}, +@samp{x}, and @samp{X} formats, it specifies the minimum number of +digits to print. For the @samp{s} format, it specifies the maximum number of +characters from the string that should be printed. +@end table + +Either or both of the @var{width} and @var{prec} values may be specified +as @samp{*}. In that case, the particular value is taken from the argument +list. + +@xref{Printf, ,Using @code{printf} Statements for Fancier Printing}. + +@node Special File Summary, Built-in Functions Summary, Printf Summary, Actions Summary +@appendixsubsec Special File Names + +When doing I/O redirection from either @code{print} or @code{printf} into a +file, or via @code{getline} from a file, @code{gawk} recognizes certain special +file names internally. These file names allow access to open file descriptors +inherited from @code{gawk}'s parent process (usually the shell). The +file names are: + +@table @file +@item /dev/stdin +The standard input. + +@item /dev/stdout +The standard output. + +@item /dev/stderr +The standard error output. + +@item /dev/fd/@var{n} +The file denoted by the open file descriptor @var{n}. +@end table + +In addition, reading the following files provides process related information +about the running @code{gawk} program. All returned records are terminated +with a newline. + +@table @file +@item /dev/pid +Returns the process ID of the current process. + +@item /dev/ppid +Returns the parent process ID of the current process. + +@item /dev/pgrpid +Returns the process group ID of the current process. + +@item /dev/user +At least four space-separated fields, containing the return values of +the @code{getuid}, @code{geteuid}, @code{getgid}, and @code{getegid} +system calls. +If there are any additional fields, they are the group IDs returned by +@code{getgroups} system call. +(Multiple groups may not be supported on all systems.) +@end table + +@noindent +These file names may also be used on the command line to name data files. +These file names are only recognized internally if you do not +actually have files with these names on your system. + +@xref{Special Files, ,Special File Names in @code{gawk}}, for a longer description that +provides the motivation for this feature. + +@node Built-in Functions Summary, Time Functions Summary, Special File Summary, Actions Summary +@appendixsubsec Built-in Functions + +@code{awk} provides a number of built-in functions for performing +numeric operations, string related operations, and I/O related operations. + +The built-in arithmetic functions are: + +@table @code +@item atan2(@var{y}, @var{x}) +the arctangent of @var{y/x} in radians. + +@item cos(@var{expr}) +the cosine of @var{expr}, which is in radians. + +@item exp(@var{expr}) +the exponential function (@code{e ^ @var{expr}}). + +@item int(@var{expr}) +truncates to integer. + +@item log(@var{expr}) +the natural logarithm of @code{expr}. + +@item rand() +a random number between zero and one. + +@item sin(@var{expr}) +the sine of @var{expr}, which is in radians. + +@item sqrt(@var{expr}) +the square root function. + +@item srand(@r{[}@var{expr}@r{]}) +use @var{expr} as a new seed for the random number generator. If no @var{expr} +is provided, the time of day is used. The return value is the previous +seed for the random number generator. +@end table + +@code{awk} has the following built-in string functions: + +@table @code +@item gensub(@var{regex}, @var{subst}, @var{how} @r{[}, @var{target}@r{]}) +If @var{how} is a string beginning with @samp{g} or @samp{G}, then +replace each match of @var{regex} in @var{target} with @var{subst}. +Otherwise, replace the @var{how}'th occurrence. If @var{target} is not +supplied, use @code{$0}. The return value is the changed string; the +original @var{target} is not modified. Within @var{subst}, +@samp{\@var{n}}, where @var{n} is a digit from one to nine, can be used to +indicate the text that matched the @var{n}'th parenthesized +subexpression. +This function is @code{gawk}-specific. + +@item gsub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]}) +for each substring matching the regular expression @var{regex} in the string +@var{target}, substitute the string @var{subst}, and return the number of +substitutions. If @var{target} is not supplied, use @code{$0}. + +@item index(@var{str}, @var{search}) +returns the index of the string @var{search} in the string @var{str}, or +zero if +@var{search} is not present. + +@item length(@r{[}@var{str}@r{]}) +returns the length of the string @var{str}. The length of @code{$0} +is returned if no argument is supplied. + +@item match(@var{str}, @var{regex}) +returns the position in @var{str} where the regular expression @var{regex} +occurs, or zero if @var{regex} is not present, and sets the values of +@code{RSTART} and @code{RLENGTH}. + +@item split(@var{str}, @var{arr} @r{[}, @var{regex}@r{]}) +splits the string @var{str} into the array @var{arr} on the regular expression +@var{regex}, and returns the number of elements. If @var{regex} is omitted, +@code{FS} is used instead. @var{regex} can be the null string, causing +each character to be placed into its own array element. +The array @var{arr} is cleared first. + +@item sprintf(@var{fmt}, @var{expr-list}) +prints @var{expr-list} according to @var{fmt}, and returns the resulting string. + +@item sub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]}) +just like @code{gsub}, but only the first matching substring is replaced. + +@item substr(@var{str}, @var{index} @r{[}, @var{len}@r{]}) +returns the @var{len}-character substring of @var{str} starting at @var{index}. +If @var{len} is omitted, the rest of @var{str} is used. + +@item tolower(@var{str}) +returns a copy of the string @var{str}, with all the upper-case characters in +@var{str} translated to their corresponding lower-case counterparts. +Non-alphabetic characters are left unchanged. + +@item toupper(@var{str}) +returns a copy of the string @var{str}, with all the lower-case characters in +@var{str} translated to their corresponding upper-case counterparts. +Non-alphabetic characters are left unchanged. +@end table + +The I/O related functions are: + +@table @code +@item close(@var{expr}) +Close the open file or pipe denoted by @var{expr}. + +@item fflush(@r{[}@var{expr}@r{]}) +Flush any buffered output for the output file or pipe denoted by @var{expr}. +If @var{expr} is omitted, standard output is flushed. +If @var{expr} is the null string (@code{""}), all output buffers are flushed. + +@item system(@var{cmd-line}) +Execute the command @var{cmd-line}, and return the exit status. +If your operating system does not support @code{system}, calling it will +generate a fatal error. + +@samp{system("")} can be used to force @code{awk} to flush any pending +output. This is more portable, but less obvious, than calling @code{fflush}. +@end table + +@node Time Functions Summary, String Constants Summary, Built-in Functions Summary, Actions Summary +@appendixsubsec Time Functions + +The following two functions are available for getting the current +time of day, and for formatting time stamps. +They are specific to @code{gawk}. + +@table @code +@item systime() +returns the current time of day as the number of seconds since a particular +epoch (Midnight, January 1, 1970 UTC, on POSIX systems). + +@item strftime(@r{[}@var{format}@r{[}, @var{timestamp}@r{]]}) +formats @var{timestamp} according to the specification in @var{format}. +The current time of day is used if no @var{timestamp} is supplied. +A default format equivalent to the output of the @code{date} utility is used if +no @var{format} is supplied. +@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the +details on the conversion specifiers that @code{strftime} accepts. +@end table + +@iftex +@xref{Built-in, ,Built-in Functions}, for a description of all of +@code{awk}'s built-in functions. +@end iftex + +@node String Constants Summary, , Time Functions Summary, Actions Summary +@appendixsubsec String Constants + +String constants in @code{awk} are sequences of characters enclosed +in double quotes (@code{"}). Within strings, certain @dfn{escape sequences} +are recognized, as in C. These are: + +@table @code +@item \\ +A literal backslash. + +@item \a +The ``alert'' character; usually the ASCII BEL character. + +@item \b +Backspace. + +@item \f +Formfeed. + +@item \n +Newline. + +@item \r +Carriage return. + +@item \t +Horizontal tab. + +@item \v +Vertical tab. + +@item \x@var{hex digits} +The character represented by the string of hexadecimal digits following +the @samp{\x}. As in ANSI C, all following hexadecimal digits are +considered part of the escape sequence. E.g., @code{"\x1B"} is a +string containing the ASCII ESC (escape) character. (The @samp{\x} +escape sequence is not in POSIX @code{awk}.) + +@item \@var{ddd} +The character represented by the one, two, or three digit sequence of octal +digits. Thus, @code{"\033"} is also a string containing the ASCII ESC +(escape) character. + +@item \@var{c} +The literal character @var{c}, if @var{c} is not one of the above. +@end table + +The escape sequences may also be used inside constant regular expressions +(e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace +characters). + +@xref{Escape Sequences}. + +@node Functions Summary, Historical Features, Actions Summary, Gawk Summary +@appendixsec User-defined Functions + +Functions in @code{awk} are defined as follows: + +@example +function @var{name}(@var{parameter list}) @{ @var{statements} @} +@end example + +Actual parameters supplied in the function call are used to instantiate +the formal parameters declared in the function. Arrays are passed by +reference, other variables are passed by value. + +If there are fewer arguments passed than there are names in @var{parameter-list}, +the extra names are given the null string as their value. Extra names have the +effect of local variables. + +The open-parenthesis in a function call of a user-defined function must +immediately follow the function name, without any intervening white space. +This is to avoid a syntactic ambiguity with the concatenation operator. + +The word @code{func} may be used in place of @code{function} (but not in +POSIX @code{awk}). + +Use the @code{return} statement to return a value from a function. + +@xref{User-defined, ,User-defined Functions}. + +@node Historical Features, , Functions Summary, Gawk Summary +@appendixsec Historical Features + +@cindex historical features +There are two features of historical @code{awk} implementations that +@code{gawk} supports. + +First, it is possible to call the @code{length} built-in function not only +with no arguments, but even without parentheses! + +@example +a = length +@end example + +@noindent +is the same as either of + +@example +a = length() +a = length($0) +@end example + +@noindent +For example: + +@example +$ echo abcdef | awk '@{ print length @}' +@print{} 6 +@end example + +@noindent +This feature is marked as ``deprecated'' in the POSIX standard, and +@code{gawk} will issue a warning about its use if @samp{--lint} is +specified on the command line. +(The ability to use @code{length} this way was actually an accident of the +original Unix @code{awk} implementation. If any built-in function used +@code{$0} as its default argument, it was possible to call that function +without the parentheses. In particular, it was common practice to use +the @code{length} function in this fashion, and this usage was documented +in the @code{awk} manual page.) + +The other historical feature is the use of either the @code{break} statement, +or the @code{continue} statement +outside the body of a @code{while}, @code{for}, or @code{do} loop. Traditional +@code{awk} implementations have treated such usage as equivalent to the +@code{next} statement. More recent versions of Unix @code{awk} do not allow +it. @code{gawk} supports this usage if @samp{--traditional} has been +specified. + +@xref{Options, ,Command Line Options}, for more information about the +@samp{--posix} and @samp{--lint} options. + +@node Installation, Notes, Gawk Summary, Top +@appendix Installing @code{gawk} + +This appendix provides instructions for installing @code{gawk} on the +various platforms that are supported by the developers. The primary +developers support Unix (and one day, GNU), while the other ports were +contributed. The file @file{ACKNOWLEDGMENT} in the @code{gawk} +distribution lists the electronic mail addresses of the people who did +the respective ports, and they are also provided in +@ref{Bugs, , Reporting Problems and Bugs}. + +@menu +* Gawk Distribution:: What is in the @code{gawk} distribution. +* Unix Installation:: Installing @code{gawk} under various versions + of Unix. +* VMS Installation:: Installing @code{gawk} on VMS. +* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS + and OS/2 +* Atari Installation:: Installing @code{gawk} on the Atari ST. +* Amiga Installation:: Installing @code{gawk} on an Amiga. +* Bugs:: Reporting Problems and Bugs. +* Other Versions:: Other freely available @code{awk} + implementations. +@end menu + +@node Gawk Distribution, Unix Installation, Installation, Installation +@appendixsec The @code{gawk} Distribution + +This section first describes how to get the @code{gawk} +distribution, how to extract it, and then what is in the various files and +subdirectories. + +@menu +* Getting:: How to get the distribution. +* Extracting:: How to extract the distribution. +* Distribution contents:: What is in the distribution. +@end menu + +@node Getting, Extracting, Gawk Distribution, Gawk Distribution +@appendixsubsec Getting the @code{gawk} Distribution +@cindex getting @code{gawk} +@cindex anonymous @code{ftp} +@cindex @code{ftp}, anonymous +@cindex Free Software Foundation +There are three ways you can get GNU software. + +@enumerate +@item +You can copy it from someone else who already has it. + +@cindex Free Software Foundation +@item +You can order @code{gawk} directly from the Free Software Foundation. +Software distributions are available for Unix, MS-DOS, and VMS, on +tape and CD-ROM. The address is: + +@quotation +Free Software Foundation @* +59 Temple Place---Suite 330 @* +Boston, MA 02111-1307 USA @* +Phone: +1-617-542-5942 @* +Fax (including Japan): +1-617-542-2652 @* +E-mail: @code{gnu@@prep.ai.mit.edu} @* +@end quotation + +@noindent +Ordering from the FSF directly contributes to the support of the foundation +and to the production of more free software. + +@item +You can get @code{gawk} by using anonymous @code{ftp} to the Internet host +@code{ftp.gnu.ai.mit.edu}, in the directory @file{/pub/gnu}. + +Here is a list of alternate @code{ftp} sites from which you can obtain GNU +software. When a site is listed as ``@var{site}@code{:}@var{directory}'' the +@var{directory} indicates the directory where GNU software is kept. +You should use a site that is geographically close to you. + +@table @asis +@item Asia: +@table @code +@item cair-archive.kaist.ac.kr:/pub/gnu +@itemx ftp.cs.titech.ac.jp +@itemx ftp.nectec.or.th:/pub/mirrors/gnu +@itemx utsun.s.u-tokyo.ac.jp:/ftpsync/prep +@end table + +@item Australia: +@table @code +@item archie.au:/gnu +(@code{archie.oz} or @code{archie.oz.au} for ACSnet) +@end table + +@item Africa: +@table @code +@item ftp.sun.ac.za:/pub/gnu +@end table + +@item Middle East: +@table @code +@item ftp.technion.ac.il:/pub/unsupported/gnu +@end table + +@item Europe: +@table @code +@item archive.eu.net +@itemx ftp.denet.dk +@itemx ftp.eunet.ch +@itemx ftp.funet.fi:/pub/gnu +@itemx ftp.ieunet.ie:pub/gnu +@itemx ftp.informatik.rwth-aachen.de:/pub/gnu +@itemx ftp.informatik.tu-muenchen.de +@itemx ftp.luth.se:/pub/unix/gnu +@itemx ftp.mcc.ac.uk +@itemx ftp.stacken.kth.se +@itemx ftp.sunet.se:/pub/gnu +@itemx ftp.univ-lyon1.fr:pub/gnu +@itemx ftp.win.tue.nl:/pub/gnu +@itemx irisa.irisa.fr:/pub/gnu +@itemx isy.liu.se +@itemx nic.switch.ch:/mirror/gnu +@itemx src.doc.ic.ac.uk:/gnu +@itemx unix.hensa.ac.uk:/pub/uunet/systems/gnu +@end table + +@item South America: +@table @code +@item ftp.inf.utfsm.cl:/pub/gnu +@itemx ftp.unicamp.br:/pub/gnu +@end table + +@item Western Canada: +@table @code +@item ftp.cs.ubc.ca:/mirror2/gnu +@end table + +@item USA: +@table @code +@item col.hp.com:/mirrors/gnu +@itemx f.ms.uky.edu:/pub3/gnu +@itemx ftp.cc.gatech.edu:/pub/gnu +@itemx ftp.cs.columbia.edu:/archives/gnu/prep +@itemx ftp.digex.net:/pub/gnu +@itemx ftp.hawaii.edu:/mirrors/gnu +@itemx ftp.kpc.com:/pub/mirror/gnu +@end table + +@c NEEDED +@page +@item USA (continued): +@table @code +@itemx ftp.uu.net:/systems/gnu +@itemx gatekeeper.dec.com:/pub/GNU +@itemx jaguar.utah.edu:/gnustuff +@itemx labrea.stanford.edu +@itemx mrcnext.cso.uiuc.edu:/pub/gnu +@itemx vixen.cso.uiuc.edu:/gnu +@itemx wuarchive.wustl.edu:/systems/gnu +@end table +@end table +@end enumerate + +@node Extracting, Distribution contents, Getting, Gawk Distribution +@appendixsubsec Extracting the Distribution +@code{gawk} is distributed as a @code{tar} file compressed with the +GNU Zip program, @code{gzip}. + +Once you have the distribution (for example, +@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}), first use @code{gzip} to expand the +file, and then use @code{tar} to extract it. You can use the following +pipeline to produce the @code{gawk} distribution: + +@example +# Under System V, add 'o' to the tar flags +gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf - +@end example + +@noindent +This will create a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} in the current +directory. + +The distribution file name is of the form +@file{gawk-@var{V}.@var{R}.@var{n}.tar.gz}. +The @var{V} represents the major version of @code{gawk}, +the @var{R} represents the current release of version @var{V}, and +the @var{n} represents a @dfn{patch level}, meaning that minor bugs have +been fixed in the release. The current patch level is @value{PATCHLEVEL}, +but when +retrieving distributions, you should get the version with the highest +version, release, and patch level. (Note that release levels greater than +or equal to 90 denote ``beta,'' or non-production software; you may not wish +to retrieve such a version unless you don't mind experimenting.) + +If you are not on a Unix system, you will need to make other arrangements +for getting and extracting the @code{gawk} distribution. You should consult +a local expert. + +@node Distribution contents, , Extracting, Gawk Distribution +@appendixsubsec Contents of the @code{gawk} Distribution + +The @code{gawk} distribution has a number of C source files, +documentation files, +subdirectories and files related to the configuration process +(@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}), +and several subdirectories related to different, non-Unix, +operating systems. + +@table @asis +@item various @samp{.c}, @samp{.y}, and @samp{.h} files +These files are the actual @code{gawk} source code. +@end table + +@table @file +@item README +@itemx README_d/README.* +Descriptive files: @file{README} for @code{gawk} under Unix, and the +rest for the various hardware and software combinations. + +@item INSTALL +A file providing an overview of the configuration and installation process. + +@item PORTS +A list of systems to which @code{gawk} has been ported, and which +have successfully run the test suite. + +@item ACKNOWLEDGMENT +A list of the people who contributed major parts of the code or documentation. + +@item ChangeLog +A detailed list of source code changes as bugs are fixed or improvements made. + +@item NEWS +A list of changes to @code{gawk} since the last release or patch. + +@item COPYING +The GNU General Public License. + +@item FUTURES +A brief list of features and/or changes being contemplated for future +releases, with some indication of the time frame for the feature, based +on its difficulty. + +@item LIMITATIONS +A list of those factors that limit @code{gawk}'s performance. +Most of these depend on the hardware or operating system software, and +are not limits in @code{gawk} itself. + +@item POSIX.STD +A description of one area where the POSIX standard for @code{awk} is +incorrect, and how @code{gawk} handles the problem. + +@item PROBLEMS +A file describing known problems with the current release. + +@cindex artificial intelligence, using @code{gawk} +@cindex AI programming, using @code{gawk} +@item doc/awkforai.txt +A short article describing why @code{gawk} is a good language for +AI (Artificial Intelligence) programming. + +@item doc/README.card +@itemx doc/ad.block +@itemx doc/awkcard.in +@itemx doc/cardfonts +@itemx doc/colors +@itemx doc/macros +@itemx doc/no.colors +@itemx doc/setter.outline +The @code{troff} source for a five-color @code{awk} reference card. +A modern version of @code{troff}, such as GNU Troff (@code{groff}) is +needed to produce the color version. See the file @file{README.card} +for instructions if you have an older @code{troff}. + +@item doc/gawk.1 +The @code{troff} source for a manual page describing @code{gawk}. +This is distributed for the convenience of Unix users. + +@item doc/gawk.texi +The Texinfo source file for this @value{DOCUMENT}. +It should be processed with @TeX{} to produce a printed document, and +with @code{makeinfo} to produce an Info file. + +@item doc/gawk.info +The generated Info file for this @value{DOCUMENT}. + +@item doc/igawk.1 +The @code{troff} source for a manual page describing the @code{igawk} +program presented in +@ref{Igawk Program, ,An Easy Way to Use Library Functions}. + +@item doc/Makefile.in +The input file used during the configuration process to generate the +actual @file{Makefile} for creating the documentation. + +@item Makefile.in +@itemx acconfig.h +@itemx aclocal.m4 +@itemx configh.in +@itemx configure.in +@itemx configure +@itemx custom.h +@itemx missing/* +These files and subdirectory are used when configuring @code{gawk} +for various Unix systems. They are explained in detail in +@ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}. + +@item awklib/extract.awk +@itemx awklib/Makefile.in +The @file{awklib} directory contains a copy of @file{extract.awk} +(@pxref{Extract Program, ,Extracting Programs from Texinfo Source Files}), +which can be used to extract the sample programs from the Texinfo +source file for this @value{DOCUMENT}, and a @file{Makefile.in} file, which +@code{configure} uses to generate a @file{Makefile}. +As part of the process of building @code{gawk}, the library functions from +@ref{Library Functions, , A Library of @code{awk} Functions}, +and the @code{igawk} program from +@ref{Igawk Program, , An Easy Way to Use Library Functions}, +are extracted into ready to use files. +They are installed as part of the installation process. + +@item atari/* +Files needed for building @code{gawk} on an Atari ST. +@xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details. + +@item pc/* +Files needed for building @code{gawk} under MS-DOS and OS/2. +@xref{PC Installation, ,MS-DOS and OS/2 Installation and Compilation}, for details. + +@item vms/* +Files needed for building @code{gawk} under VMS. +@xref{VMS Installation, ,How to Compile and Install @code{gawk} on VMS}, for details. + +@item test/* +A test suite for +@code{gawk}. You can use @samp{make check} from the top level @code{gawk} +directory to run your version of @code{gawk} against the test suite. +If @code{gawk} successfully passes @samp{make check} then you can +be confident of a successful port. +@end table + +@node Unix Installation, VMS Installation, Gawk Distribution, Installation +@appendixsec Compiling and Installing @code{gawk} on Unix + +Usually, you can compile and install @code{gawk} by typing only two +commands. However, if you do use an unusual system, you may need +to configure @code{gawk} for your system yourself. + +@menu +* Quick Installation:: Compiling @code{gawk} under Unix. +* Configuration Philosophy:: How it's all supposed to work. +@end menu + +@node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation +@appendixsubsec Compiling @code{gawk} for Unix + +@cindex installation, unix +After you have extracted the @code{gawk} distribution, @code{cd} +to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. Like most GNU software, +@code{gawk} is configured +automatically for your Unix system by running the @code{configure} program. +This program is a Bourne shell script that was generated automatically using +GNU @code{autoconf}. +@iftex +(The @code{autoconf} software is +described fully in +@cite{Autoconf---Generating Automatic Configuration Scripts}, +which is available from the Free Software Foundation.) +@end iftex +@ifinfo +(The @code{autoconf} software is described fully starting with +@ref{Top, , Introduction, autoconf, Autoconf---Generating Automatic Configuration Scripts}.) +@end ifinfo + +To configure @code{gawk}, simply run @code{configure}: + +@example +sh ./configure +@end example + +This produces a @file{Makefile} and @file{config.h} tailored to your system. +The @file{config.h} file describes various facts about your system. +You may wish to edit the @file{Makefile} to +change the @code{CFLAGS} variable, which controls +the command line options that are passed to the C compiler (such as +optimization levels, or compiling for debugging). + +Alternatively, you can add your own values for most @code{make} +variables, such as @code{CC} and @code{CFLAGS}, on the command line when +running @code{configure}: + +@example +CC=cc CFLAGS=-g sh ./configure +@end example + +@noindent +See the file @file{INSTALL} in the @code{gawk} distribution for +all the details. + +After you have run @code{configure}, and possibly edited the @file{Makefile}, +type: + +@example +make +@end example + +@noindent +and shortly thereafter, you should have an executable version of @code{gawk}. +That's all there is to it! +(If these steps do not work, please send in a bug report; +@pxref{Bugs, ,Reporting Problems and Bugs}.) + +@node Configuration Philosophy, , Quick Installation, Unix Installation +@appendixsubsec The Configuration Process + +@cindex configuring @code{gawk} +(This section is of interest only if you know something about using the +C language and the Unix operating system.) + +The source code for @code{gawk} generally attempts to adhere to formal +standards wherever possible. This means that @code{gawk} uses library +routines that are specified by the ANSI C standard and by the POSIX +operating system interface standard. When using an ANSI C compiler, +function prototypes are used to help improve the compile-time checking. + +Many Unix systems do not support all of either the ANSI or the +POSIX standards. The @file{missing} subdirectory in the @code{gawk} +distribution contains replacement versions of those subroutines that are +most likely to be missing. + +The @file{config.h} file that is created by the @code{configure} program +contains definitions that describe features of the particular operating +system where you are attempting to compile @code{gawk}. The three things +described by this file are what header files are available, so that +they can be correctly included, +what (supposedly) standard functions are actually available in your C +libraries, and +other miscellaneous facts about your +variant of Unix. For example, there may not be an @code{st_blksize} +element in the @code{stat} structure. In this case @samp{HAVE_ST_BLKSIZE} +would be undefined. + +@cindex @code{custom.h} configuration file +It is possible for your C compiler to lie to @code{configure}. It may +do so by not exiting with an error when a library function is not +available. To get around this, you can edit the file @file{custom.h}. +Use an @samp{#ifdef} that is appropriate for your system, and either +@code{#define} any constants that @code{configure} should have defined but +didn't, or @code{#undef} any constants that @code{configure} defined and +should not have. @file{custom.h} is automatically included by +@file{config.h}. + +It is also possible that the @code{configure} program generated by +@code{autoconf} +will not work on your system in some other fashion. If you do have a problem, +the file +@file{configure.in} is the input for @code{autoconf}. You may be able to +change this file, and generate a new version of @code{configure} that will +work on your system. @xref{Bugs, ,Reporting Problems and Bugs}, for +information on how to report problems in configuring @code{gawk}. The same +mechanism may be used to send in updates to @file{configure.in} and/or +@file{custom.h}. + +@node VMS Installation, PC Installation, Unix Installation, Installation +@appendixsec How to Compile and Install @code{gawk} on VMS + +@c based on material from Pat Rankin <rankin@eql.caltech.edu> + +@cindex installation, vms +This section describes how to compile and install @code{gawk} under VMS. + +@menu +* VMS Compilation:: How to compile @code{gawk} under VMS. +* VMS Installation Details:: How to install @code{gawk} under VMS. +* VMS Running:: How to run @code{gawk} under VMS. +* VMS POSIX:: Alternate instructions for VMS POSIX. +@end menu + +@node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation +@appendixsubsec Compiling @code{gawk} on VMS + +To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that +will issue all the necessary @code{CC} and @code{LINK} commands, and there is +also a @file{Makefile} for use with the @code{MMS} utility. From the source +directory, use either + +@example +$ @@[.VMS]VMSBUILD.COM +@end example + +@noindent +or + +@example +$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK +@end example + +Depending upon which C compiler you are using, follow one of the sets +of instructions in this table: + +@table @asis +@item VAX C V3.x +Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use +@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0. + +@item VAX C V2.x +You must have Version 2.3 or 2.4; older ones won't work. Edit either +@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them. +For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters. +Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h}) +and comment out or delete the two lines @samp{#define __STDC__ 0} and +@samp{#define VAXC_BUILTINS} near the end. + +@item GNU C +Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different +from those for VAX C V2.x, but equally straightforward. No changes to +@file{config.h} should be needed. + +@item DEC C +Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments. +No changes to @file{config.h} should be needed. +@end table + +@code{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2, +GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up. + +@node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation +@appendixsubsec Installing @code{gawk} on VMS + +To install @code{gawk}, all you need is a ``foreign'' command, which is +a @code{DCL} symbol whose value begins with a dollar sign. For example: + +@example +$ GAWK :== $disk1:[gnubin]GAWK +@end example + +@noindent +(Substitute the actual location of @code{gawk.exe} for +@samp{$disk1:[gnubin]}.) The symbol should be placed in the +@file{login.com} of any user who wishes to run @code{gawk}, +so that it will be defined every time the user logs on. +Alternatively, the symbol may be placed in the system-wide +@file{sylogin.com} procedure, which will allow all users +to run @code{gawk}. + +Optionally, the help entry can be loaded into a VMS help library: + +@example +$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP +@end example + +@noindent +(You may want to substitute a site-specific help library rather than +the standard VMS library @samp{HELPLIB}.) After loading the help text, + +@example +$ HELP GAWK +@end example + +@noindent +will provide information about both the @code{gawk} implementation and the +@code{awk} programming language. + +The logical name @samp{AWK_LIBRARY} can designate a default location +for @code{awk} program files. For the @samp{-f} option, if the specified +filename has no device or directory path information in it, @code{gawk} +will look in the current directory first, then in the directory specified +by the translation of @samp{AWK_LIBRARY} if the file was not found. +If after searching in both directories, the file still is not found, +then @code{gawk} appends the suffix @samp{.awk} to the filename and the +file search will be re-tried. If @samp{AWK_LIBRARY} is not defined, that +portion of the file search will fail benignly. + +@node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation +@appendixsubsec Running @code{gawk} on VMS + +Command line parsing and quoting conventions are significantly different +on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor +changes. They @emph{are} minor though, and all @code{awk} programs +should run correctly. + +Here are a couple of trivial tests: + +@example +$ gawk -- "BEGIN @{print ""Hello, World!""@}" +$ gawk -"W" version +! could also be -"W version" or "-W version" +@end example + +@noindent +Note that upper-case and mixed-case text must be quoted. + +The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition +to the original shell-style interface (see the help entry for details). +One side-effect of dual command line parsing is that if there is only a +single parameter (as in the quoted string program above), the command +becomes ambiguous. To work around this, the normally optional @samp{--} +flag is required to force Unix style rather than @code{DCL} parsing. If any +other dash-type options (or multiple parameters such as data files to be +processed) are present, there is no ambiguity and @samp{--} can be omitted. + +The default search path when looking for @code{awk} program files specified +by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical +name @samp{AWKPATH} can be used to override this default. The format +of @samp{AWKPATH} is a comma-separated list of directory specifications. +When defining it, the value should be quoted so that it retains a single +translation, and not a multi-translation @code{RMS} searchlist. + +@node VMS POSIX, , VMS Running, VMS Installation +@appendixsubsec Building and Using @code{gawk} on VMS POSIX + +Ignore the instructions above, although @file{vms/gawk.hlp} should still +be made available in a help library. The source tree should be unpacked +into a container file subsystem rather than into the ordinary VMS file +system. Make sure that the two scripts, @file{configure} and +@file{vms/posix-cc.sh}, are executable; use @samp{chmod +x} on them if +necessary. Then execute the following two commands: + +@example +@group +psx> CC=vms/posix-cc.sh configure +psx> make CC=c89 gawk +@end group +@end example + +@noindent +The first command will construct files @file{config.h} and @file{Makefile} out +of templates, using a script to make the C compiler fit @code{configure}'s +expectations. The second command will compile and link @code{gawk} using +the C compiler directly; ignore any warnings from @code{make} about being +unable to redefine @code{CC}. @code{configure} will take a very long +time to execute, but at least it provides incremental feedback as it +runs. + +This has been tested with VAX/VMS V6.2, VMS POSIX V2.0, and DEC C V5.2. + +Once built, @code{gawk} will work like any other shell utility. Unlike +the normal VMS port of @code{gawk}, no special command line manipulation is +needed in the VMS POSIX environment. + +@c Rewritten by Scott Deifik <scottd@amgen.com> +@c and Darrel Hankerson <hankedr@mail.auburn.edu> +@node PC Installation, Atari Installation, VMS Installation, Installation +@appendixsec MS-DOS and OS/2 Installation and Compilation + +@cindex installation, MS-DOS and OS/2 +If you have received a binary distribution prepared by the DOS +maintainers, then @code{gawk} and the necessary support files will appear +under the @file{gnu} directory, with executables in @file{gnu/bin}, +libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}. +This is designed for easy installation to a @file{/gnu} directory on your +drive, but the files can be installed anywhere provided @code{AWKPATH} is +set properly. Regardless of the installation directory, the first line of +@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be +edited. + +The binary distribution will contain a separate file describing the +contents. In particular, it may include more than one version of the +@code{gawk} executable. OS/2 binary distributions may have a +different arrangement, but installation is similar. + +The OS/2 and MS-DOS versions of @code{gawk} search for program files as +described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}. +However, semicolons (rather than colons) separate elements +in the @code{AWKPATH} variable. If @code{AWKPATH} is not set or is empty, +then the default search path is @code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}. + +An @code{sh}-like shell (as opposed to @code{command.com} under MS-DOS +or @code{cmd.exe} under OS/2) may be useful for @code{awk} programming. +Ian Stewartson has written an excellent shell for MS-DOS and OS/2, and a +@code{ksh} clone and GNU Bash are available for OS/2. The file +@file{README_d/README.pc} in the @code{gawk} distribution contains +information on these shells. Users of Stewartson's shell on DOS should +examine its documentation on handling of command-lines. In particular, +the setting for @code{gawk} in the shell configuration may need to be +changed, and the @code{ignoretype} option may also be of interest. + +@code{gawk} can be compiled for MS-DOS and OS/2 using the GNU development tools +from DJ Delorie (DJGPP, MS-DOS-only) or Eberhard Mattes (EMX, MS-DOS and OS/2). +Microsoft C can be used to build 16-bit versions for MS-DOS and OS/2. The file +@file{README_d/README.pc} in the @code{gawk} distribution contains additional +notes, and @file{pc/Makefile} contains important notes on compilation options. + +To build @code{gawk}, copy the files in the @file{pc} directory (@emph{except} +for @file{ChangeLog}) to the +directory with the rest of the @code{gawk} sources. The @file{Makefile} +contains a configuration section with comments, and may need to be +edited in order to work with your @code{make} utility. + +The @file{Makefile} contains a number of targets for building various MS-DOS +and OS/2 versions. A list of targets will be printed if the @code{make} +command is given without a target. As an example, to build @code{gawk} +using the DJGPP tools, enter @samp{make djgpp}. + +Using @code{make} to run the standard tests and to install @code{gawk} +requires additional Unix-like tools, including @code{sh}, @code{sed}, and +@code{cp}. In order to run the tests, the @file{test/*.ok} files may need to +be converted so that they have the usual DOS-style end-of-line markers. Most +of the tests will work properly with Stewartson's shell along with the +companion utilities or appropriate GNU utilities. However, some editing of +@file{test/Makefile} is required. It is recommended that the file +@file{pc/Makefile.tst} be copied to @file{test/Makefile} as a +replacement. Details can be found in @file{README_d/README.pc}. + +@node Atari Installation, Amiga Installation, PC Installation, Installation +@appendixsec Installing @code{gawk} on the Atari ST + +@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca> + +@cindex atari +@cindex installation, atari +There are no substantial differences when installing @code{gawk} on +various Atari models. Compiled @code{gawk} executables do not require +a large amount of memory with most @code{awk} programs and should run on all +Motorola processor based models (called further ST, even if that is not +exactly right). + +In order to use @code{gawk}, you need to have a shell, either text or +graphics, that does not map all the characters of a command line to +upper-case. Maintaining case distinction in option flags is very +important (@pxref{Options, ,Command Line Options}). +These days this is the default, and it may only be a problem for some +very old machines. If your system does not preserve the case of option +flags, you will need to upgrade your tools. Support for I/O +redirection is necessary to make it easy to import @code{awk} programs +from other environments. Pipes are nice to have, but not vital. + +@menu +* Atari Compiling:: Compiling @code{gawk} on Atari +* Atari Using:: Running @code{gawk} on Atari +@end menu + +@node Atari Compiling, Atari Using, Atari Installation, Atari Installation +@appendixsubsec Compiling @code{gawk} on the Atari ST + +A proper compilation of @code{gawk} sources when @code{sizeof(int)} +differs from @code{sizeof(void *)} requires an ANSI C compiler. An initial +port was done with @code{gcc}. You may actually prefer executables +where @code{int}s are four bytes wide, but the other variant works as well. + +You may need quite a bit of memory when trying to recompile the @code{gawk} +sources, as some source files (@file{regex.c} in particular) are quite +big. If you run out of memory compiling such a file, try reducing the +optimization level for this particular file; this may help. + +@cindex Linux +With a reasonable shell (Bash will do), and in particular if you run +Linux, MiNT or a similar operating system, you have a pretty good +chance that the @code{configure} utility will succeed. Otherwise +sample versions of @file{config.h} and @file{Makefile.st} are given in the +@file{atari} subdirectory and can be edited and copied to the +corresponding files in the main source directory. Even if +@code{configure} produced something, it might be advisable to compare +its results with the sample versions and possibly make adjustments. + +Some @code{gawk} source code fragments depend on a preprocessor define +@samp{atarist}. This basically assumes the TOS environment with @code{gcc}. +Modify these sections as appropriate if they are not right for your +environment. Also see the remarks about @code{AWKPATH} and @code{envsep} in +@ref{Atari Using, ,Running @code{gawk} on the Atari ST}. + +As shipped, the sample @file{config.h} claims that the @code{system} +function is missing from the libraries, which is not true, and an +alternative implementation of this function is provided in +@file{atari/system.c}. Depending upon your particular combination of +shell and operating system, you may wish to change the file to indicate +that @code{system} is available. + +@node Atari Using, , Atari Compiling, Atari Installation +@appendixsubsec Running @code{gawk} on the Atari ST + +An executable version of @code{gawk} should be placed, as usual, +anywhere in your @code{PATH} where your shell can find it. + +While executing, @code{gawk} creates a number of temporary files. When +using @code{gcc} libraries for TOS, @code{gawk} looks for either of +the environment variables @code{TEMP} or @code{TMPDIR}, in that order. +If either one is found, its value is assumed to be a directory for +temporary files. This directory must exist, and if you can spare the +memory, it is a good idea to put it on a RAM drive. If neither +@code{TEMP} nor @code{TMPDIR} are found, then @code{gawk} uses the +current directory for its temporary files. + +The ST version of @code{gawk} searches for its program files as described in +@ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}. +The default value for the @code{AWKPATH} variable is taken from +@code{DEFPATH} defined in @file{Makefile}. The sample @code{gcc}/TOS +@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to +@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be +modified by explicitly setting @code{AWKPATH} to whatever you wish. +Note that colons cannot be used on the ST to separate elements in the +@code{AWKPATH} variable, since they have another, reserved, meaning. +Instead, you must use a comma to separate elements in the path. When +recompiling, the separating character can be modified by initializing +the @code{envsep} variable in @file{atari/gawkmisc.atr} to another +value. + +Although @code{awk} allows great flexibility in doing I/O redirections +from within a program, this facility should be used with care on the ST +running under TOS. In some circumstances the OS routines for file +handle pool processing lose track of certain events, causing the +computer to crash, and requiring a reboot. Often a warm reboot is +sufficient. Fortunately, this happens infrequently, and in rather +esoteric situations. In particular, avoid having one part of an +@code{awk} program using @code{print} statements explicitly redirected +to @code{"/dev/stdout"}, while other @code{print} statements use the +default standard output, and a calling shell has redirected standard +output to a file. + +When @code{gawk} is compiled with the ST version of @code{gcc} and its +usual libraries, it will accept both @samp{/} and @samp{\} as path separators. +While this is convenient, it should be remembered that this removes one, +technically valid, character (@samp{/}) from your file names, and that +it may create problems for external programs, called via the @code{system} +function, which may not support this convention. Whenever it is possible +that a file created by @code{gawk} will be used by some other program, +use only backslashes. Also remember that in @code{awk}, backslashes in +strings have to be doubled in order to get literal backslashes +(@pxref{Escape Sequences}). + +@node Amiga Installation, Bugs, Atari Installation, Installation +@appendixsec Installing @code{gawk} on an Amiga + +@cindex amiga +@cindex installation, amiga +You can install @code{gawk} on an Amiga system using a Unix emulation +environment available via anonymous @code{ftp} from +@code{ftp.ninemoons.com} in the directory @file{pub/ade/current}. +This includes a shell based on @code{pdksh}. The primary component of +this environment is a Unix emulation library, @file{ixemul.lib}. +@c could really use more background here, who wrote this, etc. + +A more complete distribution for the Amiga is available on +the Geek Gadgets CD-ROM from: + +@quotation +CRONUS @* +1840 E. Warner Road #105-265 @* +Tempe, AZ 85284 USA @* +US Toll Free: (800) 804-0833 @* +Phone: +1-602-491-0442 @* +FAX: +1-602-491-0048 @* +Email: @code{info@@ninemoons.com} @* +WWW: @code{http://www.ninemoons.com} @* +Anonymous @code{ftp} site: @code{ftp.ninemoons.com} @* +@end quotation + +Once you have the distribution, you can configure @code{gawk} simply by +running @code{configure}: + +@example +configure -v m68k-amigaos +@end example + +Then run @code{make}, and you should be all set! +(If these steps do not work, please send in a bug report; +@pxref{Bugs, ,Reporting Problems and Bugs}.) + +@node Bugs, Other Versions, Amiga Installation, Installation +@appendixsec Reporting Problems and Bugs +@display +@i{There is nothing more dangerous than a bored archeologist.} +The Hitchhiker's Guide to the Galaxy +@c the radio show, not the book. :-) +@end display +@sp 1 + +If you have problems with @code{gawk} or think that you have found a bug, +please report it to the developers; we cannot promise to do anything +but we might well want to fix it. + +Before reporting a bug, make sure you have actually found a real bug. +Carefully reread the documentation and see if it really says you can do +what you're trying to do. If it's not clear whether you should be able +to do something or not, report that too; it's a bug in the documentation! + +Before reporting a bug or trying to fix it yourself, try to isolate it +to the smallest possible @code{awk} program and input data file that +reproduces the problem. Then send us the program and data file, +some idea of what kind of Unix system you're using, and the exact results +@code{gawk} gave you. Also say what you expected to occur; this will help +us decide whether the problem was really in the documentation. + +Once you have a precise problem, there are two e-mail addresses you +can send mail to. + +@table @asis +@item Internet: +@samp{bug-gnu-utils@@prep.ai.mit.edu} + +@item UUCP: +@samp{uunet!prep.ai.mit.edu!bug-gnu-utils} +@end table + +Please include the +version number of @code{gawk} you are using. You can get this information +with the command @samp{gawk --version}. +You should send a carbon copy of your mail to Arnold Robbins, who can +be reached at @samp{arnold@@gnu.ai.mit.edu}. + +@cindex @code{comp.lang.awk} +@strong{Important!} Do @emph{not} try to report bugs in @code{gawk} by +posting to the Usenet/Internet newsgroup @code{comp.lang.awk}. +While the @code{gawk} developers do occasionally read this newsgroup, +there is no guarantee that we will see your posting. The steps described +above are the official, recognized ways for reporting bugs. + +Non-bug suggestions are always welcome as well. If you have questions +about things that are unclear in the documentation or are just obscure +features, ask Arnold Robbins; he will try to help you out, although he +may not have the time to fix the problem. You can send him electronic +mail at the Internet address above. + +If you find bugs in one of the non-Unix ports of @code{gawk}, please send +an electronic mail message to the person who maintains that port. They +are listed below, and also in the @file{README} file in the @code{gawk} +distribution. Information in the @file{README} file should be considered +authoritative if it conflicts with this @value{DOCUMENT}. + +@c NEEDED for looks +@page +The people maintaining the non-Unix ports of @code{gawk} are: + +@cindex Deifik, Scott +@cindex Fish, Fred +@cindex Hankerson, Darrel +@cindex Jaegermann, Michal +@cindex Rankin, Pat +@cindex Rommel, Kai Uwe +@table @asis +@item MS-DOS +Scott Deifik, @samp{scottd@@amgen.com}, and +Darrel Hankerson, @samp{hankedr@@mail.auburn.edu}. + +@item OS/2 +Kai Uwe Rommel, @samp{rommel@@ars.de}. + +@item VMS +Pat Rankin, @samp{rankin@@eql.caltech.edu}. + +@item Atari ST +Michal Jaegermann, @samp{michal@@gortel.phys.ualberta.ca}. + +@item Amiga +Fred Fish, @samp{fnf@@ninemoons.com}. +@end table + +If your bug is also reproducible under Unix, please send copies of your +report to the general GNU bug list, as well as to Arnold Robbins, at the +addresses listed above. + +@node Other Versions, , Bugs, Installation +@appendixsec Other Freely Available @code{awk} Implementations +@cindex Brennan, Michael +@ignore +From: emory!amc.com!brennan (Michael Brennan) +Subject: C++ comments in awk programs +To: arnold@gnu.ai.mit.edu (Arnold Robbins) +Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT) + +@end ignore +@display +@i{It's kind of fun to put comments like this in your awk code.} + @code{// Do C++ comments work? answer: yes! of course} +Michael Brennan +@end display +@sp 1 + +There are two other freely available @code{awk} implementations. +This section briefly describes where to get them. + +@table @asis +@cindex Kernighan, Brian +@cindex anonymous @code{ftp} +@cindex @code{ftp}, anonymous +@item Unix @code{awk} +Brian Kernighan has been able to make his implementation of +@code{awk} freely available. You can get it via anonymous @code{ftp} +to the host @code{@w{netlib.att.com}}. Change directory to +@file{/netlib/research}. Use ``binary'' or ``image'' mode, and +retrieve @file{awk.bundle.Z}. + +This is a shell archive that has been compressed with the @code{compress} +utility. It can be uncompressed with either @code{uncompress} or the +GNU @code{gunzip} utility. + +This version requires an ANSI C compiler; GCC (the GNU C compiler) +works quite nicely. + +@cindex Brennan, Michael +@cindex @code{mawk} +@item @code{mawk} +Michael Brennan has written an independent implementation of @code{awk}, +called @code{mawk}. It is available under the GPL +(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}), +just as @code{gawk} is. + +You can get it via anonymous @code{ftp} to the host +@code{@w{ftp.whidbey.net}}. Change directory to @file{/pub/brennan}. +Use ``binary'' or ``image'' mode, and retrieve @file{mawk1.3.3.tar.gz} +(or the latest version that is there). + +@code{gunzip} may be used to decompress this file. Installation +is similar to @code{gawk}'s +(@pxref{Unix Installation, , Compiling and Installing @code{gawk} on Unix}). +@end table + +@node Notes, Glossary, Installation, Top +@appendix Implementation Notes + +This appendix contains information mainly of interest to implementors and +maintainers of @code{gawk}. Everything in it applies specifically to +@code{gawk}, and not to other implementations. + +@menu +* Compatibility Mode:: How to disable certain @code{gawk} extensions. +* Additions:: Making Additions To @code{gawk}. +* Future Extensions:: New features that may be implemented one day. +* Improvements:: Suggestions for improvements by volunteers. +@end menu + +@node Compatibility Mode, Additions, Notes, Notes +@appendixsec Downward Compatibility and Debugging + +@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}}, +for a summary of the GNU extensions to the @code{awk} language and program. +All of these features can be turned off by invoking @code{gawk} with the +@samp{--traditional} option, or with the @samp{--posix} option. + +If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there +is one more option available on the command line: + +@table @code +@item -W parsedebug +@itemx --parsedebug +Print out the parse stack information as the program is being parsed. +@end table + +This option is intended only for serious @code{gawk} developers, +and not for the casual user. It probably has not even been compiled into +your version of @code{gawk}, since it slows down execution. + +@node Additions, Future Extensions, Compatibility Mode, Notes +@appendixsec Making Additions to @code{gawk} + +If you should find that you wish to enhance @code{gawk} in a significant +fashion, you are perfectly free to do so. That is the point of having +free software; the source code is available, and you are free to change +it as you wish (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}). + +This section discusses the ways you might wish to change @code{gawk}, +and any considerations you should bear in mind. + +@menu +* Adding Code:: Adding code to the main body of @code{gawk}. +* New Ports:: Porting @code{gawk} to a new operating system. +@end menu + +@node Adding Code, New Ports, Additions, Additions +@appendixsubsec Adding New Features + +@cindex adding new features +@cindex features, adding +You are free to add any new features you like to @code{gawk}. +However, if you want your changes to be incorporated into the @code{gawk} +distribution, there are several steps that you need to take in order to +make it possible for me to include to your changes. + +@enumerate 1 +@item +Get the latest version. +It is much easier for me to integrate changes if they are relative to +the most recent distributed version of @code{gawk}. If your version of +@code{gawk} is very old, I may not be able to integrate them at all. +@xref{Getting, ,Getting the @code{gawk} Distribution}, +for information on getting the latest version of @code{gawk}. + +@item +@iftex +Follow the @cite{GNU Coding Standards}. +@end iftex +@ifinfo +See @inforef{Top, , Version, standards, GNU Coding Standards}. +@end ifinfo +This document describes how GNU software should be written. If you haven't +read it, please do so, preferably @emph{before} starting to modify @code{gawk}. +(The @cite{GNU Coding Standards} are available as part of the Autoconf +distribution, from the FSF.) + +@cindex @code{gawk} coding style +@cindex coding style used in @code{gawk} +@item +Use the @code{gawk} coding style. +The C code for @code{gawk} follows the instructions in the +@cite{GNU Coding Standards}, with minor exceptions. The code is formatted +using the traditional ``K&R'' style, particularly as regards the placement +of braces and the use of tabs. In brief, the coding rules for @code{gawk} +are: + +@itemize @bullet +@item +Use old style (non-prototype) function headers when defining functions. + +@item +Put the name of the function at the beginning of its own line. + +@item +Put the return type of the function, even if it is @code{int}, on the +line above the line with the name and arguments of the function. + +@item +The declarations for the function arguments should not be indented. + +@item +Put spaces around parentheses used in control structures +(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch} +and @code{return}). + +@item +Do not put spaces in front of parentheses used in function calls. + +@item +Put spaces around all C operators, and after commas in function calls. + +@item +Do not use the comma operator to produce multiple side-effects, except +in @code{for} loop initialization and increment parts, and in macro bodies. + +@item +Use real tabs for indenting, not spaces. + +@item +Use the ``K&R'' brace layout style. + +@item +Use comparisons against @code{NULL} and @code{'\0'} in the conditions of +@code{if}, @code{while} and @code{for} statements, and in the @code{case}s +of @code{switch} statements, instead of just the +plain pointer or character value. + +@item +Use the @code{TRUE}, @code{FALSE}, and @code{NULL} symbolic constants, +and the character constant @code{'\0'} where appropriate, instead of @code{1} +and @code{0}. + +@item +Provide one-line descriptive comments for each function. + +@item +Do not use @samp{#elif}. Many older Unix C compilers cannot handle it. + +@item +Do not use the @code{alloca} function for allocating memory off the stack. +Its use causes more portability trouble than the minor benefit of not having +to free the storage. Instead, use @code{malloc} and @code{free}. +@end itemize + +If I have to reformat your code to follow the coding style used in +@code{gawk}, I may not bother. + +@item +Be prepared to sign the appropriate paperwork. +In order for the FSF to distribute your changes, you must either place +those changes in the public domain, and submit a signed statement to that +effect, or assign the copyright in your changes to the FSF. +Both of these actions are easy to do, and @emph{many} people have done so +already. If you have questions, please contact me +(@pxref{Bugs, , Reporting Problems and Bugs}), +or @code{gnu@@prep.ai.mit.edu}. + +@item +Update the documentation. +Along with your new code, please supply new sections and or chapters +for this @value{DOCUMENT}. If at all possible, please use real +Texinfo, instead of just supplying unformatted ASCII text (although +even that is better than no documentation at all). +Conventions to be followed in @cite{@value{TITLE}} are provided +after the @samp{@@bye} at the end of the Texinfo source file. +If possible, please update the man page as well. + +You will also have to sign paperwork for your documentation changes. + +@item +Submit changes as context diffs or unified diffs. +Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare +the original @code{gawk} source tree with your version. +(I find context diffs to be more readable, but unified diffs are +more compact.) +I recommend using the GNU version of @code{diff}. +Send the output produced by either run of @code{diff} to me when you +submit your changes. +@xref{Bugs, , Reporting Problems and Bugs}, for the electronic mail +information. + +Using this format makes it easy for me to apply your changes to the +master version of the @code{gawk} source code (using @code{patch}). +If I have to apply the changes manually, using a text editor, I may +not do so, particularly if there are lots of changes. +@end enumerate + +Although this sounds like a lot of work, please remember that while you +may write the new code, I have to maintain it and support it, and if it +isn't possible for me to do that with a minimum of extra work, then I +probably will not. + +@node New Ports, , Adding Code, Additions +@appendixsubsec Porting @code{gawk} to a New Operating System + +@cindex porting @code{gawk} +If you wish to port @code{gawk} to a new operating system, there are +several steps to follow. + +@enumerate 1 +@item +Follow the guidelines in +@ref{Adding Code, ,Adding New Features}, +concerning coding style, submission of diffs, and so on. + +@item +When doing a port, bear in mind that your code must co-exist peacefully +with the rest of @code{gawk}, and the other ports. Avoid gratuitous +changes to the system-independent parts of the code. If at all possible, +avoid sprinkling @samp{#ifdef}s just for your port throughout the +code. + +If the changes needed for a particular system affect too much of the +code, I probably will not accept them. In such a case, you will, of course, +be able to distribute your changes on your own, as long as you comply +with the GPL +(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}). + +@item +A number of the files that come with @code{gawk} are maintained by other +people at the Free Software Foundation. Thus, you should not change them +unless it is for a very good reason. I.e.@: changes are not out of the +question, but changes to these files will be scrutinized extra carefully. +The files are @file{alloca.c}, @file{getopt.h}, @file{getopt.c}, +@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h}, +@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}. + +@item +Be willing to continue to maintain the port. +Non-Unix operating systems are supported by volunteers who maintain +the code needed to compile and run @code{gawk} on their systems. If no-one +volunteers to maintain a port, that port becomes unsupported, and it may +be necessary to remove it from the distribution. + +@item +Supply an appropriate @file{gawkmisc.???} file. +Each port has its own @file{gawkmisc.???} that implements certain +operating system specific functions. This is cleaner than a plethora of +@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in +the main source directory includes the appropriate +@file{gawkmisc.???} file from each subdirectory. +Be sure to update it as well. + +Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine +or operating system for the port. For example, @file{pc/gawkmisc.pc} and +@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain +@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory +into the main subdirectory, without accidentally destroying the real +@file{gawkmisc.c} file. (Currently, this is only an issue for the MS-DOS +and OS/2 ports.) + +@item +Supply a @file{Makefile} and any other C source and header files that are +necessary for your operating system. All your code should be in a +separate subdirectory, with a name that is the same as, or reminiscent +of, either your operating system or the computer system. If possible, +try to structure things so that it is not necessary to move files out +of the subdirectory into the main source directory. If that is not +possible, then be sure to avoid using names for your files that +duplicate the names of files in the main source directory. + +@item +Update the documentation. +Please write a section (or sections) for this @value{DOCUMENT} describing the +installation and compilation steps needed to install and/or compile +@code{gawk} for your system. + +@item +Be prepared to sign the appropriate paperwork. +In order for the FSF to distribute your code, you must either place +your code in the public domain, and submit a signed statement to that +effect, or assign the copyright in your code to the FSF. +@ifinfo +Both of these actions are easy to do, and @emph{many} people have done so +already. If you have questions, please contact me, or +@code{gnu@@prep.ai.mit.edu}. +@end ifinfo +@end enumerate + +Following these steps will make it much easier to integrate your changes +into @code{gawk}, and have them co-exist happily with the code for other +operating systems that is already there. + +In the code that you supply, and that you maintain, feel free to use a +coding style and brace layout that suits your taste. + +@node Future Extensions, Improvements, Additions, Notes +@appendixsec Probable Future Extensions +@ignore +From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995 +Return-Path: <emory!scalpel.netlabs.com!lwall> +Message-Id: <9510311732.AA28472@scalpel.netlabs.com> +To: arnold@skeeve.atl.ga.us (Arnold D. Robbins) +Subject: Re: May I quote you? +In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST." + <m0tAHPQ-00014MC@skeeve.atl.ga.us> +Date: Tue, 31 Oct 95 09:32:46 -0800 +From: Larry Wall <emory!scalpel.netlabs.com!lwall> + +: Greetings. I am working on the release of gawk 3.0. Part of it will be a +: thoroughly updated manual. One of the sections deals with planned future +: extensions and enhancements. I have the following at the beginning +: of it: +: +: @cindex PERL +: @cindex Wall, Larry +: @display +: @i{AWK is a language similar to PERL, only considerably more elegant.} @* +: Arnold Robbins +: @sp 1 +: @i{Hey!} @* +: Larry Wall +: @end display +: +: Before I actually release this for publication, I wanted to get your +: permission to quote you. (Hopefully, in the spirit of much of GNU, the +: implied humor is visible... :-) + +I think that would be fine. + +Larry +@end ignore +@cindex PERL +@cindex Wall, Larry +@display +@i{AWK is a language similar to PERL, only considerably more elegant.} +Arnold Robbins + +@i{Hey!} +Larry Wall +@end display +@sp 1 + +This section briefly lists extensions and possible improvements +that indicate the directions we are +currently considering for @code{gawk}. The file @file{FUTURES} in the +@code{gawk} distributions lists these extensions as well. + +This is a list of probable future changes that will be usable by the +@code{awk} language programmer. + +@c these are ordered by likelihood +@table @asis +@item Localization +The GNU project is starting to support multiple languages. +It will at least be possible to make @code{gawk} print its warnings and +error messages in languages other than English. +It may be possible for @code{awk} programs to also use the multiple +language facilities, separate from @code{gawk} itself. + +@item Databases +It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array. + +@item A @code{PROCINFO} Array +The special files that provide process-related information +(@pxref{Special Files, ,Special File Names in @code{gawk}}) +may be superseded by a @code{PROCINFO} array that would provide the same +information, in an easier to access fashion. + +@item More @code{lint} warnings +There are more things that could be checked for portability. + +@item Control of subprocess environment +Changes made in @code{gawk} to the array @code{ENVIRON} may be +propagated to subprocesses run by @code{gawk}. + +@ignore +@item @code{RECLEN} variable for fixed length records +Along with @code{FIELDWIDTHS}, this would speed up the processing of +fixed-length records. + +@item A @code{restart} keyword +After modifying @code{$0}, @code{restart} would restart the pattern +matching loop, without reading a new record from the input. + +@item A @samp{|&} redirection +The @samp{|&} redirection, in place of @samp{|}, would open a two-way +pipeline for communication with a sub-process (via @code{getline} and +@code{print} and @code{printf}). + +@item Function valued variables +It would be possible to assign the name of a user-defined or built-in +function to a regular @code{awk} variable, and then call the function +indirectly, by using the regular variable. This would make it possible +to write general purpose sorting and comparing routines, for example, +by simply passing the name of one function into another. + +@item A built-in @code{stat} function +The @code{stat} function would provide an easy-to-use hook to the +@code{stat} system call so that @code{awk} programs could determine information +about files. + +@item A built-in @code{ftw} function +Combined with function valued variables and the @code{stat} function, +@code{ftw} (file tree walk) would make it easy for an @code{awk} program +to walk an entire file tree. +@end ignore +@end table + +This is a list of probable improvements that will make @code{gawk} +perform better. + +@table @asis +@item An Improved Version of @code{dfa} +The @code{dfa} pattern matcher from GNU @code{grep} has some +problems. Either a new version or a fixed one will deal with some +important regexp matching issues. + +@item Use of GNU @code{malloc} +The GNU version of @code{malloc} could potentially speed up @code{gawk}, +since it relies heavily on the use of dynamic memory allocation. + +@item Use of the @code{rx} regexp library +The @code{rx} regular expression library could potentially speed up +all regexp operations that require knowing the exact location of matches. +This includes record termination, field and array splitting, +and the @code{sub}, @code{gsub}, @code{gensub} and @code{match} functions. +@end table + +@node Improvements, , Future Extensions, Notes +@appendixsec Suggestions for Improvements + +Here are some projects that would-be @code{gawk} hackers might like to take +on. They vary in size from a few days to a few weeks of programming, +depending on which one you choose and how fast a programmer you are. Please +send any improvements you write to the maintainers at the GNU project. +@xref{Adding Code, , Adding New Features}, +for guidelines to follow when adding new features to @code{gawk}. +@xref{Bugs, ,Reporting Problems and Bugs}, for information on +contacting the maintainers. + +@enumerate +@item +Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like) +parser to convert the script given it into a syntax tree; the syntax +tree is then executed by a simple recursive evaluator. This method incurs +a lot of overhead, since the recursive evaluator performs many procedure +calls to do even the simplest things. + +It should be possible for @code{gawk} to convert the script's parse tree +into a C program which the user would then compile, using the normal +C compiler and a special @code{gawk} library to provide all the needed +functions (regexps, fields, associative arrays, type coercion, and so +on). + +An easier possibility might be for an intermediate phase of @code{awk} to +convert the parse tree into a linear byte code form like the one used +in GNU Emacs Lisp. The recursive evaluator would then be replaced by +a straight line byte code interpreter that would be intermediate in speed +between running a compiled program and doing what @code{gawk} does +now. + +@item +The programs in the test suite could use documenting in this @value{DOCUMENT}. + +@item +See the @file{FUTURES} file for more ideas. Contact us if you would +seriously like to tackle any of the items listed there. +@end enumerate + +@node Glossary, Copying, Notes, Top +@appendix Glossary + +@table @asis +@item Action +A series of @code{awk} statements attached to a rule. If the rule's +pattern matches an input record, @code{awk} executes the +rule's action. Actions are always enclosed in curly braces. +@xref{Action Overview, ,Overview of Actions}. + +@item Amazing @code{awk} Assembler +Henry Spencer at the University of Toronto wrote a retargetable assembler +completely as @code{awk} scripts. It is thousands of lines long, including +machine descriptions for several eight-bit microcomputers. +It is a good example of a +program that would have been better written in another language. + +@item Amazingly Workable Formatter (@code{awf}) +Henry Spencer at the University of Toronto wrote a formatter that accepts +a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting +commands, using @code{awk} and @code{sh}. + +@item ANSI +The American National Standards Institute. This organization produces +many standards, among them the standards for the C and C++ programming +languages. + +@item Assignment +An @code{awk} expression that changes the value of some @code{awk} +variable or data object. An object that you can assign to is called an +@dfn{lvalue}. The assigned values are called @dfn{rvalues}. +@xref{Assignment Ops, ,Assignment Expressions}. + +@item @code{awk} Language +The language in which @code{awk} programs are written. + +@item @code{awk} Program +An @code{awk} program consists of a series of @dfn{patterns} and +@dfn{actions}, collectively known as @dfn{rules}. For each input record +given to the program, the program's rules are all processed in turn. +@code{awk} programs may also contain function definitions. + +@item @code{awk} Script +Another name for an @code{awk} program. + +@item Bash +The GNU version of the standard shell (the Bourne-Again shell). +See ``Bourne Shell.'' + +@item BBS +See ``Bulletin Board System.'' + +@item Boolean Expression +Named after the English mathematician Boole. See ``Logical Expression.'' + +@item Bourne Shell +The standard shell (@file{/bin/sh}) on Unix and Unix-like systems, +originally written by Steven R.@: Bourne. +Many shells (Bash, @code{ksh}, @code{pdksh}, @code{zsh}) are +generally upwardly compatible with the Bourne shell. + +@item Built-in Function +The @code{awk} language provides built-in functions that perform various +numerical, time stamp related, and string computations. Examples are +@code{sqrt} (for the square root of a number) and @code{substr} (for a +substring of a string). @xref{Built-in, ,Built-in Functions}. + +@item Built-in Variable +@code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON}, +@code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS}, +@code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS}, +@code{RLENGTH}, @code{RSTART}, @code{RS}, @code{RT}, and @code{SUBSEP}, +are the variables that have special meaning to @code{awk}. +Changing some of them affects @code{awk}'s running environment. +Several of these variables are specific to @code{gawk}. +@xref{Built-in Variables}. + +@item Braces +See ``Curly Braces.'' + +@item Bulletin Board System +A computer system allowing users to log in and read and/or leave messages +for other users of the system, much like leaving paper notes on a bulletin +board. + +@item C +The system programming language that most GNU software is written in. The +@code{awk} programming language has C-like syntax, and this @value{DOCUMENT} +points out similarities between @code{awk} and C when appropriate. + +@cindex ISO 8859-1 +@cindex ISO Latin-1 +@item Character Set +The set of numeric codes used by a computer system to represent the +characters (letters, numbers, punctuation, etc.) of a particular country +or place. The most common character set in use today is ASCII (American +Standard Code for Information Interchange). Many European +countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1). + +@item CHEM +A preprocessor for @code{pic} that reads descriptions of molecules +and produces @code{pic} input for drawing them. It was written in @code{awk} +by Brian Kernighan and Jon Bentley, and is available from +@code{@w{netlib@@research.att.com}}. + +@item Compound Statement +A series of @code{awk} statements, enclosed in curly braces. Compound +statements may be nested. +@xref{Statements, ,Control Statements in Actions}. + +@item Concatenation +Concatenating two strings means sticking them together, one after another, +giving a new string. For example, the string @samp{foo} concatenated with +the string @samp{bar} gives the string @samp{foobar}. +@xref{Concatenation, ,String Concatenation}. + +@item Conditional Expression +An expression using the @samp{?:} ternary operator, such as +@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression +@var{expr1} is evaluated; if the result is true, the value of the whole +expression is the value of @var{expr2}, otherwise the value is +@var{expr3}. In either case, only one of @var{expr2} and @var{expr3} +is evaluated. @xref{Conditional Exp, ,Conditional Expressions}. + +@item Comparison Expression +A relation that is either true or false, such as @samp{(a < b)}. +Comparison expressions are used in @code{if}, @code{while}, @code{do}, +and @code{for} +statements, and in patterns to select which input records to process. +@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}. + +@item Curly Braces +The characters @samp{@{} and @samp{@}}. Curly braces are used in +@code{awk} for delimiting actions, compound statements, and function +bodies. + +@item Dark Corner +An area in the language where specifications often were (or still +are) not clear, leading to unexpected or undesirable behavior. +Such areas are marked in this @value{DOCUMENT} with ``(d.c.)'' in the +text, and are indexed under the heading ``dark corner.'' + +@item Data Objects +These are numbers and strings of characters. Numbers are converted into +strings and vice versa, as needed. +@xref{Conversion, ,Conversion of Strings and Numbers}. + +@item Double Precision +An internal representation of numbers that can have fractional parts. +Double precision numbers keep track of more digits than do single precision +numbers, but operations on them are more expensive. This is the way +@code{awk} stores numeric values. It is the C type @code{double}. + +@item Dynamic Regular Expression +A dynamic regular expression is a regular expression written as an +ordinary expression. It could be a string constant, such as +@code{"foo"}, but it may also be an expression whose value can vary. +@xref{Computed Regexps, , Using Dynamic Regexps}. + +@item Environment +A collection of strings, of the form @var{name@code{=}val}, that each +program has available to it. Users generally place values into the +environment in order to provide information to various programs. Typical +examples are the environment variables @code{HOME} and @code{PATH}. + +@item Empty String +See ``Null String.'' + +@item Escape Sequences +A special sequence of characters used for describing non-printing +characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII +ESC (escape) character. @xref{Escape Sequences}. + +@item Field +When @code{awk} reads an input record, it splits the record into pieces +separated by whitespace (or by a separator regexp which you can +change by setting the built-in variable @code{FS}). Such pieces are +called fields. If the pieces are of fixed length, you can use the built-in +variable @code{FIELDWIDTHS} to describe their lengths. +@xref{Field Separators, ,Specifying How Fields are Separated}, +and also see +@xref{Constant Size, , Reading Fixed-width Data}. + +@item Floating Point Number +Often referred to in mathematical terms as a ``rational'' number, this is +just a number that can have a fractional part. +See ``Double Precision'' and ``Single Precision.'' + +@item Format +Format strings are used to control the appearance of output in the +@code{printf} statement. Also, data conversions from numbers to strings +are controlled by the format string contained in the built-in variable +@code{CONVFMT}. @xref{Control Letters, ,Format-Control Letters}. + +@item Function +A specialized group of statements used to encapsulate general +or program-specific tasks. @code{awk} has a number of built-in +functions, and also allows you to define your own. +@xref{Built-in, ,Built-in Functions}, +and @ref{User-defined, ,User-defined Functions}. + +@item FSF +See ``Free Software Foundation.'' + +@item Free Software Foundation +A non-profit organization dedicated +to the production and distribution of freely distributable software. +It was founded by Richard M.@: Stallman, the author of the original +Emacs editor. GNU Emacs is the most widely used version of Emacs today. + +@item @code{gawk} +The GNU implementation of @code{awk}. + +@item General Public License +This document describes the terms under which @code{gawk} and its source +code may be distributed. (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}) + +@item GNU +``GNU's not Unix''. An on-going project of the Free Software Foundation +to create a complete, freely distributable, POSIX-compliant computing +environment. + +@item GPL +See ``General Public License.'' + +@item Hexadecimal +Base 16 notation, where the digits are @code{0}-@code{9} and +@code{A}-@code{F}, with @samp{A} +representing 10, @samp{B} representing 11, and so on up to @samp{F} for 15. +Hexadecimal numbers are written in C using a leading @samp{0x}, +to indicate their base. Thus, @code{0x12} is 18 (one times 16 plus 2). + +@item I/O +Abbreviation for ``Input/Output,'' the act of moving data into and/or +out of a running program. + +@item Input Record +A single chunk of data read in by @code{awk}. Usually, an @code{awk} input +record consists of one line of text. +@xref{Records, ,How Input is Split into Records}. + +@item Integer +A whole number, i.e.@: a number that does not have a fractional part. + +@item Keyword +In the @code{awk} language, a keyword is a word that has special +meaning. Keywords are reserved and may not be used as variable names. + +@code{gawk}'s keywords are: +@code{BEGIN}, +@code{END}, +@code{if}, +@code{else}, +@code{while}, +@code{do@dots{}while}, +@code{for}, +@code{for@dots{}in}, +@code{break}, +@code{continue}, +@code{delete}, +@code{next}, +@code{nextfile}, +@code{function}, +@code{func}, +and @code{exit}. + +@item Logical Expression +An expression using the operators for logic, AND, OR, and NOT, written +@samp{&&}, @samp{||}, and @samp{!} in @code{awk}. Often called Boolean +expressions, after the mathematician who pioneered this kind of +mathematical logic. + +@item Lvalue +An expression that can appear on the left side of an assignment +operator. In most languages, lvalues can be variables or array +elements. In @code{awk}, a field designator can also be used as an +lvalue. + +@item Null String +A string with no characters in it. It is represented explicitly in +@code{awk} programs by placing two double-quote characters next to +each other (@code{""}). It can appear in input data by having two successive +occurrences of the field separator appear next to each other. + +@item Number +A numeric valued data object. The @code{gawk} implementation uses double +precision floating point to represent numbers. +Very old @code{awk} implementations use single precision floating +point. + +@item Octal +Base-eight notation, where the digits are @code{0}-@code{7}. +Octal numbers are written in C using a leading @samp{0}, +to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3). + +@item Pattern +Patterns tell @code{awk} which input records are interesting to which +rules. + +A pattern is an arbitrary conditional expression against which input is +tested. If the condition is satisfied, the pattern is said to @dfn{match} +the input record. A typical pattern might compare the input record against +a regular expression. @xref{Pattern Overview, ,Pattern Elements}. + +@item POSIX +The name for a series of standards being developed by the IEEE +that specify a Portable Operating System interface. The ``IX'' denotes +the Unix heritage of these standards. The main standard of interest for +@code{awk} users is +@cite{IEEE Standard for Information Technology, Standard 1003.2-1992, +Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}. +Informally, this standard is often referred to as simply ``P1003.2.'' + +@item Private +Variables and/or functions that are meant for use exclusively by library +functions, and not for the main @code{awk} program. Special care must be +taken when naming such variables and functions. +@xref{Library Names, , Naming Library Function Global Variables}. + +@item Range (of input lines) +A sequence of consecutive lines from the input file. A pattern +can specify ranges of input lines for @code{awk} to process, or it can +specify single lines. @xref{Pattern Overview, ,Pattern Elements}. + +@item Recursion +When a function calls itself, either directly or indirectly. +If this isn't clear, refer to the entry for ``recursion.'' + +@item Redirection +Redirection means performing input from other than the standard input +stream, or output to other than the standard output stream. + +You can redirect the output of the @code{print} and @code{printf} statements +to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|} +operators. You can redirect input to the @code{getline} statement using +the @samp{<} and @samp{|} operators. +@xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}, +and @ref{Getline, ,Explicit Input with @code{getline}}. + +@item Regexp +Short for @dfn{regular expression}. A regexp is a pattern that denotes a +set of strings, possibly an infinite set. For example, the regexp +@samp{R.*xp} matches any string starting with the letter @samp{R} +and ending with the letters @samp{xp}. In @code{awk}, regexps are +used in patterns and in conditional expressions. Regexps may contain +escape sequences. @xref{Regexp, ,Regular Expressions}. + +@item Regular Expression +See ``regexp.'' + +@item Regular Expression Constant +A regular expression constant is a regular expression written within +slashes, such as @code{/foo/}. This regular expression is chosen +when you write the @code{awk} program, and cannot be changed doing +its execution. @xref{Regexp Usage, ,How to Use Regular Expressions}. + +@item Rule +A segment of an @code{awk} program that specifies how to process single +input records. A rule consists of a @dfn{pattern} and an @dfn{action}. +@code{awk} reads an input record; then, for each rule, if the input record +satisfies the rule's pattern, @code{awk} executes the rule's action. +Otherwise, the rule does nothing for that input record. + +@item Rvalue +A value that can appear on the right side of an assignment operator. +In @code{awk}, essentially every expression has a value. These values +are rvalues. + +@item @code{sed} +See ``Stream Editor.'' + +@item Short-Circuit +The nature of the @code{awk} logical operators @samp{&&} and @samp{||}. +If the value of the entire expression can be deduced from evaluating just +the left-hand side of these operators, the right-hand side will not +be evaluated +(@pxref{Boolean Ops, ,Boolean Expressions}). + +@item Side Effect +A side effect occurs when an expression has an effect aside from merely +producing a value. Assignment expressions, increment and decrement +expressions and function calls have side effects. +@xref{Assignment Ops, ,Assignment Expressions}. + +@item Single Precision +An internal representation of numbers that can have fractional parts. +Single precision numbers keep track of fewer digits than do double precision +numbers, but operations on them are less expensive in terms of CPU time. +This is the type used by some very old versions of @code{awk} to store +numeric values. It is the C type @code{float}. + +@item Space +The character generated by hitting the space bar on the keyboard. + +@item Special File +A file name interpreted internally by @code{gawk}, instead of being handed +directly to the underlying operating system. For example, @file{/dev/stderr}. +@xref{Special Files, ,Special File Names in @code{gawk}}. + +@item Stream Editor +A program that reads records from an input stream and processes them one +or more at a time. This is in contrast with batch programs, which may +expect to read their input files in entirety before starting to do +anything, and with interactive programs, which require input from the +user. + +@item String +A datum consisting of a sequence of characters, such as @samp{I am a +string}. Constant strings are written with double-quotes in the +@code{awk} language, and may contain escape sequences. +@xref{Escape Sequences}. + +@item Tab +The character generated by hitting the @kbd{TAB} key on the keyboard. +It usually expands to up to eight spaces upon output. + +@item Unix +A computer operating system originally developed in the early 1970's at +AT&T Bell Laboratories. It initially became popular in universities around +the world, and later moved into commercial evnironments as a software +development system and network server system. There are many commercial +versions of Unix, as well as several work-alike systems whose source code +is freely available (such as Linux, NetBSD, and FreeBSD). + +@item Whitespace +A sequence of space, tab, or newline characters occurring inside an input +record or a string. +@end table + +@node Copying, Index, Glossary, Top +@unnumbered GNU GENERAL PUBLIC LICENSE +@center Version 2, June 1991 + +@display +Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc. +59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA + +Everyone is permitted to copy and distribute verbatim copies +of this license document, but changing it is not allowed. +@end display + +@c fakenode --- for prepinfo +@unnumberedsec Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software---to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + +@iftex +@c fakenode --- for prepinfo +@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION +@end iftex +@ifinfo +@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION +@end ifinfo + +@enumerate 0 +@item +This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The ``Program'', below, +refers to any such program or work, and a ``work based on the Program'' +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term ``modification''.) Each licensee is addressed as ``you''. + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + +@item +You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + +@item +You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + +@enumerate a +@item +You must cause the modified files to carry prominent notices +stating that you changed the files and the date of any change. + +@item +You must cause any work that you distribute or publish, that in +whole or in part contains or is derived from the Program or any +part thereof, to be licensed as a whole at no charge to all third +parties under the terms of this License. + +@item +If the modified program normally reads commands interactively +when run, you must cause it, when started running for such +interactive use in the most ordinary way, to print or display an +announcement including an appropriate copyright notice and a +notice that there is no warranty (or else, saying that you provide +a warranty) and that users may redistribute the program under +these conditions, and telling the user how to view a copy of this +License. (Exception: if the Program itself is interactive but +does not normally print such an announcement, your work based on +the Program is not required to print an announcement.) +@end enumerate + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + +@item +You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + +@enumerate a +@item +Accompany it with the complete corresponding machine-readable +source code, which must be distributed under the terms of Sections +1 and 2 above on a medium customarily used for software interchange; or, + +@item +Accompany it with a written offer, valid for at least three +years, to give any third party, for a charge no more than your +cost of physically performing source distribution, a complete +machine-readable copy of the corresponding source code, to be +distributed under the terms of Sections 1 and 2 above on a medium +customarily used for software interchange; or, + +@item +Accompany it with the information you received as to the offer +to distribute corresponding source code. (This alternative is +allowed only for non-commercial distribution and only if you +received the program in object code or executable form with such +an offer, in accord with Subsection b above.) +@end enumerate + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + +@item +You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + +@item +You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + +@item +Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + +@item +If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + +@item +If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + +@item +The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and ``any +later version'', you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + +@item +If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + +@iftex +@c fakenode --- for prepinfo +@heading NO WARRANTY +@end iftex +@ifinfo +@center NO WARRANTY +@end ifinfo + +@item +BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + +@item +IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. +@end enumerate + +@iftex +@c fakenode --- for prepinfo +@heading END OF TERMS AND CONDITIONS +@end iftex +@ifinfo +@center END OF TERMS AND CONDITIONS +@end ifinfo + +@page +@c fakenode --- for prepinfo +@unnumberedsec How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the ``copyright'' line and a pointer to where the full notice is found. + +@smallexample +@var{one line to give the program's name and an idea of what it does.} +Copyright (C) 19@var{yy} @var{name of author} + +This program is free software; you can redistribute it and/or +modify it under the terms of the GNU General Public License +as published by the Free Software Foundation; either version 2 +of the License, or (at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program; if not, write to the Free Software +Foundation, Inc., 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA. +@end smallexample + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + +@smallexample +Gnomovision version 69, Copyright (C) 19@var{yy} @var{name of author} +Gnomovision comes with ABSOLUTELY NO WARRANTY; for details +type `show w'. This is free software, and you are welcome +to redistribute it under certain conditions; type `show c' +for details. +@end smallexample + +The hypothetical commands @samp{show w} and @samp{show c} should show +the appropriate parts of the General Public License. Of course, the +commands you use may be called something other than @samp{show w} and +@samp{show c}; they could even be mouse-clicks or menu items---whatever +suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a ``copyright disclaimer'' for the program, if +necessary. Here is a sample; alter the names: + +@smallexample +@group +Yoyodyne, Inc., hereby disclaims all copyright +interest in the program `Gnomovision' +(which makes passes at compilers) written +by James Hacker. + +@var{signature of Ty Coon}, 1 April 1989 +Ty Coon, President of Vice +@end group +@end smallexample + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. + +@node Index, , Copying, Top +@unnumbered Index +@printindex cp + +@summarycontents +@contents +@bye + +Unresolved Issues: +------------------ +1. From ADR. + + Robert J. Chassell points out that awk programs should have some indication + of how to use them. It would be useful to perhaps have a "programming + style" section of the manual that would include this and other tips. + +2. The default AWKPATH search path should be configurable via `configure' + The default and how this changes needs to be documented. + +Consistency issues: + /.../ regexps are in @code, not @samp + ".." strings are in @code, not @samp + no @print before @dots + values of expressions in the text (@code{x} has the value 15), + should be in roman, not @code + Use tab and not TAB + Use ESC and not ESCAPE + Use space and not blank to describe the space bar's character + The term "blank" is thus basically reserved for "blank lines" etc. + The `(d.c.)' should appear inside the closing `.' of a sentence + It should come before (pxref{...}) + " " should have an @w{} around it + Use "non-" everywhere + Use @code{ftp} when talking about anonymous ftp + Use upper-case and lower-case, not "upper case" and "lower case" + Use alphanumeric, not alpha-numeric + Use --foo, not -Wfoo when describing long options + Use findex for all programs and functions in the example chapters + Use "Bell Laboratories", but not "Bell Labs". + Use "behavior" instead of "behaviour". + Use "zeros" instead of "zeroes". + Use "Input/Output", not "input/output". Also "I/O", not "i/o". + Use @code{do}, and not @code{do}-@code{while}, except where + actually discussing the do-while. + The words "a", "and", "as", "between", "for", "from", "in", "of", + "on", "that", "the", "to", "with", and "without", + should not be capitalized in @chapter, @section etc. + "Into" and "How" should. + Search for @dfn; make sure important items are also indexed. + "e.g." should always be followed by a comma. + "i.e." should never be followed by a comma, and should be followed + by `@:'. + The numbers zero through ten should be spelled out, except when + talking about file descriptor numbers. > 10 and < 0, it's + ok to use numbers. + In tables, put command line options in @code, while in the text, + put them in @samp. + When using @strong, use "Note:" or "Caution:" with colons and + not exclamation points. Do not surround the paragraphs + with @quotation ... @end quotation. + +Date: Wed, 13 Apr 94 15:20:52 -0400 +From: rsm@gnu.ai.mit.edu (Richard Stallman) +To: gnu-prog@gnu.ai.mit.edu +Subject: A reminder: no pathnames in GNU + +It's a GNU convention to use the term "file name" for the name of a +file, never "pathname". We use the term "path" for search paths, +which are lists of file names. Using it for a single file name as +well is potentially confusing to users. + +So please check any documentation you maintain, if you think you might +have used "pathname". + +Note that "file name" should be two words when it appears as ordinary +text. It's ok as one word when it's a metasyntactic variable, though. + +Suggestions: +------------ +Enhance FIELDWIDTHS with some way to indicate "the rest of the record". +E.g., a length of 0 or -1 or something. May be "n"? + +Make FIELDWIDTHS be an array? + +What if FIELDWIDTHS has invalid values in it? |