diff options
author | ache <ache@FreeBSD.org> | 1995-01-12 01:30:34 +0000 |
---|---|---|
committer | ache <ache@FreeBSD.org> | 1995-01-12 01:30:34 +0000 |
commit | 1331e08127103e2c0ac04115309d9ce3e3e82029 (patch) | |
tree | e123aef43f837b795ee7048c7aefeea9b1a52919 /gnu/usr.bin/ptx | |
parent | a2eddc566d68b6a2143436a2f21176a1b1250c6d (diff) | |
download | FreeBSD-src-1331e08127103e2c0ac04115309d9ce3e3e82029.zip FreeBSD-src-1331e08127103e2c0ac04115309d9ce3e3e82029.tar.gz |
Use -lgnuregex properly
Install infopages
Diffstat (limited to 'gnu/usr.bin/ptx')
-rw-r--r-- | gnu/usr.bin/ptx/Makefile | 2 | ||||
-rw-r--r-- | gnu/usr.bin/ptx/doc/Makefile | 3 | ||||
-rw-r--r-- | gnu/usr.bin/ptx/doc/ptx.texinfo | 554 | ||||
-rw-r--r-- | gnu/usr.bin/ptx/ptx.c | 2 |
4 files changed, 559 insertions, 2 deletions
diff --git a/gnu/usr.bin/ptx/Makefile b/gnu/usr.bin/ptx/Makefile index 340c09a..5a2bd7f 100644 --- a/gnu/usr.bin/ptx/Makefile +++ b/gnu/usr.bin/ptx/Makefile @@ -2,7 +2,7 @@ PROG= ptx SRCS= argmatch.c diacrit.c error.c getopt.c getopt1.c ptx.c xmalloc.c LDADD+= -lgnuregex -DPADD+= ${GNUREGEX} +DPADD+= ${LIBGNUREGEX} CFLAGS+= -DHAVE_CONFIG_H -DDEFAULT_IGNORE_FILE=\"/usr/share/dict/eign\" NOMAN= diff --git a/gnu/usr.bin/ptx/doc/Makefile b/gnu/usr.bin/ptx/doc/Makefile new file mode 100644 index 0000000..ff514e2 --- /dev/null +++ b/gnu/usr.bin/ptx/doc/Makefile @@ -0,0 +1,3 @@ +INFO = ptx + +.include <bsd.info.mk> diff --git a/gnu/usr.bin/ptx/doc/ptx.texinfo b/gnu/usr.bin/ptx/doc/ptx.texinfo new file mode 100644 index 0000000..e690c55 --- /dev/null +++ b/gnu/usr.bin/ptx/doc/ptx.texinfo @@ -0,0 +1,554 @@ +\input texinfo @c -*-texinfo-*- +@c %**start of header +@setfilename ptx.info +@settitle GNU @code{ptx} reference manual +@finalout +@c %**end of header + +@ifinfo +This file documents the @code{ptx} command, which has the purpose of +generated permuted indices for group of files. + +Copyright (C) 1990, 1991, 1993 by the Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +@ignore +Permission is granted to process this file through TeX and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual). + +@end ignore +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided that the entire +resulting derived work is distributed under the terms of a permission +notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions, +except that this permission notice may be stated in a translation approved +by the Foundation. +@end ifinfo + +@titlepage +@title ptx +@subtitle The GNU permuted indexer +@subtitle Edition 0.3, for ptx version 0.3 +@subtitle November 1993 +@author by Francois Pinard + +@page +@vskip 0pt plus 1filll +Copyright @copyright{} 1990, 1991, 1993 Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided that the entire +resulting derived work is distributed under the terms of a permission +notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions, +except that this permission notice may be stated in a translation approved +by the Foundation. +@end titlepage + +@node Top, Invoking ptx, (dir), (dir) +@chapter Introduction + +This is the 0.3 beta release of @code{ptx}, the GNU version of a +permuted index generator. This software has the main goal of providing +a replacement for the traditional @code{ptx} as found on System V +machines, able to handle small files quickly, while providing a platform +for more development. + +This version reimplements and extends traditional @code{ptx}. Among +other things, it can produce a readable @dfn{KWIC} (keywords in their +context) without the need of @code{nroff}, there is also an option to +produce @TeX{} compatible output. This version does not handle huge +input files, that is, those files which do not fit in memory all at +once. + +@emph{Please note} that an overall renaming of all options is +foreseeable. In fact, GNU ptx specifications are not frozen yet. + +@menu +* Invoking ptx:: How to use this program +* Compatibility:: The GNU extensions to @code{ptx} + + --- The Detailed Node Listing --- + +How to use this program + +* General options:: Options which affect general program behaviour. +* Charset selection:: Underlying character set considerations. +* Input processing:: Input fields, contexts, and keyword selection. +* Output formatting:: Types of output format, and sizing the fields. +@end menu + +@node Invoking ptx, Compatibility, Top, Top +@chapter How to use this program + +This tool reads a text file and essentially produces a permuted index, with +each keyword in its context. The calling sketch is one of: + +@example +ptx [@var{option} @dots{}] [@var{file} @dots{}] +@end example + +or: + +@example +ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]] +@end example + +The @samp{-G} (or its equivalent: @samp{--traditional}) option disables +all GNU extensions and revert to traditional mode, thus introducing some +limitations, and changes several of the program's default option values. +When @samp{-G} is not specified, GNU extensions are always enabled. GNU +extensions to @code{ptx} are documented wherever appropriate in this +document. See @xref{Compatibility} for an explicit list of them. + +Individual options are explained later in this document. + +When GNU extensions are enabled, there may be zero, one or several +@var{file} after the options. If there is no @var{file}, the program +reads the standard input. If there is one or several @var{file}, they +give the name of input files which are all read in turn, as if all the +input files were concatenated. However, there is a full contextual +break between each file and, when automatic referencing is requested, +file names and line numbers refer to individual text input files. In +all cases, the program produces the permuted index onto the standard +output. + +When GNU extensions are @emph{not} enabled, that is, when the program +operates in traditional mode, there may be zero, one or two parameters +besides the options. If there is no parameters, the program reads the +standard input and produces the permuted index onto the standard output. +If there is only one parameter, it names the text @var{input} to be read +instead of the standard input. If two parameters are given, they give +respectively the name of the @var{input} file to read and the name of +the @var{output} file to produce. @emph{Be very careful} to note that, +in this case, the contents of file given by the second parameter is +destroyed. This behaviour is dictated only by System V @code{ptx} +compatibility, because GNU Standards discourage output parameters not +introduced by an option. + +Note that for @emph{any} file named as the value of an option or as an +input text file, a single dash @kbd{-} may be used, in which case +standard input is assumed. However, it would not make sense to use this +convention more than once per program invocation. + +@menu +* General options:: Options which affect general program behaviour. +* Charset selection:: Underlying character set considerations. +* Input processing:: Input fields, contexts, and keyword selection. +* Output formatting:: Types of output format, and sizing the fields. +@end menu + +@node General options, Charset selection, Invoking ptx, Invoking ptx +@section General options + +@table @code + +@item -C +@itemx --copyright +Prints a short note about the Copyright and copying conditions, then +exit without further processing. + +@item -G +@itemx --traditional +As already explained, this option disables all GNU extensions to +@code{ptx} and switch to traditional mode. + +@item --help +Prints a short help on standard output, then exit without further +processing. + +@item --version +Prints the program verison on standard output, then exit without further +processing. + +@end table + +@node Charset selection, Input processing, General options, Invoking ptx +@section Charset selection + +As it is setup now, the program assumes that the input file is coded +using 8-bit ISO 8859-1 code, also known as Latin-1 character set, +@emph{unless} if it is compiled for MS-DOS, in which case it uses the +character set of the IBM-PC. (GNU @code{ptx} is not known to work on +smaller MS-DOS machines anymore.) Compared to 7-bit ASCII, the set of +characters which are letters is then different, this fact alters the +behaviour of regular expression matching. Thus, the default regular +expression for a keyword allows foreign or diacriticized letters. +Keyword sorting, however, is still crude; it obeys the underlying +character set ordering quite blindly. + +@table @code + +@item -f +@itemx --ignore-case +Fold lower case letters to upper case for sorting. + +@end table + +@node Input processing, Output formatting, Charset selection, Invoking ptx +@section Word selection + +@table @code + +@item -b @var{file} +@item --break-file=@var{file} + +This option is an alternative way to option @code{-W} for describing +which characters make up words. This option introduces the name of a +file which contains a list of characters which can@emph{not} be part of +one word, this file is called the @dfn{Break file}. Any character which +is not part of the Break file is a word constituent. If both options +@code{-b} and @code{-W} are specified, then @code{-W} has precedence and +@code{-b} is ignored. + +When GNU extensions are enabled, the only way to avoid newline as a +break character is to write all the break characters in the file with no +newline at all, not even at the end of the file. When GNU extensions +are disabled, spaces, tabs and newlines are always considered as break +characters even if not included in the Break file. + +@item -i @var{file} +@itemx --ignore-file=@var{file} + +The file associated with this option contains a list of words which will +never be taken as keywords in concordance output. It is called the +@dfn{Ignore file}. The file contains exactly one word in each line; the +end of line separation of words is not subject to the value of the +@code{-S} option. + +There is a default Ignore file used by @code{ptx} when this option is +not specified, usually found in @file{/usr/local/lib/eign} if this has +not been changed at installation time. If you want to deactivate the +default Ignore file, specify @code{/dev/null} instead. + +@item -o @var{file} +@itemx --only-file=@var{file} + +The file associated with this option contains a list of words which will +be retained in concordance output, any word not mentioned in this file +is ignored. The file is called the @dfn{Only file}. The file contains +exactly one word in each line; the end of line separation of words is +not subject to the value of the @code{-S} option. + +There is no default for the Only file. In the case there are both an +Only file and an Ignore file, a word will be subject to be a keyword +only if it is given in the Only file and not given in the Ignore file. + +@item -r +@itemx --references + +On each input line, the leading sequence of non white characters will be +taken to be a reference that has the purpose of identifying this input +line on the produced permuted index. See @xref{Output formatting} for +more information about reference production. Using this option change +the default value for option @code{-S}. + +Using this option, the program does not try very hard to remove +references from contexts in output, but it succeeds in doing so +@emph{when} the context ends exactly at the newline. If option +@code{-r} is used with @code{-S} default value, or when GNU extensions +are disabled, this condition is always met and references are completely +excluded from the output contexts. + +@item -S @var{regexp} +@itemx --sentence-regexp=@var{regexp} + +This option selects which regular expression will describe the end of a +line or the end of a sentence. In fact, there is other distinction +between end of lines or end of sentences than the effect of this regular +expression, and input line boundaries have no special significance +outside this option. By default, when GNU extensions are enabled and if +@code{-r} option is not used, end of sentences are used. In this +case, the precise @var{regex} is imported from GNU emacs: + +@example +[.?!][]\"')@}]*\\($\\|\t\\| \\)[ \t\n]* +@end example + +Whenever GNU extensions are disabled or if @code{-r} option is used, end +of lines are used; in this case, the default @var{regexp} is just: + +@example +\n +@end example + +Using an empty REGEXP is equivalent to completely disabling end of line or end +of sentence recognition. In this case, the whole file is considered to +be a single big line or sentence. The user might want to disallow all +truncation flag generation as well, through option @code{-F ""}. +@xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs +Manual}. + +When the keywords happen to be near the beginning of the input line or +sentence, this often creates an unused area at the beginning of the +output context line; when the keywords happen to be near the end of the +input line or sentence, this often creates an unused area at the end of +the output context line. The program tries to fill those unused areas +by wrapping around context in them; the tail of the input line or +sentence is used to fill the unused area on the left of the output line; +the head of the input line or sentence is used to fill the unused area +on the right of the output line. + +As a matter of convenience to the user, many usual backslashed escape +sequences, as found in the C language, are recognized and converted to +the corresponding characters by @code{ptx} itself. + +@item -W @var{regexp} +@itemx --word-regexp=@var{regexp} + +This option selects which regular expression will describe each keyword. +By default, if GNU extensions are enabled, a word is a sequence of +letters; the @var{regexp} used is @code{\w+}. When GNU extensions are +disabled, a word is by default anything which ends with a space, a tab +or a newline; the @var{regexp} used is @code{[^ \t\n]+}. + +An empty REGEXP is equivalent to not using this option, letting the +default dive in. @xref{Regexps, , Syntax of Regular Expressions, emacs, +The GNU Emacs Manual}. + +As a matter of convenience to the user, many usual backslashed escape +sequences, as found in the C language, are recognized and converted to +the corresponding characters by @code{ptx} itself. + +@end table + +@node Output formatting, , Input processing, Invoking ptx +@section Output formatting + +Output format is mainly controlled by @code{-O} and @code{-T} options, +described in the table below. When neither @code{-O} nor @code{-T} is +selected, and if GNU extensions are enabled, the program choose an +output format suited for a dumb terminal. Each keyword occurrence is +output to the center of one line, surrounded by its left and right +contexts. Each field is properly justified, so the concordance output +could readily be observed. As a special feature, if automatic +references are selected by option @code{-A} and are output before the +left context, that is, if option @code{-R} is @emph{not} selected, then +a colon is added after the reference; this nicely interfaces with GNU +Emacs @code{next-error} processing. In this default output format, each +white space character, like newline and tab, is merely changed to +exactly one space, with no special attempt to compress consecutive +spaces. This might change in the future. Except for those white space +characters, every other character of the underlying set of 256 +characters is transmitted verbatim. + +Output format is further controlled by the following options. + +@table @code + +@item -g @var{number} +@itemx --gap-size=@var{number} + +Select the size of the minimum white gap between the fields on the output +line. + +@item -w @var{number} +@itemx --width=@var{number} + +Select the output maximum width of each final line. If references are +used, they are included or excluded from the output maximum width +depending on the value of option @code{-R}. If this option is not +selected, that is, when references are output before the left context, +the output maximum width takes into account the maximum length of all +references. If this options is selected, that is, when references are +output after the right context, the output maximum width does not take +into account the space taken by references, nor the gap that precedes +them. + +@item -A +@itemx --auto-reference + +Select automatic references. Each input line will have an automatic +reference made up of the file name and the line ordinal, with a single +colon between them. However, the file name will be empty when standard +input is being read. If both @code{-A} and @code{-r} are selected, then +the input reference is still read and skipped, but the automatic +reference is used at output time, overriding the input reference. + +@item -R +@itemx --right-side-refs + +In default output format, when option @code{-R} is not used, any +reference produced by the effect of options @code{-r} or @code{-A} are +given to the far right of output lines, after the right context. In +default output format, when option @code{-R} is specified, references +are rather given to the beginning of each output line, before the left +context. For any other output format, option @code{-R} is almost +ignored, except for the fact that the width of references is @emph{not} +taken into account in total output width given by @code{-w} whenever +@code{-R} is selected. + +This option is automatically selected whenever GNU extensions are +disabled. + +@item -F @var{string} +@itemx --flac-truncation=@var{string} + +This option will request that any truncation in the output be reported +using the string @var{string}. Most output fields theoretically extend +towards the beginning or the end of the current line, or current +sentence, as selected with option @code{-S}. But there is a maximum +allowed output line width, changeable through option @code{-w}, which is +further divided into space for various output fields. When a field has +to be truncated because cannot extend until the beginning or the end of +the current line to fit in the, then a truncation occurs. By default, +the string used is a single slash, as in @code{-F /}. + +@var{string} may have more than one character, as in @code{-F ...}. +Also, in the particular case @var{string} is empty (@code{-F ""}), +truncation flagging is disabled, and no truncation marks are appended in +this case. + +As a matter of convenience to the user, many usual backslashed escape +sequences, as found in the C language, are recognized and converted to +the corresponding characters by @code{ptx} itself. + +@item -M @var{string} +@itemx --macro-name=@var{string} + +Select another @var{string} to be used instead of @samp{xx}, while +generating output suitable for @code{nroff}, @code{troff} or @TeX{}. + +@item -O +@itemx --format=roff + +Choose an output format suitable for @code{nroff} or @code{troff} +processing. Each output line will look like: + +@example +.xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}" +@end example + +so it will be possible to write an @samp{.xx} roff macro to take care of +the output typesetting. This is the default output format when GNU +extensions are disabled. Option @samp{-M} might be used to change +@samp{xx} to another macro name. + +In this output format, each non-graphical character, like newline and +tab, is merely changed to exactly one space, with no special attempt to +compress consecutive spaces. Each quote character: @kbd{"} is doubled +so it will be correctly processed by @code{nroff} or @code{troff}. + +@item -T +@itemx --format=tex + +Choose an output format suitable for @TeX{} processing. Each output +line will look like: + +@example +\xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@} +@end example + +@noindent +so it will be possible to write write a @code{\xx} definition to take +care of the output typesetting. Note that when references are not being +produced, that is, neither option @code{-A} nor option @code{-r} is +selected, the last parameter of each @code{\xx} call is inhibited. +Option @samp{-M} might be used to change @samp{xx} to another macro +name. + +In this output format, some special characters, like @kbd{$}, @kbd{%}, +@kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a +backslash. Curly brackets @kbd{@{}, @kbd{@}} are also protected with a +backslash, but also enclosed in a pair of dollar signs to force +mathematical mode. The backslash itself produces the sequence +@code{\backslash@{@}}. Circumflex and tilde diacritics produce the +sequence @code{^\@{ @}} and @code{~\@{ @}} respectively. Other +diacriticized characters of the underlying character set produce an +appropriate @TeX{} sequence as far as possible. The other non-graphical +characters, like newline and tab, and all others characters which are +not part of ASCII, are merely changed to exactly one space, with no +special attempt to compress consecutive spaces. Let me know how to +improve this special character processing for @TeX{}. + +@end table + +@node Compatibility, , Invoking ptx, Top +@chapter The GNU extensions to @code{ptx} + +This version of @code{ptx} contains a few features which do not exist in +System V @code{ptx}. These extra features are suppressed by using the +@samp{-G} command line option, unless overridden by other command line +options. Some GNU extensions cannot be recovered by overriding, so the +simple rule is to avoid @samp{-G} if you care about GNU extensions. +Here are the differences between this program and System V @code{ptx}. + +@itemize @bullet + +@item +This program can read many input files at once, it always writes the +resulting concordance on standard output. On the other end, System V +@code{ptx} reads only one file and produce the result on standard output +or, if a second @var{file} parameter is given on the command, to that +@var{file}. + +Having output parameters not introduced by options is a quite dangerous +practice which GNU avoids as far as possible. So, for using @code{ptx} +portably between GNU and System V, you should pay attention to always +use it with a single input file, and always expect the result on +standard output. You might also want to automatically configure in a +@samp{-G} option to @code{ptx} calls in products using @code{ptx}, if +the configurator finds that the installed @code{ptx} accepts @samp{-G}. + +@item +The only options available in System V @code{ptx} are options @samp{-b}, +@samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and +@samp{-w}. All other options are GNU extensions and are not repeated in +this enumeration. Moreover, some options have a slightly different +meaning when GNU extensions are enabled, as explained below. + +@item +By default, concordance output is not formatted for @code{troff} or +@code{nroff}. It is rather formatted for a dumb terminal. @code{troff} +or @code{nroff} output may still be selected through option @code{-O}. + +@item +Unless @code{-R} option is used, the maximum reference width is +subtracted from the total output line width. With GNU extensions +disabled, width of references is not taken into account in the output +line width computations. + +@item +All 256 characters, even @kbd{NUL}s, are always read and processed from +input file with no adverse effect, even if GNU extensions are disabled. +However, System V @code{ptx} does not accept 8-bit characters, a few +control characters are rejected, and the tilda @kbd{~} is condemned. + +@item +Input line length is only limited by available memory, even if GNU +extensions are disabled. However, System V @code{ptx} processes only +the first 200 characters in each line. + +@item +The break (non-word) characters default to be every character except all +letters of the underlying character set, diacriticized or not. When GNU +extensions are disabled, the break characters default to space, tab and +newline only. + +@item +The program makes better use of output line width. If GNU extensions +are disabled, the program rather tries to imitate System V @code{ptx}, +but still, there are some slight disposition glitches this program does +not completely reproduce. + +@item +The user can specify both an Ignore file and an Only file. This is not +allowed with System V @code{ptx}. + +@end itemize + +@bye diff --git a/gnu/usr.bin/ptx/ptx.c b/gnu/usr.bin/ptx/ptx.c index 2dc306e..3ef3665 100644 --- a/gnu/usr.bin/ptx/ptx.c +++ b/gnu/usr.bin/ptx/ptx.c @@ -101,7 +101,7 @@ extern int errno; #include "bumpalloc.h" #include "diacrit.h" -#include "regex.h" +#include "gnuregex.h" #ifndef __STDC__ void *xmalloc (); |