From e9243aa42348dfb93efdf470f5feedf6d53fab58 Mon Sep 17 00:00:00 2001 From: peter Date: Sat, 30 Dec 1995 19:02:48 +0000 Subject: recording cvs-1.6 file death --- usr.bin/lex/flex.1 | 1001 ---------------- usr.bin/lex/flexdoc.1 | 3045 ------------------------------------------------- 2 files changed, 4046 deletions(-) delete mode 100644 usr.bin/lex/flex.1 delete mode 100644 usr.bin/lex/flexdoc.1 (limited to 'usr.bin') diff --git a/usr.bin/lex/flex.1 b/usr.bin/lex/flex.1 deleted file mode 100644 index 6aba4d6..0000000 --- a/usr.bin/lex/flex.1 +++ /dev/null @@ -1,1001 +0,0 @@ -.TH FLEX 1 "November 1993" "Version 2.4" -.SH NAME -flex \- fast lexical analyzer generator -.SH SYNOPSIS -.B flex -.B [\-bcdfhilnpstvwBFILTV78+ \-C[aefFmr] \-Pprefix \-Sskeleton] -.I [filename ...] -.SH DESCRIPTION -.I flex -is a tool for generating -.I scanners: -programs which recognized lexical patterns in text. -.I flex -reads -the given input files, or its standard input if no file names are given, -for a description of a scanner to generate. The description is in -the form of pairs -of regular expressions and C code, called -.I rules. flex -generates as output a C source file, -.B lex.yy.c, -which defines a routine -.B yylex(). -This file is compiled and linked with the -.B \-lfl -library to produce an executable. When the executable is run, -it analyzes its input for occurrences -of the regular expressions. Whenever it finds one, it executes -the corresponding C code. -.PP -For full documentation, see -.B flexdoc(1). -This manual entry is intended for use as a quick reference. -.SH OPTIONS -.I flex -has the following options: -.TP -.B \-b -generate backing-up information to -.I lex.backup. -This is a list of scanner states which require backing up and the input -characters on which they do so. By adding rules one can remove -backing-up states. If all backing-up states are eliminated and -.B \-Cf -or -.B \-CF -is used, the generated scanner will run faster. -.TP -.B \-c -is a do-nothing, deprecated option included for POSIX compliance. -.IP -.B NOTE: -in previous releases of -.I flex -.B \-c -specified table-compression options. This functionality is -now given by the -.B \-C -flag. To ease the the impact of this change, when -.I flex -encounters -.B \-c, -it currently issues a warning message and assumes that -.B \-C -was desired instead. In the future this "promotion" of -.B \-c -to -.B \-C -will go away in the name of full POSIX compliance (unless -the POSIX meaning is removed first). -.TP -.B \-d -makes the generated scanner run in -.I debug -mode. Whenever a pattern is recognized and the global -.B yy_flex_debug -is non-zero (which is the default), the scanner will -write to -.I stderr -a line of the form: -.nf - - --accepting rule at line 53 ("the matched text") - -.fi -The line number refers to the location of the rule in the file -defining the scanner (i.e., the file that was fed to flex). Messages -are also generated when the scanner backs up, accepts the -default rule, reaches the end of its input buffer (or encounters -a NUL; the two look the same as far as the scanner's concerned), -or reaches an end-of-file. -.TP -.B \-f -specifies -.I fast scanner. -No table compression is done and stdio is bypassed. -The result is large but fast. This option is equivalent to -.B \-Cfr -(see below). -.TP -.B \-h -generates a "help" summary of -.I flex's -options to -.I stderr -and then exits. -.TP -.B \-i -instructs -.I flex -to generate a -.I case-insensitive -scanner. The case of letters given in the -.I flex -input patterns will -be ignored, and tokens in the input will be matched regardless of case. The -matched text given in -.I yytext -will have the preserved case (i.e., it will not be folded). -.TP -.B \-l -turns on maximum compatibility with the original AT&T lex implementation, -at a considerable performance cost. This option is incompatible with -.B \-+, \-f, \-F, \-Cf, -or -.B \-CF. -See -.I flexdoc(1) -for details. -.TP -.B \-n -is another do-nothing, deprecated option included only for -POSIX compliance. -.TP -.B \-p -generates a performance report to stderr. The report -consists of comments regarding features of the -.I flex -input file which will cause a loss of performance in the resulting scanner. -If you give the flag twice, you will also get comments regarding -features that lead to minor performance losses. -.TP -.B \-s -causes the -.I default rule -(that unmatched scanner input is echoed to -.I stdout) -to be suppressed. If the scanner encounters input that does not -match any of its rules, it aborts with an error. -.TP -.B \-t -instructs -.I flex -to write the scanner it generates to standard output instead -of -.B lex.yy.c. -.TP -.B \-v -specifies that -.I flex -should write to -.I stderr -a summary of statistics regarding the scanner it generates. -.TP -.B \-w -suppresses warning messages. -.TP -.B \-B -instructs -.I flex -to generate a -.I batch -scanner instead of an -.I interactive -scanner (see -.B \-I -below). See -.I flexdoc(1) -for details. Scanners using -.B \-Cf -or -.B \-CF -compression options automatically specify this option, too. -.TP -.B \-F -specifies that the -.ul -fast -scanner table representation should be used (and stdio bypassed). -This representation is about as fast as the full table representation -.B (-f), -and for some sets of patterns will be considerably smaller (and for -others, larger). It cannot be used with the -.B \-+ -option. See -.B flexdoc(1) -for more details. -.IP -This option is equivalent to -.B \-CFr -(see below). -.TP -.B \-I -instructs -.I flex -to generate an -.I interactive -scanner, that is, a scanner which stops immediately rather than -looking ahead if it knows -that the currently scanned text cannot be part of a longer rule's match. -This is the opposite of -.I batch -scanners (see -.B \-B -above). See -.B flexdoc(1) -for details. -.IP -Note, -.B \-I -cannot be used in conjunction with -.I full -or -.I fast tables, -i.e., the -.B \-f, \-F, \-Cf, -or -.B \-CF -flags. For other table compression options, -.B \-I -is the default. -.TP -.B \-L -instructs -.I flex -not to generate -.B #line -directives in -.B lex.yy.c. -The default is to generate such directives so error -messages in the actions will be correctly -located with respect to the original -.I flex -input file, and not to -the fairly meaningless line numbers of -.B lex.yy.c. -.TP -.B \-T -makes -.I flex -run in -.I trace -mode. It will generate a lot of messages to -.I stderr -concerning -the form of the input and the resultant non-deterministic and deterministic -finite automata. This option is mostly for use in maintaining -.I flex. -.TP -.B \-V -prints the version number to -.I stderr -and exits. -.TP -.B \-7 -instructs -.I flex -to generate a 7-bit scanner, which can save considerable table space, -especially when using -.B \-Cf -or -.B \-CF -(and, at most sites, -.B \-7 -is on by default for these options. To see if this is the case, use the -.B -v -verbose flag and check the flag summary it reports). -.TP -.B \-8 -instructs -.I flex -to generate an 8-bit scanner. This is the default except for the -.B \-Cf -and -.B \-CF -compression options, for which the default is site-dependent, and -can be checked by inspecting the flag summary generated by the -.B \-v -option. -.TP -.B \-+ -specifies that you want flex to generate a C++ -scanner class. See the section on Generating C++ Scanners in -.I flexdoc(1) -for details. -.TP -.B \-C[aefFmr] -controls the degree of table compression and scanner optimization. -.IP -.B \-Ca -trade off larger tables in the generated scanner for faster performance -because the elements of the tables are better aligned for memory access -and computation. This option can double the size of the tables used by -your scanner. -.IP -.B \-Ce -directs -.I flex -to construct -.I equivalence classes, -i.e., sets of characters -which have identical lexical properties. -Equivalence classes usually give -dramatic reductions in the final table/object file sizes (typically -a factor of 2-5) and are pretty cheap performance-wise (one array -look-up per character scanned). -.IP -.B \-Cf -specifies that the -.I full -scanner tables should be generated - -.I flex -should not compress the -tables by taking advantages of similar transition functions for -different states. -.IP -.B \-CF -specifies that the alternate fast scanner representation (described in -.B flexdoc(1)) -should be used. This option cannot be used with -.B \-+. -.IP -.B \-Cm -directs -.I flex -to construct -.I meta-equivalence classes, -which are sets of equivalence classes (or characters, if equivalence -classes are not being used) that are commonly used together. Meta-equivalence -classes are often a big win when using compressed tables, but they -have a moderate performance impact (one or two "if" tests and one -array look-up per character scanned). -.IP -.B \-Cr -causes the generated scanner to -.I bypass -using stdio for input. In general this option results in a minor -performance gain only worthwhile if used in conjunction with -.B \-Cf -or -.B \-CF. -It can cause surprising behavior if you use stdio yourself to -read from -.I yyin -prior to calling the scanner. -.IP -A lone -.B \-C -specifies that the scanner tables should be compressed but neither -equivalence classes nor meta-equivalence classes should be used. -.IP -The options -.B \-Cf -or -.B \-CF -and -.B \-Cm -do not make sense together - there is no opportunity for meta-equivalence -classes if the table is not being compressed. Otherwise the options -may be freely mixed. -.IP -The default setting is -.B \-Cem, -which specifies that -.I flex -should generate equivalence classes -and meta-equivalence classes. This setting provides the highest -degree of table compression. You can trade off -faster-executing scanners at the cost of larger tables with -the following generally being true: -.nf - - slowest & smallest - -Cem - -Cm - -Ce - -C - -C{f,F}e - -C{f,F} - -C{f,F}a - fastest & largest - -.fi -.IP -.B \-C -options are cumulative. -.TP -.B \-Pprefix -changes the default -.I "yy" -prefix used by -.I flex -to be -.I prefix -instead. See -.I flexdoc(1) -for a description of all the global variables and file names that -this affects. -.TP -.B \-Sskeleton_file -overrides the default skeleton file from which -.I flex -constructs its scanners. You'll never need this option unless you are doing -.I flex -maintenance or development. -.SH SUMMARY OF FLEX REGULAR EXPRESSIONS -The patterns in the input are written using an extended set of regular -expressions. These are: -.nf - - x match the character 'x' - . any character except newline - [xyz] a "character class"; in this case, the pattern - matches either an 'x', a 'y', or a 'z' - [abj-oZ] a "character class" with a range in it; matches - an 'a', a 'b', any letter from 'j' through 'o', - or a 'Z' - [^A-Z] a "negated character class", i.e., any character - but those in the class. In this case, any - character EXCEPT an uppercase letter. - [^A-Z\\n] any character EXCEPT an uppercase letter or - a newline - r* zero or more r's, where r is any regular expression - r+ one or more r's - r? zero or one r's (that is, "an optional r") - r{2,5} anywhere from two to five r's - r{2,} two or more r's - r{4} exactly 4 r's - {name} the expansion of the "name" definition - (see above) - "[xyz]\\"foo" - the literal string: [xyz]"foo - \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', - then the ANSI-C interpretation of \\x. - Otherwise, a literal 'X' (used to escape - operators such as '*') - \\123 the character with octal value 123 - \\x2a the character with hexadecimal value 2a - (r) match an r; parentheses are used to override - precedence (see below) - - - rs the regular expression r followed by the - regular expression s; called "concatenation" - - - r|s either an r or an s - - - r/s an r but only if it is followed by an s. The - s is not part of the matched text. This type - of pattern is called as "trailing context". - ^r an r, but only at the beginning of a line - r$ an r, but only at the end of a line. Equivalent - to "r/\\n". - - - r an r, but only in start condition s (see - below for discussion of start conditions) - r - same, but in any of start conditions s1, - s2, or s3 - <*>r an r in any start condition, even an exclusive one. - - - <> an end-of-file - <> - an end-of-file when in start condition s1 or s2 - -.fi -The regular expressions listed above are grouped according to -precedence, from highest precedence at the top to lowest at the bottom. -Those grouped together have equal precedence. -.PP -Some notes on patterns: -.IP - -Negated character classes -.I match newlines -unless "\\n" (or an equivalent escape sequence) is one of the -characters explicitly present in the negated character class -(e.g., "[^A-Z\\n]"). -.IP - -A rule can have at most one instance of trailing context (the '/' operator -or the '$' operator). The start condition, '^', and "<>" patterns -can only occur at the beginning of a pattern, and, as well as with '/' and '$', -cannot be grouped inside parentheses. The following are all illegal: -.nf - - foo/bar$ - foo|(bar$) - foo|^bar - foobar - -.fi -.SH SUMMARY OF SPECIAL ACTIONS -In addition to arbitrary C code, the following can appear in actions: -.IP - -.B ECHO -copies yytext to the scanner's output. -.IP - -.B BEGIN -followed by the name of a start condition places the scanner in the -corresponding start condition. -.IP - -.B REJECT -directs the scanner to proceed on to the "second best" rule which matched the -input (or a prefix of the input). -.B yytext -and -.B yyleng -are set up appropriately. Note that -.B REJECT -is a particularly expensive feature in terms scanner performance; -if it is used in -.I any -of the scanner's actions it will slow down -.I all -of the scanner's matching. Furthermore, -.B REJECT -cannot be used with the -.B \-f -or -.B \-F -options. -.IP -Note also that unlike the other special actions, -.B REJECT -is a -.I branch; -code immediately following it in the action will -.I not -be executed. -.IP - -.B yymore() -tells the scanner that the next time it matches a rule, the corresponding -token should be -.I appended -onto the current value of -.B yytext -rather than replacing it. -.IP - -.B yyless(n) -returns all but the first -.I n -characters of the current token back to the input stream, where they -will be rescanned when the scanner looks for the next match. -.B yytext -and -.B yyleng -are adjusted appropriately (e.g., -.B yyleng -will now be equal to -.I n -). -.IP - -.B unput(c) -puts the character -.I c -back onto the input stream. It will be the next character scanned. -.IP - -.B input() -reads the next character from the input stream (this routine is called -.B yyinput() -if the scanner is compiled using -.B C++). -.IP - -.B yyterminate() -can be used in lieu of a return statement in an action. It terminates -the scanner and returns a 0 to the scanner's caller, indicating "all done". -.IP -By default, -.B yyterminate() -is also called when an end-of-file is encountered. It is a macro and -may be redefined. -.IP - -.B YY_NEW_FILE -is an action available only in <> rules. It means "Okay, I've -set up a new input file, continue scanning". It is no longer required; -you can just assign -.I yyin -to point to a new file in the <> action. -.IP - -.B yy_create_buffer( file, size ) -takes a -.I FILE -pointer and an integer -.I size. -It returns a YY_BUFFER_STATE -handle to a new input buffer large enough to accomodate -.I size -characters and associated with the given file. When in doubt, use -.B YY_BUF_SIZE -for the size. -.IP - -.B yy_switch_to_buffer( new_buffer ) -switches the scanner's processing to scan for tokens from -the given buffer, which must be a YY_BUFFER_STATE. -.IP - -.B yy_delete_buffer( buffer ) -deletes the given buffer. -.SH VALUES AVAILABLE TO THE USER -.IP - -.B char *yytext -holds the text of the current token. It may be modified but not lengthened -(you cannot append characters to the end). Modifying the last character -may affect the activity of rules anchored using '^' during the next scan; -see -.B flexdoc(1) -for details. -.IP -If the special directive -.B %array -appears in the first section of the scanner description, then -.B yytext -is instead declared -.B char yytext[YYLMAX], -where -.B YYLMAX -is a macro definition that you can redefine in the first section -if you don't like the default value (generally 8KB). Using -.B %array -results in somewhat slower scanners, but the value of -.B yytext -becomes immune to calls to -.I input() -and -.I unput(), -which potentially destroy its value when -.B yytext -is a character pointer. The opposite of -.B %array -is -.B %pointer, -which is the default. -.IP -You cannot use -.B %array -when generating C++ scanner classes -(the -.B \-+ -flag). -.IP - -.B int yyleng -holds the length of the current token. -.IP - -.B FILE *yyin -is the file which by default -.I flex -reads from. It may be redefined but doing so only makes sense before -scanning begins or after an EOF has been encountered. Changing it in -the midst of scanning will have unexpected results since -.I flex -buffers its input; use -.B yyrestart() -instead. -Once scanning terminates because an end-of-file -has been seen, -.B -you can assign -.I yyin -at the new input file and then call the scanner again to continue scanning. -.IP - -.B void yyrestart( FILE *new_file ) -may be called to point -.I yyin -at the new input file. The switch-over to the new file is immediate -(any previously buffered-up input is lost). Note that calling -.B yyrestart() -with -.I yyin -as an argument thus throws away the current input buffer and continues -scanning the same input file. -.IP - -.B FILE *yyout -is the file to which -.B ECHO -actions are done. It can be reassigned by the user. -.IP - -.B YY_CURRENT_BUFFER -returns a -.B YY_BUFFER_STATE -handle to the current buffer. -.IP - -.B YY_START -returns an integer value corresponding to the current start -condition. You can subsequently use this value with -.B BEGIN -to return to that start condition. -.SH MACROS AND FUNCTIONS YOU CAN REDEFINE -.IP - -.B YY_DECL -controls how the scanning routine is declared. -By default, it is "int yylex()", or, if prototypes are being -used, "int yylex(void)". This definition may be changed by redefining -the "YY_DECL" macro. Note that -if you give arguments to the scanning routine using a -K&R-style/non-prototyped function declaration, you must terminate -the definition with a semi-colon (;). -.IP - -The nature of how the scanner -gets its input can be controlled by redefining the -.B YY_INPUT -macro. -YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its -action is to place up to -.I max_size -characters in the character array -.I buf -and return in the integer variable -.I result -either the -number of characters read or the constant YY_NULL (0 on Unix systems) -to indicate EOF. The default YY_INPUT reads from the -global file-pointer "yyin". -A sample redefinition of YY_INPUT (in the definitions -section of the input file): -.nf - - %{ - #undef YY_INPUT - #define YY_INPUT(buf,result,max_size) \\ - { \\ - int c = getchar(); \\ - result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ - } - %} - -.fi -.IP - -When the scanner receives an end-of-file indication from YY_INPUT, -it then checks the function -.B yywrap() -function. If -.B yywrap() -returns false (zero), then it is assumed that the -function has gone ahead and set up -.I yyin -to point to another input file, and scanning continues. If it returns -true (non-zero), then the scanner terminates, returning 0 to its -caller. -.IP -The default -.B yywrap() -always returns 1. -.IP - -YY_USER_ACTION -can be redefined to provide an action -which is always executed prior to the matched rule's action. -.IP - -The macro -.B YY_USER_INIT -may be redefined to provide an action which is always executed before -the first scan. -.IP - -In the generated scanner, the actions are all gathered in one large -switch statement and separated using -.B YY_BREAK, -which may be redefined. By default, it is simply a "break", to separate -each rule's action from the following rule's. -.SH FILES -.TP -.B \-lfl -library with which to link scanners to obtain the default versions -of -.I yywrap() -and/or -.I main(). -.TP -.I lex.yy.c -generated scanner (called -.I lexyy.c -on some systems). -.TP -.I lex.yy.cc -generated C++ scanner class, when using -.B -+. -.TP -.I -header file defining the C++ scanner base class, -.B FlexLexer, -and its derived class, -.B yyFlexLexer. -.TP -.I flex.skl -skeleton scanner. This file is only used when building flex, not when -flex executes. -.TP -.I lex.backup -backing-up information for -.B \-b -flag (called -.I lex.bck -on some systems). -.SH "SEE ALSO" -.PP -flexdoc(1), lex(1), yacc(1), sed(1), awk(1). -.PP -M. E. Lesk and E. Schmidt, -.I LEX \- Lexical Analyzer Generator -.SH DIAGNOSTICS -.PP -.I reject_used_but_not_detected undefined -or -.PP -.I yymore_used_but_not_detected undefined - -These errors can occur at compile time. They indicate that the -scanner uses -.B REJECT -or -.B yymore() -but that -.I flex -failed to notice the fact, meaning that -.I flex -scanned the first two sections looking for occurrences of these actions -and failed to find any, but somehow you snuck some in (via a #include -file, for example). Make an explicit reference to the action in your -.I flex -input file. (Note that previously -.I flex -supported a -.B %used/%unused -mechanism for dealing with this problem; this feature is still supported -but now deprecated, and will go away soon unless the author hears from -people who can argue compellingly that they need it.) -.PP -.I flex scanner jammed - -a scanner compiled with -.B \-s -has encountered an input string which wasn't matched by -any of its rules. -.PP -.I warning, rule cannot be matched -indicates that the given rule -cannot be matched because it follows other rules that will -always match the same text as it. See -.I flexdoc(1) -for an example. -.PP -.I warning, -.B \-s -.I -option given but default rule can be matched -means that it is possible (perhaps only in a particular start condition) -that the default rule (match any single character) is the only one -that will match a particular input. Since -.PP -.I scanner input buffer overflowed - -a scanner rule matched more text than the available dynamic memory. -.PP -.I token too large, exceeds YYLMAX - -your scanner uses -.B %array -and one of its rules matched a string longer than the -.B YYLMAX -constant (8K bytes by default). You can increase the value by -#define'ing -.B YYLMAX -in the definitions section of your -.I flex -input. -.PP -.I scanner requires \-8 flag to -.I use the character 'x' - -Your scanner specification includes recognizing the 8-bit character -.I 'x' -and you did not specify the \-8 flag, and your scanner defaulted to 7-bit -because you used the -.B \-Cf -or -.B \-CF -table compression options. -.PP -.I flex scanner push-back overflow - -you used -.B unput() -to push back so much text that the scanner's buffer could not hold -both the pushed-back text and the current token in -.B yytext. -Ideally the scanner should dynamically resize the buffer in this case, but at -present it does not. -.PP -.I -input buffer overflow, can't enlarge buffer because scanner uses REJECT - -the scanner was working on matching an extremely large token and needed -to expand the input buffer. This doesn't work with scanners that use -.B -REJECT. -.PP -.I -fatal flex scanner internal error--end of buffer missed - -This can occur in an scanner which is reentered after a long-jump -has jumped out (or over) the scanner's activation frame. Before -reentering the scanner, use: -.nf - - yyrestart( yyin ); - -.fi -or use C++ scanner classes (the -.B \-+ -option), which are fully reentrant. -.SH AUTHOR -Vern Paxson, with the help of many ideas and much inspiration from -Van Jacobson. Original version by Jef Poskanzer. -.PP -See flexdoc(1) for additional credits and the address to send comments to. -.SH DEFICIENCIES / BUGS -.PP -Some trailing context -patterns cannot be properly matched and generate -warning messages ("dangerous trailing context"). These are -patterns where the ending of the -first part of the rule matches the beginning of the second -part, such as "zx*/xy*", where the 'x*' matches the 'x' at -the beginning of the trailing context. (Note that the POSIX draft -states that the text matched by such patterns is undefined.) -.PP -For some trailing context rules, parts which are actually fixed-length are -not recognized as such, leading to the abovementioned performance loss. -In particular, parts using '|' or {n} (such as "foo{3}") are always -considered variable-length. -.PP -Combining trailing context with the special '|' action can result in -.I fixed -trailing context being turned into the more expensive -.I variable -trailing context. For example, in the following: -.nf - - %% - abc | - xyz/def - -.fi -.PP -Use of -.B unput() -or -.B input() -invalidates yytext and yyleng, unless the -.B %array -directive -or the -.B \-l -option has been used. -.PP -Use of unput() to push back more text than was matched can -result in the pushed-back text matching a beginning-of-line ('^') -rule even though it didn't come at the beginning of the line -(though this is rare!). -.PP -Pattern-matching of NUL's is substantially slower than matching other -characters. -.PP -Dynamic resizing of the input buffer is slow, as it entails rescanning -all the text matched so far by the current (generally huge) token. -.PP -.I flex -does not generate correct #line directives for code internal -to the scanner; thus, bugs in -.I flex.skl -yield bogus line numbers. -.PP -Due to both buffering of input and read-ahead, you cannot intermix -calls to routines, such as, for example, -.B getchar(), -with -.I flex -rules and expect it to work. Call -.B input() -instead. -.PP -The total table entries listed by the -.B \-v -flag excludes the number of table entries needed to determine -what rule has been matched. The number of entries is equal -to the number of DFA states if the scanner does not use -.B REJECT, -and somewhat greater than the number of states if it does. -.PP -.B REJECT -cannot be used with the -.B \-f -or -.B \-F -options. -.PP -The -.I flex -internal algorithms need documentation. diff --git a/usr.bin/lex/flexdoc.1 b/usr.bin/lex/flexdoc.1 deleted file mode 100644 index b80d569..0000000 --- a/usr.bin/lex/flexdoc.1 +++ /dev/null @@ -1,3045 +0,0 @@ -.TH FLEXDOC 1 "November 1993" "Version 2.4" -.SH NAME -flexdoc \- documentation for flex, fast lexical analyzer generator -.SH SYNOPSIS -.B flex -.B [\-bcdfhilnpstvwBFILTV78+ \-C[aefFmr] \-Pprefix \-Sskeleton] -.I [filename ...] -.SH DESCRIPTION -.I flex -is a tool for generating -.I scanners: -programs which recognized lexical patterns in text. -.I flex -reads -the given input files, or its standard input if no file names are given, -for a description of a scanner to generate. The description is in -the form of pairs -of regular expressions and C code, called -.I rules. flex -generates as output a C source file, -.B lex.yy.c, -which defines a routine -.B yylex(). -This file is compiled and linked with the -.B \-lfl -library to produce an executable. When the executable is run, -it analyzes its input for occurrences -of the regular expressions. Whenever it finds one, it executes -the corresponding C code. -.SH SOME SIMPLE EXAMPLES -.PP -First some simple examples to get the flavor of how one uses -.I flex. -The following -.I flex -input specifies a scanner which whenever it encounters the string -"username" will replace it with the user's login name: -.nf - - %% - username printf( "%s", getlogin() ); - -.fi -By default, any text not matched by a -.I flex -scanner -is copied to the output, so the net effect of this scanner is -to copy its input file to its output with each occurrence -of "username" expanded. -In this input, there is just one rule. "username" is the -.I pattern -and the "printf" is the -.I action. -The "%%" marks the beginning of the rules. -.PP -Here's another simple example: -.nf - - int num_lines = 0, num_chars = 0; - - %% - \\n ++num_lines; ++num_chars; - . ++num_chars; - - %% - main() - { - yylex(); - printf( "# of lines = %d, # of chars = %d\\n", - num_lines, num_chars ); - } - -.fi -This scanner counts the number of characters and the number -of lines in its input (it produces no output other than the -final report on the counts). The first line -declares two globals, "num_lines" and "num_chars", which are accessible -both inside -.B yylex() -and in the -.B main() -routine declared after the second "%%". There are two rules, one -which matches a newline ("\\n") and increments both the line count and -the character count, and one which matches any character other than -a newline (indicated by the "." regular expression). -.PP -A somewhat more complicated example: -.nf - - /* scanner for a toy Pascal-like language */ - - %{ - /* need this for the call to atof() below */ - #include - %} - - DIGIT [0-9] - ID [a-z][a-z0-9]* - - %% - - {DIGIT}+ { - printf( "An integer: %s (%d)\\n", yytext, - atoi( yytext ) ); - } - - {DIGIT}+"."{DIGIT}* { - printf( "A float: %s (%g)\\n", yytext, - atof( yytext ) ); - } - - if|then|begin|end|procedure|function { - printf( "A keyword: %s\\n", yytext ); - } - - {ID} printf( "An identifier: %s\\n", yytext ); - - "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext ); - - "{"[^}\\n]*"}" /* eat up one-line comments */ - - [ \\t\\n]+ /* eat up whitespace */ - - . printf( "Unrecognized character: %s\\n", yytext ); - - %% - - main( argc, argv ) - int argc; - char **argv; - { - ++argv, --argc; /* skip over program name */ - if ( argc > 0 ) - yyin = fopen( argv[0], "r" ); - else - yyin = stdin; - - yylex(); - } - -.fi -This is the beginnings of a simple scanner for a language like -Pascal. It identifies different types of -.I tokens -and reports on what it has seen. -.PP -The details of this example will be explained in the following -sections. -.SH FORMAT OF THE INPUT FILE -The -.I flex -input file consists of three sections, separated by a line with just -.B %% -in it: -.nf - - definitions - %% - rules - %% - user code - -.fi -The -.I definitions -section contains declarations of simple -.I name -definitions to simplify the scanner specification, and declarations of -.I start conditions, -which are explained in a later section. -.PP -Name definitions have the form: -.nf - - name definition - -.fi -The "name" is a word beginning with a letter or an underscore ('_') -followed by zero or more letters, digits, '_', or '-' (dash). -The definition is taken to begin at the first non-white-space character -following the name and continuing to the end of the line. -The definition can subsequently be referred to using "{name}", which -will expand to "(definition)". For example, -.nf - - DIGIT [0-9] - ID [a-z][a-z0-9]* - -.fi -defines "DIGIT" to be a regular expression which matches a -single digit, and -"ID" to be a regular expression which matches a letter -followed by zero-or-more letters-or-digits. -A subsequent reference to -.nf - - {DIGIT}+"."{DIGIT}* - -.fi -is identical to -.nf - - ([0-9])+"."([0-9])* - -.fi -and matches one-or-more digits followed by a '.' followed -by zero-or-more digits. -.PP -The -.I rules -section of the -.I flex -input contains a series of rules of the form: -.nf - - pattern action - -.fi -where the pattern must be unindented and the action must begin -on the same line. -.PP -See below for a further description of patterns and actions. -.PP -Finally, the user code section is simply copied to -.B lex.yy.c -verbatim. -It is used for companion routines which call or are called -by the scanner. The presence of this section is optional; -if it is missing, the second -.B %% -in the input file may be skipped, too. -.PP -In the definitions and rules sections, any -.I indented -text or text enclosed in -.B %{ -and -.B %} -is copied verbatim to the output (with the %{}'s removed). -The %{}'s must appear unindented on lines by themselves. -.PP -In the rules section, -any indented or %{} text appearing before the -first rule may be used to declare variables -which are local to the scanning routine and (after the declarations) -code which is to be executed whenever the scanning routine is entered. -Other indented or %{} text in the rule section is still copied to the output, -but its meaning is not well-defined and it may well cause compile-time -errors (this feature is present for -.I POSIX -compliance; see below for other such features). -.PP -In the definitions section (but not in the rules section), -an unindented comment (i.e., a line -beginning with "/*") is also copied verbatim to the output up -to the next "*/". -.SH PATTERNS -The patterns in the input are written using an extended set of regular -expressions. These are: -.nf - - x match the character 'x' - . any character except newline - [xyz] a "character class"; in this case, the pattern - matches either an 'x', a 'y', or a 'z' - [abj-oZ] a "character class" with a range in it; matches - an 'a', a 'b', any letter from 'j' through 'o', - or a 'Z' - [^A-Z] a "negated character class", i.e., any character - but those in the class. In this case, any - character EXCEPT an uppercase letter. - [^A-Z\\n] any character EXCEPT an uppercase letter or - a newline - r* zero or more r's, where r is any regular expression - r+ one or more r's - r? zero or one r's (that is, "an optional r") - r{2,5} anywhere from two to five r's - r{2,} two or more r's - r{4} exactly 4 r's - {name} the expansion of the "name" definition - (see above) - "[xyz]\\"foo" - the literal string: [xyz]"foo - \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', - then the ANSI-C interpretation of \\x. - Otherwise, a literal 'X' (used to escape - operators such as '*') - \\123 the character with octal value 123 - \\x2a the character with hexadecimal value 2a - (r) match an r; parentheses are used to override - precedence (see below) - - - rs the regular expression r followed by the - regular expression s; called "concatenation" - - - r|s either an r or an s - - - r/s an r but only if it is followed by an s. The - s is not part of the matched text. This type - of pattern is called as "trailing context". - ^r an r, but only at the beginning of a line - r$ an r, but only at the end of a line. Equivalent - to "r/\\n". - - - r an r, but only in start condition s (see - below for discussion of start conditions) - r - same, but in any of start conditions s1, - s2, or s3 - <*>r an r in any start condition, even an exclusive one. - - - <> an end-of-file - <> - an end-of-file when in start condition s1 or s2 - -.fi -Note that inside of a character class, all regular expression operators -lose their special meaning except escape ('\\') and the character class -operators, '-', ']', and, at the beginning of the class, '^'. -.PP -The regular expressions listed above are grouped according to -precedence, from highest precedence at the top to lowest at the bottom. -Those grouped together have equal precedence. For example, -.nf - - foo|bar* - -.fi -is the same as -.nf - - (foo)|(ba(r*)) - -.fi -since the '*' operator has higher precedence than concatenation, -and concatenation higher than alternation ('|'). This pattern -therefore matches -.I either -the string "foo" -.I or -the string "ba" followed by zero-or-more r's. -To match "foo" or zero-or-more "bar"'s, use: -.nf - - foo|(bar)* - -.fi -and to match zero-or-more "foo"'s-or-"bar"'s: -.nf - - (foo|bar)* - -.fi -.PP -Some notes on patterns: -.IP - -A negated character class such as the example "[^A-Z]" -above -.I will match a newline -unless "\\n" (or an equivalent escape sequence) is one of the -characters explicitly present in the negated character class -(e.g., "[^A-Z\\n]"). This is unlike how many other regular -expression tools treat negated character classes, but unfortunately -the inconsistency is historically entrenched. -Matching newlines means that a pattern like [^"]* can match the entire -input unless there's another quote in the input. -.IP - -A rule can have at most one instance of trailing context (the '/' operator -or the '$' operator). The start condition, '^', and "<>" patterns -can only occur at the beginning of a pattern, and, as well as with '/' and '$', -cannot be grouped inside parentheses. A '^' which does not occur at -the beginning of a rule or a '$' which does not occur at the end of -a rule loses its special properties and is treated as a normal character. -.IP -The following are illegal: -.nf - - foo/bar$ - foobar - -.fi -Note that the first of these, can be written "foo/bar\\n". -.IP -The following will result in '$' or '^' being treated as a normal character: -.nf - - foo|(bar$) - foo|^bar - -.fi -If what's wanted is a "foo" or a bar-followed-by-a-newline, the following -could be used (the special '|' action is explained below): -.nf - - foo | - bar$ /* action goes here */ - -.fi -A similar trick will work for matching a foo or a -bar-at-the-beginning-of-a-line. -.SH HOW THE INPUT IS MATCHED -When the generated scanner is run, it analyzes its input looking -for strings which match any of its patterns. If it finds more than -one match, it takes the one matching the most text (for trailing -context rules, this includes the length of the trailing part, even -though it will then be returned to the input). If it finds two -or more matches of the same length, the -rule listed first in the -.I flex -input file is chosen. -.PP -Once the match is determined, the text corresponding to the match -(called the -.I token) -is made available in the global character pointer -.B yytext, -and its length in the global integer -.B yyleng. -The -.I action -corresponding to the matched pattern is then executed (a more -detailed description of actions follows), and then the remaining -input is scanned for another match. -.PP -If no match is found, then the -.I default rule -is executed: the next character in the input is considered matched and -copied to the standard output. Thus, the simplest legal -.I flex -input is: -.nf - - %% - -.fi -which generates a scanner that simply copies its input (one character -at a time) to its output. -.PP -Note that -.B yytext -can be defined in two different ways: either as a character -.I pointer -or as a character -.I array. -You can control which definition -.I flex -uses by including one of the special directives -.B %pointer -or -.B %array -in the first (definitions) section of your flex input. The default is -.B %pointer, -unless you use the -.B -l -lex compatibility option, in which case -.B yytext -will be an array. -The advantage of using -.B %pointer -is substantially faster scanning and no buffer overflow when matching -very large tokens (unless you run out of dynamic memory). The disadvantage -is that you are restricted in how your actions can modify -.B yytext -(see the next section), and calls to the -.B input() -and -.B unput() -functions destroy the present contents of -.B yytext, -which can be a considerable porting headache when moving between different -.I lex -versions. -.PP -The advantage of -.B %array -is that you can then modify -.B yytext -to your heart's content, and calls to -.B input() -and -.B unput() -do not destroy -.B yytext -(see below). Furthermore, existing -.I lex -programs sometimes access -.B yytext -externally using declarations of the form: -.nf - extern char yytext[]; -.fi -This definition is erroneous when used with -.B %pointer, -but correct for -.B %array. -.PP -.B %array -defines -.B yytext -to be an array of -.B YYLMAX -characters, which defaults to a fairly large value. You can change -the size by simply #define'ing -.B YYLMAX -to a different value in the first section of your -.I flex -input. As mentioned above, with -.B %pointer -yytext grows dynamically to accomodate large tokens. While this means your -.B %pointer -scanner can accomodate very large tokens (such as matching entire blocks -of comments), bear in mind that each time the scanner must resize -.B yytext -it also must rescan the entire token from the beginning, so matching such -tokens can prove slow. -.B yytext -presently does -.I not -dynamically grow if a call to -.B unput() -results in too much text being pushed back; instead, a run-time error results. -.PP -Also note that you cannot use -.B %array -with C++ scanner classes -(the -.B \-+ -option; see below). -.SH ACTIONS -Each pattern in a rule has a corresponding action, which can be any -arbitrary C statement. The pattern ends at the first non-escaped -whitespace character; the remainder of the line is its action. If the -action is empty, then when the pattern is matched the input token -is simply discarded. For example, here is the specification for a program -which deletes all occurrences of "zap me" from its input: -.nf - - %% - "zap me" - -.fi -(It will copy all other characters in the input to the output since -they will be matched by the default rule.) -.PP -Here is a program which compresses multiple blanks and tabs down to -a single blank, and throws away whitespace found at the end of a line: -.nf - - %% - [ \\t]+ putchar( ' ' ); - [ \\t]+$ /* ignore this token */ - -.fi -.PP -If the action contains a '{', then the action spans till the balancing '}' -is found, and the action may cross multiple lines. -.I flex -knows about C strings and comments and won't be fooled by braces found -within them, but also allows actions to begin with -.B %{ -and will consider the action to be all the text up to the next -.B %} -(regardless of ordinary braces inside the action). -.PP -An action consisting solely of a vertical bar ('|') means "same as -the action for the next rule." See below for an illustration. -.PP -Actions can include arbitrary C code, including -.B return -statements to return a value to whatever routine called -.B yylex(). -Each time -.B yylex() -is called it continues processing tokens from where it last left -off until it either reaches -the end of the file or executes a return. -.PP -Actions are free to modify -.B yytext -except for lengthening it (adding -characters to its end--these will overwrite later characters in the -input stream). Modifying the final character of yytext may alter -whether when scanning resumes rules anchored with '^' are active. -Specifically, changing the final character of yytext to a newline will -activate such rules on the next scan, and changing it to anything else -will deactivate the rules. Users should not rely on this behavior being -present in future releases. Finally, note that none of this paragraph -applies when using -.B %array -(see above). -.PP -Actions are free to modify -.B yyleng -except they should not do so if the action also includes use of -.B yymore() -(see below). -.PP -There are a number of special directives which can be included within -an action: -.IP - -.B ECHO -copies yytext to the scanner's output. -.IP - -.B BEGIN -followed by the name of a start condition places the scanner in the -corresponding start condition (see below). -.IP - -.B REJECT -directs the scanner to proceed on to the "second best" rule which matched the -input (or a prefix of the input). The rule is chosen as described -above in "How the Input is Matched", and -.B yytext -and -.B yyleng -set up appropriately. -It may either be one which matched as much text -as the originally chosen rule but came later in the -.I flex -input file, or one which matched less text. -For example, the following will both count the -words in the input and call the routine special() whenever "frob" is seen: -.nf - - int word_count = 0; - %% - - frob special(); REJECT; - [^ \\t\\n]+ ++word_count; - -.fi -Without the -.B REJECT, -any "frob"'s in the input would not be counted as words, since the -scanner normally executes only one action per token. -Multiple -.B REJECT's -are allowed, each one finding the next best choice to the currently -active rule. For example, when the following scanner scans the token -"abcd", it will write "abcdabcaba" to the output: -.nf - - %% - a | - ab | - abc | - abcd ECHO; REJECT; - .|\\n /* eat up any unmatched character */ - -.fi -(The first three rules share the fourth's action since they use -the special '|' action.) -.B REJECT -is a particularly expensive feature in terms scanner performance; -if it is used in -.I any -of the scanner's actions it will slow down -.I all -of the scanner's matching. Furthermore, -.B REJECT -cannot be used with the -.I -Cf -or -.I -CF -options (see below). -.IP -Note also that unlike the other special actions, -.B REJECT -is a -.I branch; -code immediately following it in the action will -.I not -be executed. -.IP - -.B yymore() -tells the scanner that the next time it matches a rule, the corresponding -token should be -.I appended -onto the current value of -.B yytext -rather than replacing it. For example, given the input "mega-kludge" -the following will write "mega-mega-kludge" to the output: -.nf - - %% - mega- ECHO; yymore(); - kludge ECHO; - -.fi -First "mega-" is matched and echoed to the output. Then "kludge" -is matched, but the previous "mega-" is still hanging around at the -beginning of -.B yytext -so the -.B ECHO -for the "kludge" rule will actually write "mega-kludge". -The presence of -.B yymore() -in the scanner's action entails a minor performance penalty in the -scanner's matching speed. -.IP - -.B yyless(n) -returns all but the first -.I n -characters of the current token back to the input stream, where they -will be rescanned when the scanner looks for the next match. -.B yytext -and -.B yyleng -are adjusted appropriately (e.g., -.B yyleng -will now be equal to -.I n -). For example, on the input "foobar" the following will write out -"foobarbar": -.nf - - %% - foobar ECHO; yyless(3); - [a-z]+ ECHO; - -.fi -An argument of 0 to -.B yyless -will cause the entire current input string to be scanned again. Unless you've -changed how the scanner will subsequently process its input (using -.B BEGIN, -for example), this will result in an endless loop. -.PP -Note that -.B yyless -is a macro and can only be used in the flex input file, not from -other source files. -.IP - -.B unput(c) -puts the character -.I c -back onto the input stream. It will be the next character scanned. -The following action will take the current token and cause it -to be rescanned enclosed in parentheses. -.nf - - { - int i; - unput( ')' ); - for ( i = yyleng - 1; i >= 0; --i ) - unput( yytext[i] ); - unput( '(' ); - } - -.fi -Note that since each -.B unput() -puts the given character back at the -.I beginning -of the input stream, pushing back strings must be done back-to-front. -Also note that you cannot put back -.B EOF -to attempt to mark the input stream with an end-of-file. -.IP - -.B input() -reads the next character from the input stream. For example, -the following is one way to eat up C comments: -.nf - - %% - "/*" { - register int c; - - for ( ; ; ) - { - while ( (c = input()) != '*' && - c != EOF ) - ; /* eat up text of comment */ - - if ( c == '*' ) - { - while ( (c = input()) == '*' ) - ; - if ( c == '/' ) - break; /* found the end */ - } - - if ( c == EOF ) - { - error( "EOF in comment" ); - break; - } - } - } - -.fi -(Note that if the scanner is compiled using -.B C++, -then -.B input() -is instead referred to as -.B yyinput(), -in order to avoid a name clash with the -.B C++ -stream by the name of -.I input.) -.IP - -.B yyterminate() -can be used in lieu of a return statement in an action. It terminates -the scanner and returns a 0 to the scanner's caller, indicating "all done". -By default, -.B yyterminate() -is also called when an end-of-file is encountered. It is a macro and -may be redefined. -.SH THE GENERATED SCANNER -The output of -.I flex -is the file -.B lex.yy.c, -which contains the scanning routine -.B yylex(), -a number of tables used by it for matching tokens, and a number -of auxiliary routines and macros. By default, -.B yylex() -is declared as follows: -.nf - - int yylex() - { - ... various definitions and the actions in here ... - } - -.fi -(If your environment supports function prototypes, then it will -be "int yylex( void )".) This definition may be changed by defining -the "YY_DECL" macro. For example, you could use: -.nf - - #define YY_DECL float lexscan( a, b ) float a, b; - -.fi -to give the scanning routine the name -.I lexscan, -returning a float, and taking two floats as arguments. Note that -if you give arguments to the scanning routine using a -K&R-style/non-prototyped function declaration, you must terminate -the definition with a semi-colon (;). -.PP -Whenever -.B yylex() -is called, it scans tokens from the global input file -.I yyin -(which defaults to stdin). It continues until it either reaches -an end-of-file (at which point it returns the value 0) or -one of its actions executes a -.I return -statement. -.PP -If the scanner reaches an end-of-file, subsequent calls are undefined -unless either -.I yyin -is pointed at a new input file (in which case scanning continues from -that file), or -.B yyrestart() -is called. -.B yyrestart() -takes one argument, a -.B FILE * -pointer, and initializes -.I yyin -for scanning from that file. Essentially there is no difference between -just assigning -.I yyin -to a new input file or using -.B yyrestart() -to do so; the latter is available for compatibility with previous versions -of -.I flex, -and because it can be used to switch input files in the middle of scanning. -It can also be used to throw away the current input buffer, by calling -it with an argument of -.I yyin. -.PP -If -.B yylex() -stops scanning due to executing a -.I return -statement in one of the actions, the scanner may then be called again and it -will resume scanning where it left off. -.PP -By default (and for purposes of efficiency), the scanner uses -block-reads rather than simple -.I getc() -calls to read characters from -.I yyin. -The nature of how it gets its input can be controlled by defining the -.B YY_INPUT -macro. -YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its -action is to place up to -.I max_size -characters in the character array -.I buf -and return in the integer variable -.I result -either the -number of characters read or the constant YY_NULL (0 on Unix systems) -to indicate EOF. The default YY_INPUT reads from the -global file-pointer "yyin". -.PP -A sample definition of YY_INPUT (in the definitions -section of the input file): -.nf - - %{ - #define YY_INPUT(buf,result,max_size) \\ - { \\ - int c = getchar(); \\ - result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\ - } - %} - -.fi -This definition will change the input processing to occur -one character at a time. -.PP -You also can add in things like keeping track of the -input line number this way; but don't expect your scanner to -go very fast. -.PP -When the scanner receives an end-of-file indication from YY_INPUT, -it then checks the -.B yywrap() -function. If -.B yywrap() -returns false (zero), then it is assumed that the -function has gone ahead and set up -.I yyin -to point to another input file, and scanning continues. If it returns -true (non-zero), then the scanner terminates, returning 0 to its -caller. -.PP -The default -.B yywrap() -always returns 1. -.PP -The scanner writes its -.B ECHO -output to the -.I yyout -global (default, stdout), which may be redefined by the user simply -by assigning it to some other -.B FILE -pointer. -.SH START CONDITIONS -.I flex -provides a mechanism for conditionally activating rules. Any rule -whose pattern is prefixed with "" will only be active when -the scanner is in the start condition named "sc". For example, -.nf - - [^"]* { /* eat up the string body ... */ - ... - } - -.fi -will be active only when the scanner is in the "STRING" start -condition, and -.nf - - \\. { /* handle an escape ... */ - ... - } - -.fi -will be active only when the current start condition is -either "INITIAL", "STRING", or "QUOTE". -.PP -Start conditions -are declared in the definitions (first) section of the input -using unindented lines beginning with either -.B %s -or -.B %x -followed by a list of names. -The former declares -.I inclusive -start conditions, the latter -.I exclusive -start conditions. A start condition is activated using the -.B BEGIN -action. Until the next -.B BEGIN -action is executed, rules with the given start -condition will be active and -rules with other start conditions will be inactive. -If the start condition is -.I inclusive, -then rules with no start conditions at all will also be active. -If it is -.I exclusive, -then -.I only -rules qualified with the start condition will be active. -A set of rules contingent on the same exclusive start condition -describe a scanner which is independent of any of the other rules in the -.I flex -input. Because of this, -exclusive start conditions make it easy to specify "mini-scanners" -which scan portions of the input that are syntactically different -from the rest (e.g., comments). -.PP -If the distinction between inclusive and exclusive start conditions -is still a little vague, here's a simple example illustrating the -connection between the two. The set of rules: -.nf - - %s example - %% - foo /* do something */ - -.fi -is equivalent to -.nf - - %x example - %% - foo /* do something */ - -.fi -.PP -Also note that the special start-condition specifier -.B <*> -matches every start condition. Thus, the above example could also -have been written; -.nf - - %x example - %% - <*>foo /* do something */ - -.fi -.PP -The default rule (to -.B ECHO -any unmatched character) remains active in start conditions. -.PP -.B BEGIN(0) -returns to the original state where only the rules with -no start conditions are active. This state can also be -referred to as the start-condition "INITIAL", so -.B BEGIN(INITIAL) -is equivalent to -.B BEGIN(0). -(The parentheses around the start condition name are not required but -are considered good style.) -.PP -.B BEGIN -actions can also be given as indented code at the beginning -of the rules section. For example, the following will cause -the scanner to enter the "SPECIAL" start condition whenever -.I yylex() -is called and the global variable -.I enter_special -is true: -.nf - - int enter_special; - - %x SPECIAL - %% - if ( enter_special ) - BEGIN(SPECIAL); - - blahblahblah - ...more rules follow... - -.fi -.PP -To illustrate the uses of start conditions, -here is a scanner which provides two different interpretations -of a string like "123.456". By default it will treat it as -as three tokens, the integer "123", a dot ('.'), and the integer "456". -But if the string is preceded earlier in the line by the string -"expect-floats" -it will treat it as a single token, the floating-point number -123.456: -.nf - - %{ - #include - %} - %s expect - - %% - expect-floats BEGIN(expect); - - [0-9]+"."[0-9]+ { - printf( "found a float, = %f\\n", - atof( yytext ) ); - } - \\n { - /* that's the end of the line, so - * we need another "expect-number" - * before we'll recognize any more - * numbers - */ - BEGIN(INITIAL); - } - - [0-9]+ { - printf( "found an integer, = %d\\n", - atoi( yytext ) ); - } - - "." printf( "found a dot\\n" ); - -.fi -Here is a scanner which recognizes (and discards) C comments while -maintaining a count of the current input line. -.nf - - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - [^*\\n]* /* eat anything that's not a '*' */ - "*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ - \\n ++line_num; - "*"+"/" BEGIN(INITIAL); - -.fi -This scanner goes to a bit of trouble to match as much -text as possible with each rule. In general, when attempting to write -a high-speed scanner try to match as much possible in each rule, as -it's a big win. -.PP -Note that start-conditions names are really integer values and -can be stored as such. Thus, the above could be extended in the -following fashion: -.nf - - %x comment foo - %% - int line_num = 1; - int comment_caller; - - "/*" { - comment_caller = INITIAL; - BEGIN(comment); - } - - ... - - "/*" { - comment_caller = foo; - BEGIN(comment); - } - - [^*\\n]* /* eat anything that's not a '*' */ - "*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */ - \\n ++line_num; - "*"+"/" BEGIN(comment_caller); - -.fi -Furthermore, you can access the current start condition using -the integer-valued -.B YY_START -macro. For example, the above assignments to -.I comment_caller -could instead be written -.nf - - comment_caller = YY_START; -.fi -.PP -Note that start conditions do not have their own name-space; %s's and %x's -declare names in the same fashion as #define's. -.PP -Finally, here's an example of how to match C-style quoted strings using -exclusive start conditions, including expanded escape sequences (but -not including checking for a string that's too long): -.nf - - %x str - - %% - char string_buf[MAX_STR_CONST]; - char *string_buf_ptr; - - - \\" string_buf_ptr = string_buf; BEGIN(str); - - \\" { /* saw closing quote - all done */ - BEGIN(INITIAL); - *string_buf_ptr = '\\0'; - /* return string constant token type and - * value to parser - */ - } - - \\n { - /* error - unterminated string constant */ - /* generate error message */ - } - - \\\\[0-7]{1,3} { - /* octal escape sequence */ - int result; - - (void) sscanf( yytext + 1, "%o", &result ); - - if ( result > 0xff ) - /* error, constant is out-of-bounds */ - - *string_buf_ptr++ = result; - } - - \\\\[0-9]+ { - /* generate error - bad escape sequence; something - * like '\\48' or '\\0777777' - */ - } - - \\\\n *string_buf_ptr++ = '\\n'; - \\\\t *string_buf_ptr++ = '\\t'; - \\\\r *string_buf_ptr++ = '\\r'; - \\\\b *string_buf_ptr++ = '\\b'; - \\\\f *string_buf_ptr++ = '\\f'; - - \\\\(.|\\n) *string_buf_ptr++ = yytext[1]; - - [^\\\\\\n\\"]+ { - char *yytext_ptr = yytext; - - while ( *yytext_ptr ) - *string_buf_ptr++ = *yytext_ptr++; - } - -.fi -.SH MULTIPLE INPUT BUFFERS -Some scanners (such as those which support "include" files) -require reading from several input streams. As -.I flex -scanners do a large amount of buffering, one cannot control -where the next input will be read from by simply writing a -.B YY_INPUT -which is sensitive to the scanning context. -.B YY_INPUT -is only called when the scanner reaches the end of its buffer, which -may be a long time after scanning a statement such as an "include" -which requires switching the input source. -.PP -To negotiate these sorts of problems, -.I flex -provides a mechanism for creating and switching between multiple -input buffers. An input buffer is created by using: -.nf - - YY_BUFFER_STATE yy_create_buffer( FILE *file, int size ) - -.fi -which takes a -.I FILE -pointer and a size and creates a buffer associated with the given -file and large enough to hold -.I size -characters (when in doubt, use -.B YY_BUF_SIZE -for the size). It returns a -.B YY_BUFFER_STATE -handle, which may then be passed to other routines: -.nf - - void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer ) - -.fi -switches the scanner's input buffer so subsequent tokens will -come from -.I new_buffer. -Note that -.B yy_switch_to_buffer() -may be used by yywrap() to set things up for continued scanning, instead -of opening a new file and pointing -.I yyin -at it. -.nf - - void yy_delete_buffer( YY_BUFFER_STATE buffer ) - -.fi -is used to reclaim the storage associated with a buffer. -.PP -.B yy_new_buffer() -is an alias for -.B yy_create_buffer(), -provided for compatibility with the C++ use of -.I new -and -.I delete -for creating and destroying dynamic objects. -.PP -Finally, the -.B YY_CURRENT_BUFFER -macro returns a -.B YY_BUFFER_STATE -handle to the current buffer. -.PP -Here is an example of using these features for writing a scanner -which expands include files (the -.B <> -feature is discussed below): -.nf - - /* the "incl" state is used for picking up the name - * of an include file - */ - %x incl - - %{ - #define MAX_INCLUDE_DEPTH 10 - YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; - int include_stack_ptr = 0; - %} - - %% - include BEGIN(incl); - - [a-z]+ ECHO; - [^a-z\\n]*\\n? ECHO; - - [ \\t]* /* eat the whitespace */ - [^ \\t\\n]+ { /* got the include file name */ - if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) - { - fprintf( stderr, "Includes nested too deeply" ); - exit( 1 ); - } - - include_stack[include_stack_ptr++] = - YY_CURRENT_BUFFER; - - yyin = fopen( yytext, "r" ); - - if ( ! yyin ) - error( ... ); - - yy_switch_to_buffer( - yy_create_buffer( yyin, YY_BUF_SIZE ) ); - - BEGIN(INITIAL); - } - - <> { - if ( --include_stack_ptr < 0 ) - { - yyterminate(); - } - - else - { - yy_delete_buffer( YY_CURRENT_BUFFER ); - yy_switch_to_buffer( - include_stack[include_stack_ptr] ); - } - } - -.fi -.SH END-OF-FILE RULES -The special rule "<>" indicates -actions which are to be taken when an end-of-file is -encountered and yywrap() returns non-zero (i.e., indicates -no further files to process). The action must finish -by doing one of four things: -.IP - -assigning -.I yyin -to a new input file (in previous versions of flex, after doing the -assignment you had to call the special action -.B YY_NEW_FILE; -this is no longer necessary); -.IP - -executing a -.I return -statement; -.IP - -executing the special -.B yyterminate() -action; -.IP - -or, switching to a new buffer using -.B yy_switch_to_buffer() -as shown in the example above. -.PP -<> rules may not be used with other -patterns; they may only be qualified with a list of start -conditions. If an unqualified <> rule is given, it -applies to -.I all -start conditions which do not already have <> actions. To -specify an <> rule for only the initial start condition, use -.nf - - <> - -.fi -.PP -These rules are useful for catching things like unclosed comments. -An example: -.nf - - %x quote - %% - - ...other rules for dealing with quotes... - - <> { - error( "unterminated quote" ); - yyterminate(); - } - <> { - if ( *++filelist ) - yyin = fopen( *filelist, "r" ); - else - yyterminate(); - } - -.fi -.SH MISCELLANEOUS MACROS -The macro -.bd -YY_USER_ACTION -can be defined to provide an action -which is always executed prior to the matched rule's action. For example, -it could be #define'd to call a routine to convert yytext to lower-case. -.PP -The macro -.B YY_USER_INIT -may be defined to provide an action which is always executed before -the first scan (and before the scanner's internal initializations are done). -For example, it could be used to call a routine to read -in a data table or open a logging file. -.PP -In the generated scanner, the actions are all gathered in one large -switch statement and separated using -.B YY_BREAK, -which may be redefined. By default, it is simply a "break", to separate -each rule's action from the following rule's. -Redefining -.B YY_BREAK -allows, for example, C++ users to -#define YY_BREAK to do nothing (while being very careful that every -rule ends with a "break" or a "return"!) to avoid suffering from -unreachable statement warnings where because a rule's action ends with -"return", the -.B YY_BREAK -is inaccessible. -.SH INTERFACING WITH YACC -One of the main uses of -.I flex -is as a companion to the -.I yacc -parser-generator. -.I yacc -parsers expect to call a routine named -.B yylex() -to find the next input token. The routine is supposed to -return the type of the next token as well as putting any associated -value in the global -.B yylval. -To use -.I flex -with -.I yacc, -one specifies the -.B \-d -option to -.I yacc -to instruct it to generate the file -.B y.tab.h -containing definitions of all the -.B %tokens -appearing in the -.I yacc -input. This file is then included in the -.I flex -scanner. For example, if one of the tokens is "TOK_NUMBER", -part of the scanner might look like: -.nf - - %{ - #include "y.tab.h" - %} - - %% - - [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; - -.fi -.SH OPTIONS -.I flex -has the following options: -.TP -.B \-b -Generate backing-up information to -.I lex.backup. -This is a list of scanner states which require backing up -and the input characters on which they do so. By adding rules one -can remove backing-up states. If all backing-up states -are eliminated and -.B \-Cf -or -.B \-CF -is used, the generated scanner will run faster (see the -.B \-p -flag). Only users who wish to squeeze every last cycle out of their -scanners need worry about this option. (See the section on Performance -Considerations below.) -.TP -.B \-c -is a do-nothing, deprecated option included for POSIX compliance. -.IP -.B NOTE: -in previous releases of -.I flex -.B \-c -specified table-compression options. This functionality is -now given by the -.B \-C -flag. To ease the the impact of this change, when -.I flex -encounters -.B \-c, -it currently issues a warning message and assumes that -.B \-C -was desired instead. In the future this "promotion" of -.B \-c -to -.B \-C -will go away in the name of full POSIX compliance (unless -the POSIX meaning is removed first). -.TP -.B \-d -makes the generated scanner run in -.I debug -mode. Whenever a pattern is recognized and the global -.B yy_flex_debug -is non-zero (which is the default), -the scanner will write to -.I stderr -a line of the form: -.nf - - --accepting rule at line 53 ("the matched text") - -.fi -The line number refers to the location of the rule in the file -defining the scanner (i.e., the file that was fed to flex). Messages -are also generated when the scanner backs up, accepts the -default rule, reaches the end of its input buffer (or encounters -a NUL; at this point, the two look the same as far as the scanner's concerned), -or reaches an end-of-file. -.TP -.B \-f -specifies -.I fast scanner. -No table compression is done and stdio is bypassed. -The result is large but fast. This option is equivalent to -.B \-Cfr -(see below). -.TP -.B \-h -generates a "help" summary of -.I flex's -options to -.I stderr -and then exits. -.TP -.B \-i -instructs -.I flex -to generate a -.I case-insensitive -scanner. The case of letters given in the -.I flex -input patterns will -be ignored, and tokens in the input will be matched regardless of case. The -matched text given in -.I yytext -will have the preserved case (i.e., it will not be folded). -.TP -.B \-l -turns on maximum compatibility with the original AT&T -.I lex -implementation. Note that this does not mean -.I full -compatibility. Use of this option costs a considerable amount of -performance, and it cannot be used with the -.B \-+, -f, -F, -Cf, -or -.B -CF -options. For details on the compatibilities it provides, see the section -"Incompatibilities With Lex And POSIX" below. -.TP -.B \-n -is another do-nothing, deprecated option included only for -POSIX compliance. -.TP -.B \-p -generates a performance report to stderr. The report -consists of comments regarding features of the -.I flex -input file which will cause a serious loss of performance in the resulting -scanner. If you give the flag twice, you will also get comments regarding -features that lead to minor performance losses. -.IP -Note that the use of -.B REJECT -and variable trailing context (see the Bugs section in flex(1)) -entails a substantial performance penalty; use of -.I yymore(), -the -.B ^ -operator, -and the -.B \-I -flag entail minor performance penalties. -.TP -.B \-s -causes the -.I default rule -(that unmatched scanner input is echoed to -.I stdout) -to be suppressed. If the scanner encounters input that does not -match any of its rules, it aborts with an error. This option is -useful for finding holes in a scanner's rule set. -.TP -.B \-t -instructs -.I flex -to write the scanner it generates to standard output instead -of -.B lex.yy.c. -.TP -.B \-v -specifies that -.I flex -should write to -.I stderr -a summary of statistics regarding the scanner it generates. -Most of the statistics are meaningless to the casual -.I flex -user, but the first line identifies the version of -.I flex -(same as reported by -.B \-V), -and the next line the flags used when generating the scanner, including -those that are on by default. -.TP -.B \-w -suppresses warning messages. -.TP -.B \-B -instructs -.I flex -to generate a -.I batch -scanner, the opposite of -.I interactive -scanners generated by -.B \-I -(see below). In general, you use -.B \-B -when you are -.I certain -that your scanner will never be used interactively, and you want to -squeeze a -.I little -more performance out of it. If your goal is instead to squeeze out a -.I lot -more performance, you should be using the -.B \-Cf -or -.B \-CF -options (discussed below), which turn on -.B \-B -automatically anyway. -.TP -.B \-F -specifies that the -.ul -fast -scanner table representation should be used (and stdio -bypassed). This representation is -about as fast as the full table representation -.B (-f), -and for some sets of patterns will be considerably smaller (and for -others, larger). In general, if the pattern set contains both "keywords" -and a catch-all, "identifier" rule, such as in the set: -.nf - - "case" return TOK_CASE; - "switch" return TOK_SWITCH; - ... - "default" return TOK_DEFAULT; - [a-z]+ return TOK_ID; - -.fi -then you're better off using the full table representation. If only -the "identifier" rule is present and you then use a hash table or some such -to detect the keywords, you're better off using -.B -F. -.IP -This option is equivalent to -.B \-CFr -(see below). It cannot be used with -.B \-+. -.TP -.B \-I -instructs -.I flex -to generate an -.I interactive -scanner. An interactive scanner is one that only looks ahead to decide -what token has been matched if it absolutely must. It turns out that -always looking one extra character ahead, even if the scanner has already -seen enough text to disambiguate the current token, is a bit faster than -only looking ahead when necessary. But scanners that always look ahead -give dreadful interactive performance; for example, when a user types -a newline, it is not recognized as a newline token until they enter -.I another -token, which often means typing in another whole line. -.IP -.I Flex -scanners default to -.I interactive -unless you use the -.B \-Cf -or -.B \-CF -table-compression options (see below). That's because if you're looking -for high-performance you should be using one of these options, so if you -didn't, -.I flex -assumes you'd rather trade off a bit of run-time performance for intuitive -interactive behavior. Note also that you -.I cannot -use -.B \-I -in conjunction with -.B \-Cf -or -.B \-CF. -Thus, this option is not really needed; it is on by default for all those -cases in which it is allowed. -.IP -You can force a scanner to -.I not -be interactive by using -.B \-B -(see above). -.TP -.B \-L -instructs -.I flex -not to generate -.B #line -directives. Without this option, -.I flex -peppers the generated scanner -with #line directives so error messages in the actions will be correctly -located with respect to the original -.I flex -input file, and not to -the fairly meaningless line numbers of -.B lex.yy.c. -(Unfortunately -.I flex -does not presently generate the necessary directives -to "retarget" the line numbers for those parts of -.B lex.yy.c -which it generated. So if there is an error in the generated code, -a meaningless line number is reported.) -.TP -.B \-T -makes -.I flex -run in -.I trace -mode. It will generate a lot of messages to -.I stderr -concerning -the form of the input and the resultant non-deterministic and deterministic -finite automata. This option is mostly for use in maintaining -.I flex. -.TP -.B \-V -prints the version number to -.I stderr -and exits. -.TP -.B \-7 -instructs -.I flex -to generate a 7-bit scanner, i.e., one which can only recognized 7-bit -characters in its input. The advantage of using -.B \-7 -is that the scanner's tables can be up to half the size of those generated -using the -.B \-8 -option (see below). The disadvantage is that such scanners often hang -or crash if their input contains an 8-bit character. -.IP -Note, however, that unless you generate your scanner using the -.B \-Cf -or -.B \-CF -table compression options, use of -.B \-7 -will save only a small amount of table space, and make your scanner -considerably less portable. -.I Flex's -default behavior is to generate an 8-bit scanner unless you use the -.B \-Cf -or -.B \-CF, -in which case -.I flex -defaults to generating 7-bit scanners unless your site was always -configured to generate 8-bit scanners (as will often be the case -with non-USA sites). You can tell whether flex generated a 7-bit -or an 8-bit scanner by inspecting the flag summary in the -.B \-v -output as described above. -.IP -Note that if you use -.B \-Cfe -or -.B \-CFe -(those table compression options, but also using equivalence classes as -discussed see below), flex still defaults to generating an 8-bit -scanner, since usually with these compression options full 8-bit tables -are not much more expensive than 7-bit tables. -.TP -.B \-8 -instructs -.I flex -to generate an 8-bit scanner, i.e., one which can recognize 8-bit -characters. This flag is only needed for scanners generated using -.B \-Cf -or -.B \-CF, -as otherwise flex defaults to generating an 8-bit scanner anyway. -.IP -See the discussion of -.B \-7 -above for flex's default behavior and the tradeoffs between 7-bit -and 8-bit scanners. -.TP -.B \-+ -specifies that you want flex to generate a C++ -scanner class. See the section on Generating C++ Scanners below for -details. -.TP -.B \-C[aefFmr] -controls the degree of table compression and, more generally, trade-offs -between small scanners and fast scanners. -.IP -.B \-Ca -("align") instructs flex to trade off larger tables in the -generated scanner for faster performance because the elements of -the tables are better aligned for memory access and computation. On some -RISC architectures, fetching and manipulating longwords is more efficient -than with smaller-sized datums such as shortwords. This option can -double the size of the tables used by your scanner. -.IP -.B \-Ce -directs -.I flex -to construct -.I equivalence classes, -i.e., sets of characters -which have identical lexical properties (for example, if the only -appearance of digits in the -.I flex -input is in the character class -"[0-9]" then the digits '0', '1', ..., '9' will all be put -in the same equivalence class). Equivalence classes usually give -dramatic reductions in the final table/object file sizes (typically -a factor of 2-5) and are pretty cheap performance-wise (one array -look-up per character scanned). -.IP -.B \-Cf -specifies that the -.I full -scanner tables should be generated - -.I flex -should not compress the -tables by taking advantages of similar transition functions for -different states. -.IP -.B \-CF -specifies that the alternate fast scanner representation (described -above under the -.B \-F -flag) -should be used. This option cannot be used with -.B \-+. -.IP -.B \-Cm -directs -.I flex -to construct -.I meta-equivalence classes, -which are sets of equivalence classes (or characters, if equivalence -classes are not being used) that are commonly used together. Meta-equivalence -classes are often a big win when using compressed tables, but they -have a moderate performance impact (one or two "if" tests and one -array look-up per character scanned). -.IP -.B \-Cr -causes the generated scanner to -.I bypass -use of the standard I/O library (stdio) for input. Instead of calling -.B fread() -or -.B getc(), -the scanner will use the -.B read() -system call, resulting in a performance gain which varies from system -to system, but in general is probably negligible unless you are also using -.B \-Cf -or -.B \-CF. -Using -.B \-Cr -can cause strange behavior if, for example, you read from -.I yyin -using stdio prior to calling the scanner (because the scanner will miss -whatever text your previous reads left in the stdio input buffer). -.IP -.B \-Cr -has no effect if you define -.B YY_INPUT -(see The Generated Scanner above). -.IP -A lone -.B \-C -specifies that the scanner tables should be compressed but neither -equivalence classes nor meta-equivalence classes should be used. -.IP -The options -.B \-Cf -or -.B \-CF -and -.B \-Cm -do not make sense together - there is no opportunity for meta-equivalence -classes if the table is not being compressed. Otherwise the options -may be freely mixed, and are cumulative. -.IP -The default setting is -.B \-Cem, -which specifies that -.I flex -should generate equivalence classes -and meta-equivalence classes. This setting provides the highest -degree of table compression. You can trade off -faster-executing scanners at the cost of larger tables with -the following generally being true: -.nf - - slowest & smallest - -Cem - -Cm - -Ce - -C - -C{f,F}e - -C{f,F} - -C{f,F}a - fastest & largest - -.fi -Note that scanners with the smallest tables are usually generated and -compiled the quickest, so -during development you will usually want to use the default, maximal -compression. -.IP -.B \-Cfe -is often a good compromise between speed and size for production -scanners. -.TP -.B \-Pprefix -changes the default -.I "yy" -prefix used by -.I flex -for all globally-visible variable and function names to instead be -.I prefix. -For example, -.B \-Pfoo -changes the name of -.B yytext -to -.B footext. -It also changes the name of the default output file from -.B lex.yy.c -to -.B lex.foo.c. -Here are all of the names affected: -.nf - - yyFlexLexer - yy_create_buffer - yy_delete_buffer - yy_flex_debug - yy_init_buffer - yy_load_buffer_state - yy_switch_to_buffer - yyin - yyleng - yylex - yyout - yyrestart - yytext - yywrap - -.fi -Within your scanner itself, you can still refer to the global variables -and functions using either version of their name; but eternally, they -have the modified name. -.IP -This option lets you easily link together multiple -.I flex -programs into the same executable. Note, though, that using this -option also renames -.B yywrap(), -so you now -.I must -provide your own (appropriately-named) version of the routine for your -scanner, as linking with -.B \-lfl -no longer provides one for you by default. -.TP -.B \-Sskeleton_file -overrides the default skeleton file from which -.I flex -constructs its scanners. You'll never need this option unless you are doing -.I flex -maintenance or development. -.SH PERFORMANCE CONSIDERATIONS -The main design goal of -.I flex -is that it generate high-performance scanners. It has been optimized -for dealing well with large sets of rules. Aside from the effects on -scanner speed of the table compression -.B \-C -options outlined above, -there are a number of options/actions which degrade performance. These -are, from most expensive to least: -.nf - - REJECT - - pattern sets that require backing up - arbitrary trailing context - - yymore() - '^' beginning-of-line operator - -.fi -with the first three all being quite expensive and the last two -being quite cheap. Note also that -.B unput() -is implemented as a routine call that potentially does quite a bit of -work, while -.B yyless() -is a quite-cheap macro; so if just putting back some excess text you -scanned, use -.B yyless(). -.PP -.B REJECT -should be avoided at all costs when performance is important. -It is a particularly expensive option. -.PP -Getting rid of backing up is messy and often may be an enormous -amount of work for a complicated scanner. In principal, one begins -by using the -.B \-b -flag to generate a -.I lex.backup -file. For example, on the input -.nf - - %% - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - -.fi -the file looks like: -.nf - - State #6 is non-accepting - - associated rule line numbers: - 2 3 - out-transitions: [ o ] - jam-transitions: EOF [ \\001-n p-\\177 ] - - State #8 is non-accepting - - associated rule line numbers: - 3 - out-transitions: [ a ] - jam-transitions: EOF [ \\001-` b-\\177 ] - - State #9 is non-accepting - - associated rule line numbers: - 3 - out-transitions: [ r ] - jam-transitions: EOF [ \\001-q s-\\177 ] - - Compressed tables always back up. - -.fi -The first few lines tell us that there's a scanner state in -which it can make a transition on an 'o' but not on any other -character, and that in that state the currently scanned text does not match -any rule. The state occurs when trying to match the rules found -at lines 2 and 3 in the input file. -If the scanner is in that state and then reads -something other than an 'o', it will have to back up to find -a rule which is matched. With -a bit of headscratching one can see that this must be the -state it's in when it has seen "fo". When this has happened, -if anything other than another 'o' is seen, the scanner will -have to back up to simply match the 'f' (by the default rule). -.PP -The comment regarding State #8 indicates there's a problem -when "foob" has been scanned. Indeed, on any character other -than an 'a', the scanner will have to back up to accept "foo". -Similarly, the comment for State #9 concerns when "fooba" has -been scanned and an 'r' does not follow. -.PP -The final comment reminds us that there's no point going to -all the trouble of removing backing up from the rules unless -we're using -.B \-Cf -or -.B \-CF, -since there's no performance gain doing so with compressed scanners. -.PP -The way to remove the backing up is to add "error" rules: -.nf - - %% - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - - fooba | - foob | - fo { - /* false alarm, not really a keyword */ - return TOK_ID; - } - -.fi -.PP -Eliminating backing up among a list of keywords can also be -done using a "catch-all" rule: -.nf - - %% - foo return TOK_KEYWORD; - foobar return TOK_KEYWORD; - - [a-z]+ return TOK_ID; - -.fi -This is usually the best solution when appropriate. -.PP -Backing up messages tend to cascade. -With a complicated set of rules it's not uncommon to get hundreds -of messages. If one can decipher them, though, it often -only takes a dozen or so rules to eliminate the backing up (though -it's easy to make a mistake and have an error rule accidentally match -a valid token. A possible future -.I flex -feature will be to automatically add rules to eliminate backing up). -.PP -.I Variable -trailing context (where both the leading and trailing parts do not have -a fixed length) entails almost the same performance loss as -.B REJECT -(i.e., substantial). So when possible a rule like: -.nf - - %% - mouse|rat/(cat|dog) run(); - -.fi -is better written: -.nf - - %% - mouse/cat|dog run(); - rat/cat|dog run(); - -.fi -or as -.nf - - %% - mouse|rat/cat run(); - mouse|rat/dog run(); - -.fi -Note that here the special '|' action does -.I not -provide any savings, and can even make things worse (see -.PP -A final note regarding performance: as mentioned above in the section -How the Input is Matched, dynamically resizing -.B yytext -to accomodate huge tokens is a slow process because it presently requires that -the (huge) token be rescanned from the beginning. Thus if performance is -vital, you should attempt to match "large" quantities of text but not -"huge" quantities, where the cutoff between the two is at about 8K -characters/token. -.PP -Another area where the user can increase a scanner's performance -(and one that's easier to implement) arises from the fact that -the longer the tokens matched, the faster the scanner will run. -This is because with long tokens the processing of most input -characters takes place in the (short) inner scanning loop, and -does not often have to go through the additional work of setting up -the scanning environment (e.g., -.B yytext) -for the action. Recall the scanner for C comments: -.nf - - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - [^*\\n]* - "*"+[^*/\\n]* - \\n ++line_num; - "*"+"/" BEGIN(INITIAL); - -.fi -This could be sped up by writing it as: -.nf - - %x comment - %% - int line_num = 1; - - "/*" BEGIN(comment); - - [^*\\n]* - [^*\\n]*\\n ++line_num; - "*"+[^*/\\n]* - "*"+[^*/\\n]*\\n ++line_num; - "*"+"/" BEGIN(INITIAL); - -.fi -Now instead of each newline requiring the processing of another -action, recognizing the newlines is "distributed" over the other rules -to keep the matched text as long as possible. Note that -.I adding -rules does -.I not -slow down the scanner! The speed of the scanner is independent -of the number of rules or (modulo the considerations given at the -beginning of this section) how complicated the rules are with -regard to operators such as '*' and '|'. -.PP -A final example in speeding up a scanner: suppose you want to scan -through a file containing identifiers and keywords, one per line -and with no other extraneous characters, and recognize all the -keywords. A natural first approach is: -.nf - - %% - asm | - auto | - break | - ... etc ... - volatile | - while /* it's a keyword */ - - .|\\n /* it's not a keyword */ - -.fi -To eliminate the back-tracking, introduce a catch-all rule: -.nf - - %% - asm | - auto | - break | - ... etc ... - volatile | - while /* it's a keyword */ - - [a-z]+ | - .|\\n /* it's not a keyword */ - -.fi -Now, if it's guaranteed that there's exactly one word per line, -then we can reduce the total number of matches by a half by -merging in the recognition of newlines with that of the other -tokens: -.nf - - %% - asm\\n | - auto\\n | - break\\n | - ... etc ... - volatile\\n | - while\\n /* it's a keyword */ - - [a-z]+\\n | - .|\\n /* it's not a keyword */ - -.fi -One has to be careful here, as we have now reintroduced backing up -into the scanner. In particular, while -.I we -know that there will never be any characters in the input stream -other than letters or newlines, -.I flex -can't figure this out, and it will plan for possibly needing to back up -when it has scanned a token like "auto" and then the next character -is something other than a newline or a letter. Previously it would -then just match the "auto" rule and be done, but now it has no "auto" -rule, only a "auto\\n" rule. To eliminate the possibility of backing up, -we could either duplicate all rules but without final newlines, or, -since we never expect to encounter such an input and therefore don't -how it's classified, we can introduce one more catch-all rule, this -one which doesn't include a newline: -.nf - - %% - asm\\n | - auto\\n | - break\\n | - ... etc ... - volatile\\n | - while\\n /* it's a keyword */ - - [a-z]+\\n | - [a-z]+ | - .|\\n /* it's not a keyword */ - -.fi -Compiled with -.B \-Cf, -this is about as fast as one can get a -.I flex -scanner to go for this particular problem. -.PP -A final note: -.I flex -is slow when matching NUL's, particularly when a token contains -multiple NUL's. -It's best to write rules which match -.I short -amounts of text if it's anticipated that the text will often include NUL's. -.SH GENERATING C++ SCANNERS -.I flex -provides two different ways to generate scanners for use with C++. The -first way is to simply compile a scanner generated by -.I flex -using a C++ compiler instead of a C compiler. You should not encounter -any compilations errors (please report any you find to the email address -given in the Author section below). You can then use C++ code in your -rule actions instead of C code. Note that the default input source for -your scanner remains -.I yyin, -and default echoing is still done to -.I yyout. -Both of these remain -.I FILE * -variables and not C++ -.I streams. -.PP -You can also use -.I flex -to generate a C++ scanner class, using the -.B \-+ -option, which is automatically specified if the name of the flex -executable ends in a '+', such as -.I flex++. -When using this option, flex defaults to generating the scanner to the file -.B lex.yy.cc -instead of -.B lex.yy.c. -The generated scanner includes the header file -.I FlexLexer.h, -which defines the interface to two C++ classes. -.PP -The first class, -.B FlexLexer, -provides an abstract base class defining the general scanner class -interface. It provides the following member functions: -.TP -.B const char* YYText() -returns the text of the most recently matched token, the equivalent of -.B yytext. -.TP -.B int YYLeng() -returns the length of the most recently matched token, the equivalent of -.B yyleng. -.PP -Also provided are member functions equivalent to -.B yy_switch_to_buffer(), -.B yy_create_buffer() -(though the first argument is an -.B istream* -object pointer and not a -.B FILE*), -.B yy_delete_buffer(), -and -.B yyrestart() -(again, the first argument is a -.B istream* -object pointer). -.PP -The second class defined in -.I FlexLexer.h -is -.B yyFlexLexer, -which is derived from -.B FlexLexer. -It defines the following additional member functions: -.TP -.B -yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) -constructs a -.B yyFlexLexer -object using the given streams for input and output. If not specified, -the streams default to -.B cin -and -.B cout, -respectively. -.TP -.B virtual int yylex() -performs the same role is -.B yylex() -does for ordinary flex scanners: it scans the input stream, consuming -tokens, until a rule's action returns a value. -.PP -In addition, -.B yyFlexLexer -defines the following protected virtual functions which you can redefine -in derived classes to tailor the scanner: -.TP -.B -virtual int LexerInput( char* buf, int max_size ) -reads up to -.B max_size -characters into -.B buf -and returns the number of characters read. To indicate end-of-input, -return 0 characters. Note that "interactive" scanners (see the -.B \-B -and -.B \-I -flags) define the macro -.B YY_INTERACTIVE. -If you redefine -.B LexerInput() -and need to take different actions depending on whether or not -the scanner might be scanning an interactive input source, you can -test for the presence of this name via -.B #ifdef. -.TP -.B -virtual void LexerOutput( const char* buf, int size ) -writes out -.B size -characters from the buffer -.B buf, -which, while NUL-terminated, may also contain "internal" NUL's if -the scanner's rules can match text with NUL's in them. -.TP -.B -virtual void LexerError( const char* msg ) -reports a fatal error message. The default version of this function -writes the message to the stream -.B cerr -and exits. -.PP -Note that a -.B yyFlexLexer -object contains its -.I entire -scanning state. Thus you can use such objects to create reentrant -scanners. You can instantiate multiple instances of the same -.B yyFlexLexer -class, and you can also combine multiple C++ scanner classes together -in the same program using the -.B \-P -option discussed above. -.PP -Finally, note that the -.B %array -feature is not available to C++ scanner classes; you must use -.B %pointer -(the default). -.PP -Here is an example of a simple C++ scanner: -.nf - - // An example of using the flex C++ scanner class. - - %{ - int mylineno = 0; - %} - - string \\"[^\\n"]+\\" - - ws [ \\t]+ - - alpha [A-Za-z] - dig [0-9] - name ({alpha}|{dig}|\\$)({alpha}|{dig}|[_.\\-/$])* - num1 [-+]?{dig}+\\.?([eE][-+]?{dig}+)? - num2 [-+]?{dig}*\\.{dig}+([eE][-+]?{dig}+)? - number {num1}|{num2} - - %% - - {ws} /* skip blanks and tabs */ - - "/*" { - int c; - - while((c = yyinput()) != 0) - { - if(c == '\\n') - ++mylineno; - - else if(c == '*') - { - if((c = yyinput()) == '/') - break; - else - unput(c); - } - } - } - - {number} cout << "number " << YYText() << '\\n'; - - \\n mylineno++; - - {name} cout << "name " << YYText() << '\\n'; - - {string} cout << "string " << YYText() << '\\n'; - - %% - - int main( int /* argc */, char** /* argv */ ) - { - FlexLexer* lexer = new yyFlexLexer; - while(lexer->yylex() != 0) - ; - return 0; - } -.fi -IMPORTANT: the present form of the scanning class is -.I experimental -and may change considerably between major releases. -.SH INCOMPATIBILITIES WITH LEX AND POSIX -.I flex -is a rewrite of the AT&T Unix -.I lex -tool (the two implementations do not share any code, though), -with some extensions and incompatibilities, both of which -are of concern to those who wish to write scanners acceptable -to either implementation. The POSIX -.I lex -specification is closer to -.I flex's -behavior than that of the original -.I lex -implementation, but there also remain some incompatibilities between -.I flex -and POSIX. The intent is that ultimately -.I flex -will be fully POSIX-conformant. In this section we discuss all of -the known areas of incompatibility. -.PP -.I flex's -.B \-l -option turns on maximum compatibility with the original AT&T -.I lex -implementation, at the cost of a major loss in the generated scanner's -performance. We note below which incompatibilities can be overcome -using the -.B \-l -option. -.PP -.I flex -is fully compatible with -.I lex -with the following exceptions: -.IP - -The undocumented -.I lex -scanner internal variable -.B yylineno -is not supported unless -.B \-l -is used. -.IP -yylineno is not part of the POSIX specification. -.IP - -The -.B input() -routine is not redefinable, though it may be called to read characters -following whatever has been matched by a rule. If -.B input() -encounters an end-of-file the normal -.B yywrap() -processing is done. A ``real'' end-of-file is returned by -.B input() -as -.I EOF. -.IP -Input is instead controlled by defining the -.B YY_INPUT -macro. -.IP -The -.I flex -restriction that -.B input() -cannot be redefined is in accordance with the POSIX specification, -which simply does not specify any way of controlling the -scanner's input other than by making an initial assignment to -.I yyin. -.IP - -.I flex -scanners are not as reentrant as -.I lex -scanners. In particular, if you have an interactive scanner and -an interrupt handler which long-jumps out of the scanner, and -the scanner is subsequently called again, you may get the following -message: -.nf - - fatal flex scanner internal error--end of buffer missed - -.fi -To reenter the scanner, first use -.nf - - yyrestart( yyin ); - -.fi -Note that this call will throw away any buffered input; usually this -isn't a problem with an interactive scanner. -.IP -Also note that flex C++ scanner classes -.I are -reentrant, so if using C++ is an option for you, you should use -them instead. See "Generating C++ Scanners" above for details. -.IP - -.B output() -is not supported. -Output from the -.B ECHO -macro is done to the file-pointer -.I yyout -(default -.I stdout). -.IP -.B output() -is not part of the POSIX specification. -.IP - -.I lex -does not support exclusive start conditions (%x), though they -are in the POSIX specification. -.IP - -When definitions are expanded, -.I flex -encloses them in parentheses. -With lex, the following: -.nf - - NAME [A-Z][A-Z0-9]* - %% - foo{NAME}? printf( "Found it\\n" ); - %% - -.fi -will not match the string "foo" because when the macro -is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?" -and the precedence is such that the '?' is associated with -"[A-Z0-9]*". With -.I flex, -the rule will be expanded to -"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match. -.IP -Note that if the definition begins with -.B ^ -or ends with -.B $ -then it is -.I not -expanded with parentheses, to allow these operators to appear in -definitions without losing their special meanings. But the -.B , /, -and -.B <> -operators cannot be used in a -.I flex -definition. -.IP -Using -.B \-l -results in the -.I lex -behavior of no parentheses around the definition. -.IP -The POSIX specification is that the definition be enclosed in parentheses. -.IP - -The -.I lex -.B %r -(generate a Ratfor scanner) option is not supported. It is not part -of the POSIX specification. -.IP - -After a call to -.B unput(), -.I yytext -and -.I yyleng -are undefined until the next token is matched, unless the scanner -was built using -.B %array. -This is not the case with -.I lex -or the POSIX specification. The -.B \-l -option does away with this incompatibility. -.IP - -The precedence of the -.B {} -(numeric range) operator is different. -.I lex -interprets "abc{1,3}" as "match one, two, or -three occurrences of 'abc'", whereas -.I flex -interprets it as "match 'ab' -followed by one, two, or three occurrences of 'c'". The latter is -in agreement with the POSIX specification. -.IP - -The precedence of the -.B ^ -operator is different. -.I lex -interprets "^foo|bar" as "match either 'foo' at the beginning of a line, -or 'bar' anywhere", whereas -.I flex -interprets it as "match either 'foo' or 'bar' if they come at the beginning -of a line". The latter is in agreement with the POSIX specification. -.IP - -.I yyin -is -.I initialized -by -.I lex -to be -.I stdin; -.I flex, -on the other hand, -initializes -.I yyin -to NULL -and then -.I assigns -it to -.I stdin -the first time the scanner is called, providing -.I yyin -has not already been assigned to a non-NULL value. The difference is -subtle, but the net effect is that with -.I flex -scanners, -.I yyin -does not have a valid value until the scanner has been called. -.IP -The -.B \-l -option does away with this incompatibility. -.IP - -The special table-size declarations such as -.B %a -supported by -.I lex -are not required by -.I flex -scanners; -.I flex -ignores them. -.IP - -The name -.bd -FLEX_SCANNER -is #define'd so scanners may be written for use with either -.I flex -or -.I lex. -.PP -The following -.I flex -features are not included in -.I lex -or the POSIX specification: -.nf - - yyterminate() - <> - <*> - YY_DECL - YY_START - YY_USER_ACTION - #line directives - %{}'s around actions - multiple actions on a line - -.fi -plus almost all of the flex flags. -The last feature in the list refers to the fact that with -.I flex -you can put multiple actions on the same line, separated with -semi-colons, while with -.I lex, -the following -.nf - - foo handle_foo(); ++num_foos_seen; - -.fi -is (rather surprisingly) truncated to -.nf - - foo handle_foo(); - -.fi -.I flex -does not truncate the action. Actions that are not enclosed in -braces are simply terminated at the end of the line. -.SH DIAGNOSTICS -.PP -.I warning, rule cannot be matched -indicates that the given rule -cannot be matched because it follows other rules that will -always match the same text as it. For -example, in the following "foo" cannot be matched because it comes after -an identifier "catch-all" rule: -.nf - - [a-z]+ got_identifier(); - foo got_foo(); - -.fi -Using -.B REJECT -in a scanner suppresses this warning. -.PP -.I warning, -.B \-s -.I -option given but default rule can be matched -means that it is possible (perhaps only in a particular start condition) -that the default rule (match any single character) is the only one -that will match a particular input. Since -.B \-s -was given, presumably this is not intended. -.PP -.I reject_used_but_not_detected undefined -or -.I yymore_used_but_not_detected undefined - -These errors can occur at compile time. They indicate that the -scanner uses -.B REJECT -or -.B yymore() -but that -.I flex -failed to notice the fact, meaning that -.I flex -scanned the first two sections looking for occurrences of these actions -and failed to find any, but somehow you snuck some in (via a #include -file, for example). Make an explicit reference to the action in your -.I flex -input file. (Note that previously -.I flex -supported a -.B %used/%unused -mechanism for dealing with this problem; this feature is still supported -but now deprecated, and will go away soon unless the author hears from -people who can argue compellingly that they need it.) -.PP -.I flex scanner jammed - -a scanner compiled with -.B \-s -has encountered an input string which wasn't matched by -any of its rules. This error can also occur due to internal problems. -.PP -.I token too large, exceeds YYLMAX - -your scanner uses -.B %array -and one of its rules matched a string longer than the -.B YYLMAX -constant (8K bytes by default). You can increase the value by -#define'ing -.B YYLMAX -in the definitions section of your -.I flex -input. -.PP -.I scanner requires \-8 flag to -.I use the character 'x' - -Your scanner specification includes recognizing the 8-bit character -.I 'x' -and you did not specify the \-8 flag, and your scanner defaulted to 7-bit -because you used the -.B \-Cf -or -.B \-CF -table compression options. See the discussion of the -.B \-7 -flag for details. -.PP -.I flex scanner push-back overflow - -you used -.B unput() -to push back so much text that the scanner's buffer could not hold -both the pushed-back text and the current token in -.B yytext. -Ideally the scanner should dynamically resize the buffer in this case, but at -present it does not. -.PP -.I -input buffer overflow, can't enlarge buffer because scanner uses REJECT - -the scanner was working on matching an extremely large token and needed -to expand the input buffer. This doesn't work with scanners that use -.B -REJECT. -.PP -.I -fatal flex scanner internal error--end of buffer missed - -This can occur in an scanner which is reentered after a long-jump -has jumped out (or over) the scanner's activation frame. Before -reentering the scanner, use: -.nf - - yyrestart( yyin ); - -.fi -or, as noted above, switch to using the C++ scanner class. -.PP -.I too many start conditions in <> construct! - -you listed more start conditions in a <> construct than exist (so -you must have listed at least one of them twice). -.SH FILES -See flex(1). -.SH DEFICIENCIES / BUGS -Again, see flex(1). -.SH "SEE ALSO" -.PP -flex(1), lex(1), yacc(1), sed(1), awk(1). -.PP -M. E. Lesk and E. Schmidt, -.I LEX \- Lexical Analyzer Generator -.SH AUTHOR -Vern Paxson, with the help of many ideas and much inspiration from -Van Jacobson. Original version by Jef Poskanzer. The fast table -representation is a partial implementation of a design done by Van -Jacobson. The implementation was done by Kevin Gong and Vern Paxson. -.PP -Thanks to the many -.I flex -beta-testers, feedbackers, and contributors, especially Francois Pinard, -Casey Leedom, -Nelson H.F. Beebe, benson@odi.com, Peter A. Bigot, Keith Bostic, Frederic -Brehm, Nick Christopher, Jason Coughlin, Bill Cox, Dave Curtis, Scott David -Daniels, Chris G. Demetriou, Mike Donahue, Chuck Doucette, Tom Epperly, Leo -Eskin, Chris Faylor, Jon Forrest, Kaveh R. Ghazi, -Eric Goldman, Ulrich Grepel, Jan Hajic, -Jarkko Hietaniemi, Eric Hughes, John Interrante, -Ceriel Jacobs, Jeffrey R. Jones, Henry -Juengst, Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Marq Kole, Ronald -Lamprecht, Greg Lee, Craig Leres, John Levine, Steve Liddle, -Mohamed el Lozy, Brian Madsen, Chris -Metcalf, Luke Mewburn, Jim Meyering, G.T. Nicol, Landon Noll, Marc Nozell, -Richard Ohnemus, Sven Panne, Roland Pesch, Walter Pelissero, Gaumond -Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Frederic Raimbault, -Rick Richardson, -Kevin Rodgers, Jim Roskind, -Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, -Alex Siegel, Mike Stump, Paul Stuart, Dave Tallman, Chris Thewalt, -Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken -Yap, Nathan Zelle, David Zuhn, and those whose names have slipped my marginal -mail-archiving skills but whose contributions are appreciated all the -same. -.PP -Thanks to Keith Bostic, Jon Forrest, Noah Friedman, -John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. -Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various -distribution headaches. -.PP -Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to -Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom -Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to -Eric Hughes for support of multiple buffers. -.PP -This work was primarily done when I was with the Real Time Systems Group -at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there -for the support I received. -.PP -Send comments to: -.nf - - Vern Paxson - Systems Engineering - Bldg. 46A, Room 1123 - Lawrence Berkeley Laboratory - University of California - Berkeley, CA 94720 - - vern@ee.lbl.gov - -.fi -- cgit v1.1