summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authornate <nate@FreeBSD.org>1993-07-06 18:45:26 +0000
committernate <nate@FreeBSD.org>1993-07-06 18:45:26 +0000
commita166e91d63467bf059c2b62af47de4924a2d0fef (patch)
treeca8c8ea4e51223ded295999ec6a3751a752e05a2
parent28938b81988b68e0d5eae82228889fc1258b4034 (diff)
downloadFreeBSD-src-a166e91d63467bf059c2b62af47de4924a2d0fef.zip
FreeBSD-src-a166e91d63467bf059c2b62af47de4924a2d0fef.tar.gz
Update GNU grep to version 2.0
-rw-r--r--gnu/usr.bin/grep/AUTHORS29
-rw-r--r--gnu/usr.bin/grep/Makefile14
-rw-r--r--gnu/usr.bin/grep/NEWS35
-rw-r--r--gnu/usr.bin/grep/PROJECTS15
-rw-r--r--gnu/usr.bin/grep/README68
-rw-r--r--gnu/usr.bin/grep/dfa.c2491
-rw-r--r--gnu/usr.bin/grep/dfa.h308
-rw-r--r--gnu/usr.bin/grep/getopt.c137
-rw-r--r--gnu/usr.bin/grep/getopt.h40
-rw-r--r--gnu/usr.bin/grep/getpagesize.h42
-rw-r--r--gnu/usr.bin/grep/grep.1507
-rw-r--r--gnu/usr.bin/grep/grep.c1505
-rw-r--r--gnu/usr.bin/grep/grep.h53
-rw-r--r--gnu/usr.bin/grep/kwset.c805
-rw-r--r--gnu/usr.bin/grep/kwset.h69
-rw-r--r--gnu/usr.bin/grep/obstack.c454
-rw-r--r--gnu/usr.bin/grep/obstack.h484
-rw-r--r--gnu/usr.bin/grep/regex.c5811
-rw-r--r--gnu/usr.bin/grep/regex.h625
-rw-r--r--gnu/usr.bin/grep/search.c481
-rw-r--r--gnu/usr.bin/grep/tests/check.sh24
-rw-r--r--gnu/usr.bin/grep/tests/scriptgen.awk4
-rw-r--r--gnu/usr.bin/grep/tests/spencer.tests2
23 files changed, 10085 insertions, 3918 deletions
diff --git a/gnu/usr.bin/grep/AUTHORS b/gnu/usr.bin/grep/AUTHORS
new file mode 100644
index 0000000..e3e033b
--- /dev/null
+++ b/gnu/usr.bin/grep/AUTHORS
@@ -0,0 +1,29 @@
+Mike Haertel wrote the main program and the dfa and kwset matchers.
+
+Arthur David Olson contributed the heuristics for finding fixed substrings
+at the end of dfa.c.
+
+Richard Stallman and Karl Berry wrote the regex backtracking matcher.
+
+Henry Spencer wrote the original test suite from which grep's was derived.
+
+Scott Anderson invented the Khadafy test.
+
+David MacKenzie wrote the automatic configuration software use to
+produce the configure script.
+
+Authors of the replacements for standard library routines are identified
+in the corresponding source files.
+
+The idea of using Boyer-Moore type algorithms to quickly filter out
+non-matching text before calling the regexp matcher was originally due
+to James Woods. He also contributed some code to early versions of
+GNU grep.
+
+Finally, I would like to thank Andrew Hume for many fascinating discussions
+of string searching issues over the years. Hume & Sunday's excellent
+paper on fast string searching (AT&T Bell Laboratories CSTR #156)
+describes some of the history of the subject, as well as providing
+exhaustive performance analysis of various implementation alternatives.
+The inner loop of GNU grep is similar to Hume & Sunday's recommended
+"Tuned Boyer Moore" inner loop.
diff --git a/gnu/usr.bin/grep/Makefile b/gnu/usr.bin/grep/Makefile
index bc6fbf4..f062a64 100644
--- a/gnu/usr.bin/grep/Makefile
+++ b/gnu/usr.bin/grep/Makefile
@@ -1,7 +1,13 @@
-
PROG= grep
-SRCS= dfa.c regex.c grep.c
-CFLAGS+=-DSTDC_HEADERS=1 -DHAVE_UNISTD_H=1
-MLINKS= grep.1 egrep.1
+SRCS= dfa.c grep.c getopt.c kwset.c obstack.c regex.c search.c
+
+CFLAGS+=-DGREP -DHAVE_STRING_H=1 -DHAVE_SYS_PARAM_H=1 -DHAVE_UNISTD_H=1 \
+ -DHAVE_GETPAGESIZE=1 -DHAVE_MEMCHR=1 -DHAVE_STRERROR=1 \
+ -DHAVE_VALLOC=1
+
+
+#check: ${.CURDIR}/grep
+check: all
+ awk sh ${.CURDIR}/tests/check.sh ${.CURDIR}/tests
.include <bsd.prog.mk>
diff --git a/gnu/usr.bin/grep/NEWS b/gnu/usr.bin/grep/NEWS
new file mode 100644
index 0000000..eb0b513
--- /dev/null
+++ b/gnu/usr.bin/grep/NEWS
@@ -0,0 +1,35 @@
+Version 2.0:
+
+The most important user visible change is that egrep and fgrep have
+disappeared as separate programs into the single grep program mandated
+by POSIX 1003.2. New options -G, -E, and -F have been added,
+selecting grep, egrep, and fgrep behavior respectively. For
+compatibility with historical practice, hard links named egrep and
+fgrep are also provided. See the manual page for details.
+
+In addition, the regular expression facilities described in Posix
+draft 11.2 are now supported, except for internationalization features
+related to locale-dependent collating sequence information.
+
+There is a new option, -L, which is like -l except it lists
+files which don't contain matches. The reason this option was
+added is because '-l -v' doesn't do what you expect.
+
+Performance has been improved; the amount of improvement is platform
+dependent, but (for example) grep 2.0 typically runs at least 30% faster
+than grep 1.6 on a DECstation using the MIPS compiler. Where possible,
+grep now uses mmap() for file input; on a Sun 4 running SunOS 4.1 this
+may cut system time by as much as half, for a total reduction in running
+time by nearly 50%. On machines that don't use mmap(), the buffering
+code has been rewritten to choose more favorable alignments and buffer
+sizes for read().
+
+Portability has been substantially cleaned up, and an automatic
+configure script is now provided.
+
+The internals have changed in ways too numerous to mention.
+People brave enough to reuse the DFA matcher in other programs
+will now have their bravery amply "rewarded", for the interface
+to that file has been completely changed. Some changes were
+necessary to track the evolution of the regex package, and since
+I was changing it anyway I decided to do a general cleanup.
diff --git a/gnu/usr.bin/grep/PROJECTS b/gnu/usr.bin/grep/PROJECTS
new file mode 100644
index 0000000..67e9a2a
--- /dev/null
+++ b/gnu/usr.bin/grep/PROJECTS
@@ -0,0 +1,15 @@
+Write Texinfo documentation for grep. The manual page would be a good
+place to start, but Info documents are also supposed to contain a
+tutorial and examples.
+
+Fix the DFA matcher to never use exponential space. (Fortunately, these
+cases are rare.)
+
+Improve the performance of the regex backtracking matcher. This matcher
+is agonizingly slow, and is responsible for grep sometimes being slower
+than Unix grep when backreferences are used.
+
+Provide support for the Posix [= =] and [. .] constructs. This is
+difficult because it requires locale-dependent details of the character
+set and collating sequence, but Posix does not standardize any method
+for accessing this information!
diff --git a/gnu/usr.bin/grep/README b/gnu/usr.bin/grep/README
index 27f5bae..bc34a85 100644
--- a/gnu/usr.bin/grep/README
+++ b/gnu/usr.bin/grep/README
@@ -1,70 +1,28 @@
-This README documents GNU e?grep version 1.6. All bugs reported for
-previous versions have been fixed.
+This is GNU grep 2.0, the "fastest grep in the west" (we hope). All
+bugs reported in previous releases have been fixed. Many exciting new
+bugs have probably been introduced in this major revision.
-See the file INSTALL for compilation and installation instructions.
-
-Send bug reports to bug-gnu-utils@prep.ai.mit.edu.
-
-GNU e?grep is provided "as is" with no warranty. The exact terms
+GNU grep is provided "as is" with no warranty. The exact terms
under which you may use and (re)distribute this program are detailed
in the GNU General Public License, in the file COPYING.
-GNU e?grep is based on a fast lazy-state deterministic matcher (about
+GNU grep is based on a fast lazy-state deterministic matcher (about
twice as fast as stock Unix egrep) hybridized with a Boyer-Moore-Gosper
search for a fixed string that eliminates impossible text from being
considered by the full regexp matcher without necessarily having to
look at every character. The result is typically many times faster
than Unix grep or egrep. (Regular expressions containing backreferencing
-may run more slowly, however.)
-
-GNU e?grep is brought to you by the efforts of several people:
-
- Mike Haertel wrote the deterministic regexp code and the bulk
- of the program.
-
- James A. Woods is responsible for the hybridized search strategy
- of using Boyer-Moore-Gosper fixed-string search as a filter
- before calling the general regexp matcher.
+will run more slowly, however.)
- Arthur David Olson contributed code that finds fixed strings for
- the aforementioned BMG search for a large class of regexps.
+See the file AUTHORS for a list of authors and other contributors.
- Richard Stallman wrote the backtracking regexp matcher that is
- used for \<digit> backreferences, as well as the getopt that
- is provided for 4.2BSD sites. The backtracking matcher was
- originally written for GNU Emacs.
-
- D. A. Gwyn wrote the C alloca emulation that is provided so
- System V machines can run this program. (Alloca is used only
- by RMS' backtracking matcher, and then only rarely, so there
- is no loss if your machine doesn't have a "real" alloca.)
-
- Scott Anderson and Henry Spencer designed the regression tests
- used in the "regress" script.
-
- Paul Placeway wrote the manual page, based on this README.
-
-If you are interested in improving this program, you may wish to try
-any of the following:
+See the file INSTALL for compilation and installation instructions.
-1. Replace the fast search loop with a faster search loop.
- There are several things that could be improved, the most notable
- of which would be to calculate a minimal delta2 to use.
+See the file MANIFEST for a list of files in this distribution.
-2. Make backreferencing \<digit> faster. Right now, backreferencing is
- handled by calling the Emacs backtracking matcher to verify the partial
- match. This is slow; if the DFA routines could handle backreferencing
- themselves a speedup on the order of three to four times might occur
- in those cases where the backtracking matcher is called to verify nearly
- every line. Also, some portability problems due to the inclusion of the
- emacs matcher would be solved because it could then be eliminated.
- Note that expressions with backreferencing are not true regular
- expressions, and thus are not equivalent to any DFA. So this is hard.
+See the file NEWS for a description of major changes in this release.
-3. Handle POSIX style regexps. I'm not sure if this could be called an
- improvement; some of the things on regexps in the POSIX draft I have
- seen are pretty sickening. But it would be useful in the interests of
- conforming to the standard.
+See the file PROJECTS if you want to be mentioned in AUTHORS.
-4. Replace the main driver program grep.c with the much cleaner main driver
- program used in GNU fgrep.
+Send bug reports to bug-gnu-utils@prep.ai.mit.edu. Be sure to
+include the word "grep" in your Subject: header field.
diff --git a/gnu/usr.bin/grep/dfa.c b/gnu/usr.bin/grep/dfa.c
index 08b383d..fc649af 100644
--- a/gnu/usr.bin/grep/dfa.c
+++ b/gnu/usr.bin/grep/dfa.c
@@ -1,4 +1,4 @@
-/* dfa.c - determinisitic extended regexp routines for GNU
+/* dfa.c - deterministic extended regexp routines for GNU
Copyright (C) 1988 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
@@ -17,20 +17,59 @@
/* Written June, 1988 by Mike Haertel
Modified July, 1988 by Arthur David Olson to assist BMG speedups */
-
-#include <stdio.h>
+
#include <assert.h>
+#include <ctype.h>
+#include <stdio.h>
+
+#ifdef STDC_HEADERS
+#include <stdlib.h>
+#else
+#include <sys/types.h>
+extern char *calloc(), *malloc(), *realloc();
+extern void free();
+#endif
-#if defined(USG) || defined(STDC_HEADERS)
+#if defined(HAVE_STRING_H) || defined(STDC_HEADERS)
#include <string.h>
-#ifndef index
+#undef index
#define index strchr
-#endif
#else
#include <strings.h>
#endif
+#ifndef isgraph
+#define isgraph(C) (isprint(C) && !isspace(C))
+#endif
+
+#ifdef isascii
+#define ISALPHA(C) (isascii(C) && isalpha(C))
+#define ISUPPER(C) (isascii(C) && isupper(C))
+#define ISLOWER(C) (isascii(C) && islower(C))
+#define ISDIGIT(C) (isascii(C) && isdigit(C))
+#define ISXDIGIT(C) (isascii(C) && isxdigit(C))
+#define ISSPACE(C) (isascii(C) && isspace(C))
+#define ISPUNCT(C) (isascii(C) && ispunct(C))
+#define ISALNUM(C) (isascii(C) && isalnum(C))
+#define ISPRINT(C) (isascii(C) && isprint(C))
+#define ISGRAPH(C) (isascii(C) && isgraph(C))
+#define ISCNTRL(C) (isascii(C) && iscntrl(C))
+#else
+#define ISALPHA(C) isalpha(C)
+#define ISUPPER(C) isupper(C)
+#define ISLOWER(C) islower(C)
+#define ISDIGIT(C) isdigit(C)
+#define ISXDIGIT(C) isxdigit(C)
+#define ISSPACE(C) isspace(C)
+#define ISPUNCT(C) ispunct(C)
+#define ISALNUM(C) isalnum(C)
+#define ISPRINT(C) isprint(C)
+#define ISGRAPH(C) isgraph(C)
+#define ISCNTRL(C) iscntrl(C)
+#endif
+
#include "dfa.h"
+#include "regex.h"
#if __STDC__
typedef void *ptr_t;
@@ -38,7 +77,7 @@ typedef void *ptr_t;
typedef char *ptr_t;
#endif
-static void regmust();
+static void dfamust();
static ptr_t
xcalloc(n, s)
@@ -48,11 +87,11 @@ xcalloc(n, s)
ptr_t r = calloc(n, s);
if (!r)
- regerror("Memory exhausted");
+ dfaerror("Memory exhausted");
return r;
}
-ptr_t /* Not static, so alloca.o can use it. */
+static ptr_t
xmalloc(n)
size_t n;
{
@@ -60,7 +99,7 @@ xmalloc(n)
assert(n != 0);
if (!r)
- regerror("Memory exhausted");
+ dfaerror("Memory exhausted");
return r;
}
@@ -73,7 +112,7 @@ xrealloc(p, n)
assert(n != 0);
if (!r)
- regerror("Memory exhausted");
+ dfaerror("Memory exhausted");
return r;
}
@@ -89,54 +128,52 @@ xrealloc(p, n)
(nalloc) *= 2; \
REALLOC(p, t, nalloc); \
}
-
+
#ifdef DEBUG
-#include <stdio.h>
static void
prtok(t)
- _token t;
+ token t;
{
char *s;
if (t < 0)
fprintf(stderr, "END");
- else if (t < _NOTCHAR)
+ else if (t < NOTCHAR)
fprintf(stderr, "%c", t);
else
{
switch (t)
{
- case _EMPTY: s = "EMPTY"; break;
- case _BACKREF: s = "BACKREF"; break;
- case _BEGLINE: s = "BEGLINE"; break;
- case _ALLBEGLINE: s = "ALLBEGLINE"; break;
- case _ENDLINE: s = "ENDLINE"; break;
- case _ALLENDLINE: s = "ALLENDLINE"; break;
- case _BEGWORD: s = "BEGWORD"; break;
- case _ENDWORD: s = "ENDWORD"; break;
- case _LIMWORD: s = "LIMWORD"; break;
- case _NOTLIMWORD: s = "NOTLIMWORD"; break;
- case _QMARK: s = "QMARK"; break;
- case _STAR: s = "STAR"; break;
- case _PLUS: s = "PLUS"; break;
- case _CAT: s = "CAT"; break;
- case _OR: s = "OR"; break;
- case _LPAREN: s = "LPAREN"; break;
- case _RPAREN: s = "RPAREN"; break;
- default: s = "SET"; break;
+ case EMPTY: s = "EMPTY"; break;
+ case BACKREF: s = "BACKREF"; break;
+ case BEGLINE: s = "BEGLINE"; break;
+ case ENDLINE: s = "ENDLINE"; break;
+ case BEGWORD: s = "BEGWORD"; break;
+ case ENDWORD: s = "ENDWORD"; break;
+ case LIMWORD: s = "LIMWORD"; break;
+ case NOTLIMWORD: s = "NOTLIMWORD"; break;
+ case QMARK: s = "QMARK"; break;
+ case STAR: s = "STAR"; break;
+ case PLUS: s = "PLUS"; break;
+ case CAT: s = "CAT"; break;
+ case OR: s = "OR"; break;
+ case ORTOP: s = "ORTOP"; break;
+ case LPAREN: s = "LPAREN"; break;
+ case RPAREN: s = "RPAREN"; break;
+ default: s = "CSET"; break;
}
fprintf(stderr, "%s", s);
}
}
#endif /* DEBUG */
-/* Stuff pertaining to charsets. */
+/* Stuff pertaining to charclasses. */
static int
tstbit(b, c)
int b;
- _charset c;
+ charclass c;
{
return c[b / INTBITS] & 1 << b % INTBITS;
}
@@ -144,7 +181,7 @@ tstbit(b, c)
static void
setbit(b, c)
int b;
- _charset c;
+ charclass c;
{
c[b / INTBITS] |= 1 << b % INTBITS;
}
@@ -152,83 +189,83 @@ setbit(b, c)
static void
clrbit(b, c)
int b;
- _charset c;
+ charclass c;
{
c[b / INTBITS] &= ~(1 << b % INTBITS);
}
static void
copyset(src, dst)
- const _charset src;
- _charset dst;
+ charclass src;
+ charclass dst;
{
int i;
- for (i = 0; i < _CHARSET_INTS; ++i)
+ for (i = 0; i < CHARCLASS_INTS; ++i)
dst[i] = src[i];
}
static void
zeroset(s)
- _charset s;
+ charclass s;
{
int i;
- for (i = 0; i < _CHARSET_INTS; ++i)
+ for (i = 0; i < CHARCLASS_INTS; ++i)
s[i] = 0;
}
static void
notset(s)
- _charset s;
+ charclass s;
{
int i;
- for (i = 0; i < _CHARSET_INTS; ++i)
+ for (i = 0; i < CHARCLASS_INTS; ++i)
s[i] = ~s[i];
}
static int
equal(s1, s2)
- const _charset s1;
- const _charset s2;
+ charclass s1;
+ charclass s2;
{
int i;
- for (i = 0; i < _CHARSET_INTS; ++i)
+ for (i = 0; i < CHARCLASS_INTS; ++i)
if (s1[i] != s2[i])
return 0;
return 1;
}
-
-/* A pointer to the current regexp is kept here during parsing. */
-static struct regexp *reg;
-/* Find the index of charset s in reg->charsets, or allocate a new charset. */
+/* A pointer to the current dfa is kept here during parsing. */
+static struct dfa *dfa;
+
+/* Find the index of charclass s in dfa->charclasses, or allocate a new charclass. */
static int
-charset_index(s)
- const _charset s;
+charclass_index(s)
+ charclass s;
{
int i;
- for (i = 0; i < reg->cindex; ++i)
- if (equal(s, reg->charsets[i]))
+ for (i = 0; i < dfa->cindex; ++i)
+ if (equal(s, dfa->charclasses[i]))
return i;
- REALLOC_IF_NECESSARY(reg->charsets, _charset, reg->calloc, reg->cindex);
- ++reg->cindex;
- copyset(s, reg->charsets[i]);
+ REALLOC_IF_NECESSARY(dfa->charclasses, charclass, dfa->calloc, dfa->cindex);
+ ++dfa->cindex;
+ copyset(s, dfa->charclasses[i]);
return i;
}
/* Syntax bits controlling the behavior of the lexical analyzer. */
-static syntax_bits, syntax_bits_set;
+static int syntax_bits, syntax_bits_set;
/* Flag for case-folding letters into sets. */
-static case_fold;
+static int case_fold;
/* Entry point to set syntax options. */
void
-regsyntax(bits, fold)
+dfasyntax(bits, fold)
int bits;
int fold;
{
@@ -237,63 +274,132 @@ regsyntax(bits, fold)
case_fold = fold;
}
-/* Lexical analyzer. */
-static const char *lexstart; /* Pointer to beginning of input string. */
-static const char *lexptr; /* Pointer to next input character. */
+/* Lexical analyzer. All the dross that deals with the obnoxious
+ GNU Regex syntax bits is located here. The poor, suffering
+ reader is referred to the GNU Regex documentation for the
+ meaning of the @#%!@#%^!@ syntax bits. */
+
+static char *lexstart; /* Pointer to beginning of input string. */
+static char *lexptr; /* Pointer to next input character. */
static lexleft; /* Number of characters remaining. */
-static caret_allowed; /* True if backward context allows ^
- (meaningful only if RE_CONTEXT_INDEP_OPS
- is turned off). */
-static closure_allowed; /* True if backward context allows closures
- (meaningful only if RE_CONTEXT_INDEP_OPS
- is turned off). */
+static token lasttok; /* Previous token returned; initially END. */
+static int laststart; /* True if we're separated from beginning or (, |
+ only by zero-width characters. */
+static int parens; /* Count of outstanding left parens. */
+static int minrep, maxrep; /* Repeat counts for {m,n}. */
/* Note that characters become unsigned here. */
#define FETCH(c, eoferr) \
{ \
if (! lexleft) \
- if (eoferr) \
- regerror(eoferr); \
+ if (eoferr != 0) \
+ dfaerror(eoferr); \
else \
- return _END; \
+ return END; \
(c) = (unsigned char) *lexptr++; \
--lexleft; \
}
-static _token
+#define FUNC(F, P) static int F(c) int c; { return P(c); }
+
+FUNC(is_alpha, ISALPHA)
+FUNC(is_upper, ISUPPER)
+FUNC(is_lower, ISLOWER)
+FUNC(is_digit, ISDIGIT)
+FUNC(is_xdigit, ISXDIGIT)
+FUNC(is_space, ISSPACE)
+FUNC(is_punct, ISPUNCT)
+FUNC(is_alnum, ISALNUM)
+FUNC(is_print, ISPRINT)
+FUNC(is_graph, ISGRAPH)
+FUNC(is_cntrl, ISCNTRL)
+
+/* The following list maps the names of the Posix named character classes
+ to predicate functions that determine whether a given character is in
+ the class. The leading [ has already been eaten by the lexical analyzer. */
+static struct {
+ char *name;
+ int (*pred)();
+} prednames[] = {
+ ":alpha:]", is_alpha,
+ ":upper:]", is_upper,
+ ":lower:]", is_lower,
+ ":digit:]", is_digit,
+ ":xdigit:]", is_xdigit,
+ ":space:]", is_space,
+ ":punct:]", is_punct,
+ ":alnum:]", is_alnum,
+ ":print:]", is_print,
+ ":graph:]", is_graph,
+ ":cntrl:]", is_cntrl,
+ 0
+};
+
+static int
+looking_at(s)
+ char *s;
+{
+ int len;
+
+ len = strlen(s);
+ if (lexleft < len)
+ return 0;
+ return strncmp(s, lexptr, len) == 0;
+}
+
+static token
lex()
{
- _token c, c2;
- int invert;
- _charset cset;
+ token c, c1, c2;
+ int backslash = 0, invert;
+ charclass ccl;
+ int i;
- FETCH(c, (char *) 0);
- switch (c)
+ /* Basic plan: We fetch a character. If it's a backslash,
+ we set the backslash flag and go through the loop again.
+ On the plus side, this avoids having a duplicate of the
+ main switch inside the backslash case. On the minus side,
+ it means that just about every case begins with
+ "if (backslash) ...". */
+ for (i = 0; i < 2; ++i)
{
- case '^':
- if (! (syntax_bits & RE_CONTEXT_INDEP_OPS)
- && (!caret_allowed ||
- (syntax_bits & RE_TIGHT_VBAR) && lexptr - 1 != lexstart))
- goto normal_char;
- caret_allowed = 0;
- return syntax_bits & RE_TIGHT_VBAR ? _ALLBEGLINE : _BEGLINE;
-
- case '$':
- if (syntax_bits & RE_CONTEXT_INDEP_OPS || !lexleft
- || (! (syntax_bits & RE_TIGHT_VBAR)
- && ((syntax_bits & RE_NO_BK_PARENS
- ? lexleft > 0 && *lexptr == ')'
- : lexleft > 1 && *lexptr == '\\' && lexptr[1] == ')')
- || (syntax_bits & RE_NO_BK_VBAR
- ? lexleft > 0 && *lexptr == '|'
- : lexleft > 1 && *lexptr == '\\' && lexptr[1] == '|'))))
- return syntax_bits & RE_TIGHT_VBAR ? _ALLENDLINE : _ENDLINE;
- goto normal_char;
-
- case '\\':
- FETCH(c, "Unfinished \\ quote");
+ FETCH(c, 0);
switch (c)
{
+ case '\\':
+ if (backslash)
+ goto normal_char;
+ if (lexleft == 0)
+ dfaerror("Unfinished \\ escape");
+ backslash = 1;
+ break;
+
+ case '^':
+ if (backslash)
+ goto normal_char;
+ if (syntax_bits & RE_CONTEXT_INDEP_ANCHORS
+ || lasttok == END
+ || lasttok == LPAREN
+ || lasttok == OR)
+ return lasttok = BEGLINE;
+ goto normal_char;
+
+ case '$':
+ if (backslash)
+ goto normal_char;
+ if (syntax_bits & RE_CONTEXT_INDEP_ANCHORS
+ || lexleft == 0
+ || (syntax_bits & RE_NO_BK_PARENS
+ ? lexleft > 0 && *lexptr == ')'
+ : lexleft > 1 && lexptr[0] == '\\' && lexptr[1] == ')')
+ || (syntax_bits & RE_NO_BK_VBAR
+ ? lexleft > 0 && *lexptr == '|'
+ : lexleft > 1 && lexptr[0] == '\\' && lexptr[1] == '|')
+ || ((syntax_bits & RE_NEWLINE_ALT)
+ && lexleft > 0 && *lexptr == '\n'))
+ return lasttok = ENDLINE;
+ goto normal_char;
+
case '1':
case '2':
case '3':
@@ -303,238 +409,307 @@ lex()
case '7':
case '8':
case '9':
- caret_allowed = 0;
- closure_allowed = 1;
- return _BACKREF;
+ if (backslash && !(syntax_bits & RE_NO_BK_REFS))
+ {
+ laststart = 0;
+ return lasttok = BACKREF;
+ }
+ goto normal_char;
case '<':
- caret_allowed = 0;
- return _BEGWORD;
+ if (backslash)
+ return lasttok = BEGWORD;
+ goto normal_char;
case '>':
- caret_allowed = 0;
- return _ENDWORD;
+ if (backslash)
+ return lasttok = ENDWORD;
+ goto normal_char;
case 'b':
- caret_allowed = 0;
- return _LIMWORD;
+ if (backslash)
+ return lasttok = LIMWORD;
+ goto normal_char;
case 'B':
- caret_allowed = 0;
- return _NOTLIMWORD;
-
- case 'w':
- case 'W':
- zeroset(cset);
- for (c2 = 0; c2 < _NOTCHAR; ++c2)
- if (ISALNUM(c2))
- setbit(c2, cset);
- if (c == 'W')
- notset(cset);
- caret_allowed = 0;
- closure_allowed = 1;
- return _SET + charset_index(cset);
+ if (backslash)
+ return lasttok = NOTLIMWORD;
+ goto normal_char;
case '?':
- if (syntax_bits & RE_BK_PLUS_QM)
- goto qmark;
- goto normal_char;
+ if (syntax_bits & RE_LIMITED_OPS)
+ goto normal_char;
+ if (backslash != ((syntax_bits & RE_BK_PLUS_QM) != 0))
+ goto normal_char;
+ if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && laststart)
+ goto normal_char;
+ return lasttok = QMARK;
+
+ case '*':
+ if (backslash)
+ goto normal_char;
+ if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && laststart)
+ goto normal_char;
+ return lasttok = STAR;
case '+':
- if (syntax_bits & RE_BK_PLUS_QM)
- goto plus;
- goto normal_char;
+ if (syntax_bits & RE_LIMITED_OPS)
+ goto normal_char;
+ if (backslash != ((syntax_bits & RE_BK_PLUS_QM) != 0))
+ goto normal_char;
+ if (!(syntax_bits & RE_CONTEXT_INDEP_OPS) && laststart)
+ goto normal_char;
+ return lasttok = PLUS;
+
+ case '{':
+ if (!(syntax_bits & RE_INTERVALS))
+ goto normal_char;
+ if (backslash != ((syntax_bits & RE_NO_BK_BRACES) == 0))
+ goto normal_char;
+ minrep = maxrep = 0;
+ /* Cases:
+ {M} - exact count
+ {M,} - minimum count, maximum is infinity
+ {,M} - 0 through M
+ {M,N} - M through N */
+ FETCH(c, "unfinished repeat count");
+ if (ISDIGIT(c))
+ {
+ minrep = c - '0';
+ for (;;)
+ {
+ FETCH(c, "unfinished repeat count");
+ if (!ISDIGIT(c))
+ break;
+ minrep = 10 * minrep + c - '0';
+ }
+ }
+ else if (c != ',')
+ dfaerror("malformed repeat count");
+ if (c == ',')
+ for (;;)
+ {
+ FETCH(c, "unfinished repeat count");
+ if (!ISDIGIT(c))
+ break;
+ maxrep = 10 * maxrep + c - '0';
+ }
+ else
+ maxrep = minrep;
+ if (!(syntax_bits & RE_NO_BK_BRACES))
+ {
+ if (c != '\\')
+ dfaerror("malformed repeat count");
+ FETCH(c, "unfinished repeat count");
+ }
+ if (c != '}')
+ dfaerror("malformed repeat count");
+ laststart = 0;
+ return lasttok = REPMN;
case '|':
- if (! (syntax_bits & RE_NO_BK_VBAR))
- goto or;
- goto normal_char;
+ if (syntax_bits & RE_LIMITED_OPS)
+ goto normal_char;
+ if (backslash != ((syntax_bits & RE_NO_BK_VBAR) == 0))
+ goto normal_char;
+ laststart = 1;
+ return lasttok = OR;
+
+ case '\n':
+ if (syntax_bits & RE_LIMITED_OPS
+ || backslash
+ || !(syntax_bits & RE_NEWLINE_ALT))
+ goto normal_char;
+ laststart = 1;
+ return lasttok = OR;
case '(':
- if (! (syntax_bits & RE_NO_BK_PARENS))
- goto lparen;
- goto normal_char;
+ if (backslash != ((syntax_bits & RE_NO_BK_PARENS) == 0))
+ goto normal_char;
+ ++parens;
+ laststart = 1;
+ return lasttok = LPAREN;
case ')':
- if (! (syntax_bits & RE_NO_BK_PARENS))
- goto rparen;
- goto normal_char;
-
- default:
- goto normal_char;
- }
+ if (backslash != ((syntax_bits & RE_NO_BK_PARENS) == 0))
+ goto normal_char;
+ if (parens == 0 && syntax_bits & RE_UNMATCHED_RIGHT_PAREN_ORD)
+ goto normal_char;
+ --parens;
+ laststart = 0;
+ return lasttok = RPAREN;
+
+ case '.':
+ if (backslash)
+ goto normal_char;
+ zeroset(ccl);
+ notset(ccl);
+ if (!(syntax_bits & RE_DOT_NEWLINE))
+ clrbit('\n', ccl);
+ if (syntax_bits & RE_DOT_NOT_NULL)
+ clrbit('\0', ccl);
+ laststart = 0;
+ return lasttok = CSET + charclass_index(ccl);
- case '?':
- if (syntax_bits & RE_BK_PLUS_QM)
- goto normal_char;
- qmark:
- if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed)
- goto normal_char;
- return _QMARK;
-
- case '*':
- if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed)
- goto normal_char;
- return _STAR;
-
- case '+':
- if (syntax_bits & RE_BK_PLUS_QM)
- goto normal_char;
- plus:
- if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed)
- goto normal_char;
- return _PLUS;
-
- case '|':
- if (! (syntax_bits & RE_NO_BK_VBAR))
- goto normal_char;
- or:
- caret_allowed = 1;
- closure_allowed = 0;
- return _OR;
-
- case '\n':
- if (! (syntax_bits & RE_NEWLINE_OR))
- goto normal_char;
- goto or;
-
- case '(':
- if (! (syntax_bits & RE_NO_BK_PARENS))
- goto normal_char;
- lparen:
- caret_allowed = 1;
- closure_allowed = 0;
- return _LPAREN;
-
- case ')':
- if (! (syntax_bits & RE_NO_BK_PARENS))
- goto normal_char;
- rparen:
- caret_allowed = 0;
- closure_allowed = 1;
- return _RPAREN;
-
- case '.':
- zeroset(cset);
- notset(cset);
- clrbit('\n', cset);
- caret_allowed = 0;
- closure_allowed = 1;
- return _SET + charset_index(cset);
-
- case '[':
- zeroset(cset);
- FETCH(c, "Unbalanced [");
- if (c == '^')
- {
+ case 'w':
+ case 'W':
+ if (!backslash)
+ goto normal_char;
+ zeroset(ccl);
+ for (c2 = 0; c2 < NOTCHAR; ++c2)
+ if (ISALNUM(c2))
+ setbit(c2, ccl);
+ if (c == 'W')
+ notset(ccl);
+ laststart = 0;
+ return lasttok = CSET + charclass_index(ccl);
+
+ case '[':
+ if (backslash)
+ goto normal_char;
+ zeroset(ccl);
FETCH(c, "Unbalanced [");
- invert = 1;
- }
- else
- invert = 0;
- do
- {
- FETCH(c2, "Unbalanced [");
- if (c2 == '-')
+ if (c == '^')
{
- FETCH(c2, "Unbalanced [");
+ FETCH(c, "Unbalanced [");
+ invert = 1;
+ }
+ else
+ invert = 0;
+ do
+ {
+ /* Nobody ever said this had to be fast. :-)
+ Note that if we're looking at some other [:...:]
+ construct, we just treat it as a bunch of ordinary
+ characters. We can do this because we assume
+ regex has checked for syntax errors before
+ dfa is ever called. */
+ if (c == '[' && (syntax_bits & RE_CHAR_CLASSES))
+ for (c1 = 0; prednames[c1].name; ++c1)
+ if (looking_at(prednames[c1].name))
+ {
+ for (c2 = 0; c2 < NOTCHAR; ++c2)
+ if ((*prednames[c1].pred)(c2))
+ setbit(c2, ccl);
+ lexptr += strlen(prednames[c1].name);
+ lexleft -= strlen(prednames[c1].name);
+ FETCH(c1, "Unbalanced [");
+ goto skip;
+ }
+ if (c == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
+ FETCH(c, "Unbalanced [");
+ FETCH(c1, "Unbalanced [");
+ if (c1 == '-')
+ {
+ FETCH(c2, "Unbalanced [");
+ if (c2 == ']')
+ {
+ /* In the case [x-], the - is an ordinary hyphen,
+ which is left in c1, the lookahead character. */
+ --lexptr;
+ ++lexleft;
+ c2 = c;
+ }
+ else
+ {
+ if (c2 == '\\'
+ && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
+ FETCH(c2, "Unbalanced [");
+ FETCH(c1, "Unbalanced [");
+ }
+ }
+ else
+ c2 = c;
while (c <= c2)
{
- setbit(c, cset);
+ setbit(c, ccl);
if (case_fold)
if (ISUPPER(c))
- setbit(tolower(c), cset);
+ setbit(tolower(c), ccl);
else if (ISLOWER(c))
- setbit(toupper(c), cset);
+ setbit(toupper(c), ccl);
++c;
}
- FETCH(c, "Unbalanced [");
+ skip:
+ ;
}
- else
+ while ((c = c1) != ']');
+ if (invert)
{
- setbit(c, cset);
- if (case_fold)
- if (ISUPPER(c))
- setbit(tolower(c), cset);
- else if (ISLOWER(c))
- setbit(toupper(c), cset);
- c = c2;
+ notset(ccl);
+ if (syntax_bits & RE_HAT_LISTS_NOT_NEWLINE)
+ clrbit('\n', ccl);
}
- }
- while (c != ']');
- if (invert)
- notset(cset);
- caret_allowed = 0;
- closure_allowed = 1;
- return _SET + charset_index(cset);
+ laststart = 0;
+ return lasttok = CSET + charclass_index(ccl);
- default:
- normal_char:
- caret_allowed = 0;
- closure_allowed = 1;
- if (case_fold && ISALPHA(c))
- {
- zeroset(cset);
- if (isupper(c))
- c = tolower(c);
- setbit(c, cset);
- setbit(toupper(c), cset);
- return _SET + charset_index(cset);
+ default:
+ normal_char:
+ laststart = 0;
+ if (case_fold && ISALPHA(c))
+ {
+ zeroset(ccl);
+ setbit(c, ccl);
+ if (isupper(c))
+ setbit(tolower(c), ccl);
+ else
+ setbit(toupper(c), ccl);
+ return lasttok = CSET + charclass_index(ccl);
+ }
+ return c;
}
- return c;
}
+
+ /* The above loop should consume at most a backslash
+ and some other character. */
+ abort();
}
-
+
/* Recursive descent parser for regular expressions. */
-static _token tok; /* Lookahead token. */
+static token tok; /* Lookahead token. */
static depth; /* Current depth of a hypothetical stack
holding deferred productions. This is
used to determine the depth that will be
required of the real stack later on in
- reganalyze(). */
+ dfaanalyze(). */
/* Add the given token to the parse tree, maintaining the depth count and
updating the maximum depth if necessary. */
static void
addtok(t)
- _token t;
+ token t;
{
- REALLOC_IF_NECESSARY(reg->tokens, _token, reg->talloc, reg->tindex);
- reg->tokens[reg->tindex++] = t;
+ REALLOC_IF_NECESSARY(dfa->tokens, token, dfa->talloc, dfa->tindex);
+ dfa->tokens[dfa->tindex++] = t;
switch (t)
{
- case _QMARK:
- case _STAR:
- case _PLUS:
+ case QMARK:
+ case STAR:
+ case PLUS:
break;
- case _CAT:
- case _OR:
+ case CAT:
+ case OR:
+ case ORTOP:
--depth;
break;
default:
- ++reg->nleaves;
- case _EMPTY:
+ ++dfa->nleaves;
+ case EMPTY:
++depth;
break;
}
- if (depth > reg->depth)
- reg->depth = depth;
+ if (depth > dfa->depth)
+ dfa->depth = depth;
}
/* The grammar understood by the parser is as follows.
- start:
- regexp
- _ALLBEGLINE regexp
- regexp _ALLENDLINE
- _ALLBEGLINE regexp _ALLENDLINE
-
regexp:
- regexp _OR branch
+ regexp OR branch
branch
branch:
@@ -542,27 +717,27 @@ addtok(t)
closure
closure:
- closure _QMARK
- closure _STAR
- closure _PLUS
+ closure QMARK
+ closure STAR
+ closure PLUS
atom
atom:
<normal character>
- _SET
- _BACKREF
- _BEGLINE
- _ENDLINE
- _BEGWORD
- _ENDWORD
- _LIMWORD
- _NOTLIMWORD
+ CSET
+ BACKREF
+ BEGLINE
+ ENDLINE
+ BEGWORD
+ ENDWORD
+ LIMWORD
+ NOTLIMWORD
<empty>
The parser builds a parse tree in postfix form in an array of tokens. */
#if __STDC__
-static void regexp(void);
+static void regexp(int);
#else
static void regexp();
#endif
@@ -570,116 +745,164 @@ static void regexp();
static void
atom()
{
- if (tok >= 0 && tok < _NOTCHAR || tok >= _SET || tok == _BACKREF
- || tok == _BEGLINE || tok == _ENDLINE || tok == _BEGWORD
- || tok == _ENDWORD || tok == _LIMWORD || tok == _NOTLIMWORD)
+ if ((tok >= 0 && tok < NOTCHAR) || tok >= CSET || tok == BACKREF
+ || tok == BEGLINE || tok == ENDLINE || tok == BEGWORD
+ || tok == ENDWORD || tok == LIMWORD || tok == NOTLIMWORD)
{
addtok(tok);
tok = lex();
}
- else if (tok == _LPAREN)
+ else if (tok == LPAREN)
{
tok = lex();
- regexp();
- if (tok != _RPAREN)
- regerror("Unbalanced (");
+ regexp(0);
+ if (tok != RPAREN)
+ dfaerror("Unbalanced (");
tok = lex();
}
else
- addtok(_EMPTY);
+ addtok(EMPTY);
+}
+
+/* Return the number of tokens in the given subexpression. */
+static int
+nsubtoks(tindex)
+{
+ int ntoks1;
+
+ switch (dfa->tokens[tindex - 1])
+ {
+ default:
+ return 1;
+ case QMARK:
+ case STAR:
+ case PLUS:
+ return 1 + nsubtoks(tindex - 1);
+ case CAT:
+ case OR:
+ case ORTOP:
+ ntoks1 = nsubtoks(tindex - 1);
+ return 1 + ntoks1 + nsubtoks(tindex - 1 - ntoks1);
+ }
+}
+
+/* Copy the given subexpression to the top of the tree. */
+static void
+copytoks(tindex, ntokens)
+ int tindex, ntokens;
+{
+ int i;
+
+ for (i = 0; i < ntokens; ++i)
+ addtok(dfa->tokens[tindex + i]);
}
static void
closure()
{
+ int tindex, ntokens, i;
+
atom();
- while (tok == _QMARK || tok == _STAR || tok == _PLUS)
- {
- addtok(tok);
- tok = lex();
- }
+ while (tok == QMARK || tok == STAR || tok == PLUS || tok == REPMN)
+ if (tok == REPMN)
+ {
+ ntokens = nsubtoks(dfa->tindex);
+ tindex = dfa->tindex - ntokens;
+ if (maxrep == 0)
+ addtok(PLUS);
+ if (minrep == 0)
+ addtok(QMARK);
+ for (i = 1; i < minrep; ++i)
+ {
+ copytoks(tindex, ntokens);
+ addtok(CAT);
+ }
+ for (; i < maxrep; ++i)
+ {
+ copytoks(tindex, ntokens);
+ addtok(QMARK);
+ addtok(CAT);
+ }
+ tok = lex();
+ }
+ else
+ {
+ addtok(tok);
+ tok = lex();
+ }
}
static void
branch()
{
closure();
- while (tok != _RPAREN && tok != _OR && tok != _ALLENDLINE && tok >= 0)
+ while (tok != RPAREN && tok != OR && tok >= 0)
{
closure();
- addtok(_CAT);
+ addtok(CAT);
}
}
static void
-regexp()
+regexp(toplevel)
+ int toplevel;
{
branch();
- while (tok == _OR)
+ while (tok == OR)
{
tok = lex();
branch();
- addtok(_OR);
+ if (toplevel)
+ addtok(ORTOP);
+ else
+ addtok(OR);
}
}
/* Main entry point for the parser. S is a string to be parsed, len is the
- length of the string, so s can include NUL characters. R is a pointer to
- the struct regexp to parse into. */
+ length of the string, so s can include NUL characters. D is a pointer to
+ the struct dfa to parse into. */
void
-regparse(s, len, r)
- const char *s;
+dfaparse(s, len, d)
+ char *s;
size_t len;
- struct regexp *r;
+ struct dfa *d;
+
{
- reg = r;
+ dfa = d;
lexstart = lexptr = s;
lexleft = len;
- caret_allowed = 1;
- closure_allowed = 0;
+ lasttok = END;
+ laststart = 1;
+ parens = 0;
if (! syntax_bits_set)
- regerror("No syntax specified");
+ dfaerror("No syntax specified");
tok = lex();
- depth = r->depth;
-
- if (tok == _ALLBEGLINE)
- {
- addtok(_BEGLINE);
- tok = lex();
- regexp();
- addtok(_CAT);
- }
- else
- regexp();
+ depth = d->depth;
- if (tok == _ALLENDLINE)
- {
- addtok(_ENDLINE);
- addtok(_CAT);
- tok = lex();
- }
+ regexp(1);
- if (tok != _END)
- regerror("Unbalanced )");
+ if (tok != END)
+ dfaerror("Unbalanced )");
- addtok(_END - r->nregexps);
- addtok(_CAT);
+ addtok(END - d->nregexps);
+ addtok(CAT);
- if (r->nregexps)
- addtok(_OR);
+ if (d->nregexps)
+ addtok(ORTOP);
- ++r->nregexps;
+ ++d->nregexps;
}
-
+
/* Some primitives for operating on sets of positions. */
/* Copy one set to another; the destination must be large enough. */
static void
copy(src, dst)
- const _position_set *src;
- _position_set *dst;
+ position_set *src;
+ position_set *dst;
{
int i;
@@ -694,11 +917,11 @@ copy(src, dst)
S->elems must point to an array large enough to hold the resulting set. */
static void
insert(p, s)
- _position p;
- _position_set *s;
+ position p;
+ position_set *s;
{
int i;
- _position t1, t2;
+ position t1, t2;
for (i = 0; i < s->nelem && p.index < s->elems[i].index; ++i)
;
@@ -721,9 +944,9 @@ insert(p, s)
the positions of both sets were inserted into an initially empty set. */
static void
merge(s1, s2, m)
- _position_set *s1;
- _position_set *s2;
- _position_set *m;
+ position_set *s1;
+ position_set *s2;
+ position_set *m;
{
int i = 0, j = 0;
@@ -747,8 +970,8 @@ merge(s1, s2, m)
/* Delete a position from a set. */
static void
delete(p, s)
- _position p;
- _position_set *s;
+ position p;
+ position_set *s;
{
int i;
@@ -759,15 +982,15 @@ delete(p, s)
for (--s->nelem; i < s->nelem; ++i)
s->elems[i] = s->elems[i + 1];
}
-
+
/* Find the index of the state corresponding to the given position set with
the given preceding context, or create a new state if there is no such
state. Newline and letter tell whether we got here on a newline or
letter, respectively. */
static int
-state_index(r, s, newline, letter)
- struct regexp *r;
- _position_set *s;
+state_index(d, s, newline, letter)
+ struct dfa *d;
+ position_set *s;
int newline;
int letter;
{
@@ -782,75 +1005,75 @@ state_index(r, s, newline, letter)
hash ^= s->elems[i].index + s->elems[i].constraint;
/* Try to find a state that exactly matches the proposed one. */
- for (i = 0; i < r->sindex; ++i)
+ for (i = 0; i < d->sindex; ++i)
{
- if (hash != r->states[i].hash || s->nelem != r->states[i].elems.nelem
- || newline != r->states[i].newline || letter != r->states[i].letter)
+ if (hash != d->states[i].hash || s->nelem != d->states[i].elems.nelem
+ || newline != d->states[i].newline || letter != d->states[i].letter)
continue;
for (j = 0; j < s->nelem; ++j)
if (s->elems[j].constraint
- != r->states[i].elems.elems[j].constraint
- || s->elems[j].index != r->states[i].elems.elems[j].index)
+ != d->states[i].elems.elems[j].constraint
+ || s->elems[j].index != d->states[i].elems.elems[j].index)
break;
if (j == s->nelem)
return i;
}
/* We'll have to create a new state. */
- REALLOC_IF_NECESSARY(r->states, _dfa_state, r->salloc, r->sindex);
- r->states[i].hash = hash;
- MALLOC(r->states[i].elems.elems, _position, s->nelem);
- copy(s, &r->states[i].elems);
- r->states[i].newline = newline;
- r->states[i].letter = letter;
- r->states[i].backref = 0;
- r->states[i].constraint = 0;
- r->states[i].first_end = 0;
+ REALLOC_IF_NECESSARY(d->states, dfa_state, d->salloc, d->sindex);
+ d->states[i].hash = hash;
+ MALLOC(d->states[i].elems.elems, position, s->nelem);
+ copy(s, &d->states[i].elems);
+ d->states[i].newline = newline;
+ d->states[i].letter = letter;
+ d->states[i].backref = 0;
+ d->states[i].constraint = 0;
+ d->states[i].first_end = 0;
for (j = 0; j < s->nelem; ++j)
- if (r->tokens[s->elems[j].index] < 0)
+ if (d->tokens[s->elems[j].index] < 0)
{
constraint = s->elems[j].constraint;
- if (_SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 0)
- || _SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 1)
- || _SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 0)
- || _SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 1))
- r->states[i].constraint |= constraint;
- if (! r->states[i].first_end)
- r->states[i].first_end = r->tokens[s->elems[j].index];
+ if (SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 0)
+ || SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 1)
+ || SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 0)
+ || SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 1))
+ d->states[i].constraint |= constraint;
+ if (! d->states[i].first_end)
+ d->states[i].first_end = d->tokens[s->elems[j].index];
}
- else if (r->tokens[s->elems[j].index] == _BACKREF)
+ else if (d->tokens[s->elems[j].index] == BACKREF)
{
- r->states[i].constraint = _NO_CONSTRAINT;
- r->states[i].backref = 1;
+ d->states[i].constraint = NO_CONSTRAINT;
+ d->states[i].backref = 1;
}
- ++r->sindex;
+ ++d->sindex;
return i;
}
-
+
/* Find the epsilon closure of a set of positions. If any position of the set
contains a symbol that matches the empty string in some context, replace
that position with the elements of its follow labeled with an appropriate
constraint. Repeat exhaustively until no funny positions are left.
S->elems must be large enough to hold the result. */
void
-epsclosure(s, r)
- _position_set *s;
- struct regexp *r;
+epsclosure(s, d)
+ position_set *s;
+ struct dfa *d;
{
int i, j;
int *visited;
- _position p, old;
+ position p, old;
- MALLOC(visited, int, r->tindex);
- for (i = 0; i < r->tindex; ++i)
+ MALLOC(visited, int, d->tindex);
+ for (i = 0; i < d->tindex; ++i)
visited[i] = 0;
for (i = 0; i < s->nelem; ++i)
- if (r->tokens[s->elems[i].index] >= _NOTCHAR
- && r->tokens[s->elems[i].index] != _BACKREF
- && r->tokens[s->elems[i].index] < _SET)
+ if (d->tokens[s->elems[i].index] >= NOTCHAR
+ && d->tokens[s->elems[i].index] != BACKREF
+ && d->tokens[s->elems[i].index] < CSET)
{
old = s->elems[i];
p.constraint = old.constraint;
@@ -861,32 +1084,32 @@ epsclosure(s, r)
continue;
}
visited[old.index] = 1;
- switch (r->tokens[old.index])
+ switch (d->tokens[old.index])
{
- case _BEGLINE:
- p.constraint &= _BEGLINE_CONSTRAINT;
+ case BEGLINE:
+ p.constraint &= BEGLINE_CONSTRAINT;
break;
- case _ENDLINE:
- p.constraint &= _ENDLINE_CONSTRAINT;
+ case ENDLINE:
+ p.constraint &= ENDLINE_CONSTRAINT;
break;
- case _BEGWORD:
- p.constraint &= _BEGWORD_CONSTRAINT;
+ case BEGWORD:
+ p.constraint &= BEGWORD_CONSTRAINT;
break;
- case _ENDWORD:
- p.constraint &= _ENDWORD_CONSTRAINT;
+ case ENDWORD:
+ p.constraint &= ENDWORD_CONSTRAINT;
break;
- case _LIMWORD:
- p.constraint &= _LIMWORD_CONSTRAINT;
+ case LIMWORD:
+ p.constraint &= LIMWORD_CONSTRAINT;
break;
- case _NOTLIMWORD:
- p.constraint &= _NOTLIMWORD_CONSTRAINT;
+ case NOTLIMWORD:
+ p.constraint &= NOTLIMWORD_CONSTRAINT;
break;
default:
break;
}
- for (j = 0; j < r->follows[old.index].nelem; ++j)
+ for (j = 0; j < d->follows[old.index].nelem; ++j)
{
- p.index = r->follows[old.index].elems[j].index;
+ p.index = d->follows[old.index].elems[j].index;
insert(p, s);
}
/* Force rescan to start at the beginning. */
@@ -895,41 +1118,41 @@ epsclosure(s, r)
free(visited);
}
-
+
/* Perform bottom-up analysis on the parse tree, computing various functions.
Note that at this point, we're pretending constructs like \< are real
characters rather than constraints on what can follow them.
Nullable: A node is nullable if it is at the root of a regexp that can
match the empty string.
- * _EMPTY leaves are nullable.
+ * EMPTY leaves are nullable.
* No other leaf is nullable.
- * A _QMARK or _STAR node is nullable.
- * A _PLUS node is nullable if its argument is nullable.
- * A _CAT node is nullable if both its arguments are nullable.
- * An _OR node is nullable if either argument is nullable.
+ * A QMARK or STAR node is nullable.
+ * A PLUS node is nullable if its argument is nullable.
+ * A CAT node is nullable if both its arguments are nullable.
+ * An OR node is nullable if either argument is nullable.
Firstpos: The firstpos of a node is the set of positions (nonempty leaves)
that could correspond to the first character of a string matching the
regexp rooted at the given node.
- * _EMPTY leaves have empty firstpos.
+ * EMPTY leaves have empty firstpos.
* The firstpos of a nonempty leaf is that leaf itself.
- * The firstpos of a _QMARK, _STAR, or _PLUS node is the firstpos of its
+ * The firstpos of a QMARK, STAR, or PLUS node is the firstpos of its
argument.
- * The firstpos of a _CAT node is the firstpos of the left argument, union
+ * The firstpos of a CAT node is the firstpos of the left argument, union
the firstpos of the right if the left argument is nullable.
- * The firstpos of an _OR node is the union of firstpos of each argument.
+ * The firstpos of an OR node is the union of firstpos of each argument.
Lastpos: The lastpos of a node is the set of positions that could
correspond to the last character of a string matching the regexp at
the given node.
- * _EMPTY leaves have empty lastpos.
+ * EMPTY leaves have empty lastpos.
* The lastpos of a nonempty leaf is that leaf itself.
- * The lastpos of a _QMARK, _STAR, or _PLUS node is the lastpos of its
+ * The lastpos of a QMARK, STAR, or PLUS node is the lastpos of its
argument.
- * The lastpos of a _CAT node is the lastpos of its right argument, union
+ * The lastpos of a CAT node is the lastpos of its right argument, union
the lastpos of the left if the right argument is nullable.
- * The lastpos of an _OR node is the union of the lastpos of each argument.
+ * The lastpos of an OR node is the union of the lastpos of each argument.
Follow: The follow of a position is the set of positions that could
correspond to the character following a character matching the node in
@@ -938,9 +1161,9 @@ epsclosure(s, r)
Later, if we find that a special symbol is in a follow set, we will
replace it with the elements of its follow, labeled with an appropriate
constraint.
- * Every node in the firstpos of the argument of a _STAR or _PLUS node is in
+ * Every node in the firstpos of the argument of a STAR or PLUS node is in
the follow of every node in the lastpos.
- * Every node in the firstpos of the second argument of a _CAT node is in
+ * Every node in the firstpos of the second argument of a CAT node is in
the follow of every node in the lastpos of the first argument.
Because of the postfix representation of the parse tree, the depth-first
@@ -949,61 +1172,61 @@ epsclosure(s, r)
scheme; the number of elements in each set deeper in the stack can be
used to determine the address of a particular set's array. */
void
-reganalyze(r, searchflag)
- struct regexp *r;
+dfaanalyze(d, searchflag)
+ struct dfa *d;
int searchflag;
{
int *nullable; /* Nullable stack. */
int *nfirstpos; /* Element count stack for firstpos sets. */
- _position *firstpos; /* Array where firstpos elements are stored. */
+ position *firstpos; /* Array where firstpos elements are stored. */
int *nlastpos; /* Element count stack for lastpos sets. */
- _position *lastpos; /* Array where lastpos elements are stored. */
+ position *lastpos; /* Array where lastpos elements are stored. */
int *nalloc; /* Sizes of arrays allocated to follow sets. */
- _position_set tmp; /* Temporary set for merging sets. */
- _position_set merged; /* Result of merging sets. */
+ position_set tmp; /* Temporary set for merging sets. */
+ position_set merged; /* Result of merging sets. */
int wants_newline; /* True if some position wants newline info. */
int *o_nullable;
int *o_nfirst, *o_nlast;
- _position *o_firstpos, *o_lastpos;
+ position *o_firstpos, *o_lastpos;
int i, j;
- _position *pos;
+ position *pos;
#ifdef DEBUG
- fprintf(stderr, "reganalyze:\n");
- for (i = 0; i < r->tindex; ++i)
+ fprintf(stderr, "dfaanalyze:\n");
+ for (i = 0; i < d->tindex; ++i)
{
fprintf(stderr, " %d:", i);
- prtok(r->tokens[i]);
+ prtok(d->tokens[i]);
}
putc('\n', stderr);
#endif
- r->searchflag = searchflag;
+ d->searchflag = searchflag;
- MALLOC(nullable, int, r->depth);
+ MALLOC(nullable, int, d->depth);
o_nullable = nullable;
- MALLOC(nfirstpos, int, r->depth);
+ MALLOC(nfirstpos, int, d->depth);
o_nfirst = nfirstpos;
- MALLOC(firstpos, _position, r->nleaves);
- o_firstpos = firstpos, firstpos += r->nleaves;
- MALLOC(nlastpos, int, r->depth);
+ MALLOC(firstpos, position, d->nleaves);
+ o_firstpos = firstpos, firstpos += d->nleaves;
+ MALLOC(nlastpos, int, d->depth);
o_nlast = nlastpos;
- MALLOC(lastpos, _position, r->nleaves);
- o_lastpos = lastpos, lastpos += r->nleaves;
- MALLOC(nalloc, int, r->tindex);
- for (i = 0; i < r->tindex; ++i)
+ MALLOC(lastpos, position, d->nleaves);
+ o_lastpos = lastpos, lastpos += d->nleaves;
+ MALLOC(nalloc, int, d->tindex);
+ for (i = 0; i < d->tindex; ++i)
nalloc[i] = 0;
- MALLOC(merged.elems, _position, r->nleaves);
+ MALLOC(merged.elems, position, d->nleaves);
- CALLOC(r->follows, _position_set, r->tindex);
+ CALLOC(d->follows, position_set, d->tindex);
- for (i = 0; i < r->tindex; ++i)
+ for (i = 0; i < d->tindex; ++i)
#ifdef DEBUG
{ /* Nonsyntactic #ifdef goo... */
#endif
- switch (r->tokens[i])
+ switch (d->tokens[i])
{
- case _EMPTY:
+ case EMPTY:
/* The empty set is nullable. */
*nullable++ = 1;
@@ -1011,8 +1234,8 @@ reganalyze(r, searchflag)
*nfirstpos++ = *nlastpos++ = 0;
break;
- case _STAR:
- case _PLUS:
+ case STAR:
+ case PLUS:
/* Every element in the firstpos of the argument is in the follow
of every element in the lastpos. */
tmp.nelem = nfirstpos[-1];
@@ -1020,19 +1243,19 @@ reganalyze(r, searchflag)
pos = lastpos;
for (j = 0; j < nlastpos[-1]; ++j)
{
- merge(&tmp, &r->follows[pos[j].index], &merged);
- REALLOC_IF_NECESSARY(r->follows[pos[j].index].elems, _position,
+ merge(&tmp, &d->follows[pos[j].index], &merged);
+ REALLOC_IF_NECESSARY(d->follows[pos[j].index].elems, position,
nalloc[pos[j].index], merged.nelem - 1);
- copy(&merged, &r->follows[pos[j].index]);
+ copy(&merged, &d->follows[pos[j].index]);
}
- case _QMARK:
- /* A _QMARK or _STAR node is automatically nullable. */
- if (r->tokens[i] != _PLUS)
+ case QMARK:
+ /* A QMARK or STAR node is automatically nullable. */
+ if (d->tokens[i] != PLUS)
nullable[-1] = 1;
break;
- case _CAT:
+ case CAT:
/* Every element in the firstpos of the second argument is in the
follow of every element in the lastpos of the first argument. */
tmp.nelem = nfirstpos[-1];
@@ -1040,13 +1263,13 @@ reganalyze(r, searchflag)
pos = lastpos + nlastpos[-1];
for (j = 0; j < nlastpos[-2]; ++j)
{
- merge(&tmp, &r->follows[pos[j].index], &merged);
- REALLOC_IF_NECESSARY(r->follows[pos[j].index].elems, _position,
+ merge(&tmp, &d->follows[pos[j].index], &merged);
+ REALLOC_IF_NECESSARY(d->follows[pos[j].index].elems, position,
nalloc[pos[j].index], merged.nelem - 1);
- copy(&merged, &r->follows[pos[j].index]);
+ copy(&merged, &d->follows[pos[j].index]);
}
- /* The firstpos of a _CAT node is the firstpos of the first argument,
+ /* The firstpos of a CAT node is the firstpos of the first argument,
union that of the second argument if the first is nullable. */
if (nullable[-2])
nfirstpos[-2] += nfirstpos[-1];
@@ -1054,7 +1277,7 @@ reganalyze(r, searchflag)
firstpos += nfirstpos[-1];
--nfirstpos;
- /* The lastpos of a _CAT node is the lastpos of the second argument,
+ /* The lastpos of a CAT node is the lastpos of the second argument,
union that of the first argument if the second is nullable. */
if (nullable[-1])
nlastpos[-2] += nlastpos[-1];
@@ -1068,12 +1291,13 @@ reganalyze(r, searchflag)
}
--nlastpos;
- /* A _CAT node is nullable if both arguments are nullable. */
+ /* A CAT node is nullable if both arguments are nullable. */
nullable[-2] = nullable[-1] && nullable[-2];
--nullable;
break;
- case _OR:
+ case OR:
+ case ORTOP:
/* The firstpos is the union of the firstpos of each argument. */
nfirstpos[-2] += nfirstpos[-1];
--nfirstpos;
@@ -1082,7 +1306,7 @@ reganalyze(r, searchflag)
nlastpos[-2] += nlastpos[-1];
--nlastpos;
- /* An _OR node is nullable if either argument is nullable. */
+ /* An OR node is nullable if either argument is nullable. */
nullable[-2] = nullable[-1] || nullable[-2];
--nullable;
break;
@@ -1093,36 +1317,36 @@ reganalyze(r, searchflag)
an "epsilon closure" effectively makes them nullable later.
Backreferences have to get a real position so we can detect
transitions on them later. But they are nullable. */
- *nullable++ = r->tokens[i] == _BACKREF;
+ *nullable++ = d->tokens[i] == BACKREF;
/* This position is in its own firstpos and lastpos. */
*nfirstpos++ = *nlastpos++ = 1;
--firstpos, --lastpos;
firstpos->index = lastpos->index = i;
- firstpos->constraint = lastpos->constraint = _NO_CONSTRAINT;
+ firstpos->constraint = lastpos->constraint = NO_CONSTRAINT;
/* Allocate the follow set for this position. */
nalloc[i] = 1;
- MALLOC(r->follows[i].elems, _position, nalloc[i]);
+ MALLOC(d->follows[i].elems, position, nalloc[i]);
break;
}
#ifdef DEBUG
/* ... balance the above nonsyntactic #ifdef goo... */
fprintf(stderr, "node %d:", i);
- prtok(r->tokens[i]);
+ prtok(d->tokens[i]);
putc('\n', stderr);
fprintf(stderr, nullable[-1] ? " nullable: yes\n" : " nullable: no\n");
fprintf(stderr, " firstpos:");
for (j = nfirstpos[-1] - 1; j >= 0; --j)
{
fprintf(stderr, " %d:", firstpos[j].index);
- prtok(r->tokens[firstpos[j].index]);
+ prtok(d->tokens[firstpos[j].index]);
}
fprintf(stderr, "\n lastpos:");
for (j = nlastpos[-1] - 1; j >= 0; --j)
{
fprintf(stderr, " %d:", lastpos[j].index);
- prtok(r->tokens[lastpos[j].index]);
+ prtok(d->tokens[lastpos[j].index]);
}
putc('\n', stderr);
}
@@ -1130,26 +1354,26 @@ reganalyze(r, searchflag)
/* For each follow set that is the follow set of a real position, replace
it with its epsilon closure. */
- for (i = 0; i < r->tindex; ++i)
- if (r->tokens[i] < _NOTCHAR || r->tokens[i] == _BACKREF
- || r->tokens[i] >= _SET)
+ for (i = 0; i < d->tindex; ++i)
+ if (d->tokens[i] < NOTCHAR || d->tokens[i] == BACKREF
+ || d->tokens[i] >= CSET)
{
#ifdef DEBUG
fprintf(stderr, "follows(%d:", i);
- prtok(r->tokens[i]);
+ prtok(d->tokens[i]);
fprintf(stderr, "):");
- for (j = r->follows[i].nelem - 1; j >= 0; --j)
+ for (j = d->follows[i].nelem - 1; j >= 0; --j)
{
- fprintf(stderr, " %d:", r->follows[i].elems[j].index);
- prtok(r->tokens[r->follows[i].elems[j].index]);
+ fprintf(stderr, " %d:", d->follows[i].elems[j].index);
+ prtok(d->tokens[d->follows[i].elems[j].index]);
}
putc('\n', stderr);
#endif
- copy(&r->follows[i], &merged);
- epsclosure(&merged, r);
- if (r->follows[i].nelem < merged.nelem)
- REALLOC(r->follows[i].elems, _position, merged.nelem);
- copy(&merged, &r->follows[i]);
+ copy(&d->follows[i], &merged);
+ epsclosure(&merged, d);
+ if (d->follows[i].nelem < merged.nelem)
+ REALLOC(d->follows[i].elems, position, merged.nelem);
+ copy(&merged, &d->follows[i]);
}
/* Get the epsilon closure of the firstpos of the regexp. The result will
@@ -1157,19 +1381,19 @@ reganalyze(r, searchflag)
merged.nelem = 0;
for (i = 0; i < nfirstpos[-1]; ++i)
insert(firstpos[i], &merged);
- epsclosure(&merged, r);
+ epsclosure(&merged, d);
/* Check if any of the positions of state 0 will want newline context. */
wants_newline = 0;
for (i = 0; i < merged.nelem; ++i)
- if (_PREV_NEWLINE_DEPENDENT(merged.elems[i].constraint))
+ if (PREV_NEWLINE_DEPENDENT(merged.elems[i].constraint))
wants_newline = 1;
/* Build the initial state. */
- r->salloc = 1;
- r->sindex = 0;
- MALLOC(r->states, _dfa_state, r->salloc);
- state_index(r, &merged, wants_newline, 0);
+ d->salloc = 1;
+ d->sindex = 0;
+ MALLOC(d->states, dfa_state, d->salloc);
+ state_index(d, &merged, wants_newline, 0);
free(o_nullable);
free(o_nfirst);
@@ -1179,8 +1403,8 @@ reganalyze(r, searchflag)
free(nalloc);
free(merged.elems);
}
-
-/* Find, for each character, the transition out of state s of r, and store
+
+/* Find, for each character, the transition out of state s of d, and store
it in the appropriate slot of trans.
We divide the positions of s into groups (positions can appear in more
@@ -1211,25 +1435,25 @@ reganalyze(r, searchflag)
create a new group labeled with the characters of C and insert this
position in that group. */
void
-regstate(s, r, trans)
+dfastate(s, d, trans)
int s;
- struct regexp *r;
+ struct dfa *d;
int trans[];
{
- _position_set grps[_NOTCHAR]; /* As many as will ever be needed. */
- _charset labels[_NOTCHAR]; /* Labels corresponding to the groups. */
+ position_set grps[NOTCHAR]; /* As many as will ever be needed. */
+ charclass labels[NOTCHAR]; /* Labels corresponding to the groups. */
int ngrps = 0; /* Number of groups actually used. */
- _position pos; /* Current position being considered. */
- _charset matches; /* Set of matching characters. */
+ position pos; /* Current position being considered. */
+ charclass matches; /* Set of matching characters. */
int matchesf; /* True if matches is nonempty. */
- _charset intersect; /* Intersection with some label set. */
+ charclass intersect; /* Intersection with some label set. */
int intersectf; /* True if intersect is nonempty. */
- _charset leftovers; /* Stuff in the label that didn't match. */
+ charclass leftovers; /* Stuff in the label that didn't match. */
int leftoversf; /* True if leftovers is nonempty. */
- static _charset letters; /* Set of characters considered letters. */
- static _charset newline; /* Set of characters that aren't newline. */
- _position_set follows; /* Union of the follows of some group. */
- _position_set tmp; /* Temporary space for merging sets. */
+ static charclass letters; /* Set of characters considered letters. */
+ static charclass newline; /* Set of characters that aren't newline. */
+ position_set follows; /* Union of the follows of some group. */
+ position_set tmp; /* Temporary space for merging sets. */
int state; /* New state. */
int wants_newline; /* New state wants to know newline context. */
int state_newline; /* New state on a newline transition. */
@@ -1242,7 +1466,7 @@ regstate(s, r, trans)
if (! initialized)
{
initialized = 1;
- for (i = 0; i < _NOTCHAR; ++i)
+ for (i = 0; i < NOTCHAR; ++i)
if (ISALNUM(i))
setbit(i, letters);
setbit('\n', newline);
@@ -1250,13 +1474,13 @@ regstate(s, r, trans)
zeroset(matches);
- for (i = 0; i < r->states[s].elems.nelem; ++i)
+ for (i = 0; i < d->states[s].elems.nelem; ++i)
{
- pos = r->states[s].elems.elems[i];
- if (r->tokens[pos.index] >= 0 && r->tokens[pos.index] < _NOTCHAR)
- setbit(r->tokens[pos.index], matches);
- else if (r->tokens[pos.index] >= _SET)
- copyset(r->charsets[r->tokens[pos.index] - _SET], matches);
+ pos = d->states[s].elems.elems[i];
+ if (d->tokens[pos.index] >= 0 && d->tokens[pos.index] < NOTCHAR)
+ setbit(d->tokens[pos.index], matches);
+ else if (d->tokens[pos.index] >= CSET)
+ copyset(d->charclasses[d->tokens[pos.index] - CSET], matches);
else
continue;
@@ -1264,26 +1488,26 @@ regstate(s, r, trans)
they fail in the current context. */
if (pos.constraint != 0xFF)
{
- if (! _MATCHES_NEWLINE_CONTEXT(pos.constraint,
- r->states[s].newline, 1))
+ if (! MATCHES_NEWLINE_CONTEXT(pos.constraint,
+ d->states[s].newline, 1))
clrbit('\n', matches);
- if (! _MATCHES_NEWLINE_CONTEXT(pos.constraint,
- r->states[s].newline, 0))
- for (j = 0; j < _CHARSET_INTS; ++j)
+ if (! MATCHES_NEWLINE_CONTEXT(pos.constraint,
+ d->states[s].newline, 0))
+ for (j = 0; j < CHARCLASS_INTS; ++j)
matches[j] &= newline[j];
- if (! _MATCHES_LETTER_CONTEXT(pos.constraint,
- r->states[s].letter, 1))
- for (j = 0; j < _CHARSET_INTS; ++j)
+ if (! MATCHES_LETTER_CONTEXT(pos.constraint,
+ d->states[s].letter, 1))
+ for (j = 0; j < CHARCLASS_INTS; ++j)
matches[j] &= ~letters[j];
- if (! _MATCHES_LETTER_CONTEXT(pos.constraint,
- r->states[s].letter, 0))
- for (j = 0; j < _CHARSET_INTS; ++j)
+ if (! MATCHES_LETTER_CONTEXT(pos.constraint,
+ d->states[s].letter, 0))
+ for (j = 0; j < CHARCLASS_INTS; ++j)
matches[j] &= letters[j];
/* If there are no characters left, there's no point in going on. */
- for (j = 0; j < _CHARSET_INTS && !matches[j]; ++j)
+ for (j = 0; j < CHARCLASS_INTS && !matches[j]; ++j)
;
- if (j == _CHARSET_INTS)
+ if (j == CHARCLASS_INTS)
continue;
}
@@ -1292,21 +1516,21 @@ regstate(s, r, trans)
/* If matches contains a single character only, and the current
group's label doesn't contain that character, go on to the
next group. */
- if (r->tokens[pos.index] >= 0 && r->tokens[pos.index] < _NOTCHAR
- && !tstbit(r->tokens[pos.index], labels[j]))
+ if (d->tokens[pos.index] >= 0 && d->tokens[pos.index] < NOTCHAR
+ && !tstbit(d->tokens[pos.index], labels[j]))
continue;
/* Check if this group's label has a nonempty intersection with
matches. */
intersectf = 0;
- for (k = 0; k < _CHARSET_INTS; ++k)
+ for (k = 0; k < CHARCLASS_INTS; ++k)
(intersect[k] = matches[k] & labels[j][k]) ? intersectf = 1 : 0;
if (! intersectf)
continue;
/* It does; now find the set differences both ways. */
leftoversf = matchesf = 0;
- for (k = 0; k < _CHARSET_INTS; ++k)
+ for (k = 0; k < CHARCLASS_INTS; ++k)
{
/* Even an optimizing compiler can't know this for sure. */
int match = matches[k], label = labels[j][k];
@@ -1320,7 +1544,7 @@ regstate(s, r, trans)
{
copyset(leftovers, labels[ngrps]);
copyset(intersect, labels[j]);
- MALLOC(grps[ngrps].elems, _position, r->nleaves);
+ MALLOC(grps[ngrps].elems, position, d->nleaves);
copy(&grps[j], &grps[ngrps]);
++ngrps;
}
@@ -1341,41 +1565,41 @@ regstate(s, r, trans)
{
copyset(matches, labels[ngrps]);
zeroset(matches);
- MALLOC(grps[ngrps].elems, _position, r->nleaves);
+ MALLOC(grps[ngrps].elems, position, d->nleaves);
grps[ngrps].nelem = 1;
grps[ngrps].elems[0] = pos;
++ngrps;
}
}
- MALLOC(follows.elems, _position, r->nleaves);
- MALLOC(tmp.elems, _position, r->nleaves);
+ MALLOC(follows.elems, position, d->nleaves);
+ MALLOC(tmp.elems, position, d->nleaves);
/* If we are a searching matcher, the default transition is to a state
containing the positions of state 0, otherwise the default transition
is to fail miserably. */
- if (r->searchflag)
+ if (d->searchflag)
{
wants_newline = 0;
wants_letter = 0;
- for (i = 0; i < r->states[0].elems.nelem; ++i)
+ for (i = 0; i < d->states[0].elems.nelem; ++i)
{
- if (_PREV_NEWLINE_DEPENDENT(r->states[0].elems.elems[i].constraint))
+ if (PREV_NEWLINE_DEPENDENT(d->states[0].elems.elems[i].constraint))
wants_newline = 1;
- if (_PREV_LETTER_DEPENDENT(r->states[0].elems.elems[i].constraint))
+ if (PREV_LETTER_DEPENDENT(d->states[0].elems.elems[i].constraint))
wants_letter = 1;
}
- copy(&r->states[0].elems, &follows);
- state = state_index(r, &follows, 0, 0);
+ copy(&d->states[0].elems, &follows);
+ state = state_index(d, &follows, 0, 0);
if (wants_newline)
- state_newline = state_index(r, &follows, 1, 0);
+ state_newline = state_index(d, &follows, 1, 0);
else
state_newline = state;
if (wants_letter)
- state_letter = state_index(r, &follows, 0, 1);
+ state_letter = state_index(d, &follows, 0, 1);
else
state_letter = state;
- for (i = 0; i < _NOTCHAR; ++i)
+ for (i = 0; i < NOTCHAR; ++i)
if (i == '\n')
trans[i] = state_newline;
else if (ISALNUM(i))
@@ -1384,7 +1608,7 @@ regstate(s, r, trans)
trans[i] = state;
}
else
- for (i = 0; i < _NOTCHAR; ++i)
+ for (i = 0; i < NOTCHAR; ++i)
trans[i] = -1;
for (i = 0; i < ngrps; ++i)
@@ -1394,44 +1618,44 @@ regstate(s, r, trans)
/* Find the union of the follows of the positions of the group.
This is a hideously inefficient loop. Fix it someday. */
for (j = 0; j < grps[i].nelem; ++j)
- for (k = 0; k < r->follows[grps[i].elems[j].index].nelem; ++k)
- insert(r->follows[grps[i].elems[j].index].elems[k], &follows);
+ for (k = 0; k < d->follows[grps[i].elems[j].index].nelem; ++k)
+ insert(d->follows[grps[i].elems[j].index].elems[k], &follows);
/* If we are building a searching matcher, throw in the positions
of state 0 as well. */
- if (r->searchflag)
- for (j = 0; j < r->states[0].elems.nelem; ++j)
- insert(r->states[0].elems.elems[j], &follows);
+ if (d->searchflag)
+ for (j = 0; j < d->states[0].elems.nelem; ++j)
+ insert(d->states[0].elems.elems[j], &follows);
/* Find out if the new state will want any context information. */
wants_newline = 0;
if (tstbit('\n', labels[i]))
for (j = 0; j < follows.nelem; ++j)
- if (_PREV_NEWLINE_DEPENDENT(follows.elems[j].constraint))
+ if (PREV_NEWLINE_DEPENDENT(follows.elems[j].constraint))
wants_newline = 1;
wants_letter = 0;
- for (j = 0; j < _CHARSET_INTS; ++j)
+ for (j = 0; j < CHARCLASS_INTS; ++j)
if (labels[i][j] & letters[j])
break;
- if (j < _CHARSET_INTS)
+ if (j < CHARCLASS_INTS)
for (j = 0; j < follows.nelem; ++j)
- if (_PREV_LETTER_DEPENDENT(follows.elems[j].constraint))
+ if (PREV_LETTER_DEPENDENT(follows.elems[j].constraint))
wants_letter = 1;
/* Find the state(s) corresponding to the union of the follows. */
- state = state_index(r, &follows, 0, 0);
+ state = state_index(d, &follows, 0, 0);
if (wants_newline)
- state_newline = state_index(r, &follows, 1, 0);
+ state_newline = state_index(d, &follows, 1, 0);
else
state_newline = state;
if (wants_letter)
- state_letter = state_index(r, &follows, 0, 1);
+ state_letter = state_index(d, &follows, 0, 1);
else
state_letter = state;
/* Set the transitions for each character in the current label. */
- for (j = 0; j < _CHARSET_INTS; ++j)
+ for (j = 0; j < CHARCLASS_INTS; ++j)
for (k = 0; k < INTBITS; ++k)
if (labels[i][j] & 1 << k)
{
@@ -1441,7 +1665,7 @@ regstate(s, r, trans)
trans[c] = state_newline;
else if (ISALNUM(c))
trans[c] = state_letter;
- else if (c < _NOTCHAR)
+ else if (c < NOTCHAR)
trans[c] = state;
}
}
@@ -1451,18 +1675,18 @@ regstate(s, r, trans)
free(follows.elems);
free(tmp.elems);
}
-
-/* Some routines for manipulating a compiled regexp's transition tables.
+
+/* Some routines for manipulating a compiled dfa's transition tables.
Each state may or may not have a transition table; if it does, and it
- is a non-accepting state, then r->trans[state] points to its table.
- If it is an accepting state then r->fails[state] points to its table.
- If it has no table at all, then r->trans[state] is NULL.
+ is a non-accepting state, then d->trans[state] points to its table.
+ If it is an accepting state then d->fails[state] points to its table.
+ If it has no table at all, then d->trans[state] is NULL.
TODO: Improve this comment, get rid of the unnecessary redundancy. */
static void
-build_state(s, r)
+build_state(s, d)
int s;
- struct regexp *r;
+ struct dfa *d;
{
int *trans; /* The new transition table. */
int i;
@@ -1471,87 +1695,87 @@ build_state(s, r)
exist at once. 1024 is arbitrary. The idea is that the frequently
used transition tables will be quickly rebuilt, whereas the ones that
were only needed once or twice will be cleared away. */
- if (r->trcount >= 1024)
+ if (d->trcount >= 1024)
{
- for (i = 0; i < r->tralloc; ++i)
- if (r->trans[i])
+ for (i = 0; i < d->tralloc; ++i)
+ if (d->trans[i])
{
- free((ptr_t) r->trans[i]);
- r->trans[i] = NULL;
+ free((ptr_t) d->trans[i]);
+ d->trans[i] = NULL;
}
- else if (r->fails[i])
+ else if (d->fails[i])
{
- free((ptr_t) r->fails[i]);
- r->fails[i] = NULL;
+ free((ptr_t) d->fails[i]);
+ d->fails[i] = NULL;
}
- r->trcount = 0;
+ d->trcount = 0;
}
- ++r->trcount;
+ ++d->trcount;
/* Set up the success bits for this state. */
- r->success[s] = 0;
- if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 1, r->states[s].letter, 0,
- s, *r))
- r->success[s] |= 4;
- if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 0, r->states[s].letter, 1,
- s, *r))
- r->success[s] |= 2;
- if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 0, r->states[s].letter, 0,
- s, *r))
- r->success[s] |= 1;
-
- MALLOC(trans, int, _NOTCHAR);
- regstate(s, r, trans);
+ d->success[s] = 0;
+ if (ACCEPTS_IN_CONTEXT(d->states[s].newline, 1, d->states[s].letter, 0,
+ s, *d))
+ d->success[s] |= 4;
+ if (ACCEPTS_IN_CONTEXT(d->states[s].newline, 0, d->states[s].letter, 1,
+ s, *d))
+ d->success[s] |= 2;
+ if (ACCEPTS_IN_CONTEXT(d->states[s].newline, 0, d->states[s].letter, 0,
+ s, *d))
+ d->success[s] |= 1;
+
+ MALLOC(trans, int, NOTCHAR);
+ dfastate(s, d, trans);
/* Now go through the new transition table, and make sure that the trans
and fail arrays are allocated large enough to hold a pointer for the
largest state mentioned in the table. */
- for (i = 0; i < _NOTCHAR; ++i)
- if (trans[i] >= r->tralloc)
+ for (i = 0; i < NOTCHAR; ++i)
+ if (trans[i] >= d->tralloc)
{
- int oldalloc = r->tralloc;
-
- while (trans[i] >= r->tralloc)
- r->tralloc *= 2;
- REALLOC(r->realtrans, int *, r->tralloc + 1);
- r->trans = r->realtrans + 1;
- REALLOC(r->fails, int *, r->tralloc);
- REALLOC(r->success, int, r->tralloc);
- REALLOC(r->newlines, int, r->tralloc);
- while (oldalloc < r->tralloc)
+ int oldalloc = d->tralloc;
+
+ while (trans[i] >= d->tralloc)
+ d->tralloc *= 2;
+ REALLOC(d->realtrans, int *, d->tralloc + 1);
+ d->trans = d->realtrans + 1;
+ REALLOC(d->fails, int *, d->tralloc);
+ REALLOC(d->success, int, d->tralloc);
+ REALLOC(d->newlines, int, d->tralloc);
+ while (oldalloc < d->tralloc)
{
- r->trans[oldalloc] = NULL;
- r->fails[oldalloc++] = NULL;
+ d->trans[oldalloc] = NULL;
+ d->fails[oldalloc++] = NULL;
}
}
/* Keep the newline transition in a special place so we can use it as
a sentinel. */
- r->newlines[s] = trans['\n'];
+ d->newlines[s] = trans['\n'];
trans['\n'] = -1;
- if (ACCEPTING(s, *r))
- r->fails[s] = trans;
+ if (ACCEPTING(s, *d))
+ d->fails[s] = trans;
else
- r->trans[s] = trans;
+ d->trans[s] = trans;
}
static void
-build_state_zero(r)
- struct regexp *r;
+build_state_zero(d)
+ struct dfa *d;
{
- r->tralloc = 1;
- r->trcount = 0;
- CALLOC(r->realtrans, int *, r->tralloc + 1);
- r->trans = r->realtrans + 1;
- CALLOC(r->fails, int *, r->tralloc);
- MALLOC(r->success, int, r->tralloc);
- MALLOC(r->newlines, int, r->tralloc);
- build_state(0, r);
+ d->tralloc = 1;
+ d->trcount = 0;
+ CALLOC(d->realtrans, int *, d->tralloc + 1);
+ d->trans = d->realtrans + 1;
+ CALLOC(d->fails, int *, d->tralloc);
+ MALLOC(d->success, int, d->tralloc);
+ MALLOC(d->newlines, int, d->tralloc);
+ build_state(0, d);
}
-
-/* Search through a buffer looking for a match to the given struct regexp.
+
+/* Search through a buffer looking for a match to the given struct dfa.
Find the first occurrence of a string matching the regexp in the buffer,
and the shortest possible version thereof. Return a pointer to the first
character after the match, or NULL if none is found. Begin points to
@@ -1565,8 +1789,8 @@ build_state_zero(r)
match needs to be verified by a backtracking matcher. Otherwise
we store a 0 in *backref. */
char *
-regexecute(r, begin, end, newline, count, backref)
- struct regexp *r;
+dfaexec(d, begin, end, newline, count, backref)
+ struct dfa *d;
char *begin;
char *end;
int newline;
@@ -1575,9 +1799,9 @@ regexecute(r, begin, end, newline, count, backref)
{
register s, s1, tmp; /* Current state. */
register unsigned char *p; /* Current input character. */
- register **trans, *t; /* Copy of r->trans so it can be optimized
+ register **trans, *t; /* Copy of d->trans so it can be optimized
into a register. */
- static sbit[_NOTCHAR]; /* Table for anding with r->success. */
+ static sbit[NOTCHAR]; /* Table for anding with d->success. */
static sbit_init;
if (! sbit_init)
@@ -1585,7 +1809,7 @@ regexecute(r, begin, end, newline, count, backref)
int i;
sbit_init = 1;
- for (i = 0; i < _NOTCHAR; ++i)
+ for (i = 0; i < NOTCHAR; ++i)
if (i == '\n')
sbit[i] = 4;
else if (ISALNUM(i))
@@ -1594,18 +1818,18 @@ regexecute(r, begin, end, newline, count, backref)
sbit[i] = 1;
}
- if (! r->tralloc)
- build_state_zero(r);
+ if (! d->tralloc)
+ build_state_zero(d);
- s = 0;
+ s = s1 = 0;
p = (unsigned char *) begin;
- trans = r->trans;
+ trans = d->trans;
*end = '\n';
for (;;)
{
/* The dreaded inner loop. */
- if (t = trans[s])
+ if ((t = trans[s]) != 0)
do
{
s1 = t[*p++];
@@ -1613,18 +1837,18 @@ regexecute(r, begin, end, newline, count, backref)
goto last_was_s;
s = t[*p++];
}
- while (t = trans[s]);
+ while ((t = trans[s]) != 0);
goto last_was_s1;
last_was_s:
tmp = s, s = s1, s1 = tmp;
last_was_s1:
- if (s >= 0 && p <= (unsigned char *) end && r->fails[s])
+ if (s >= 0 && p <= (unsigned char *) end && d->fails[s])
{
- if (r->success[s] & sbit[*p])
+ if (d->success[s] & sbit[*p])
{
if (backref)
- if (r->states[s].backref)
+ if (d->states[s].backref)
*backref = 1;
else
*backref = 0;
@@ -1632,7 +1856,7 @@ regexecute(r, begin, end, newline, count, backref)
}
s1 = s;
- s = r->fails[s][*p++];
+ s = d->fails[s][*p++];
continue;
}
@@ -1641,63 +1865,64 @@ regexecute(r, begin, end, newline, count, backref)
++*count;
/* Check if we've run off the end of the buffer. */
- if ((char *) p >= end)
+ if ((char *) p > end)
return NULL;
if (s >= 0)
{
- build_state(s, r);
- trans = r->trans;
+ build_state(s, d);
+ trans = d->trans;
continue;
}
if (p[-1] == '\n' && newline)
{
- s = r->newlines[s1];
+ s = d->newlines[s1];
continue;
}
s = 0;
}
}
-
-/* Initialize the components of a regexp that the other routines don't
+
+/* Initialize the components of a dfa that the other routines don't
initialize for themselves. */
void
-reginit(r)
- struct regexp *r;
+dfainit(d)
+ struct dfa *d;
{
- r->calloc = 1;
- MALLOC(r->charsets, _charset, r->calloc);
- r->cindex = 0;
+ d->calloc = 1;
+ MALLOC(d->charclasses, charclass, d->calloc);
+ d->cindex = 0;
- r->talloc = 1;
- MALLOC(r->tokens, _token, r->talloc);
- r->tindex = r->depth = r->nleaves = r->nregexps = 0;
+ d->talloc = 1;
+ MALLOC(d->tokens, token, d->talloc);
+ d->tindex = d->depth = d->nleaves = d->nregexps = 0;
- r->searchflag = 0;
- r->tralloc = 0;
+ d->searchflag = 0;
+ d->tralloc = 0;
+
+ d->musts = 0;
}
/* Parse and analyze a single string of the given length. */
void
-regcompile(s, len, r, searchflag)
- const char *s;
+dfacomp(s, len, d, searchflag)
+ char *s;
size_t len;
- struct regexp *r;
+ struct dfa *d;
int searchflag;
{
- if (case_fold) /* dummy folding in service of regmust() */
+ if (case_fold) /* dummy folding in service of dfamust() */
{
char *copy;
int i;
copy = malloc(len);
if (!copy)
- regerror("out of memory");
+ dfaerror("out of memory");
- /* This is a complete kludge and could potentially break
- \<letter> escapes . . . */
+ /* This is a kludge. */
case_fold = 0;
for (i = 0; i < len; ++i)
if (ISUPPER(s[i]))
@@ -1705,572 +1930,596 @@ regcompile(s, len, r, searchflag)
else
copy[i] = s[i];
- reginit(r);
- r->mustn = 0;
- r->must[0] = '\0';
- regparse(copy, len, r);
+ dfainit(d);
+ dfaparse(copy, len, d);
free(copy);
- regmust(r);
- reganalyze(r, searchflag);
+ dfamust(d);
+ d->cindex = d->tindex = d->depth = d->nleaves = d->nregexps = 0;
case_fold = 1;
- reginit(r);
- regparse(s, len, r);
- reganalyze(r, searchflag);
+ dfaparse(s, len, d);
+ dfaanalyze(d, searchflag);
}
else
{
- reginit(r);
- regparse(s, len, r);
- regmust(r);
- reganalyze(r, searchflag);
+ dfainit(d);
+ dfaparse(s, len, d);
+ dfamust(d);
+ dfaanalyze(d, searchflag);
}
}
-/* Free the storage held by the components of a regexp. */
+/* Free the storage held by the components of a dfa. */
void
-regfree(r)
- struct regexp *r;
+dfafree(d)
+ struct dfa *d;
{
int i;
-
- free((ptr_t) r->charsets);
- free((ptr_t) r->tokens);
- for (i = 0; i < r->sindex; ++i)
- free((ptr_t) r->states[i].elems.elems);
- free((ptr_t) r->states);
- for (i = 0; i < r->tindex; ++i)
- if (r->follows[i].elems)
- free((ptr_t) r->follows[i].elems);
- free((ptr_t) r->follows);
- for (i = 0; i < r->tralloc; ++i)
- if (r->trans[i])
- free((ptr_t) r->trans[i]);
- else if (r->fails[i])
- free((ptr_t) r->fails[i]);
- free((ptr_t) r->realtrans);
- free((ptr_t) r->fails);
- free((ptr_t) r->newlines);
+ struct dfamust *dm, *ndm;
+
+ free((ptr_t) d->charclasses);
+ free((ptr_t) d->tokens);
+ for (i = 0; i < d->sindex; ++i)
+ free((ptr_t) d->states[i].elems.elems);
+ free((ptr_t) d->states);
+ for (i = 0; i < d->tindex; ++i)
+ if (d->follows[i].elems)
+ free((ptr_t) d->follows[i].elems);
+ free((ptr_t) d->follows);
+ for (i = 0; i < d->tralloc; ++i)
+ if (d->trans[i])
+ free((ptr_t) d->trans[i]);
+ else if (d->fails[i])
+ free((ptr_t) d->fails[i]);
+ free((ptr_t) d->realtrans);
+ free((ptr_t) d->fails);
+ free((ptr_t) d->newlines);
+ for (dm = d->musts; dm; dm = ndm)
+ {
+ ndm = dm->next;
+ free(dm->must);
+ free((ptr_t) dm);
+ }
}
-/*
-Having found the postfix representation of the regular expression,
-try to find a long sequence of characters that must appear in any line
-containing the r.e.
-Finding a "longest" sequence is beyond the scope here;
-we take an easy way out and hope for the best.
-(Take "(ab|a)b"--please.)
-
-We do a bottom-up calculation of sequences of characters that must appear
-in matches of r.e.'s represented by trees rooted at the nodes of the postfix
-representation:
+/* Having found the postfix representation of the regular expression,
+ try to find a long sequence of characters that must appear in any line
+ containing the r.e.
+ Finding a "longest" sequence is beyond the scope here;
+ we take an easy way out and hope for the best.
+ (Take "(ab|a)b"--please.)
+
+ We do a bottom-up calculation of sequences of characters that must appear
+ in matches of r.e.'s represented by trees rooted at the nodes of the postfix
+ representation:
sequences that must appear at the left of the match ("left")
sequences that must appear at the right of the match ("right")
lists of sequences that must appear somewhere in the match ("in")
sequences that must constitute the match ("is")
-When we get to the root of the tree, we use one of the longest of its
-calculated "in" sequences as our answer. The sequence we find is returned in
-r->must (where "r" is the single argument passed to "regmust");
-the length of the sequence is returned in r->mustn.
-
-The sequences calculated for the various types of node (in pseudo ANSI c)
-are shown below. "p" is the operand of unary operators (and the left-hand
-operand of binary operators); "q" is the right-hand operand of binary operators
-.
-"ZERO" means "a zero-length sequence" below.
-
-Type left right is in
----- ---- ----- -- --
-char c # c # c # c # c
-
-SET ZERO ZERO ZERO ZERO
-
-STAR ZERO ZERO ZERO ZERO
-
-QMARK ZERO ZERO ZERO ZERO
-
-PLUS p->left p->right ZERO p->in
-
-CAT (p->is==ZERO)? (q->is==ZERO)? (p->is!=ZERO && p->in plus
- p->left : q->right : q->is!=ZERO) ? q->in plus
- p->is##q->left p->right##q->is p->is##q->is : p->right##q->left
- ZERO
-
-OR longest common longest common (do p->is and substrings common to
- leading trailing q->is have same p->in and q->in
- (sub)sequence (sub)sequence length and
- of p->left of p->right content) ?
- and q->left and q->right p->is : NULL
-
-If there's anything else we recognize in the tree, all four sequences get set
-to zero-length sequences. If there's something we don't recognize in the tree,
-we just return a zero-length sequence.
-Break ties in favor of infrequent letters (choosing 'zzz' in preference to
-'aaa')?
-
-And. . .is it here or someplace that we might ponder "optimizations" such as
+ When we get to the root of the tree, we use one of the longest of its
+ calculated "in" sequences as our answer. The sequence we find is returned in
+ d->must (where "d" is the single argument passed to "dfamust");
+ the length of the sequence is returned in d->mustn.
+
+ The sequences calculated for the various types of node (in pseudo ANSI c)
+ are shown below. "p" is the operand of unary operators (and the left-hand
+ operand of binary operators); "q" is the right-hand operand of binary
+ operators.
+
+ "ZERO" means "a zero-length sequence" below.
+
+ Type left right is in
+ ---- ---- ----- -- --
+ char c # c # c # c # c
+
+ CSET ZERO ZERO ZERO ZERO
+
+ STAR ZERO ZERO ZERO ZERO
+
+ QMARK ZERO ZERO ZERO ZERO
+
+ PLUS p->left p->right ZERO p->in
+
+ CAT (p->is==ZERO)? (q->is==ZERO)? (p->is!=ZERO && p->in plus
+ p->left : q->right : q->is!=ZERO) ? q->in plus
+ p->is##q->left p->right##q->is p->is##q->is : p->right##q->left
+ ZERO
+
+ OR longest common longest common (do p->is and substrings common to
+ leading trailing q->is have same p->in and q->in
+ (sub)sequence (sub)sequence length and
+ of p->left of p->right content) ?
+ and q->left and q->right p->is : NULL
+
+ If there's anything else we recognize in the tree, all four sequences get set
+ to zero-length sequences. If there's something we don't recognize in the tree,
+ we just return a zero-length sequence.
+
+ Break ties in favor of infrequent letters (choosing 'zzz' in preference to
+ 'aaa')?
+
+ And. . .is it here or someplace that we might ponder "optimizations" such as
egrep 'psi|epsilon' -> egrep 'psi'
egrep 'pepsi|epsilon' -> egrep 'epsi'
(Yes, we now find "epsi" as a "string
that must occur", but we might also
- simplify the *entire* r.e. being sought
-)
+ simplify the *entire* r.e. being sought)
grep '[c]' -> grep 'c'
grep '(ab|a)b' -> grep 'ab'
grep 'ab*' -> grep 'a'
grep 'a*b' -> grep 'b'
-There are several issues:
- Is optimization easy (enough)?
- Does optimization actually accomplish anything,
- or is the automaton you get from "psi|epsilon" (for example)
- the same as the one you get from "psi" (for example)?
+ There are several issues:
+
+ Is optimization easy (enough)?
- Are optimizable r.e.'s likely to be used in real-life situations
- (something like 'ab*' is probably unlikely; something like is
- 'psi|epsilon' is likelier)?
-*/
+ Does optimization actually accomplish anything,
+ or is the automaton you get from "psi|epsilon" (for example)
+ the same as the one you get from "psi" (for example)?
+
+ Are optimizable r.e.'s likely to be used in real-life situations
+ (something like 'ab*' is probably unlikely; something like is
+ 'psi|epsilon' is likelier)? */
static char *
icatalloc(old, new)
-char * old;
-char * new;
+ char *old;
+ char *new;
{
- register char * result;
- register int oldsize, newsize;
-
- newsize = (new == NULL) ? 0 : strlen(new);
- if (old == NULL)
- oldsize = 0;
- else if (newsize == 0)
- return old;
- else oldsize = strlen(old);
- if (old == NULL)
- result = (char *) malloc(newsize + 1);
- else result = (char *) realloc((void *) old, oldsize + newsize + 1);
- if (result != NULL && new != NULL)
- (void) strcpy(result + oldsize, new);
- return result;
+ char *result;
+ int oldsize, newsize;
+
+ newsize = (new == NULL) ? 0 : strlen(new);
+ if (old == NULL)
+ oldsize = 0;
+ else if (newsize == 0)
+ return old;
+ else oldsize = strlen(old);
+ if (old == NULL)
+ result = (char *) malloc(newsize + 1);
+ else
+ result = (char *) realloc((void *) old, oldsize + newsize + 1);
+ if (result != NULL && new != NULL)
+ (void) strcpy(result + oldsize, new);
+ return result;
}
static char *
icpyalloc(string)
-const char * string;
+ char *string;
{
- return icatalloc((char *) NULL, string);
+ return icatalloc((char *) NULL, string);
}
static char *
istrstr(lookin, lookfor)
-char * lookin;
-register char * lookfor;
+ char *lookin;
+ char *lookfor;
{
- register char * cp;
- register int len;
-
- len = strlen(lookfor);
- for (cp = lookin; *cp != '\0'; ++cp)
- if (strncmp(cp, lookfor, len) == 0)
- return cp;
- return NULL;
+ char *cp;
+ int len;
+
+ len = strlen(lookfor);
+ for (cp = lookin; *cp != '\0'; ++cp)
+ if (strncmp(cp, lookfor, len) == 0)
+ return cp;
+ return NULL;
}
static void
ifree(cp)
-char * cp;
+ char *cp;
{
- if (cp != NULL)
- free(cp);
+ if (cp != NULL)
+ free(cp);
}
static void
freelist(cpp)
-register char ** cpp;
+ char **cpp;
{
- register int i;
+ int i;
- if (cpp == NULL)
- return;
- for (i = 0; cpp[i] != NULL; ++i) {
- free(cpp[i]);
- cpp[i] = NULL;
- }
+ if (cpp == NULL)
+ return;
+ for (i = 0; cpp[i] != NULL; ++i)
+ {
+ free(cpp[i]);
+ cpp[i] = NULL;
+ }
}
static char **
enlist(cpp, new, len)
-register char ** cpp;
-register char * new;
-int len;
+ char **cpp;
+ char *new;
+ int len;
{
- register int i, j;
+ int i, j;
- if (cpp == NULL)
- return NULL;
- if ((new = icpyalloc(new)) == NULL) {
- freelist(cpp);
- return NULL;
- }
- new[len] = '\0';
- /*
- ** Is there already something in the list that's new (or longer)?
- */
- for (i = 0; cpp[i] != NULL; ++i)
- if (istrstr(cpp[i], new) != NULL) {
- free(new);
- return cpp;
- }
- /*
- ** Eliminate any obsoleted strings.
- */
- j = 0;
- while (cpp[j] != NULL)
- if (istrstr(new, cpp[j]) == NULL)
- ++j;
- else {
- free(cpp[j]);
- if (--i == j)
- break;
- cpp[j] = cpp[i];
- cpp[i] = 0;
- }
- /*
- ** Add the new string.
- */
- cpp = (char **) realloc((char *) cpp, (i + 2) * sizeof *cpp);
- if (cpp == NULL)
- return NULL;
- cpp[i] = new;
- cpp[i + 1] = NULL;
+ if (cpp == NULL)
+ return NULL;
+ if ((new = icpyalloc(new)) == NULL)
+ {
+ freelist(cpp);
+ return NULL;
+ }
+ new[len] = '\0';
+ /* Is there already something in the list that's new (or longer)? */
+ for (i = 0; cpp[i] != NULL; ++i)
+ if (istrstr(cpp[i], new) != NULL)
+ {
+ free(new);
return cpp;
+ }
+ /* Eliminate any obsoleted strings. */
+ j = 0;
+ while (cpp[j] != NULL)
+ if (istrstr(new, cpp[j]) == NULL)
+ ++j;
+ else
+ {
+ free(cpp[j]);
+ if (--i == j)
+ break;
+ cpp[j] = cpp[i];
+ cpp[i] = NULL;
+ }
+ /* Add the new string. */
+ cpp = (char **) realloc((char *) cpp, (i + 2) * sizeof *cpp);
+ if (cpp == NULL)
+ return NULL;
+ cpp[i] = new;
+ cpp[i + 1] = NULL;
+ return cpp;
}
-/*
-** Given pointers to two strings,
-** return a pointer to an allocated list of their distinct common substrings.
-** Return NULL if something seems wild.
-*/
-
+/* Given pointers to two strings, return a pointer to an allocated
+ list of their distinct common substrings. Return NULL if something
+ seems wild. */
static char **
comsubs(left, right)
-char * left;
-char * right;
+ char *left;
+ char *right;
{
- register char ** cpp;
- register char * lcp;
- register char * rcp;
- register int i, len;
-
- if (left == NULL || right == NULL)
- return NULL;
- cpp = (char **) malloc(sizeof *cpp);
- if (cpp == NULL)
- return NULL;
- cpp[0] = NULL;
- for (lcp = left; *lcp != '\0'; ++lcp) {
- len = 0;
- rcp = index(right, *lcp);
- while (rcp != NULL) {
- for (i = 1; lcp[i] != '\0' && lcp[i] == rcp[i]; ++i)
- ;
- if (i > len)
- len = i;
- rcp = index(rcp + 1, *lcp);
- }
- if (len == 0)
- continue;
- if ((cpp = enlist(cpp, lcp, len)) == NULL)
- break;
+ char **cpp;
+ char *lcp;
+ char *rcp;
+ int i, len;
+
+ if (left == NULL || right == NULL)
+ return NULL;
+ cpp = (char **) malloc(sizeof *cpp);
+ if (cpp == NULL)
+ return NULL;
+ cpp[0] = NULL;
+ for (lcp = left; *lcp != '\0'; ++lcp)
+ {
+ len = 0;
+ rcp = index(right, *lcp);
+ while (rcp != NULL)
+ {
+ for (i = 1; lcp[i] != '\0' && lcp[i] == rcp[i]; ++i)
+ ;
+ if (i > len)
+ len = i;
+ rcp = index(rcp + 1, *lcp);
}
- return cpp;
+ if (len == 0)
+ continue;
+ if ((cpp = enlist(cpp, lcp, len)) == NULL)
+ break;
+ }
+ return cpp;
}
static char **
addlists(old, new)
-char ** old;
-char ** new;
+char **old;
+char **new;
{
- register int i;
-
- if (old == NULL || new == NULL)
- return NULL;
- for (i = 0; new[i] != NULL; ++i) {
- old = enlist(old, new[i], strlen(new[i]));
- if (old == NULL)
- break;
- }
- return old;
-}
+ int i;
-/*
-** Given two lists of substrings,
-** return a new list giving substrings common to both.
-*/
+ if (old == NULL || new == NULL)
+ return NULL;
+ for (i = 0; new[i] != NULL; ++i)
+ {
+ old = enlist(old, new[i], strlen(new[i]));
+ if (old == NULL)
+ break;
+ }
+ return old;
+}
+/* Given two lists of substrings, return a new list giving substrings
+ common to both. */
static char **
inboth(left, right)
-char ** left;
-char ** right;
+ char **left;
+ char **right;
{
- register char ** both;
- register char ** temp;
- register int lnum, rnum;
-
- if (left == NULL || right == NULL)
- return NULL;
- both = (char **) malloc(sizeof *both);
- if (both == NULL)
- return NULL;
- both[0] = NULL;
- for (lnum = 0; left[lnum] != NULL; ++lnum) {
- for (rnum = 0; right[rnum] != NULL; ++rnum) {
- temp = comsubs(left[lnum], right[rnum]);
- if (temp == NULL) {
- freelist(both);
- return NULL;
- }
- both = addlists(both, temp);
- freelist(temp);
- if (both == NULL)
- return NULL;
- }
+ char **both;
+ char **temp;
+ int lnum, rnum;
+
+ if (left == NULL || right == NULL)
+ return NULL;
+ both = (char **) malloc(sizeof *both);
+ if (both == NULL)
+ return NULL;
+ both[0] = NULL;
+ for (lnum = 0; left[lnum] != NULL; ++lnum)
+ {
+ for (rnum = 0; right[rnum] != NULL; ++rnum)
+ {
+ temp = comsubs(left[lnum], right[rnum]);
+ if (temp == NULL)
+ {
+ freelist(both);
+ return NULL;
+ }
+ both = addlists(both, temp);
+ freelist(temp);
+ if (both == NULL)
+ return NULL;
}
- return both;
+ }
+ return both;
}
-typedef struct {
- char ** in;
- char * left;
- char * right;
- char * is;
+typedef struct
+{
+ char **in;
+ char *left;
+ char *right;
+ char *is;
} must;
static void
resetmust(mp)
-register must * mp;
+must *mp;
{
- mp->left[0] = mp->right[0] = mp->is[0] = '\0';
- freelist(mp->in);
+ mp->left[0] = mp->right[0] = mp->is[0] = '\0';
+ freelist(mp->in);
}
static void
-regmust(reg)
-register struct regexp * reg;
+dfamust(dfa)
+struct dfa *dfa;
{
- register must * musts;
- register must * mp;
- register char * result;
- register int ri;
- register int i;
- register _token t;
- static must must0;
-
- reg->mustn = 0;
- reg->must[0] = '\0';
- musts = (must *) malloc((reg->tindex + 1) * sizeof *musts);
- if (musts == NULL)
- return;
- mp = musts;
- for (i = 0; i <= reg->tindex; ++i)
- mp[i] = must0;
- for (i = 0; i <= reg->tindex; ++i) {
- mp[i].in = (char **) malloc(sizeof *mp[i].in);
- mp[i].left = malloc(2);
- mp[i].right = malloc(2);
- mp[i].is = malloc(2);
- if (mp[i].in == NULL || mp[i].left == NULL ||
- mp[i].right == NULL || mp[i].is == NULL)
- goto done;
- mp[i].left[0] = mp[i].right[0] = mp[i].is[0] = '\0';
- mp[i].in[0] = NULL;
- }
- result = "";
+ must *musts;
+ must *mp;
+ char *result;
+ int ri;
+ int i;
+ int exact;
+ token t;
+ static must must0;
+ struct dfamust *dm;
+
+ result = "";
+ exact = 0;
+ musts = (must *) malloc((dfa->tindex + 1) * sizeof *musts);
+ if (musts == NULL)
+ return;
+ mp = musts;
+ for (i = 0; i <= dfa->tindex; ++i)
+ mp[i] = must0;
+ for (i = 0; i <= dfa->tindex; ++i)
+ {
+ mp[i].in = (char **) malloc(sizeof *mp[i].in);
+ mp[i].left = malloc(2);
+ mp[i].right = malloc(2);
+ mp[i].is = malloc(2);
+ if (mp[i].in == NULL || mp[i].left == NULL ||
+ mp[i].right == NULL || mp[i].is == NULL)
+ goto done;
+ mp[i].left[0] = mp[i].right[0] = mp[i].is[0] = '\0';
+ mp[i].in[0] = NULL;
+ }
#ifdef DEBUG
- fprintf(stderr, "regmust:\n");
- for (i = 0; i < reg->tindex; ++i) {
- fprintf(stderr, " %d:", i);
- prtok(reg->tokens[i]);
- }
- putc('\n', stderr);
+ fprintf(stderr, "dfamust:\n");
+ for (i = 0; i < dfa->tindex; ++i)
+ {
+ fprintf(stderr, " %d:", i);
+ prtok(dfa->tokens[i]);
+ }
+ putc('\n', stderr);
#endif
- for (ri = 0; ri < reg->tindex; ++ri) {
- switch (t = reg->tokens[ri]) {
- case _ALLBEGLINE:
- case _ALLENDLINE:
- case _LPAREN:
- case _RPAREN:
- goto done; /* "cannot happen" */
- case _EMPTY:
- case _BEGLINE:
- case _ENDLINE:
- case _BEGWORD:
- case _ENDWORD:
- case _LIMWORD:
- case _NOTLIMWORD:
- case _BACKREF:
- resetmust(mp);
- break;
- case _STAR:
- case _QMARK:
- if (mp <= musts)
- goto done; /* "cannot happen" */
- --mp;
- resetmust(mp);
- break;
- case _OR:
- if (mp < &musts[2])
- goto done; /* "cannot happen" */
- {
- register char ** new;
- register must * lmp;
- register must * rmp;
- register int j, ln, rn, n;
-
- rmp = --mp;
- lmp = --mp;
- /* Guaranteed to be. Unlikely, but. . . */
- if (strcmp(lmp->is, rmp->is) != 0)
- lmp->is[0] = '\0';
- /* Left side--easy */
- i = 0;
- while (lmp->left[i] != '\0' &&
- lmp->left[i] == rmp->left[i])
- ++i;
- lmp->left[i] = '\0';
- /* Right side */
- ln = strlen(lmp->right);
- rn = strlen(rmp->right);
- n = ln;
- if (n > rn)
- n = rn;
- for (i = 0; i < n; ++i)
- if (lmp->right[ln - i - 1] !=
- rmp->right[rn - i - 1])
- break;
- for (j = 0; j < i; ++j)
- lmp->right[j] =
- lmp->right[(ln - i) + j];
- lmp->right[j] = '\0';
- new = inboth(lmp->in, rmp->in);
- if (new == NULL)
- goto done;
- freelist(lmp->in);
- free((char *) lmp->in);
- lmp->in = new;
- }
- break;
- case _PLUS:
- if (mp <= musts)
- goto done; /* "cannot happen" */
- --mp;
- mp->is[0] = '\0';
- break;
- case _END:
- if (mp != &musts[1])
- goto done; /* "cannot happen" */
- for (i = 0; musts[0].in[i] != NULL; ++i)
- if (strlen(musts[0].in[i]) > strlen(result))
- result = musts[0].in[i];
- goto done;
- case _CAT:
- if (mp < &musts[2])
- goto done; /* "cannot happen" */
- {
- register must * lmp;
- register must * rmp;
-
- rmp = --mp;
- lmp = --mp;
- /*
- ** In. Everything in left, plus everything in
- ** right, plus catenation of
- ** left's right and right's left.
- */
- lmp->in = addlists(lmp->in, rmp->in);
- if (lmp->in == NULL)
- goto done;
- if (lmp->right[0] != '\0' &&
- rmp->left[0] != '\0') {
- register char * tp;
-
- tp = icpyalloc(lmp->right);
- if (tp == NULL)
- goto done;
- tp = icatalloc(tp, rmp->left);
- if (tp == NULL)
- goto done;
- lmp->in = enlist(lmp->in, tp,
- strlen(tp));
- free(tp);
- if (lmp->in == NULL)
- goto done;
- }
- /* Left-hand */
- if (lmp->is[0] != '\0') {
- lmp->left = icatalloc(lmp->left,
- rmp->left);
- if (lmp->left == NULL)
- goto done;
- }
- /* Right-hand */
- if (rmp->is[0] == '\0')
- lmp->right[0] = '\0';
- lmp->right = icatalloc(lmp->right, rmp->right);
- if (lmp->right == NULL)
- goto done;
- /* Guaranteed to be */
- if (lmp->is[0] != '\0' && rmp->is[0] != '\0') {
- lmp->is = icatalloc(lmp->is, rmp->is);
- if (lmp->is == NULL)
- goto done;
- } else
- lmp->is[0] = '\0';
- }
- break;
- default:
- if (t < _END) {
- /* "cannot happen" */
- goto done;
- } else if (t == '\0') {
- /* not on *my* shift */
- goto done;
- } else if (t >= _SET) {
- /* easy enough */
- resetmust(mp);
- } else {
- /* plain character */
- resetmust(mp);
- mp->is[0] = mp->left[0] = mp->right[0] = t;
- mp->is[1] = mp->left[1] = mp->right[1] = '\0';
- mp->in = enlist(mp->in, mp->is, 1);
- if (mp->in == NULL)
- goto done;
- }
- break;
- }
+ for (ri = 0; ri < dfa->tindex; ++ri)
+ {
+ switch (t = dfa->tokens[ri])
+ {
+ case LPAREN:
+ case RPAREN:
+ goto done; /* "cannot happen" */
+ case EMPTY:
+ case BEGLINE:
+ case ENDLINE:
+ case BEGWORD:
+ case ENDWORD:
+ case LIMWORD:
+ case NOTLIMWORD:
+ case BACKREF:
+ resetmust(mp);
+ break;
+ case STAR:
+ case QMARK:
+ if (mp <= musts)
+ goto done; /* "cannot happen" */
+ --mp;
+ resetmust(mp);
+ break;
+ case OR:
+ case ORTOP:
+ if (mp < &musts[2])
+ goto done; /* "cannot happen" */
+ {
+ char **new;
+ must *lmp;
+ must *rmp;
+ int j, ln, rn, n;
+
+ rmp = --mp;
+ lmp = --mp;
+ /* Guaranteed to be. Unlikely, but. . . */
+ if (strcmp(lmp->is, rmp->is) != 0)
+ lmp->is[0] = '\0';
+ /* Left side--easy */
+ i = 0;
+ while (lmp->left[i] != '\0' && lmp->left[i] == rmp->left[i])
+ ++i;
+ lmp->left[i] = '\0';
+ /* Right side */
+ ln = strlen(lmp->right);
+ rn = strlen(rmp->right);
+ n = ln;
+ if (n > rn)
+ n = rn;
+ for (i = 0; i < n; ++i)
+ if (lmp->right[ln - i - 1] != rmp->right[rn - i - 1])
+ break;
+ for (j = 0; j < i; ++j)
+ lmp->right[j] = lmp->right[(ln - i) + j];
+ lmp->right[j] = '\0';
+ new = inboth(lmp->in, rmp->in);
+ if (new == NULL)
+ goto done;
+ freelist(lmp->in);
+ free((char *) lmp->in);
+ lmp->in = new;
+ }
+ break;
+ case PLUS:
+ if (mp <= musts)
+ goto done; /* "cannot happen" */
+ --mp;
+ mp->is[0] = '\0';
+ break;
+ case END:
+ if (mp != &musts[1])
+ goto done; /* "cannot happen" */
+ for (i = 0; musts[0].in[i] != NULL; ++i)
+ if (strlen(musts[0].in[i]) > strlen(result))
+ result = musts[0].in[i];
+ if (strcmp(result, musts[0].is) == 0)
+ exact = 1;
+ goto done;
+ case CAT:
+ if (mp < &musts[2])
+ goto done; /* "cannot happen" */
+ {
+ must *lmp;
+ must *rmp;
+
+ rmp = --mp;
+ lmp = --mp;
+ /* In. Everything in left, plus everything in
+ right, plus catenation of
+ left's right and right's left. */
+ lmp->in = addlists(lmp->in, rmp->in);
+ if (lmp->in == NULL)
+ goto done;
+ if (lmp->right[0] != '\0' &&
+ rmp->left[0] != '\0')
+ {
+ char *tp;
+
+ tp = icpyalloc(lmp->right);
+ if (tp == NULL)
+ goto done;
+ tp = icatalloc(tp, rmp->left);
+ if (tp == NULL)
+ goto done;
+ lmp->in = enlist(lmp->in, tp,
+ strlen(tp));
+ free(tp);
+ if (lmp->in == NULL)
+ goto done;
+ }
+ /* Left-hand */
+ if (lmp->is[0] != '\0')
+ {
+ lmp->left = icatalloc(lmp->left,
+ rmp->left);
+ if (lmp->left == NULL)
+ goto done;
+ }
+ /* Right-hand */
+ if (rmp->is[0] == '\0')
+ lmp->right[0] = '\0';
+ lmp->right = icatalloc(lmp->right, rmp->right);
+ if (lmp->right == NULL)
+ goto done;
+ /* Guaranteed to be */
+ if (lmp->is[0] != '\0' && rmp->is[0] != '\0')
+ {
+ lmp->is = icatalloc(lmp->is, rmp->is);
+ if (lmp->is == NULL)
+ goto done;
+ }
+ else
+ lmp->is[0] = '\0';
+ }
+ break;
+ default:
+ if (t < END)
+ {
+ /* "cannot happen" */
+ goto done;
+ }
+ else if (t == '\0')
+ {
+ /* not on *my* shift */
+ goto done;
+ }
+ else if (t >= CSET)
+ {
+ /* easy enough */
+ resetmust(mp);
+ }
+ else
+ {
+ /* plain character */
+ resetmust(mp);
+ mp->is[0] = mp->left[0] = mp->right[0] = t;
+ mp->is[1] = mp->left[1] = mp->right[1] = '\0';
+ mp->in = enlist(mp->in, mp->is, 1);
+ if (mp->in == NULL)
+ goto done;
+ }
+ break;
+ }
#ifdef DEBUG
- fprintf(stderr, " node: %d:", ri);
- prtok(reg->tokens[ri]);
- fprintf(stderr, "\n in:");
- for (i = 0; mp->in[i]; ++i)
- fprintf(stderr, " \"%s\"", mp->in[i]);
- fprintf(stderr, "\n is: \"%s\"\n", mp->is);
- fprintf(stderr, " left: \"%s\"\n", mp->left);
- fprintf(stderr, " right: \"%s\"\n", mp->right);
+ fprintf(stderr, " node: %d:", ri);
+ prtok(dfa->tokens[ri]);
+ fprintf(stderr, "\n in:");
+ for (i = 0; mp->in[i]; ++i)
+ fprintf(stderr, " \"%s\"", mp->in[i]);
+ fprintf(stderr, "\n is: \"%s\"\n", mp->is);
+ fprintf(stderr, " left: \"%s\"\n", mp->left);
+ fprintf(stderr, " right: \"%s\"\n", mp->right);
#endif
- ++mp;
- }
-done:
- (void) strncpy(reg->must, result, MUST_MAX - 1);
- reg->must[MUST_MAX - 1] = '\0';
- reg->mustn = strlen(reg->must);
- mp = musts;
- for (i = 0; i <= reg->tindex; ++i) {
- freelist(mp[i].in);
- ifree((char *) mp[i].in);
- ifree(mp[i].left);
- ifree(mp[i].right);
- ifree(mp[i].is);
- }
- free((char *) mp);
+ ++mp;
+ }
+ done:
+ if (strlen(result))
+ {
+ dm = (struct dfamust *) malloc(sizeof (struct dfamust));
+ dm->exact = exact;
+ dm->must = malloc(strlen(result) + 1);
+ strcpy(dm->must, result);
+ dm->next = dfa->musts;
+ dfa->musts = dm;
+ }
+ mp = musts;
+ for (i = 0; i <= dfa->tindex; ++i)
+ {
+ freelist(mp[i].in);
+ ifree((char *) mp[i].in);
+ ifree(mp[i].left);
+ ifree(mp[i].right);
+ ifree(mp[i].is);
+ }
+ free((char *) mp);
}
diff --git a/gnu/usr.bin/grep/dfa.h b/gnu/usr.bin/grep/dfa.h
index 33e4bf1..32e05fc 100644
--- a/gnu/usr.bin/grep/dfa.h
+++ b/gnu/usr.bin/grep/dfa.h
@@ -16,210 +16,115 @@
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
/* Written June, 1988 by Mike Haertel */
-
-#ifdef STDC_HEADERS
-
-#include <stddef.h>
-#include <stdlib.h>
-
-#else /* !STDC_HEADERS */
-
-#define const
-#include <sys/types.h> /* For size_t. */
-extern char *calloc(), *malloc(), *realloc();
-extern void free();
-
-#ifndef NULL
-#define NULL 0
-#endif
-
-#endif /* ! STDC_HEADERS */
-
-#include <ctype.h>
-#ifndef isascii
-#define ISALNUM(c) isalnum(c)
-#define ISALPHA(c) isalpha(c)
-#define ISUPPER(c) isupper(c)
-#define ISLOWER(c) islower(c)
-#else
-#define ISALNUM(c) (isascii(c) && isalnum(c))
-#define ISALPHA(c) (isascii(c) && isalpha(c))
-#define ISUPPER(c) (isascii(c) && isupper(c))
-#define ISLOWER(c) (isascii(c) && islower(c))
-#endif
-
-/* 1 means plain parentheses serve as grouping, and backslash
- parentheses are needed for literal searching.
- 0 means backslash-parentheses are grouping, and plain parentheses
- are for literal searching. */
-#define RE_NO_BK_PARENS 1
-
-/* 1 means plain | serves as the "or"-operator, and \| is a literal.
- 0 means \| serves as the "or"-operator, and | is a literal. */
-#define RE_NO_BK_VBAR 2
-
-/* 0 means plain + or ? serves as an operator, and \+, \? are literals.
- 1 means \+, \? are operators and plain +, ? are literals. */
-#define RE_BK_PLUS_QM 4
-
-/* 1 means | binds tighter than ^ or $.
- 0 means the contrary. */
-#define RE_TIGHT_VBAR 8
-
-/* 1 means treat \n as an _OR operator
- 0 means treat it as a normal character */
-#define RE_NEWLINE_OR 16
-
-/* 0 means that a special characters (such as *, ^, and $) always have
- their special meaning regardless of the surrounding context.
- 1 means that special characters may act as normal characters in some
- contexts. Specifically, this applies to:
- ^ - only special at the beginning, or after ( or |
- $ - only special at the end, or before ) or |
- *, +, ? - only special when not after the beginning, (, or | */
-#define RE_CONTEXT_INDEP_OPS 32
-
-/* Now define combinations of bits for the standard possibilities. */
-#define RE_SYNTAX_AWK (RE_NO_BK_PARENS | RE_NO_BK_VBAR | RE_CONTEXT_INDEP_OPS)
-#define RE_SYNTAX_EGREP (RE_SYNTAX_AWK | RE_NEWLINE_OR)
-#define RE_SYNTAX_GREP (RE_BK_PLUS_QM | RE_NEWLINE_OR)
-#define RE_SYNTAX_EMACS 0
+
+/* FIXME:
+ 2. We should not export so much of the DFA internals.
+ In addition to clobbering modularity, we eat up valuable
+ name space. */
/* Number of bits in an unsigned char. */
#define CHARBITS 8
/* First integer value that is greater than any character code. */
-#define _NOTCHAR (1 << CHARBITS)
+#define NOTCHAR (1 << CHARBITS)
/* INTBITS need not be exact, just a lower bound. */
#define INTBITS (CHARBITS * sizeof (int))
/* Number of ints required to hold a bit for every character. */
-#define _CHARSET_INTS ((_NOTCHAR + INTBITS - 1) / INTBITS)
+#define CHARCLASS_INTS ((NOTCHAR + INTBITS - 1) / INTBITS)
/* Sets of unsigned characters are stored as bit vectors in arrays of ints. */
-typedef int _charset[_CHARSET_INTS];
+typedef int charclass[CHARCLASS_INTS];
/* The regexp is parsed into an array of tokens in postfix form. Some tokens
are operators and others are terminal symbols. Most (but not all) of these
codes are returned by the lexical analyzer. */
-#if __STDC__
typedef enum
{
- _END = -1, /* _END is a terminal symbol that matches the
- end of input; any value of _END or less in
+ END = -1, /* END is a terminal symbol that matches the
+ end of input; any value of END or less in
the parse tree is such a symbol. Accepting
states of the DFA are those that would have
- a transition on _END. */
+ a transition on END. */
/* Ordinary character values are terminal symbols that match themselves. */
- _EMPTY = _NOTCHAR, /* _EMPTY is a terminal symbol that matches
+ EMPTY = NOTCHAR, /* EMPTY is a terminal symbol that matches
the empty string. */
- _BACKREF, /* _BACKREF is generated by \<digit>; it
+ BACKREF, /* BACKREF is generated by \<digit>; it
it not completely handled. If the scanner
detects a transition on backref, it returns
a kind of "semi-success" indicating that
the match will have to be verified with
a backtracking matcher. */
- _BEGLINE, /* _BEGLINE is a terminal symbol that matches
+ BEGLINE, /* BEGLINE is a terminal symbol that matches
the empty string if it is at the beginning
of a line. */
- _ALLBEGLINE, /* _ALLBEGLINE is a terminal symbol that
- matches the empty string if it is at the
- beginning of a line; _ALLBEGLINE applies
- to the entire regexp and can only occur
- as the first token thereof. _ALLBEGLINE
- never appears in the parse tree; a _BEGLINE
- is prepended with _CAT to the entire
- regexp instead. */
-
- _ENDLINE, /* _ENDLINE is a terminal symbol that matches
+ ENDLINE, /* ENDLINE is a terminal symbol that matches
the empty string if it is at the end of
a line. */
- _ALLENDLINE, /* _ALLENDLINE is to _ENDLINE as _ALLBEGLINE
- is to _BEGLINE. */
-
- _BEGWORD, /* _BEGWORD is a terminal symbol that matches
+ BEGWORD, /* BEGWORD is a terminal symbol that matches
the empty string if it is at the beginning
of a word. */
- _ENDWORD, /* _ENDWORD is a terminal symbol that matches
+ ENDWORD, /* ENDWORD is a terminal symbol that matches
the empty string if it is at the end of
a word. */
- _LIMWORD, /* _LIMWORD is a terminal symbol that matches
+ LIMWORD, /* LIMWORD is a terminal symbol that matches
the empty string if it is at the beginning
or the end of a word. */
- _NOTLIMWORD, /* _NOTLIMWORD is a terminal symbol that
+ NOTLIMWORD, /* NOTLIMWORD is a terminal symbol that
matches the empty string if it is not at
the beginning or end of a word. */
- _QMARK, /* _QMARK is an operator of one argument that
+ QMARK, /* QMARK is an operator of one argument that
matches zero or one occurences of its
argument. */
- _STAR, /* _STAR is an operator of one argument that
+ STAR, /* STAR is an operator of one argument that
matches the Kleene closure (zero or more
occurrences) of its argument. */
- _PLUS, /* _PLUS is an operator of one argument that
+ PLUS, /* PLUS is an operator of one argument that
matches the positive closure (one or more
occurrences) of its argument. */
- _CAT, /* _CAT is an operator of two arguments that
+ REPMN, /* REPMN is a lexical token corresponding
+ to the {m,n} construct. REPMN never
+ appears in the compiled token vector. */
+
+ CAT, /* CAT is an operator of two arguments that
matches the concatenation of its
- arguments. _CAT is never returned by the
+ arguments. CAT is never returned by the
lexical analyzer. */
- _OR, /* _OR is an operator of two arguments that
+ OR, /* OR is an operator of two arguments that
matches either of its arguments. */
- _LPAREN, /* _LPAREN never appears in the parse tree,
+ ORTOP, /* OR at the toplevel in the parse tree.
+ This is used for a boyer-moore heuristic. */
+
+ LPAREN, /* LPAREN never appears in the parse tree,
it is only a lexeme. */
- _RPAREN, /* _RPAREN never appears in the parse tree. */
+ RPAREN, /* RPAREN never appears in the parse tree. */
- _SET /* _SET and (and any value greater) is a
+ CSET /* CSET and (and any value greater) is a
terminal symbol that matches any of a
class of characters. */
-} _token;
-
-#else /* ! __STDC__ */
-
-typedef short _token;
-
-#define _END -1
-#define _EMPTY _NOTCHAR
-#define _BACKREF (_EMPTY + 1)
-#define _BEGLINE (_EMPTY + 2)
-#define _ALLBEGLINE (_EMPTY + 3)
-#define _ENDLINE (_EMPTY + 4)
-#define _ALLENDLINE (_EMPTY + 5)
-#define _BEGWORD (_EMPTY + 6)
-#define _ENDWORD (_EMPTY + 7)
-#define _LIMWORD (_EMPTY + 8)
-#define _NOTLIMWORD (_EMPTY + 9)
-#define _QMARK (_EMPTY + 10)
-#define _STAR (_EMPTY + 11)
-#define _PLUS (_EMPTY + 12)
-#define _CAT (_EMPTY + 13)
-#define _OR (_EMPTY + 14)
-#define _LPAREN (_EMPTY + 15)
-#define _RPAREN (_EMPTY + 16)
-#define _SET (_EMPTY + 17)
-
-#endif /* ! __STDC__ */
+} token;
-/* Sets are stored in an array in the compiled regexp; the index of the
- array corresponding to a given set token is given by _SET_INDEX(t). */
-#define _SET_INDEX(t) ((t) - _SET)
+/* Sets are stored in an array in the compiled dfa; the index of the
+ array corresponding to a given set token is given by SET_INDEX(t). */
+#define SET_INDEX(t) ((t) - CSET)
/* Sometimes characters can only be matched depending on the surrounding
context. Such context decisions depend on what the previous character
@@ -239,36 +144,36 @@ typedef short _token;
Word-constituent characters are those that satisfy isalnum().
- The macro _SUCCEEDS_IN_CONTEXT determines whether a a given constraint
+ The macro SUCCEEDS_IN_CONTEXT determines whether a a given constraint
succeeds in a particular context. Prevn is true if the previous character
was a newline, currn is true if the lookahead character is a newline.
Prevl and currl similarly depend upon whether the previous and current
characters are word-constituent letters. */
-#define _MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \
- ((constraint) & 1 << ((prevn) ? 2 : 0) + ((currn) ? 1 : 0) + 4)
-#define _MATCHES_LETTER_CONTEXT(constraint, prevl, currl) \
- ((constraint) & 1 << ((prevl) ? 2 : 0) + ((currl) ? 1 : 0))
-#define _SUCCEEDS_IN_CONTEXT(constraint, prevn, currn, prevl, currl) \
- (_MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \
- && _MATCHES_LETTER_CONTEXT(constraint, prevl, currl))
+#define MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \
+ ((constraint) & 1 << (((prevn) ? 2 : 0) + ((currn) ? 1 : 0) + 4))
+#define MATCHES_LETTER_CONTEXT(constraint, prevl, currl) \
+ ((constraint) & 1 << (((prevl) ? 2 : 0) + ((currl) ? 1 : 0)))
+#define SUCCEEDS_IN_CONTEXT(constraint, prevn, currn, prevl, currl) \
+ (MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \
+ && MATCHES_LETTER_CONTEXT(constraint, prevl, currl))
/* The following macros give information about what a constraint depends on. */
-#define _PREV_NEWLINE_DEPENDENT(constraint) \
+#define PREV_NEWLINE_DEPENDENT(constraint) \
(((constraint) & 0xc0) >> 2 != ((constraint) & 0x30))
-#define _PREV_LETTER_DEPENDENT(constraint) \
+#define PREV_LETTER_DEPENDENT(constraint) \
(((constraint) & 0x0c) >> 2 != ((constraint) & 0x03))
/* Tokens that match the empty string subject to some constraint actually
work by applying that constraint to determine what may follow them,
taking into account what has gone before. The following values are
the constraints corresponding to the special tokens previously defined. */
-#define _NO_CONSTRAINT 0xff
-#define _BEGLINE_CONSTRAINT 0xcf
-#define _ENDLINE_CONSTRAINT 0xaf
-#define _BEGWORD_CONSTRAINT 0xf2
-#define _ENDWORD_CONSTRAINT 0xf4
-#define _LIMWORD_CONSTRAINT 0xf6
-#define _NOTLIMWORD_CONSTRAINT 0xf9
+#define NO_CONSTRAINT 0xff
+#define BEGLINE_CONSTRAINT 0xcf
+#define ENDLINE_CONSTRAINT 0xaf
+#define BEGWORD_CONSTRAINT 0xf2
+#define ENDWORD_CONSTRAINT 0xf4
+#define LIMWORD_CONSTRAINT 0xf6
+#define NOTLIMWORD_CONSTRAINT 0xf9
/* States of the recognizer correspond to sets of positions in the parse
tree, together with the constraints under which they may be matched.
@@ -278,44 +183,48 @@ typedef struct
{
unsigned index; /* Index into the parse array. */
unsigned constraint; /* Constraint for matching this position. */
-} _position;
+} position;
/* Sets of positions are stored as arrays. */
typedef struct
{
- _position *elems; /* Elements of this position set. */
+ position *elems; /* Elements of this position set. */
int nelem; /* Number of elements in this set. */
-} _position_set;
+} position_set;
-/* A state of the regexp consists of a set of positions, some flags,
+/* A state of the dfa consists of a set of positions, some flags,
and the token value of the lowest-numbered position of the state that
- contains an _END token. */
+ contains an END token. */
typedef struct
{
int hash; /* Hash of the positions of this state. */
- _position_set elems; /* Positions this state could match. */
+ position_set elems; /* Positions this state could match. */
char newline; /* True if previous state matched newline. */
char letter; /* True if previous state matched a letter. */
char backref; /* True if this state matches a \<digit>. */
unsigned char constraint; /* Constraint for this state to accept. */
- int first_end; /* Token value of the first _END in elems. */
-} _dfa_state;
+ int first_end; /* Token value of the first END in elems. */
+} dfa_state;
-/* If an r.e. is at most MUST_MAX characters long, we look for a string which
- must appear in it; whatever's found is dropped into the struct reg. */
-
-#define MUST_MAX 50
+/* Element of a list of strings, at least one of which is known to
+ appear in any R.E. matching the DFA. */
+struct dfamust
+{
+ int exact;
+ char *must;
+ struct dfamust *next;
+};
/* A compiled regular expression. */
-struct regexp
+struct dfa
{
/* Stuff built by the scanner. */
- _charset *charsets; /* Array of character sets for _SET tokens. */
- int cindex; /* Index for adding new charsets. */
- int calloc; /* Number of charsets currently allocated. */
+ charclass *charclasses; /* Array of character sets for CSET tokens. */
+ int cindex; /* Index for adding new charclasses. */
+ int calloc; /* Number of charclasses currently allocated. */
/* Stuff built by the parser. */
- _token *tokens; /* Postfix parse array. */
+ token *tokens; /* Postfix parse array. */
int tindex; /* Index for adding new tokens. */
int talloc; /* Number of tokens currently allocated. */
int depth; /* Depth required of an evaluation stack
@@ -323,15 +232,15 @@ struct regexp
parse tree. */
int nleaves; /* Number of leaves on the parse tree. */
int nregexps; /* Count of parallel regexps being built
- with regparse(). */
+ with dfaparse(). */
/* Stuff owned by the state builder. */
- _dfa_state *states; /* States of the regexp. */
+ dfa_state *states; /* States of the dfa. */
int sindex; /* Index for adding new states. */
int salloc; /* Number of states currently allocated. */
/* Stuff built by the structure analyzer. */
- _position_set *follows; /* Array of follow sets, indexed by position
+ position_set *follows; /* Array of follow sets, indexed by position
index. The follow of a position is the set
of positions containing characters that
could conceivably follow a character
@@ -361,7 +270,7 @@ struct regexp
int **fails; /* Transition tables after failing to accept
on a state that potentially could do so. */
int *success; /* Table of acceptance conditions used in
- regexecute and computed in build_state. */
+ dfaexec and computed in build_state. */
int *newlines; /* Transitions on newlines. The entry for a
newline in any transition table is always
-1 so we can count lines without wasting
@@ -369,40 +278,41 @@ struct regexp
newline is stored separately and handled
as a special case. Newline is also used
as a sentinel at the end of the buffer. */
- char must[MUST_MAX];
- int mustn;
+ struct dfamust *musts; /* List of strings, at least one of which
+ is known to appear in any r.e. matching
+ the dfa. */
};
-/* Some macros for user access to regexp internals. */
+/* Some macros for user access to dfa internals. */
/* ACCEPTING returns true if s could possibly be an accepting state of r. */
#define ACCEPTING(s, r) ((r).states[s].constraint)
/* ACCEPTS_IN_CONTEXT returns true if the given state accepts in the
specified context. */
-#define ACCEPTS_IN_CONTEXT(prevn, currn, prevl, currl, state, reg) \
- _SUCCEEDS_IN_CONTEXT((reg).states[state].constraint, \
+#define ACCEPTS_IN_CONTEXT(prevn, currn, prevl, currl, state, dfa) \
+ SUCCEEDS_IN_CONTEXT((dfa).states[state].constraint, \
prevn, currn, prevl, currl)
/* FIRST_MATCHING_REGEXP returns the index number of the first of parallel
regexps that a given state could accept. Parallel regexps are numbered
starting at 1. */
-#define FIRST_MATCHING_REGEXP(state, reg) (-(reg).states[state].first_end)
+#define FIRST_MATCHING_REGEXP(state, dfa) (-(dfa).states[state].first_end)
/* Entry points. */
#if __STDC__
-/* Regsyntax() takes two arguments; the first sets the syntax bits described
+/* dfasyntax() takes two arguments; the first sets the syntax bits described
earlier in this file, and the second sets the case-folding flag. */
-extern void regsyntax(int, int);
+extern void dfasyntax(int, int);
-/* Compile the given string of the given length into the given struct regexp.
+/* Compile the given string of the given length into the given struct dfa.
Final argument is a flag specifying whether to build a searching or an
exact matcher. */
-extern void regcompile(const char *, size_t, struct regexp *, int);
+extern void dfacomp(char *, size_t, struct dfa *, int);
-/* Execute the given struct regexp on the buffer of characters. The
+/* Execute the given struct dfa on the buffer of characters. The
first char * points to the beginning, and the second points to the
first character after the end of the buffer, which must be a writable
place so a sentinel end-of-buffer marker can be stored there. The
@@ -414,37 +324,37 @@ extern void regcompile(const char *, size_t, struct regexp *, int);
order to verify backreferencing; otherwise the flag will be cleared.
Returns NULL if no match is found, or a pointer to the first
character after the first & shortest matching string in the buffer. */
-extern char *regexecute(struct regexp *, char *, char *, int, int *, int *);
+extern char *dfaexec(struct dfa *, char *, char *, int, int *, int *);
-/* Free the storage held by the components of a struct regexp. */
-extern void regfree(struct regexp *);
+/* Free the storage held by the components of a struct dfa. */
+extern void dfafree(struct dfa *);
/* Entry points for people who know what they're doing. */
-/* Initialize the components of a struct regexp. */
-extern void reginit(struct regexp *);
+/* Initialize the components of a struct dfa. */
+extern void dfainit(struct dfa *);
-/* Incrementally parse a string of given length into a struct regexp. */
-extern void regparse(const char *, size_t, struct regexp *);
+/* Incrementally parse a string of given length into a struct dfa. */
+extern void dfaparse(char *, size_t, struct dfa *);
/* Analyze a parsed regexp; second argument tells whether to build a searching
or an exact matcher. */
-extern void reganalyze(struct regexp *, int);
+extern void dfaanalyze(struct dfa *, int);
/* Compute, for each possible character, the transitions out of a given
state, storing them in an array of integers. */
-extern void regstate(int, struct regexp *, int []);
+extern void dfastate(int, struct dfa *, int []);
/* Error handling. */
-/* Regerror() is called by the regexp routines whenever an error occurs. It
+/* dfaerror() is called by the regexp routines whenever an error occurs. It
takes a single argument, a NUL-terminated string describing the error.
- The default regerror() prints the error message to stderr and exits.
- The user can provide a different regfree() if so desired. */
-extern void regerror(const char *);
+ The default dfaerror() prints the error message to stderr and exits.
+ The user can provide a different dfafree() if so desired. */
+extern void dfaerror(char *);
#else /* ! __STDC__ */
-extern void regsyntax(), regcompile(), regfree(), reginit(), regparse();
-extern void reganalyze(), regstate(), regerror();
-extern char *regexecute();
+extern void dfasyntax(), dfacomp(), dfafree(), dfainit(), dfaparse();
+extern void dfaanalyze(), dfastate(), dfaerror();
+extern char *dfaexec();
#endif /* ! __STDC__ */
diff --git a/gnu/usr.bin/grep/getopt.c b/gnu/usr.bin/grep/getopt.c
index 0661bdf..a59a013 100644
--- a/gnu/usr.bin/grep/getopt.c
+++ b/gnu/usr.bin/grep/getopt.c
@@ -3,12 +3,13 @@
"Keep this file name-space clean" means, talk to roland@gnu.ai.mit.edu
before changing it!
- Copyright (C) 1987, 88, 89, 90, 91, 1992 Free Software Foundation, Inc.
+ Copyright (C) 1987, 88, 89, 90, 91, 92, 1993
+ Free Software Foundation, Inc.
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2, or (at your option)
- any later version.
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License as published by the
+ Free Software Foundation; either version 2, or (at your option) any
+ later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
@@ -17,49 +18,67 @@
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
+ Foundation, 675 Mass Ave, Cambridge, MA 02139, USA. */
-/* AIX requires this to be the first thing in the file. */
+/* NOTE!!! AIX requires this to be the first thing in the file.
+ Do not put ANYTHING before it! */
+#if !defined (__GNUC__) && defined (_AIX)
+ #pragma alloca
+#endif
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+
#ifdef __GNUC__
#define alloca __builtin_alloca
#else /* not __GNUC__ */
-#if defined(sparc) && !defined(USG) && !defined(SVR4) && !defined(__svr4__)
+#if defined (HAVE_ALLOCA_H) || (defined(sparc) && (defined(sun) || (!defined(USG) && !defined(SVR4) && !defined(__svr4__))))
#include <alloca.h>
#else
-#ifdef _AIX
- #pragma alloca
-#else
+#ifndef _AIX
char *alloca ();
#endif
-#endif /* sparc */
+#endif /* alloca.h */
#endif /* not __GNUC__ */
-#ifdef LIBC
-/* For when compiled as part of the GNU C library. */
-#include <ansidecl.h>
+#if !__STDC__ && !defined(const) && IN_GCC
+#define const
+#endif
+
+/* This tells Alpha OSF/1 not to define a getopt prototype in <stdio.h>. */
+#ifndef _NO_PROTO
+#define _NO_PROTO
#endif
#include <stdio.h>
+/* Comment out all this code if we are using the GNU C Library, and are not
+ actually compiling the library itself. This code is part of the GNU C
+ Library, but also included in many other GNU distributions. Compiling
+ and linking in this code is a waste when using the GNU C library
+ (especially if it is a shared library). Rather than having every GNU
+ program understand `configure --with-gnu-libc' and omit the object files,
+ it is simpler to just do this in the source for each such file. */
+
+#if defined (_LIBC) || !defined (__GNU_LIBRARY__)
+
+
/* This needs to come after some library #include
to get __GNU_LIBRARY__ defined. */
#ifdef __GNU_LIBRARY__
#undef alloca
+/* Don't include stdlib.h for non-GNU C libraries because some of them
+ contain conflicting prototypes for getopt. */
#include <stdlib.h>
-#include <string.h>
#else /* Not GNU C library. */
#define __alloca alloca
#endif /* GNU C library. */
-
-#ifndef __STDC__
-#define const
-#endif
-
/* If GETOPT_COMPAT is defined, `+' as well as `--' can introduce a
long-named option. Because this is not POSIX.2 compliant, it is
- being phased out. */
-#define GETOPT_COMPAT
+ being phased out. */
+/* #define GETOPT_COMPAT */
/* This version of `getopt' appears to the caller like standard Unix `getopt'
but it behaves differently for the user, since it allows the user
@@ -97,6 +116,7 @@ char *optarg = 0;
Otherwise, `optind' communicates from one call to the next
how much of ARGV has been scanned so far. */
+/* XXX 1003.2 says this must be 1 before any call. */
int optind = 0;
/* The next char to be scanned in the option-element
@@ -113,6 +133,12 @@ static char *nextchar;
int opterr = 1;
+/* Set to an option character which was unrecognized.
+ This must be initialized on some systems to avoid linking in the
+ system's own getopt implementation. */
+
+int optopt = '?';
+
/* Describe how to deal with options that follow non-option ARGV-elements.
If the caller did not specify anything,
@@ -148,6 +174,10 @@ static enum
} ordering;
#ifdef __GNU_LIBRARY__
+/* We want to avoid inclusion of string.h with non-GNU libraries
+ because there are many ways it can cause trouble.
+ On some systems, it contains special magic macros that don't work
+ in GCC. */
#include <string.h>
#define my_index strchr
#define my_bcopy(src, dst, n) memcpy ((dst), (src), (n))
@@ -159,22 +189,23 @@ static enum
char *getenv ();
static char *
-my_index (string, chr)
- char *string;
+my_index (str, chr)
+ const char *str;
int chr;
{
- while (*string)
+ while (*str)
{
- if (*string == chr)
- return string;
- string++;
+ if (*str == chr)
+ return (char *) str;
+ str++;
}
return 0;
}
static void
my_bcopy (from, to, size)
- char *from, *to;
+ const char *from;
+ char *to;
int size;
{
int i;
@@ -210,10 +241,12 @@ exchange (argv)
/* Interchange the two blocks of data in ARGV. */
- my_bcopy (&argv[first_nonopt], temp, nonopts_size);
- my_bcopy (&argv[last_nonopt], &argv[first_nonopt],
+ my_bcopy ((char *) &argv[first_nonopt], (char *) temp, nonopts_size);
+ my_bcopy ((char *) &argv[last_nonopt], (char *) &argv[first_nonopt],
(optind - last_nonopt) * sizeof (char *));
- my_bcopy (temp, &argv[first_nonopt + optind - last_nonopt], nonopts_size);
+ my_bcopy ((char *) temp,
+ (char *) &argv[first_nonopt + optind - last_nonopt],
+ nonopts_size);
/* Update records for the slots the non-options now occupy. */
@@ -457,7 +490,7 @@ _getopt_internal (argc, argv, optstring, longopts, longind, long_only)
if (*s)
{
/* Don't test has_arg with >, because some C compilers don't
- allow it to be used on enums. */
+ allow it to be used on enums. */
if (pfound->has_arg)
optarg = s + 1;
else
@@ -489,7 +522,7 @@ _getopt_internal (argc, argv, optstring, longopts, longind, long_only)
fprintf (stderr, "%s: option `%s' requires an argument\n",
argv[0], argv[optind - 1]);
nextchar += strlen (nextchar);
- return '?';
+ return optstring[0] == ':' ? ':' : '?';
}
}
nextchar += strlen (nextchar);
@@ -505,7 +538,7 @@ _getopt_internal (argc, argv, optstring, longopts, longind, long_only)
/* Can't find it as a long option. If this is not getopt_long_only,
or the option starts with '--' or is not a valid short
option, then it's an error.
- Otherwise interpret it as a short option. */
+ Otherwise interpret it as a short option. */
if (!long_only || argv[optind][1] == '-'
#ifdef GETOPT_COMPAT
|| argv[optind][0] == '+'
@@ -523,7 +556,7 @@ _getopt_internal (argc, argv, optstring, longopts, longind, long_only)
fprintf (stderr, "%s: unrecognized option `%c%s'\n",
argv[0], argv[optind][0], nextchar);
}
- nextchar += strlen (nextchar);
+ nextchar = (char *) "";
optind++;
return '?';
}
@@ -537,18 +570,24 @@ _getopt_internal (argc, argv, optstring, longopts, longind, long_only)
/* Increment `optind' when we start to process its last character. */
if (*nextchar == '\0')
- optind++;
+ ++optind;
if (temp == NULL || c == ':')
{
if (opterr)
{
+#if 0
if (c < 040 || c >= 0177)
fprintf (stderr, "%s: unrecognized option, character code 0%o\n",
argv[0], c);
else
fprintf (stderr, "%s: unrecognized option `-%c'\n", argv[0], c);
+#else
+ /* 1003.2 specifies the format of this message. */
+ fprintf (stderr, "%s: illegal option -- %c\n", argv[0], c);
+#endif
}
+ optopt = c;
return '?';
}
if (temp[1] == ':')
@@ -568,7 +607,7 @@ _getopt_internal (argc, argv, optstring, longopts, longind, long_only)
else
{
/* This is an option that requires an argument. */
- if (*nextchar != 0)
+ if (*nextchar != '\0')
{
optarg = nextchar;
/* If we end this ARGV-element by taking the rest as an arg,
@@ -578,9 +617,21 @@ _getopt_internal (argc, argv, optstring, longopts, longind, long_only)
else if (optind == argc)
{
if (opterr)
- fprintf (stderr, "%s: option `-%c' requires an argument\n",
- argv[0], c);
- c = '?';
+ {
+#if 0
+ fprintf (stderr, "%s: option `-%c' requires an argument\n",
+ argv[0], c);
+#else
+ /* 1003.2 specifies the format of this message. */
+ fprintf (stderr, "%s: option requires an argument -- %c\n",
+ argv[0], c);
+#endif
+ }
+ optopt = c;
+ if (optstring[0] == ':')
+ c = ':';
+ else
+ c = '?';
}
else
/* We already incremented `optind' once;
@@ -604,6 +655,8 @@ getopt (argc, argv, optstring)
(int *) 0,
0);
}
+
+#endif /* _LIBC or not __GNU_LIBRARY__. */
#ifdef TEST
diff --git a/gnu/usr.bin/grep/getopt.h b/gnu/usr.bin/grep/getopt.h
index f64de31..45541f5 100644
--- a/gnu/usr.bin/grep/getopt.h
+++ b/gnu/usr.bin/grep/getopt.h
@@ -1,10 +1,10 @@
/* Declarations for getopt.
- Copyright (C) 1989, 1990, 1991, 1992 Free Software Foundation, Inc.
+ Copyright (C) 1989, 1990, 1991, 1992, 1993 Free Software Foundation, Inc.
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 2, or (at your option)
- any later version.
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License as published by the
+ Free Software Foundation; either version 2, or (at your option) any
+ later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
@@ -13,11 +13,15 @@
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
+ Foundation, 675 Mass Ave, Cambridge, MA 02139, USA. */
#ifndef _GETOPT_H
#define _GETOPT_H 1
+#ifdef __cplusplus
+extern "C" {
+#endif
+
/* For communication from `getopt' to the caller.
When `getopt' finds an option that takes an argument,
the argument value is returned here.
@@ -45,6 +49,10 @@ extern int optind;
extern int opterr;
+/* Set to an option character which was unrecognized. */
+
+extern int optopt;
+
/* Describe the long-named options requested by the application.
The LONG_OPTIONS argument to getopt_long or getopt_long_only is a vector
of `struct option' terminated by an element containing a name which is
@@ -82,15 +90,19 @@ struct option
/* Names for the values of the `has_arg' field of `struct option'. */
-enum _argtype
-{
- no_argument,
- required_argument,
- optional_argument
-};
+#define no_argument 0
+#define required_argument 1
+#define optional_argument 2
#if __STDC__
+#if defined(__GNU_LIBRARY__)
+/* Many other libraries have conflicting prototypes for getopt, with
+ differences in the consts, in stdlib.h. To avoid compilation
+ errors, only prototype getopt for the GNU C library. */
extern int getopt (int argc, char *const *argv, const char *shortopts);
+#else /* not __GNU_LIBRARY__ */
+extern int getopt ();
+#endif /* not __GNU_LIBRARY__ */
extern int getopt_long (int argc, char *const *argv, const char *shortopts,
const struct option *longopts, int *longind);
extern int getopt_long_only (int argc, char *const *argv,
@@ -110,4 +122,8 @@ extern int getopt_long_only ();
extern int _getopt_internal ();
#endif /* not __STDC__ */
+#ifdef __cplusplus
+}
+#endif
+
#endif /* _GETOPT_H */
diff --git a/gnu/usr.bin/grep/getpagesize.h b/gnu/usr.bin/grep/getpagesize.h
new file mode 100644
index 0000000..e6bd561
--- /dev/null
+++ b/gnu/usr.bin/grep/getpagesize.h
@@ -0,0 +1,42 @@
+#ifdef BSD
+#ifndef BSD4_1
+#define HAVE_GETPAGESIZE
+#endif
+#endif
+
+#ifndef HAVE_GETPAGESIZE
+
+#ifdef VMS
+#define getpagesize() 512
+#endif
+
+#ifdef HAVE_UNISTD_H
+#include <unistd.h>
+#endif
+
+#ifdef _SC_PAGESIZE
+#define getpagesize() sysconf(_SC_PAGESIZE)
+#else
+
+#ifdef HAVE_SYS_PARAM_H
+#include <sys/param.h>
+
+#ifdef EXEC_PAGESIZE
+#define getpagesize() EXEC_PAGESIZE
+#else
+#ifdef NBPG
+#define getpagesize() NBPG * CLSIZE
+#ifndef CLSIZE
+#define CLSIZE 1
+#endif /* no CLSIZE */
+#else /* no NBPG */
+#define getpagesize() NBPC
+#endif /* no NBPG */
+#endif /* no EXEC_PAGESIZE */
+#else /* !HAVE_SYS_PARAM_H */
+#define getpagesize() 8192 /* punt totally */
+#endif /* !HAVE_SYS_PARAM_H */
+#endif /* no _SC_PAGESIZE */
+
+#endif /* not HAVE_GETPAGESIZE */
+
diff --git a/gnu/usr.bin/grep/grep.1 b/gnu/usr.bin/grep/grep.1
index 57093b6..27c6b0e 100644
--- a/gnu/usr.bin/grep/grep.1
+++ b/gnu/usr.bin/grep/grep.1
@@ -1,234 +1,375 @@
-.TH GREP 1 "1988 December 13" "GNU Project" \" -*- nroff -*-
-.UC 4
+.TH GREP 1 "1992 September 10" "GNU Project"
.SH NAME
-grep, egrep \- print lines matching a regular expression
-.SH SYNOPSIS
+grep, egrep, fgrep \- print lines matching a pattern
+.SH SYNOPOSIS
.B grep
[
-.B \-CVbchilnsvwx
-] [
-.BI \- num
-] [
-.B \-AB
-.I num
-] [ [
+.BR \- [[ AB "] ]\c"
+.I "num"
+]
+[
+.BR \- [ CEFGVBchilnsvwx ]
+]
+[
.B \-e
]
-.I expr
+.I pattern
|
-.B \-f
-.I file
+.BI \-f file
] [
-.I "files ..."
+.I files...
]
.SH DESCRIPTION
-.I Grep
-searches the files listed in the arguments (or standard
-input if no files are given) for all lines that contain a match for
-the given
-.IR expr .
-If any lines match, they are printed.
-.PP
-Also, if any matches were found,
-.I grep
-exits with a status of 0, but if no matches were found it exits
-with a status of 1. This is useful for building shell scripts that
-use
-.I grep
-as a condition for, for example, the
-.I if
-statement.
-.PP
-When invoked as
-.I egrep
-the syntax of the
-.I expr
-is slightly different; See below.
-.br
-.SH "REGULAR EXPRESSIONS"
-.RS 2.5i
-.ta 1i 2i
-.sp
-.ti -2.0i
-(grep) (egrep) (explanation)
-.sp
-.ti -2.0i
-\fIc\fP \fIc\fP a single (non-meta) character matches itself.
-.sp
-.ti -2.0i
-\&. . matches any single character except newline.
-.sp
-.ti -2.0i
-\\? ? postfix operator; preceeding item is optional.
-.sp
-.ti -2.0i
-\(** \(** postfix operator; preceeding item 0 or
-more times.
-.sp
-.ti -2.0i
-\\+ + postfix operator; preceeding item 1 or
-more times.
-.sp
-.ti -2.0i
-\\| | infix operator; matches either
-argument.
-.sp
-.ti -2.0i
-^ ^ matches the empty string at the beginning of a line.
-.sp
-.ti -2.0i
-$ $ matches the empty string at the end of a line.
-.sp
-.ti -2.0i
-\\< \\< matches the empty string at the beginning of a word.
-.sp
-.ti -2.0i
-\\> \\> matches the empty string at the end of a word.
-.sp
-.ti -2.0i
-[\fIchars\fP] [\fIchars\fP] match any character in the given class; if the
-first character after [ is ^, match any character
-not in the given class; a range of characters may
-be specified by \fIfirst\-last\fP; for example, \\W
-(below) is equivalent to the class [^A\-Za\-z0\-9]
-.sp
-.ti -2.0i
-\\( \\) ( ) parentheses are used to override operator precedence.
-.sp
-.ti -2.0i
-\\\fIdigit\fP \\\fIdigit\fP \\\fIn\fP matches a repeat of the text
-matched earlier in the regexp by the subexpression inside the nth
-opening parenthesis.
-.sp
-.ti -2.0i
-\\ \\ any special character may be preceded
-by a backslash to match it literally.
-.sp
-.ti -2.0i
-(the following are for compatibility with GNU Emacs)
-.sp
-.ti -2.0i
-\\b \\b matches the empty string at the edge of a word.
-.sp
-.ti -2.0i
-\\B \\B matches the empty string if not at the edge of a word.
-.sp
-.ti -2.0i
-\\w \\w matches word-constituent characters (letters & digits).
-.sp
-.ti -2.0i
-\\W \\W matches characters that are not word-constituent.
-.RE
-.PP
-Operator precedence is (highest to lowest) ?, \(**, and +, concatenation,
-and finally |. All other constructs are syntactically identical to
-normal characters. For the truly interested, the file dfa.c describes
-(and implements) the exact grammar understood by the parser.
-.SH OPTIONS
+.PP
+.B Grep
+searches the named input
+.I files
+(or standard input if no files are named, or
+the file name
+.B \-
+is given)
+for lines containing a match to the given
+.IR pattern .
+By default,
+.B grep
+prints the matching lines.
+.PP
+There are three major variants of
+.BR grep ,
+controlled by the following options.
+.PD 0
+.TP
+.B \-G
+Interpret
+.I pattern
+as a basic regular expression (see below). This is the default.
+.TP
+.B \-E
+Interpret
+.I pattern
+as an extended regular expression (see below).
+.TP
+.B \-F
+Interpret
+.I pattern
+as a list of fixed strings, separated by newlines,
+any of which is to be matched.
+.LP
+In addition, two variant programs
+.B egrep
+and
+.B fgrep
+are available.
+.B Egrep
+is similiar (but not identical) to
+.BR "grep\ \-E" ,
+and is compatible with the historical Unix
+.BR egrep .
+.B Fgrep
+is the same as
+.BR "grep\ \-F" .
+.PD
+.LP
+All variants of
+.B grep
+understand the following options:
+.PD 0
+.TP
+.BI \- num
+Matches will be printed with
+.I num
+lines of leading and trailing context. However,
+.B grep
+will never print any given line more than once.
.TP
.BI \-A " num"
-print <num> lines of context after every matching line
+Print
+.I num
+lines of trailing context after matching lines.
.TP
.BI \-B " num"
-print
+Print
.I num
-lines of context before every matching line
+lines of leading context before matching lines.
.TP
.B \-C
-print 2 lines of context on each side of every match
-.TP
-.BI \- num
-print
-.I num
-lines of context on each side of every match
+Equivalent to
+.BR \-2 .
.TP
.B \-V
-print the version number on the diagnostic output
+Print the version number of
+.B grep
+to standard error. This version number should
+be included in all bug reports (see below).
.TP
.B \-b
-print every match preceded by its byte offset
+Print the byte offset within the input file before
+each line of output.
.TP
.B \-c
-print a total count of matching lines only
+Suppress normal output; instead print a count of
+matching lines for each input file.
+With the
+.B \-v
+option (see below), count non-matching lines.
.TP
-.BI \-e " expr"
-search for
-.IR expr ;
-useful if
-.I expr
-begins with \-
+.BI \-e " pattern"
+Use
+.I pattern
+as the pattern; useful to protect patterns beginning with
+.BR \- .
.TP
.BI \-f " file"
-search for the expression contained in
-.I file
+Obtain the pattern from
+.IR file .
.TP
.B \-h
-don't display filenames on matches
+Suppress the prefixing of filenames on output
+when multiple files are searched.
.TP
.B \-i
-ignore case difference when comparing strings
+Ignore case distinctions in both the
+.I pattern
+and the input files.
+.TP
+.B \-L
+Suppress normal output; instead print the name
+of each input file from which no output would
+normally have been printed.
.TP
.B \-l
-list files containing matches only
+Suppress normal output; instead print
+the name of each input file from which output
+would normally have been printed.
.TP
.B \-n
-print each match preceded by its line number
+Prefix each line of output with the line number
+within its input file.
+.TP
+.B \-q
+Quiet; suppress normal output.
.TP
.B \-s
-run silently producing no output except error messages
+Suppress error messages about nonexistent or unreadable files.
.TP
.B \-v
-print only lines that contain no matches for the <expr>
+Invert the sense of matching, to select non-matching lines.
.TP
.B \-w
-print only lines where the match is a complete word
+Select only those lines containing matches that form whole words.
+The test is that the matching substring must either be at the
+beginning of the line, or preceded by a non-word constituent
+character. Similarly, it must be either at the end of the line
+or followed by a non-word constituent character. Word-constituent
+characters are letters, digits, and the underscore.
.TP
.B \-x
-print only lines where the match is a whole line
-.SH "SEE ALSO"
-emacs(1), ed(1), sh(1),
-.I "GNU Emacs Manual"
-.SH INCOMPATIBILITIES
-The following incompatibilities with UNIX
-.I grep
-exist:
-.PP
-.RS 0.5i
-The context-dependent meaning of \(** is not quite the same (grep only).
+Select only those matches that exactly match the whole line.
+.PD
+.SH "REGULAR EXPRESSIONS"
.PP
-.B \-b
-prints a byte offset instead of a block offset.
+A regular expression is a pattern that describes a set of strings.
+Regular expressions are constructed analagously to arithmetic
+expressions, by using various operators to combine smaller expressions.
.PP
-The {\fIm,n\fP} construct of System V grep is not implemented.
+.B Grep
+understands two different versions of regular expression syntax:
+``basic'' and ``extended.'' In
+.RB "GNU\ " grep ,
+there is no difference in available functionality using either syntax.
+In other implementations, basic regular expressions are less powerful.
+The following description applies to extended regular expressions;
+differences for basic regular expressions are summarized afterwards.
.PP
-.SH BUGS
-GNU \fIe?grep\fP has been thoroughly debugged and tested over a period
-of several years; we think it's a reliable beast or we wouldn't
-distribute it. If by some fluke of the universe you discover a bug,
-send a detailed description (including options, regular expressions,
-and a copy of an input file that can reproduce it) to mike@ai.mit.edu.
+The fundamental building blocks are the regular expressions that match
+a single character. Most characters, including all letters and digits,
+are regular expressions that match themselves. Any metacharacter with
+special meaning may be quoted by preceding it with a backslash.
+.PP
+A list of characters enclosed by
+.B [
+and
+.B ]
+matches any single
+character in that list; if the first character of the list
+is the caret
+.B ^
+then it matches any character
+.I not
+in the list.
+For example, the regular expression
+.B [0123456789]
+matches any single digit. A range of ASCII characters
+may be specified by giving the first and last characters, separated
+by a hyphen.
+Finally, certain named classes of characters are predefined.
+Their names are self explanatory, and they are
+.BR [:alnum:] ,
+.BR [:alpha:] ,
+.BR [:cntrl:] ,
+.BR [:digit:] ,
+.BR [:graph:] ,
+.BR [:lower:] ,
+.BR [:print:] ,
+.BR [:punct:] ,
+.BR [:space:] ,
+.BR [:upper:] ,
+and
+.BR [:xdigit:].
+For example,
+.B [[:alnum:]]
+means
+.BR [0-9A-Za-z] ,
+except the latter form is dependent upon the ASCII character encoding,
+whereas the former is portable.
+(Note that the brackets in these class names are part of the symbolic
+names, and must be included in addition to the brackets delimiting
+the bracket list.) Most metacharacters lose their special meaning
+inside lists. To include a literal
+.B ]
+place it first in the list. Similarly, to include a literal
+.B ^
+place it anywhere but first. Finally, to include a literal
+.B \-
+place it last.
+.PP
+The period
+.B .
+matches any single character.
+The symbol
+.B \ew
+is a synonym for
+.B [[:alnum:]]
+and
+.B \eW
+is a synonym for
+.BR [^[:alnum]] .
+.PP
+The caret
+.B ^
+and the dollar sign
+.B $
+are metacharacters that respectively match the empty string at the
+beginning and end of a line.
+The symbols
+.B \e<
+and
+.B \e>
+respectively match the empty string at the beginning and end of a word.
+The symbol
+.B \eb
+matches the empty string at the edge of a word,
+and
+.B \eB
+matches the empty string provided it's
+.I not
+at the edge of a word.
+.PP
+A regular expression matching a single character may be followed
+by one of several repetition operators:
+.PD 0
+.TP
+.B ?
+The preceding item is optional and matched at most once.
+.TP
+.B *
+The preceding item will be matched zero or more times.
+.TP
+.B +
+The preceding item will be matched one or more times.
+.TP
+.BI { n }
+The preceding item is matched exactly
+.I n
+times.
+.TP
+.BI { n ,}
+The preceding item is matched
+.I n
+or more times.
+.TP
+.BI {, m }
+The preceding item is optional and is matched at most
+.I m
+times.
+.TP
+.BI { n , m }
+The preceding item is matched at least
+.I n
+times, but not more than
+.I m
+times.
+.PD
.PP
-.SH AUTHORS
-Mike Haertel wrote the deterministic regexp code and the bulk
-of the program.
+Two regular expressions may be concatenated; the resulting
+regular expression matches any string formed by concatenating
+two substrings that respectively match the concatenated
+subexpressions.
.PP
-James A. Woods is responsible for the hybridized search strategy
-of using Boyer-Moore-Gosper fixed-string search as a filter
-before calling the general regexp matcher.
+Two regular expressions may be joined by the infix operator
+.BR | ;
+the resulting regular expression matches any string matching
+either subexpression.
.PP
-Arthur David Olson contributed code that finds fixed strings for
-the aforementioned BMG search for a large class of regexps.
+Repetition takes precedence over concatenation, which in turn
+takes precedence over alternation. A whole subexpression may be
+enclosed in parentheses to override these precedence rules.
.PP
-Richard Stallman wrote the backtracking regexp matcher that is used
-for \\\fIdigit\fP backreferences, as well as the GNU getopt. The
-backtracking matcher was originally written for GNU Emacs.
+The backreference
+.BI \e n\c
+\&, where
+.I n
+is a single digit, matches the substring
+previously matched by the
+.IR n th
+parenthesized subexpression of the regular expression.
.PP
-D. A. Gwyn wrote the C alloca emulation that is provided so
-System V machines can run this program. (Alloca is used only
-by RMS' backtracking matcher, and then only rarely, so there
-is no loss if your machine doesn't have a "real" alloca.)
+In basic regular expressions the metacharacters
+.BR ? ,
+.BR + ,
+.BR { ,
+.BR | ,
+.BR ( ,
+and
+.BR )
+lose their special meaning; instead use the backslashed
+versions
+.BR \e? ,
+.BR \e+ ,
+.BR \e{ ,
+.BR \e| ,
+.BR \e( ,
+and
+.BR \e) .
.PP
-Scott Anderson and Henry Spencer designed the regression tests
-used in the "regress" script.
+In
+.B egrep
+the metacharacter
+.B {
+loses its special meaning; instead use
+.BR \e{ .
+.SH DIAGNOSTICS
+.PP
+Normally, exit status is 0 if matches were found,
+and 1 if no matches were found. (The
+.B \-v
+option inverts the sense of the exit status.)
+Exit status is 2 if there were syntax errors
+in the pattern, inaccessible input files, or
+other system errors.
+.SH BUGS
+.PP
+Email bug reports to
+.BR bug-gnu-utils@prep.ai.mit.edu .
+Be sure to include the word ``grep'' somewhere in the ``Subject:'' field.
+.PP
+Large repetition counts in the
+.BI { m , n }
+construct may cause grep to use lots of memory.
+In addition,
+certain other obscure regular expressions require exponential time
+and space, and may cause
+.B grep
+to run out of memory.
.PP
-Paul Placeway wrote the original version of this manual page.
+Backreferences are very slow, and may require exponential time.
diff --git a/gnu/usr.bin/grep/grep.c b/gnu/usr.bin/grep/grep.c
index 1c45c45..07872a1 100644
--- a/gnu/usr.bin/grep/grep.c
+++ b/gnu/usr.bin/grep/grep.c
@@ -1,5 +1,5 @@
-/* grep - print lines matching an extended regular expression
- Copyright (C) 1988 Free Software Foundation, Inc.
+/* grep.c - main driver file for grep.
+ Copyright (C) 1992 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -13,569 +13,616 @@
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-/* Written June, 1988 by Mike Haertel
- BMG speedups added July, 1988 by James A. Woods and Arthur David Olson */
-
+ Written July 1992 by Mike Haertel. */
+
+#include <errno.h>
#include <stdio.h>
-#if defined(USG) || defined(STDC_HEADERS)
-#include <string.h>
-#ifndef bcopy
-#define bcopy(s,d,n) memcpy((d),(s),(n))
+#ifndef errno
+extern int errno;
+#endif
+
+#ifdef STDC_HEADERS
+#include <stdlib.h>
+#else
+#include <sys/types.h>
+extern char *malloc(), *realloc();
+extern void free();
#endif
-#ifndef index
-#define index strchr
+
+#if defined(STDC_HEADERS) || defined(HAVE_STRING_H)
+#include <string.h>
+#ifdef NEED_MEMORY_H
+#include <memory.h>
#endif
#else
#include <strings.h>
+#ifdef __STDC__
+extern void *memchr();
+#else
+extern char *memchr();
+#endif
+#define strrchr rindex
#endif
#ifdef HAVE_UNISTD_H
+#include <sys/types.h>
+#include <fcntl.h>
#include <unistd.h>
+#else
+#define O_RDONLY 0
+extern int open(), read(), close();
#endif
-#ifndef STDC_HEADERS
-extern char *getenv();
-#endif
-extern int errno;
+#include "getpagesize.h"
+#include "grep.h"
-extern char *sys_errlist[];
+#undef MAX
+#define MAX(A,B) ((A) > (B) ? (A) : (B))
-#include "dfa.h"
-#include "regex.h"
-#include "getopt.h"
+/* Provide missing ANSI features if necessary. */
-/* Used by -w */
-#define WCHAR(C) (ISALNUM(C) || (C) == '_')
+#ifndef HAVE_STRERROR
+extern int sys_nerr;
+extern char *sys_errlist[];
+#define strerror(E) ((E) < sys_nerr ? sys_errlist[(E)] : "bogus error number")
+#endif
-#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#ifndef HAVE_MEMCHR
+#ifdef __STDC__
+#define VOID void
+#else
+#define VOID char
+#endif
+VOID *
+memchr(vp, c, n)
+ VOID *vp;
+ int c;
+ size_t n;
+{
+ unsigned char *p;
-/* Exit status codes. */
-#define MATCHES_FOUND 0 /* Exit 0 if no errors and matches found. */
-#define NO_MATCHES_FOUND 1 /* Exit 1 if no matches were found. */
-#define ERROR 2 /* Exit 2 if some error occurred. */
+ for (p = (unsigned char *) vp; n--; ++p)
+ if (*p == c)
+ return (VOID *) p;
+ return 0;
+}
+#endif
+
+/* Define flags declared in grep.h. */
+char *matcher;
+int match_icase;
+int match_words;
+int match_lines;
-/* Error is set true if something awful happened. */
-static int error;
+/* Functions we'll use to search. */
+static void (*compile)();
+static char *(*execute)();
-/* The program name for error messages. */
+/* For error messages. */
static char *prog;
+static char *filename;
+static int errseen;
-/* We do all our own buffering by hand for efficiency. */
-static char *buffer; /* The buffer itself, grown as needed. */
-static bufbytes; /* Number of bytes in the buffer. */
-static size_t bufalloc; /* Number of bytes allocated to the buffer. */
-static bufprev; /* Number of bytes that have been forgotten.
- This is used to get byte offsets from the
- beginning of the file. */
-static bufread; /* Number of bytes to get with each read(). */
-
+/* Print a message and possibly an error string. Remember
+ that something awful happened. */
static void
-initialize_buffer()
+error(mesg, errnum)
+#ifdef __STDC__
+ const
+#endif
+ char *mesg;
+ int errnum;
+{
+ if (errnum)
+ fprintf(stderr, "%s: %s: %s\n", prog, mesg, strerror(errnum));
+ else
+ fprintf(stderr, "%s: %s\n", prog, mesg);
+ errseen = 1;
+}
+
+/* Like error(), but die horribly after printing. */
+void
+fatal(mesg, errnum)
+#ifdef __STDC__
+ const
+#endif
+ char *mesg;
+ int errnum;
{
- bufread = 8192;
- bufalloc = bufread + bufread / 2;
- buffer = malloc(bufalloc);
- if (! buffer)
+ error(mesg, errnum);
+ exit(2);
+}
+
+/* Interface to handle errors and fix library lossage. */
+char *
+xmalloc(size)
+ size_t size;
+{
+ char *result;
+
+ result = malloc(size);
+ if (size && !result)
+ fatal("memory exhausted", 0);
+ return result;
+}
+
+/* Interface to handle errors and fix some library lossage. */
+char *
+xrealloc(ptr, size)
+ char *ptr;
+ size_t size;
+{
+ char *result;
+
+ if (ptr)
+ result = realloc(ptr, size);
+ else
+ result = malloc(size);
+ if (size && !result)
+ fatal("memory exhausted", 0);
+ return result;
+}
+
+#if !defined(HAVE_VALLOC)
+#define valloc malloc
+#else
+#ifdef __STDC__
+extern void *valloc(size_t);
+#else
+extern char *valloc();
+#endif
+#endif
+
+/* Hairy buffering mechanism for grep. The intent is to keep
+ all reads aligned on a page boundary and multiples of the
+ page size. */
+
+static char *buffer; /* Base of buffer. */
+static size_t bufsalloc; /* Allocated size of buffer save region. */
+static size_t bufalloc; /* Total buffer size. */
+static int bufdesc; /* File descriptor. */
+static char *bufbeg; /* Beginning of user-visible stuff. */
+static char *buflim; /* Limit of user-visible stuff. */
+
+#if defined(HAVE_WORKING_MMAP)
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+
+static int bufmapped; /* True for ordinary files. */
+static struct stat bufstat; /* From fstat(). */
+static off_t bufoffset; /* What read() normally remembers. */
+#endif
+
+/* Reset the buffer for a new file. Initialize
+ on the first time through. */
+void
+reset(fd)
+ int fd;
+{
+ static int initialized;
+
+ if (!initialized)
{
- fprintf(stderr, "%s: Memory exhausted (%s)\n", prog,
- sys_errlist[errno]);
- exit(ERROR);
+ initialized = 1;
+#ifndef BUFSALLOC
+ bufsalloc = MAX(8192, getpagesize());
+#else
+ bufsalloc = BUFSALLOC;
+#endif
+ bufalloc = 5 * bufsalloc;
+ /* The 1 byte of overflow is a kludge for dfaexec(), which
+ inserts a sentinel newline at the end of the buffer
+ being searched. There's gotta be a better way... */
+ buffer = valloc(bufalloc + 1);
+ if (!buffer)
+ fatal("memory exhausted", 0);
+ bufbeg = buffer;
+ buflim = buffer;
}
+ bufdesc = fd;
+#if defined(HAVE_WORKING_MMAP)
+ if (fstat(fd, &bufstat) < 0 || !S_ISREG(bufstat.st_mode))
+ bufmapped = 0;
+ else
+ {
+ bufmapped = 1;
+ bufoffset = lseek(fd, 0, 1);
+ }
+#endif
}
-/* The current input file. */
-static fd;
-static char *filename;
-static eof;
-
-/* Fill the buffer retaining the last n bytes at the beginning of the
- newly filled buffer (for backward context). Returns the number of new
- bytes read from disk. */
+/* Read new stuff into the buffer, saving the specified
+ amount of old stuff. When we're done, 'bufbeg' points
+ to the beginning of the buffer contents, and 'buflim'
+ points just after the end. Return count of new stuff. */
static int
-fill_buffer_retaining(n)
- int n;
+fillbuf(save)
+ size_t save;
{
- char *p, *q;
- int i;
+ char *nbuffer, *dp, *sp;
+ int cc;
+#if defined(HAVE_WORKING_MMAP)
+ caddr_t maddr;
+#endif
+ static int pagesize;
+
+ if (pagesize == 0 && (pagesize = getpagesize()) == 0)
+ abort();
- /* See if we need to grow the buffer. */
- if (bufalloc - n <= bufread)
+ if (save > bufsalloc)
{
- while (bufalloc - n <= bufread)
+ while (save > bufsalloc)
+ bufsalloc *= 2;
+ bufalloc = 5 * bufsalloc;
+ nbuffer = valloc(bufalloc + 1);
+ if (!nbuffer)
+ fatal("memory exhausted", 0);
+ }
+ else
+ nbuffer = buffer;
+
+ sp = buflim - save;
+ dp = nbuffer + bufsalloc - save;
+ bufbeg = dp;
+ while (save--)
+ *dp++ = *sp++;
+
+ /* We may have allocated a new, larger buffer. Since
+ there is no portable vfree(), we just have to forget
+ about the old one. Sorry. */
+ buffer = nbuffer;
+
+#if defined(HAVE_WORKING_MMAP)
+ if (bufmapped && bufoffset % pagesize == 0
+ && bufstat.st_size - bufoffset >= bufalloc - bufsalloc)
+ {
+ maddr = buffer + bufsalloc;
+ maddr = mmap(maddr, bufalloc - bufsalloc, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_FIXED, bufdesc, bufoffset);
+ if (maddr == (caddr_t) -1)
{
- bufalloc *= 2;
- bufread *= 2;
+ fprintf(stderr, "%s: warning: %s: %s\n", filename,
+ strerror(errno));
+ goto tryread;
}
- buffer = realloc(buffer, bufalloc);
- if (! buffer)
+#if 0
+ /* You might thing this (or MADV_WILLNEED) would help,
+ but it doesn't, at least not on a Sun running 4.1.
+ In fact, it actually slows us down about 30%! */
+ madvise(maddr, bufalloc - bufsalloc, MADV_SEQUENTIAL);
+#endif
+ cc = bufalloc - bufsalloc;
+ bufoffset += cc;
+ }
+ else
+ {
+ tryread:
+ /* We come here when we're not going to use mmap() any more.
+ Note that we need to synchronize the file offset the
+ first time through. */
+ if (bufmapped)
{
- fprintf(stderr, "%s: Memory exhausted (%s)\n", prog,
- sys_errlist[errno]);
- exit(ERROR);
+ bufmapped = 0;
+ lseek(bufdesc, bufoffset, 0);
}
+ cc = read(bufdesc, buffer + bufsalloc, bufalloc - bufsalloc);
}
+#else
+ cc = read(bufdesc, buffer + bufsalloc, bufalloc - bufsalloc);
+#endif
+ if (cc > 0)
+ buflim = buffer + bufsalloc + cc;
+ else
+ buflim = buffer + bufsalloc;
+ return cc;
+}
- bufprev += bufbytes - n;
+/* Flags controlling the style of output. */
+static int out_quiet; /* Suppress all normal output. */
+static int out_invert; /* Print nonmatching stuff. */
+static int out_file; /* Print filenames. */
+static int out_line; /* Print line numbers. */
+static int out_byte; /* Print byte offsets. */
+static int out_before; /* Lines of leading context. */
+static int out_after; /* Lines of trailing context. */
+
+/* Internal variables to keep track of byte count, context, etc. */
+static size_t totalcc; /* Total character count before bufbeg. */
+static char *lastnl; /* Pointer after last newline counted. */
+static char *lastout; /* Pointer after last character output;
+ NULL if no character has been output
+ or if it's conceptually before bufbeg. */
+static size_t totalnl; /* Total newline count before lastnl. */
+static int pending; /* Pending lines of output. */
- /* Shift stuff down. */
- for (i = n, p = buffer, q = p + bufbytes - n; i--; )
- *p++ = *q++;
- bufbytes = n;
+static void
+nlscan(lim)
+ char *lim;
+{
+ char *beg;
- if (eof)
- return 0;
+ for (beg = lastnl; beg < lim; ++beg)
+ if (*beg == '\n')
+ ++totalnl;
+ lastnl = beg;
+}
- /* Read in new stuff. */
- i = read(fd, buffer + bufbytes, bufread);
- if (i < 0)
+static void
+prline(beg, lim, sep)
+ char *beg;
+ char *lim;
+ char sep;
+{
+ if (out_file)
+ printf("%s%c", filename, sep);
+ if (out_line)
{
- fprintf(stderr, "%s: read on %s failed (%s)\n", prog,
- filename ? filename : "<stdin>", sys_errlist[errno]);
- error = 1;
+ nlscan(beg);
+ printf("%d%c", ++totalnl, sep);
+ lastnl = lim;
}
+ if (out_byte)
+ printf("%lu%c", totalcc + (beg - bufbeg), sep);
+ fwrite(beg, 1, lim - beg, stdout);
+ if (ferror(stdout))
+ error("writing output", errno);
+ lastout = lim;
+}
- /* Kludge to pretend every nonempty file ends with a newline. */
- if (i == 0 && bufbytes > 0 && buffer[bufbytes - 1] != '\n')
+/* Print pending lines of trailing context prior to LIM. */
+static void
+prpending(lim)
+ char *lim;
+{
+ char *nl;
+
+ if (!lastout)
+ lastout = bufbeg;
+ while (pending > 0 && lastout < lim)
{
- eof = i = 1;
- buffer[bufbytes] = '\n';
+ --pending;
+ if ((nl = memchr(lastout, '\n', lim - lastout)) != 0)
+ ++nl;
+ else
+ nl = lim;
+ prline(lastout, nl, '-');
}
-
- bufbytes += i;
- return i;
}
-
-/* Various flags set according to the argument switches. */
-static trailing_context; /* Lines of context to show after matches. */
-static leading_context; /* Lines of context to show before matches. */
-static byte_count; /* Precede output lines the byte count of the
- first character on the line. */
-static no_filenames; /* Do not display filenames. */
-static line_numbers; /* Precede output lines with line numbers. */
-static silent; /* Produce no output at all. This switch
- is bogus, ever hear of /dev/null? */
-static int whole_word; /* Match only whole words. Note that if
- backreferences are used this depends on
- the regex routines getting leftmost-longest
- right, which they don't right now if |
- is also used. */
-static int whole_line; /* Match only whole lines. Backreference
- caveat applies here too. */
-static nonmatching_lines; /* Print lines that don't match the regexp. */
-
-static bmgexec; /* Invoke Boyer-Moore-Gosper routines */
-
-/* The compiled regular expression lives here. */
-static struct regexp reg;
-
-/* The compiled regular expression for the backtracking matcher lives here. */
-static struct re_pattern_buffer regex;
-
-/* Pointer in the buffer after the last character printed. */
-static char *printed_limit;
-
-/* True when printed_limit has been artifically advanced without printing
- anything. */
-static int printed_limit_fake;
-
-/* Print a line at the given line number, returning the number of
- characters actually printed. Matching is true if the line is to
- be considered a "matching line". This is only meaningful if
- surrounding context is turned on. */
-static int
-print_line(p, number, matching)
- char *p;
- int number;
- int matching;
+
+/* Print the lines between BEG and LIM. Deal with context crap.
+ If NLINESP is non-null, store a count of lines between BEG and LIM. */
+static void
+prtext(beg, lim, nlinesp)
+ char *beg;
+ char *lim;
+ int *nlinesp;
{
- int count = 0;
+ static int used; /* avoid printing "--" before any output */
+ char *bp, *p, *nl;
+ int i, n;
+
+ if (!out_quiet && pending > 0)
+ prpending(beg);
- if (silent)
+ p = beg;
+
+ if (!out_quiet)
{
- do
- ++count;
- while (*p++ != '\n');
- printed_limit_fake = 0;
- printed_limit = p;
- return count;
+ /* Deal with leading context crap. */
+
+ bp = lastout ? lastout : bufbeg;
+ for (i = 0; i < out_before; ++i)
+ if (p > bp)
+ do
+ --p;
+ while (p > bp && p[-1] != '\n');
+
+ /* We only print the "--" separator if our output is
+ discontiguous from the last output in the file. */
+ if ((out_before || out_after) && used && p != lastout)
+ puts("--");
+
+ while (p < beg)
+ {
+ nl = memchr(p, '\n', beg - p);
+ prline(p, nl + 1, '-');
+ p = nl + 1;
+ }
}
- if (filename && !no_filenames)
- printf("%s%c", filename, matching ? ':' : '-');
- if (byte_count)
- printf("%d%c", p - buffer + bufprev, matching ? ':' : '-');
- if (line_numbers)
- printf("%d%c", number, matching ? ':' : '-');
- do
+ if (nlinesp)
{
- ++count;
- putchar(*p);
+ /* Caller wants a line count. */
+ for (n = 0; p < lim; ++n)
+ {
+ if ((nl = memchr(p, '\n', lim - p)) != 0)
+ ++nl;
+ else
+ nl = lim;
+ if (!out_quiet)
+ prline(p, nl, ':');
+ p = nl;
+ }
+ *nlinesp = n;
}
- while (*p++ != '\n');
- printed_limit_fake = 0;
- printed_limit = p;
- return count;
+ else
+ if (!out_quiet)
+ prline(beg, lim, ':');
+
+ pending = out_after;
+ used = 1;
}
-/* Print matching or nonmatching lines from the current file. Returns a
- count of matching or nonmatching lines. */
+/* Scan the specified portion of the buffer, matching lines (or
+ between matching lines if OUT_INVERT is true). Return a count of
+ lines printed. */
static int
-grep()
+grepbuf(beg, lim)
+ char *beg;
+ char *lim;
{
- int retain = 0; /* Number of bytes to retain on next call
- to fill_buffer_retaining(). */
- char *search_limit; /* Pointer to the character after the last
- newline in the buffer. */
- char saved_char; /* Character after the last newline. */
- char *resume; /* Pointer to where to resume search. */
- int resume_index = 0; /* Count of characters to ignore after
- refilling the buffer. */
- int line_count = 1; /* Line number. */
- int try_backref; /* Set to true if we need to verify the
- match with a backtracking matcher. */
- int initial_line_count; /* Line count at beginning of last search. */
- char *match; /* Pointer to the first character after the
- string matching the regexp. */
- int match_count = 0; /* Count of matching lines. */
- char *matching_line; /* Pointer to first character of the matching
- line, or of the first line of context to
- print if context is turned on. */
- char *real_matching_line; /* Pointer to the first character of the
- real matching line. */
- char *next_line; /* Pointer to first character of the line
- following the matching line. */
- char *last_match_limit; /* Pointer after last matched line. */
- int pending_lines = 0; /* Lines of context left over from last match
- that we have to print. */
- static first_match = 1; /* True when nothing has been printed. */
- int i;
- char *tmp;
- char *execute();
+ int nlines, n;
+ register char *p, *b;
+ char *endp;
- printed_limit_fake = 0;
-
- while (fill_buffer_retaining(retain) > 0)
+ nlines = 0;
+ p = beg;
+ while ((b = (*execute)(p, lim - p, &endp)) != 0)
{
- /* Find the last newline in the buffer. */
- search_limit = buffer + bufbytes;
- while (search_limit > buffer && search_limit[-1] != '\n')
- --search_limit;
- if (search_limit == buffer)
+ /* Avoid matching the empty line at the end of the buffer. */
+ if (b == lim && ((b > beg && b[-1] == '\n') || b == beg))
+ break;
+ if (!out_invert)
{
- retain = bufbytes;
- continue;
+ prtext(b, endp, (int *) 0);
+ nlines += 1;
}
+ else if (p < b)
+ {
+ prtext(p, b, &n);
+ nlines += n;
+ }
+ p = endp;
+ }
+ if (out_invert && p < lim)
+ {
+ prtext(p, lim, &n);
+ nlines += n;
+ }
+ return nlines;
+}
- /* Save the character after the last newline so regexecute can write
- its own sentinel newline. */
- saved_char = *search_limit;
-
- /* Search the buffer for a match. */
- printed_limit = buffer;
- resume = buffer + resume_index;
- last_match_limit = resume;
- initial_line_count = line_count;
+/* Search a given file. Return a count of lines printed. */
+static int
+grep(fd)
+ int fd;
+{
+ int nlines, i;
+ size_t residue, save;
+ char *beg, *lim;
+ reset(fd);
- /* In retrospect, I have to say that the following code sucks.
- For an example of how to do this right, see the fgrep
- driver program that I wrote around a year later. I'm
- too lazy to retrofit that to egrep right now (the
- pattern matchers have different needs). */
+ totalcc = 0;
+ lastout = 0;
+ totalnl = 0;
+ pending = 0;
+ nlines = 0;
+ residue = 0;
+ save = 0;
- while (match = execute(&reg, resume, search_limit, 0, &line_count, &try_backref))
+ for (;;)
+ {
+ if (fillbuf(save) < 0)
{
- /* Find the beginning of the matching line. */
- matching_line = match;
- while (matching_line > resume && matching_line[-1] != '\n')
- --matching_line;
- real_matching_line = matching_line;
-
- /* Find the beginning of the next line. */
- next_line = match;
- while (next_line < search_limit && *next_line++ != '\n')
- ;
-
- /* If a potential backreference is indicated, try it out with
- a backtracking matcher to make sure the line is a match.
- This is hairy because we need to handle whole_line and
- whole_word matches specially. The method was stolen from
- GNU fgrep. */
- if (try_backref)
- {
- struct re_registers regs;
- int beg, len, maxlen, ret;
-
- beg = 0;
- for (maxlen = next_line - matching_line - 1; beg <= maxlen; ++beg)
- {
- /* See if the matching line matches when backreferences
- are considered... */
- ret = re_search (&regex, matching_line, maxlen,
- beg, maxlen - beg, &regs);
- if (ret == -1)
- goto fail;
- beg = ret;
- len = regs.end[0] - beg;
- /* Ok, now check if it subsumed the whole line if -x */
- if (whole_line && (beg != 0 || len != maxlen))
- goto fail;
- /* If -w then check if the match aligns with word
- boundaries. We have to do this iteratively, because
- (a) The line may contain more than one occurence
- of the pattern, and;
- (b) Several alternatives in the pattern might
- be valid at a given point, and we may need to
- consider a shorter one in order to align with
- word boundaries. */
- else if (whole_word)
- while (len > 0)
- {
- /* If it's preceeded by a word constituent, then no go. */
- if (beg > 0
- && WCHAR((unsigned char) matching_line[beg - 1]))
- break;
- /* If it's followed by a word constituent, look for
- a shorter match. */
- else if (beg + len < maxlen
- && WCHAR((unsigned char) matching_line[beg + len]))
- /* This is sheer incest. */
- len = re_match_2 (&regex, (unsigned char *) 0, 0,
- matching_line, maxlen,
- beg, &regs, beg + len - 1);
- else
- goto succeed;
- }
- else
- goto succeed;
- }
- fail:
- resume = next_line;
- if (resume == search_limit)
- break;
- else
- continue;
- }
-
- succeed:
- /* Print out the matching or nonmatching lines as necessary. */
- if (! nonmatching_lines)
- {
- /* Not -v, so nothing hairy... */
- ++match_count;
-
- /* Print leftover trailing context from last time around. */
- while (pending_lines && last_match_limit < matching_line)
- {
- last_match_limit += print_line(last_match_limit,
- initial_line_count++,
- 0);
- --pending_lines;
- }
-
- /* Back up over leading context if necessary. */
- for (i = leading_context;
- i > 0 && matching_line > printed_limit;
- --i)
- {
- while (matching_line > printed_limit
- && (--matching_line)[-1] != '\n')
- ;
- --line_count;
- }
-
- /* If context is enabled, we may have to print a separator. */
- if ((leading_context || trailing_context) && !silent
- && !first_match && (printed_limit_fake
- || matching_line > printed_limit))
- printf("----------\n");
- first_match = 0;
-
- /* Print the matching line and its leading context. */
- while (matching_line < real_matching_line)
- matching_line += print_line(matching_line, line_count++, 0);
- matching_line += print_line(matching_line, line_count++, 1);
-
- /* If there's trailing context, leave some lines pending until
- next time. */
- pending_lines = trailing_context;
- }
- else if (matching_line == last_match_limit)
- {
- /* In the -v case, this is where we deal with leftover
- trailing context from last time... */
- if (pending_lines > 0)
- {
- --pending_lines;
- print_line(matching_line, line_count, 0);
- }
- ++line_count;
- }
- else if (matching_line > last_match_limit)
- {
- char *start = last_match_limit;
-
- /* Back up over leading context if necessary. */
- for (i = leading_context; start > printed_limit && i; --i)
- {
- while (start > printed_limit && (--start)[-1] != '\n')
- ;
- --initial_line_count;
- }
-
- /* If context is enabled, we may have to print a separator. */
- if ((leading_context || trailing_context) && !silent
- && !first_match && (printed_limit_fake
- || start > printed_limit))
- printf("----------\n");
- first_match = 0;
-
- /* Print out the presumably matching leading context. */
- while (start < last_match_limit)
- start += print_line(start, initial_line_count++, 0);
-
- /* Print out the nonmatching lines prior to the matching line. */
- while (start < matching_line)
- {
- /* This counts as a "matching line" in -v. */
- ++match_count;
- start += print_line(start, initial_line_count++, 1);
- }
-
- /* Deal with trailing context. In -v what this means is
- we print the current (matching) line, marked as a non
- matching line. */
- if (trailing_context)
- {
- print_line(matching_line, line_count, 0);
- pending_lines = trailing_context - 1;
- }
-
- /* Count the current line. */
- ++line_count;
- }
- else
- /* Let us pray this never happens... */
- abort();
-
- /* Resume searching at the beginning of the next line. */
- initial_line_count = line_count;
- resume = next_line;
- last_match_limit = next_line;
-
- if (resume == search_limit)
- break;
+ error(filename, errno);
+ return nlines;
}
-
- /* Restore the saved character. */
- *search_limit = saved_char;
-
- if (! nonmatching_lines)
+ lastnl = bufbeg;
+ if (lastout)
+ lastout = bufbeg;
+ if (buflim - bufbeg == save)
+ break;
+ beg = bufbeg + save - residue;
+ for (lim = buflim; lim > beg && lim[-1] != '\n'; --lim)
+ ;
+ residue = buflim - lim;
+ if (beg < lim)
{
- while (last_match_limit < search_limit && pending_lines)
- {
- last_match_limit += print_line(last_match_limit,
- initial_line_count++,
- 0);
- --pending_lines;
- }
+ nlines += grepbuf(beg, lim);
+ if (pending)
+ prpending(lim);
}
- else if (search_limit > last_match_limit)
+ i = 0;
+ beg = lim;
+ while (i < out_before && beg > bufbeg && beg != lastout)
{
- char *start = last_match_limit;
-
- /* Back up over leading context if necessary. */
- for (i = leading_context; start > printed_limit && i; --i)
- {
- while (start > printed_limit && (--start)[-1] != '\n')
- ;
- --initial_line_count;
- }
-
- /* If context is enabled, we may have to print a separator. */
- if ((leading_context || trailing_context) && !silent
- && !first_match && (printed_limit_fake
- || start > printed_limit))
- printf("----------\n");
- first_match = 0;
-
- /* Print out all the nonmatching lines up to the search limit. */
- while (start < last_match_limit)
- start += print_line(start, initial_line_count++, 0);
- while (start < search_limit)
- {
- ++match_count;
- start += print_line(start, initial_line_count++, 1);
- }
-
- pending_lines = trailing_context;
- resume_index = 0;
- retain = bufbytes - (search_limit - buffer);
- continue;
+ ++i;
+ do
+ --beg;
+ while (beg > bufbeg && beg[-1] != '\n');
}
-
- /* Save the trailing end of the buffer for possible use as leading
- context in the future. */
- i = leading_context;
- tmp = search_limit;
- while (tmp > printed_limit && i--)
- while (tmp > printed_limit && (--tmp)[-1] != '\n')
- ;
- resume_index = search_limit - tmp;
- retain = bufbytes - (tmp - buffer);
- if (tmp > printed_limit)
- printed_limit_fake = 1;
+ if (beg != lastout)
+ lastout = 0;
+ save = residue + lim - beg;
+ totalcc += buflim - bufbeg - save;
+ if (out_line)
+ nlscan(beg);
}
-
- return match_count;
+ if (residue)
+ {
+ nlines += grepbuf(bufbeg + save - residue, buflim);
+ if (pending)
+ prpending(buflim);
+ }
+ return nlines;
}
-
-void
-usage_and_die()
+
+static char version[] = "GNU grep version 2.0";
+
+#define USAGE \
+ "usage: %s [-[[AB] ]<num>] [-[CEFGVchilnqsvwx]] [-[ef]] <expr> [<files...>]\n"
+
+static void
+usage()
{
- fprintf(stderr, "\
-Usage: %s [-CVbchilnsvwx] [-num] [-A num] [-B num] [-f file]\n\
- [-e] expr [file...]\n", prog);
- exit(ERROR);
+ fprintf(stderr, USAGE, prog);
+ exit(2);
}
-static char version[] = "GNU e?grep, version 1.6";
+/* Go through the matchers vector and look for the specified matcher.
+ If we find it, install it in compile and execute, and return 1. */
+int
+setmatcher(name)
+ char *name;
+{
+ int i;
+
+ for (i = 0; matchers[i].name; ++i)
+ if (strcmp(name, matchers[i].name) == 0)
+ {
+ compile = matchers[i].compile;
+ execute = matchers[i].execute;
+ return 1;
+ }
+ return 0;
+}
int
main(argc, argv)
int argc;
- char **argv;
+ char *argv[];
{
- int c;
- int ignore_case = 0; /* Compile the regexp to ignore case. */
- char *the_regexp = 0; /* The regular expression. */
- int regexp_len; /* Length of the regular expression. */
- char *regexp_file = 0; /* File containing parallel regexps. */
- int count_lines = 0; /* Display only a count of matching lines. */
- int list_files = 0; /* Display only the names of matching files. */
- int line_count = 0; /* Count of matching lines for a file. */
- int matches_found = 0; /* True if matches were found. */
- char *regex_errmesg; /* Error message from regex routines. */
- char translate[_NOTCHAR]; /* Translate table for case conversion
- (needed by the backtracking matcher). */
-
- if (prog = index(argv[0], '/'))
- ++prog;
- else
- prog = argv[0];
-
- opterr = 0;
- while ((c = getopt(argc, argv, "0123456789A:B:CVbce:f:hilnsvwx")) != EOF)
- switch (c)
+ char *keys;
+ size_t keycc, oldcc, keyalloc;
+ int keyfound, count_matches, no_filenames, list_files, suppress_errors;
+ int opt, cc, desc, count, status;
+ FILE *fp;
+ extern char *optarg;
+ extern int optind;
+
+ prog = argv[0];
+ if (prog && strrchr(prog, '/'))
+ prog = strrchr(prog, '/') + 1;
+
+ keys = NULL;
+ keycc = 0;
+ keyfound = 0;
+ count_matches = 0;
+ no_filenames = 0;
+ list_files = 0;
+ suppress_errors = 0;
+ matcher = NULL;
+
+ while ((opt = getopt(argc, argv, "0123456789A:B:CEFGVX:bce:f:hiLlnqsvwxy"))
+ != EOF)
+ switch (opt)
{
- case '?':
- usage_and_die();
- break;
-
case '0':
case '1':
case '2':
@@ -586,422 +633,194 @@ main(argc, argv)
case '7':
case '8':
case '9':
- trailing_context = 10 * trailing_context + c - '0';
- leading_context = 10 * leading_context + c - '0';
+ out_before = 10 * out_before + opt - '0';
+ out_after = 10 * out_after + opt - '0';
break;
-
case 'A':
- if (! sscanf(optarg, "%d", &trailing_context)
- || trailing_context < 0)
- usage_and_die();
+ out_after = atoi(optarg);
+ if (out_after < 0)
+ usage();
break;
-
case 'B':
- if (! sscanf(optarg, "%d", &leading_context)
- || leading_context < 0)
- usage_and_die();
+ out_before = atoi(optarg);
+ if (out_before < 0)
+ usage();
break;
-
case 'C':
- trailing_context = leading_context = 2;
+ out_before = out_after = 2;
+ break;
+ case 'E':
+ if (matcher && strcmp(matcher, "egrep") != 0)
+ fatal("you may specify only one of -E, -F, or -G", 0);
+ matcher = "posix-egrep";
+ break;
+ case 'F':
+ if (matcher && strcmp(matcher, "fgrep") != 0)
+ fatal("you may specify only one of -E, -F, or -G", 0);;
+ matcher = "fgrep";
+ break;
+ case 'G':
+ if (matcher && strcmp(matcher, "grep") != 0)
+ fatal("you may specify only one of -E, -F, or -G", 0);
+ matcher = "grep";
break;
-
case 'V':
fprintf(stderr, "%s\n", version);
break;
-
+ case 'X':
+ if (matcher)
+ fatal("matcher already specified", 0);
+ matcher = optarg;
+ break;
case 'b':
- byte_count = 1;
+ out_byte = 1;
break;
-
case 'c':
- count_lines = 1;
- silent = 1;
+ out_quiet = 1;
+ count_matches = 1;
break;
-
case 'e':
- /* It doesn't make sense to mix -f and -e. */
- if (regexp_file)
- usage_and_die();
- the_regexp = optarg;
+ cc = strlen(optarg);
+ keys = xrealloc(keys, keycc + cc + 1);
+ if (keyfound)
+ keys[keycc++] = '\n';
+ strcpy(&keys[keycc], optarg);
+ keycc += cc;
+ keyfound = 1;
break;
-
case 'f':
- /* It doesn't make sense to mix -f and -e. */
- if (the_regexp)
- usage_and_die();
- regexp_file = optarg;
+ fp = strcmp(optarg, "-") != 0 ? fopen(optarg, "r") : stdin;
+ if (!fp)
+ fatal(optarg, errno);
+ for (keyalloc = 1; keyalloc <= keycc; keyalloc *= 2)
+ ;
+ keys = xrealloc(keys, keyalloc);
+ oldcc = keycc;
+ if (keyfound)
+ keys[keycc++] = '\n';
+ while (!feof(fp)
+ && (cc = fread(keys + keycc, 1, keyalloc - keycc, fp)) > 0)
+ {
+ keycc += cc;
+ if (keycc == keyalloc)
+ keys = xrealloc(keys, keyalloc *= 2);
+ }
+ if (fp != stdin)
+ fclose(fp);
+ /* Nuke the final newline to avoid matching a null string. */
+ if (keycc - oldcc > 0 && keys[keycc - 1] == '\n')
+ --keycc;
+ keyfound = 1;
break;
-
case 'h':
no_filenames = 1;
break;
-
case 'i':
- ignore_case = 1;
- for (c = 0; c < _NOTCHAR; ++c)
- if (isupper(c))
- translate[c] = tolower(c);
- else
- translate[c] = c;
- regex.translate = translate;
+ case 'y': /* For old-timers . . . */
+ match_icase = 1;
+ break;
+ case 'L':
+ /* Like -l, except list files that don't contain matches.
+ Inspired by the same option in Hume's gre. */
+ out_quiet = 1;
+ list_files = -1;
break;
-
case 'l':
+ out_quiet = 1;
list_files = 1;
- silent = 1;
break;
-
case 'n':
- line_numbers = 1;
+ out_line = 1;
+ break;
+ case 'q':
+ out_quiet = 1;
break;
-
case 's':
- silent = 1;
+ suppress_errors = 1;
break;
-
case 'v':
- nonmatching_lines = 1;
+ out_invert = 1;
break;
-
case 'w':
- whole_word = 1;
+ match_words = 1;
break;
-
case 'x':
- whole_line = 1;
+ match_lines = 1;
break;
-
default:
- /* This can't happen. */
- fprintf(stderr, "%s: getopt(3) let one by!\n", prog);
- usage_and_die();
+ usage();
break;
}
- /* Set the syntax depending on whether we are EGREP or not. */
-#ifdef EGREP
- regsyntax(RE_SYNTAX_EGREP, ignore_case);
- re_set_syntax(RE_SYNTAX_EGREP);
-#else
- regsyntax(RE_SYNTAX_GREP, ignore_case);
- re_set_syntax(RE_SYNTAX_GREP);
-#endif
+ if (!keyfound)
+ if (optind < argc)
+ {
+ keys = argv[optind++];
+ keycc = strlen(keys);
+ }
+ else
+ usage();
- /* Compile the regexp according to all the options. */
- if (regexp_file)
- {
- FILE *fp = fopen(regexp_file, "r");
- int len = 256;
- int i = 0;
+ if (!matcher)
+ matcher = prog;
- if (! fp)
- {
- fprintf(stderr, "%s: %s: %s\n", prog, regexp_file,
- sys_errlist[errno]);
- exit(ERROR);
- }
+ if (!setmatcher(matcher) && !setmatcher("default"))
+ abort();
- the_regexp = malloc(len);
- while ((c = getc(fp)) != EOF)
- {
- the_regexp[i++] = c;
- if (i == len)
- the_regexp = realloc(the_regexp, len *= 2);
- }
- fclose(fp);
- /* Nuke the concluding newline so we won't match the empty string. */
- if (i > 0 && the_regexp[i - 1] == '\n')
- --i;
- regexp_len = i;
- }
- else if (! the_regexp)
- {
- if (optind >= argc)
- usage_and_die();
- the_regexp = argv[optind++];
- regexp_len = strlen(the_regexp);
- }
- else
- regexp_len = strlen(the_regexp);
-
- if (whole_word || whole_line)
- {
- /* In the whole-word case, we use the pattern:
- (^|[^A-Za-z_])(userpattern)([^A-Za-z_]|$).
- In the whole-line case, we use the pattern:
- ^(userpattern)$.
- BUG: Using [A-Za-z_] is locale-dependent! */
-
- char *n = malloc(regexp_len + 50);
- int i = 0;
-
-#ifdef EGREP
- if (whole_word)
- strcpy(n, "(^|[^A-Za-z_])(");
- else
- strcpy(n, "^(");
-#else
- /* Todo: Make *sure* this is the right syntax. Down with grep! */
- if (whole_word)
- strcpy(n, "\\(^\\|[^A-Za-z_]\\)\\(");
- else
- strcpy(n, "^\\(");
-#endif
- i = strlen(n);
- bcopy(the_regexp, n + i, regexp_len);
- i += regexp_len;
-#ifdef EGREP
- if (whole_word)
- strcpy(n + i, ")([^A-Za-z_]|$)");
- else
- strcpy(n + i, ")$");
-#else
- if (whole_word)
- strcpy(n + i, "\\)\\([^A-Za-z_]\\|$\\)");
- else
- strcpy(n + i, "\\)$");
-#endif
- i += strlen(n + i);
- regcompile(n, i, &reg, 1);
- }
- else
- regcompile(the_regexp, regexp_len, &reg, 1);
-
-
- if (regex_errmesg = re_compile_pattern(the_regexp, regexp_len, &regex))
- regerror(regex_errmesg);
-
- /*
- Find the longest metacharacter-free string which must occur in the
- regexpr, before short-circuiting regexecute() with Boyer-Moore-Gosper.
- (Conjecture: The problem in general is NP-complete.) If there is no
- such string (like for many alternations), then default to full automaton
- search. regmust() code and heuristics [see dfa.c] courtesy
- Arthur David Olson.
- */
- if (line_numbers == 0 && nonmatching_lines == 0)
- {
- if (reg.mustn == 0 || reg.mustn == MUST_MAX ||
- index(reg.must, '\0') != reg.must + reg.mustn)
- bmgexec = 0;
- else
- {
- reg.must[reg.mustn] = '\0';
- if (getenv("MUSTDEBUG") != NULL)
- (void) printf("must have: \"%s\"\n", reg.must);
- bmg_setup(reg.must, ignore_case);
- bmgexec = 1;
- }
- }
-
- if (argc - optind < 2)
- no_filenames = 1;
+ (*compile)(keys, keycc);
+
+ if (argc - optind > 1 && !no_filenames)
+ out_file = 1;
- initialize_buffer();
+ status = 1;
- if (argc > optind)
+ if (optind < argc)
while (optind < argc)
{
- bufprev = eof = 0;
- filename = argv[optind++];
- fd = open(filename, 0, 0);
- if (fd < 0)
+ desc = strcmp(argv[optind], "-") ? open(argv[optind], O_RDONLY) : 0;
+ if (desc < 0)
{
- fprintf(stderr, "%s: %s: %s\n", prog, filename,
- sys_errlist[errno]);
- error = 1;
- continue;
+ if (!suppress_errors)
+ error(argv[optind], errno);
}
- if (line_count = grep())
- matches_found = 1;
- close(fd);
- if (count_lines)
- if (!no_filenames)
- printf("%s:%d\n", filename, line_count);
- else
- printf("%d\n", line_count);
- else if (list_files && line_count)
- printf("%s\n", filename);
+ else
+ {
+ filename = desc == 0 ? "(standard input)" : argv[optind];
+ count = grep(desc);
+ if (count_matches)
+ {
+ if (out_file)
+ printf("%s:", filename);
+ printf("%d\n", count);
+ }
+ if (count)
+ {
+ status = 0;
+ if (list_files == 1)
+ printf("%s\n", filename);
+ }
+ else if (list_files == -1)
+ printf("%s\n", filename);
+ }
+ if (desc != 0)
+ close(desc);
+ ++optind;
}
else
{
- if (line_count = grep())
- matches_found = 1;
- if (count_lines)
- printf("%d\n", line_count);
- else if (list_files && line_count)
- printf("<stdin>\n");
- }
-
- if (error)
- exit(ERROR);
- if (matches_found)
- exit(MATCHES_FOUND);
- exit(NO_MATCHES_FOUND);
- return NO_MATCHES_FOUND;
-}
-
-/* Needed by the regexp routines. This could be fancier, especially when
- dealing with parallel regexps in files. */
-void
-regerror(s)
- const char *s;
-{
- fprintf(stderr, "%s: %s\n", prog, s);
- exit(ERROR);
-}
-
-/*
- bmg_setup() and bmg_search() adapted from:
- Boyer/Moore/Gosper-assisted 'egrep' search, with delta0 table as in
- original paper (CACM, October, 1977). No delta1 or delta2. According to
- experiment (Horspool, Soft. Prac. Exp., 1982), delta2 is of minimal
- practical value. However, to improve for worst case input, integrating
- the improved Galil strategies (Apostolico/Giancarlo, Siam. J. Comput.,
- February 1986) deserves consideration.
-
- James A. Woods Copyleft (C) 1986, 1988
- NASA Ames Research Center
-*/
-
-char *
-execute(r, begin, end, newline, count, try_backref)
- struct regexp *r;
- char *begin;
- char *end;
- int newline;
- int *count;
- int *try_backref;
-{
- register char *p, *s;
- char *match;
- char *start = begin;
- char save; /* regexecute() sentinel */
- int len;
- char *bmg_search();
-
- if (!bmgexec) /* full automaton search */
- return(regexecute(r, begin, end, newline, count, try_backref));
- else
- {
- len = end - begin;
- while ((match = bmg_search((unsigned char *) start, len)) != NULL)
- {
- p = match; /* narrow search range to submatch line */
- while (p > begin && *p != '\n')
- p--;
- s = match;
- while (s < end && *s != '\n')
- s++;
- s++;
-
- save = *s;
- *s = '\0';
- match = regexecute(r, p, s, newline, count, try_backref);
- *s = save;
-
- if (match != NULL)
- return((char *) match);
- else
- {
- start = s;
- len = end - start;
- }
- }
- return(NULL);
- }
-}
-
-int delta0[256];
-unsigned char cmap[256]; /* (un)folded characters */
-unsigned char pattern[5000];
-int patlen;
-
-char *
-bmg_search(buffer, buflen)
- unsigned char *buffer;
- int buflen;
-{
- register unsigned char *k, *strend, *s, *buflim;
- register int t;
- int j;
-
- if (patlen > buflen)
- return NULL;
-
- buflim = buffer + buflen;
- if (buflen > patlen * 4)
- strend = buflim - patlen * 4;
- else
- strend = buffer;
-
- s = buffer;
- k = buffer + patlen - 1;
-
- for (;;)
- {
- /* The dreaded inner loop, revisited. */
- while (k < strend && (t = delta0[*k]))
+ filename = "(standard input)";
+ count = grep(0);
+ if (count_matches)
+ printf("%d\n", count);
+ if (count)
{
- k += t;
- k += delta0[*k];
- k += delta0[*k];
+ status = 0;
+ if (list_files == 1)
+ printf("(standard input)\n");
}
- while (k < buflim && delta0[*k])
- ++k;
- if (k == buflim)
- break;
-
- j = patlen - 1;
- s = k;
- while (--j >= 0 && cmap[*--s] == pattern[j])
- ;
- /*
- delta-less shortcut for literati, but
- short shrift for genetic engineers.
- */
- if (j >= 0)
- k++;
- else /* submatch */
- return ((char *)k);
+ else if (list_files == -1)
+ printf("(standard input)\n");
}
- return(NULL);
-}
-int
-bmg_setup(pat, folded) /* compute "boyer-moore" delta table */
- char *pat;
- int folded;
-{ /* ... HAKMEM lives ... */
- int j;
-
- patlen = strlen(pat);
-
- if (folded) /* fold case while saving pattern */
- for (j = 0; j < patlen; j++)
- pattern[j] = (isupper((int) pat[j]) ?
- (char) tolower((int) pat[j]) : pat[j]);
- else
- bcopy(pat, pattern, patlen);
-
- for (j = 0; j < 256; j++)
- {
- delta0[j] = patlen;
- cmap[j] = (char) j; /* could be done at compile time */
- }
- for (j = 0; j < patlen - 1; j++)
- delta0[pattern[j]] = patlen - j - 1;
- delta0[pattern[patlen - 1]] = 0;
-
- if (folded)
- {
- for (j = 0; j < patlen - 1; j++)
- if (islower((int) pattern[j]))
- delta0[toupper((int) pattern[j])] = patlen - j - 1;
- if (islower((int) pattern[patlen - 1]))
- delta0[toupper((int) pattern[patlen - 1])] = 0;
- for (j = 'A'; j <= 'Z'; j++)
- cmap[j] = (char) tolower((int) j);
- }
+ exit(errseen ? 2 : status);
}
diff --git a/gnu/usr.bin/grep/grep.h b/gnu/usr.bin/grep/grep.h
new file mode 100644
index 0000000..a3316c5
--- /dev/null
+++ b/gnu/usr.bin/grep/grep.h
@@ -0,0 +1,53 @@
+/* grep.h - interface to grep driver for searching subroutines.
+ Copyright (C) 1992 Free Software Foundation, Inc.
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2, or (at your option)
+ any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
+
+#if __STDC__
+
+extern void fatal(const char *, int);
+
+/* Grep.c expects the matchers vector to be terminated
+ by an entry with a NULL name, and to contain at least
+ an entry named "default". */
+
+extern struct matcher
+{
+ char *name;
+ void (*compile)(char *, size_t);
+ char *(*execute)(char *, size_t, char **);
+} matchers[];
+
+#else
+
+extern void fatal();
+
+extern struct matcher
+{
+ char *name;
+ void (*compile)();
+ char *(*execute)();
+} matchers[];
+
+#endif
+
+/* Exported from grep.c. */
+extern char *matcher;
+
+/* The following flags are exported from grep for the matchers
+ to look at. */
+extern int match_icase; /* -i */
+extern int match_words; /* -w */
+extern int match_lines; /* -x */
diff --git a/gnu/usr.bin/grep/kwset.c b/gnu/usr.bin/grep/kwset.c
new file mode 100644
index 0000000..9b09071
--- /dev/null
+++ b/gnu/usr.bin/grep/kwset.c
@@ -0,0 +1,805 @@
+/* kwset.c - search for any of a set of keywords.
+ Copyright 1989 Free Software Foundation
+ Written August 1989 by Mike Haertel.
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 1, or (at your option)
+ any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ The author may be reached (Email) at the address mike@ai.mit.edu,
+ or (US mail) as Mike Haertel c/o Free Software Foundation. */
+
+/* The algorithm implemented by these routines bears a startling resemblence
+ to one discovered by Beate Commentz-Walter, although it is not identical.
+ See "A String Matching Algorithm Fast on the Average," Technical Report,
+ IBM-Germany, Scientific Center Heidelberg, Tiergartenstrasse 15, D-6900
+ Heidelberg, Germany. See also Aho, A.V., and M. Corasick, "Efficient
+ String Matching: An Aid to Bibliographic Search," CACM June 1975,
+ Vol. 18, No. 6, which describes the failure function used below. */
+
+
+#ifdef STDC_HEADERS
+#include <limits.h>
+#include <stdlib.h>
+#else
+#define INT_MAX 2147483647
+#define UCHAR_MAX 255
+#ifdef __STDC__
+#include <stddef.h>
+#else
+#include <sys/types.h>
+#endif
+extern char *malloc();
+extern void free();
+#endif
+
+#ifdef HAVE_MEMCHR
+#include <string.h>
+#ifdef NEED_MEMORY_H
+#include <memory.h>
+#endif
+#else
+#ifdef __STDC__
+extern void *memchr();
+#else
+extern char *memchr();
+#endif
+#endif
+
+#ifdef GREP
+extern char *xmalloc();
+#define malloc xmalloc
+#endif
+
+#include "kwset.h"
+#include "obstack.h"
+
+#define NCHAR (UCHAR_MAX + 1)
+#define obstack_chunk_alloc malloc
+#define obstack_chunk_free free
+
+/* Balanced tree of edges and labels leaving a given trie node. */
+struct tree
+{
+ struct tree *llink; /* Left link; MUST be first field. */
+ struct tree *rlink; /* Right link (to larger labels). */
+ struct trie *trie; /* Trie node pointed to by this edge. */
+ unsigned char label; /* Label on this edge. */
+ char balance; /* Difference in depths of subtrees. */
+};
+
+/* Node of a trie representing a set of reversed keywords. */
+struct trie
+{
+ unsigned int accepting; /* Word index of accepted word, or zero. */
+ struct tree *links; /* Tree of edges leaving this node. */
+ struct trie *parent; /* Parent of this node. */
+ struct trie *next; /* List of all trie nodes in level order. */
+ struct trie *fail; /* Aho-Corasick failure function. */
+ int depth; /* Depth of this node from the root. */
+ int shift; /* Shift function for search failures. */
+ int maxshift; /* Max shift of self and descendents. */
+};
+
+/* Structure returned opaquely to the caller, containing everything. */
+struct kwset
+{
+ struct obstack obstack; /* Obstack for node allocation. */
+ int words; /* Number of words in the trie. */
+ struct trie *trie; /* The trie itself. */
+ int mind; /* Minimum depth of an accepting node. */
+ int maxd; /* Maximum depth of any node. */
+ unsigned char delta[NCHAR]; /* Delta table for rapid search. */
+ struct trie *next[NCHAR]; /* Table of children of the root. */
+ char *target; /* Target string if there's only one. */
+ int mind2; /* Used in Boyer-Moore search for one string. */
+ char *trans; /* Character translation table. */
+};
+
+/* Allocate and initialize a keyword set object, returning an opaque
+ pointer to it. Return NULL if memory is not available. */
+kwset_t
+kwsalloc(trans)
+ char *trans;
+{
+ struct kwset *kwset;
+
+ kwset = (struct kwset *) malloc(sizeof (struct kwset));
+ if (!kwset)
+ return 0;
+
+ obstack_init(&kwset->obstack);
+ kwset->words = 0;
+ kwset->trie
+ = (struct trie *) obstack_alloc(&kwset->obstack, sizeof (struct trie));
+ if (!kwset->trie)
+ {
+ kwsfree((kwset_t) kwset);
+ return 0;
+ }
+ kwset->trie->accepting = 0;
+ kwset->trie->links = 0;
+ kwset->trie->parent = 0;
+ kwset->trie->next = 0;
+ kwset->trie->fail = 0;
+ kwset->trie->depth = 0;
+ kwset->trie->shift = 0;
+ kwset->mind = INT_MAX;
+ kwset->maxd = -1;
+ kwset->target = 0;
+ kwset->trans = trans;
+
+ return (kwset_t) kwset;
+}
+
+/* Add the given string to the contents of the keyword set. Return NULL
+ for success, an error message otherwise. */
+char *
+kwsincr(kws, text, len)
+ kwset_t kws;
+ char *text;
+ size_t len;
+{
+ struct kwset *kwset;
+ register struct trie *trie;
+ register unsigned char label;
+ register struct tree *link;
+ register int depth;
+ struct tree *links[12];
+ enum { L, R } dirs[12];
+ struct tree *t, *r, *l, *rl, *lr;
+
+ kwset = (struct kwset *) kws;
+ trie = kwset->trie;
+ text += len;
+
+ /* Descend the trie (built of reversed keywords) character-by-character,
+ installing new nodes when necessary. */
+ while (len--)
+ {
+ label = kwset->trans ? kwset->trans[(unsigned char) *--text] : *--text;
+
+ /* Descend the tree of outgoing links for this trie node,
+ looking for the current character and keeping track
+ of the path followed. */
+ link = trie->links;
+ links[0] = (struct tree *) &trie->links;
+ dirs[0] = L;
+ depth = 1;
+
+ while (link && label != link->label)
+ {
+ links[depth] = link;
+ if (label < link->label)
+ dirs[depth++] = L, link = link->llink;
+ else
+ dirs[depth++] = R, link = link->rlink;
+ }
+
+ /* The current character doesn't have an outgoing link at
+ this trie node, so build a new trie node and install
+ a link in the current trie node's tree. */
+ if (!link)
+ {
+ link = (struct tree *) obstack_alloc(&kwset->obstack,
+ sizeof (struct tree));
+ if (!link)
+ return "memory exhausted";
+ link->llink = 0;
+ link->rlink = 0;
+ link->trie = (struct trie *) obstack_alloc(&kwset->obstack,
+ sizeof (struct trie));
+ if (!link->trie)
+ return "memory exhausted";
+ link->trie->accepting = 0;
+ link->trie->links = 0;
+ link->trie->parent = trie;
+ link->trie->next = 0;
+ link->trie->fail = 0;
+ link->trie->depth = trie->depth + 1;
+ link->trie->shift = 0;
+ link->label = label;
+ link->balance = 0;
+
+ /* Install the new tree node in its parent. */
+ if (dirs[--depth] == L)
+ links[depth]->llink = link;
+ else
+ links[depth]->rlink = link;
+
+ /* Back up the tree fixing the balance flags. */
+ while (depth && !links[depth]->balance)
+ {
+ if (dirs[depth] == L)
+ --links[depth]->balance;
+ else
+ ++links[depth]->balance;
+ --depth;
+ }
+
+ /* Rebalance the tree by pointer rotations if necessary. */
+ if (depth && ((dirs[depth] == L && --links[depth]->balance)
+ || (dirs[depth] == R && ++links[depth]->balance)))
+ {
+ switch (links[depth]->balance)
+ {
+ case (char) -2:
+ switch (dirs[depth + 1])
+ {
+ case L:
+ r = links[depth], t = r->llink, rl = t->rlink;
+ t->rlink = r, r->llink = rl;
+ t->balance = r->balance = 0;
+ break;
+ case R:
+ r = links[depth], l = r->llink, t = l->rlink;
+ rl = t->rlink, lr = t->llink;
+ t->llink = l, l->rlink = lr, t->rlink = r, r->llink = rl;
+ l->balance = t->balance != 1 ? 0 : -1;
+ r->balance = t->balance != (char) -1 ? 0 : 1;
+ t->balance = 0;
+ break;
+ }
+ break;
+ case 2:
+ switch (dirs[depth + 1])
+ {
+ case R:
+ l = links[depth], t = l->rlink, lr = t->llink;
+ t->llink = l, l->rlink = lr;
+ t->balance = l->balance = 0;
+ break;
+ case L:
+ l = links[depth], r = l->rlink, t = r->llink;
+ lr = t->llink, rl = t->rlink;
+ t->llink = l, l->rlink = lr, t->rlink = r, r->llink = rl;
+ l->balance = t->balance != 1 ? 0 : -1;
+ r->balance = t->balance != (char) -1 ? 0 : 1;
+ t->balance = 0;
+ break;
+ }
+ break;
+ }
+
+ if (dirs[depth - 1] == L)
+ links[depth - 1]->llink = t;
+ else
+ links[depth - 1]->rlink = t;
+ }
+ }
+
+ trie = link->trie;
+ }
+
+ /* Mark the node we finally reached as accepting, encoding the
+ index number of this word in the keyword set so far. */
+ if (!trie->accepting)
+ trie->accepting = 1 + 2 * kwset->words;
+ ++kwset->words;
+
+ /* Keep track of the longest and shortest string of the keyword set. */
+ if (trie->depth < kwset->mind)
+ kwset->mind = trie->depth;
+ if (trie->depth > kwset->maxd)
+ kwset->maxd = trie->depth;
+
+ return 0;
+}
+
+/* Enqueue the trie nodes referenced from the given tree in the
+ given queue. */
+static void
+enqueue(tree, last)
+ struct tree *tree;
+ struct trie **last;
+{
+ if (!tree)
+ return;
+ enqueue(tree->llink, last);
+ enqueue(tree->rlink, last);
+ (*last) = (*last)->next = tree->trie;
+}
+
+/* Compute the Aho-Corasick failure function for the trie nodes referenced
+ from the given tree, given the failure function for their parent as
+ well as a last resort failure node. */
+static void
+treefails(tree, fail, recourse)
+ register struct tree *tree;
+ struct trie *fail;
+ struct trie *recourse;
+{
+ register struct tree *link;
+
+ if (!tree)
+ return;
+
+ treefails(tree->llink, fail, recourse);
+ treefails(tree->rlink, fail, recourse);
+
+ /* Find, in the chain of fails going back to the root, the first
+ node that has a descendent on the current label. */
+ while (fail)
+ {
+ link = fail->links;
+ while (link && tree->label != link->label)
+ if (tree->label < link->label)
+ link = link->llink;
+ else
+ link = link->rlink;
+ if (link)
+ {
+ tree->trie->fail = link->trie;
+ return;
+ }
+ fail = fail->fail;
+ }
+
+ tree->trie->fail = recourse;
+}
+
+/* Set delta entries for the links of the given tree such that
+ the preexisting delta value is larger than the current depth. */
+static void
+treedelta(tree, depth, delta)
+ register struct tree *tree;
+ register unsigned int depth;
+ unsigned char delta[];
+{
+ if (!tree)
+ return;
+ treedelta(tree->llink, depth, delta);
+ treedelta(tree->rlink, depth, delta);
+ if (depth < delta[tree->label])
+ delta[tree->label] = depth;
+}
+
+/* Return true if A has every label in B. */
+static int
+hasevery(a, b)
+ register struct tree *a;
+ register struct tree *b;
+{
+ if (!b)
+ return 1;
+ if (!hasevery(a, b->llink))
+ return 0;
+ if (!hasevery(a, b->rlink))
+ return 0;
+ while (a && b->label != a->label)
+ if (b->label < a->label)
+ a = a->llink;
+ else
+ a = a->rlink;
+ return !!a;
+}
+
+/* Compute a vector, indexed by character code, of the trie nodes
+ referenced from the given tree. */
+static void
+treenext(tree, next)
+ struct tree *tree;
+ struct trie *next[];
+{
+ if (!tree)
+ return;
+ treenext(tree->llink, next);
+ treenext(tree->rlink, next);
+ next[tree->label] = tree->trie;
+}
+
+/* Compute the shift for each trie node, as well as the delta
+ table and next cache for the given keyword set. */
+char *
+kwsprep(kws)
+ kwset_t kws;
+{
+ register struct kwset *kwset;
+ register int i;
+ register struct trie *curr, *fail;
+ register char *trans;
+ unsigned char delta[NCHAR];
+ struct trie *last, *next[NCHAR];
+
+ kwset = (struct kwset *) kws;
+
+ /* Initial values for the delta table; will be changed later. The
+ delta entry for a given character is the smallest depth of any
+ node at which an outgoing edge is labeled by that character. */
+ if (kwset->mind < 256)
+ for (i = 0; i < NCHAR; ++i)
+ delta[i] = kwset->mind;
+ else
+ for (i = 0; i < NCHAR; ++i)
+ delta[i] = 255;
+
+ /* Check if we can use the simple boyer-moore algorithm, instead
+ of the hairy commentz-walter algorithm. */
+ if (kwset->words == 1 && kwset->trans == 0)
+ {
+ /* Looking for just one string. Extract it from the trie. */
+ kwset->target = obstack_alloc(&kwset->obstack, kwset->mind);
+ for (i = kwset->mind - 1, curr = kwset->trie; i >= 0; --i)
+ {
+ kwset->target[i] = curr->links->label;
+ curr = curr->links->trie;
+ }
+ /* Build the Boyer Moore delta. Boy that's easy compared to CW. */
+ for (i = 0; i < kwset->mind; ++i)
+ delta[(unsigned char) kwset->target[i]] = kwset->mind - (i + 1);
+ kwset->mind2 = kwset->mind;
+ /* Find the minimal delta2 shift that we might make after
+ a backwards match has failed. */
+ for (i = 0; i < kwset->mind - 1; ++i)
+ if (kwset->target[i] == kwset->target[kwset->mind - 1])
+ kwset->mind2 = kwset->mind - (i + 1);
+ }
+ else
+ {
+ /* Traverse the nodes of the trie in level order, simultaneously
+ computing the delta table, failure function, and shift function. */
+ for (curr = last = kwset->trie; curr; curr = curr->next)
+ {
+ /* Enqueue the immediate descendents in the level order queue. */
+ enqueue(curr->links, &last);
+
+ curr->shift = kwset->mind;
+ curr->maxshift = kwset->mind;
+
+ /* Update the delta table for the descendents of this node. */
+ treedelta(curr->links, curr->depth, delta);
+
+ /* Compute the failure function for the decendents of this node. */
+ treefails(curr->links, curr->fail, kwset->trie);
+
+ /* Update the shifts at each node in the current node's chain
+ of fails back to the root. */
+ for (fail = curr->fail; fail; fail = fail->fail)
+ {
+ /* If the current node has some outgoing edge that the fail
+ doesn't, then the shift at the fail should be no larger
+ than the difference of their depths. */
+ if (!hasevery(fail->links, curr->links))
+ if (curr->depth - fail->depth < fail->shift)
+ fail->shift = curr->depth - fail->depth;
+
+ /* If the current node is accepting then the shift at the
+ fail and its descendents should be no larger than the
+ difference of their depths. */
+ if (curr->accepting && fail->maxshift > curr->depth - fail->depth)
+ fail->maxshift = curr->depth - fail->depth;
+ }
+ }
+
+ /* Traverse the trie in level order again, fixing up all nodes whose
+ shift exceeds their inherited maxshift. */
+ for (curr = kwset->trie->next; curr; curr = curr->next)
+ {
+ if (curr->maxshift > curr->parent->maxshift)
+ curr->maxshift = curr->parent->maxshift;
+ if (curr->shift > curr->maxshift)
+ curr->shift = curr->maxshift;
+ }
+
+ /* Create a vector, indexed by character code, of the outgoing links
+ from the root node. */
+ for (i = 0; i < NCHAR; ++i)
+ next[i] = 0;
+ treenext(kwset->trie->links, next);
+
+ if ((trans = kwset->trans) != 0)
+ for (i = 0; i < NCHAR; ++i)
+ kwset->next[i] = next[(unsigned char) trans[i]];
+ else
+ for (i = 0; i < NCHAR; ++i)
+ kwset->next[i] = next[i];
+ }
+
+ /* Fix things up for any translation table. */
+ if ((trans = kwset->trans) != 0)
+ for (i = 0; i < NCHAR; ++i)
+ kwset->delta[i] = delta[(unsigned char) trans[i]];
+ else
+ for (i = 0; i < NCHAR; ++i)
+ kwset->delta[i] = delta[i];
+
+ return 0;
+}
+
+#define U(C) ((unsigned char) (C))
+
+/* Fast boyer-moore search. */
+static char *
+bmexec(kws, text, size)
+ kwset_t kws;
+ char *text;
+ size_t size;
+{
+ struct kwset *kwset;
+ register unsigned char *d1;
+ register char *ep, *sp, *tp;
+ register int d, gc, i, len, md2;
+
+ kwset = (struct kwset *) kws;
+ len = kwset->mind;
+
+ if (len == 0)
+ return text;
+ if (len > size)
+ return 0;
+ if (len == 1)
+ return memchr(text, kwset->target[0], size);
+
+ d1 = kwset->delta;
+ sp = kwset->target + len;
+ gc = U(sp[-2]);
+ md2 = kwset->mind2;
+ tp = text + len;
+
+ /* Significance of 12: 1 (initial offset) + 10 (skip loop) + 1 (md2). */
+ if (size > 12 * len)
+ /* 11 is not a bug, the initial offset happens only once. */
+ for (ep = text + size - 11 * len;;)
+ {
+ while (tp <= ep)
+ {
+ d = d1[U(tp[-1])], tp += d;
+ d = d1[U(tp[-1])], tp += d;
+ if (d == 0)
+ goto found;
+ d = d1[U(tp[-1])], tp += d;
+ d = d1[U(tp[-1])], tp += d;
+ d = d1[U(tp[-1])], tp += d;
+ if (d == 0)
+ goto found;
+ d = d1[U(tp[-1])], tp += d;
+ d = d1[U(tp[-1])], tp += d;
+ d = d1[U(tp[-1])], tp += d;
+ if (d == 0)
+ goto found;
+ d = d1[U(tp[-1])], tp += d;
+ d = d1[U(tp[-1])], tp += d;
+ }
+ break;
+ found:
+ if (U(tp[-2]) == gc)
+ {
+ for (i = 3; i <= len && U(tp[-i]) == U(sp[-i]); ++i)
+ ;
+ if (i > len)
+ return tp - len;
+ }
+ tp += md2;
+ }
+
+ /* Now we have only a few characters left to search. We
+ carefully avoid ever producing an out-of-bounds pointer. */
+ ep = text + size;
+ d = d1[U(tp[-1])];
+ while (d <= ep - tp)
+ {
+ d = d1[U((tp += d)[-1])];
+ if (d != 0)
+ continue;
+ if (tp[-2] == gc)
+ {
+ for (i = 3; i <= len && U(tp[-i]) == U(sp[-i]); ++i)
+ ;
+ if (i > len)
+ return tp - len;
+ }
+ d = md2;
+ }
+
+ return 0;
+}
+
+/* Hairy multiple string search. */
+static char *
+cwexec(kws, text, len, kwsmatch)
+ kwset_t kws;
+ char *text;
+ size_t len;
+ struct kwsmatch *kwsmatch;
+{
+ struct kwset *kwset;
+ struct trie **next, *trie, *accept;
+ char *beg, *lim, *mch, *lmch;
+ register unsigned char c, *delta;
+ register int d;
+ register char *end, *qlim;
+ register struct tree *tree;
+ register char *trans;
+
+ /* Initialize register copies and look for easy ways out. */
+ kwset = (struct kwset *) kws;
+ if (len < kwset->mind)
+ return 0;
+ next = kwset->next;
+ delta = kwset->delta;
+ trans = kwset->trans;
+ lim = text + len;
+ end = text;
+ if ((d = kwset->mind) != 0)
+ mch = 0;
+ else
+ {
+ mch = text, accept = kwset->trie;
+ goto match;
+ }
+
+ if (len >= 4 * kwset->mind)
+ qlim = lim - 4 * kwset->mind;
+ else
+ qlim = 0;
+
+ while (lim - end >= d)
+ {
+ if (qlim && end <= qlim)
+ {
+ end += d - 1;
+ while ((d = delta[c = *end]) && end < qlim)
+ {
+ end += d;
+ end += delta[(unsigned char) *end];
+ end += delta[(unsigned char) *end];
+ }
+ ++end;
+ }
+ else
+ d = delta[c = (end += d)[-1]];
+ if (d)
+ continue;
+ beg = end - 1;
+ trie = next[c];
+ if (trie->accepting)
+ {
+ mch = beg;
+ accept = trie;
+ }
+ d = trie->shift;
+ while (beg > text)
+ {
+ c = trans ? trans[(unsigned char) *--beg] : *--beg;
+ tree = trie->links;
+ while (tree && c != tree->label)
+ if (c < tree->label)
+ tree = tree->llink;
+ else
+ tree = tree->rlink;
+ if (tree)
+ {
+ trie = tree->trie;
+ if (trie->accepting)
+ {
+ mch = beg;
+ accept = trie;
+ }
+ }
+ else
+ break;
+ d = trie->shift;
+ }
+ if (mch)
+ goto match;
+ }
+ return 0;
+
+ match:
+ /* Given a known match, find the longest possible match anchored
+ at or before its starting point. This is nearly a verbatim
+ copy of the preceding main search loops. */
+ if (lim - mch > kwset->maxd)
+ lim = mch + kwset->maxd;
+ lmch = 0;
+ d = 1;
+ while (lim - end >= d)
+ {
+ if ((d = delta[c = (end += d)[-1]]) != 0)
+ continue;
+ beg = end - 1;
+ if (!(trie = next[c]))
+ {
+ d = 1;
+ continue;
+ }
+ if (trie->accepting && beg <= mch)
+ {
+ lmch = beg;
+ accept = trie;
+ }
+ d = trie->shift;
+ while (beg > text)
+ {
+ c = trans ? trans[(unsigned char) *--beg] : *--beg;
+ tree = trie->links;
+ while (tree && c != tree->label)
+ if (c < tree->label)
+ tree = tree->llink;
+ else
+ tree = tree->rlink;
+ if (tree)
+ {
+ trie = tree->trie;
+ if (trie->accepting && beg <= mch)
+ {
+ lmch = beg;
+ accept = trie;
+ }
+ }
+ else
+ break;
+ d = trie->shift;
+ }
+ if (lmch)
+ {
+ mch = lmch;
+ goto match;
+ }
+ if (!d)
+ d = 1;
+ }
+
+ if (kwsmatch)
+ {
+ kwsmatch->index = accept->accepting / 2;
+ kwsmatch->beg[0] = mch;
+ kwsmatch->size[0] = accept->depth;
+ }
+ return mch;
+}
+
+/* Search through the given text for a match of any member of the
+ given keyword set. Return a pointer to the first character of
+ the matching substring, or NULL if no match is found. If FOUNDLEN
+ is non-NULL store in the referenced location the length of the
+ matching substring. Similarly, if FOUNDIDX is non-NULL, store
+ in the referenced location the index number of the particular
+ keyword matched. */
+char *
+kwsexec(kws, text, size, kwsmatch)
+ kwset_t kws;
+ char *text;
+ size_t size;
+ struct kwsmatch *kwsmatch;
+{
+ struct kwset *kwset;
+ char *ret;
+
+ kwset = (struct kwset *) kws;
+ if (kwset->words == 1 && kwset->trans == 0)
+ {
+ ret = bmexec(kws, text, size);
+ if (kwsmatch != 0 && ret != 0)
+ {
+ kwsmatch->index = 0;
+ kwsmatch->beg[0] = ret;
+ kwsmatch->size[0] = kwset->mind;
+ }
+ return ret;
+ }
+ else
+ return cwexec(kws, text, size, kwsmatch);
+}
+
+/* Free the components of the given keyword set. */
+void
+kwsfree(kws)
+ kwset_t kws;
+{
+ struct kwset *kwset;
+
+ kwset = (struct kwset *) kws;
+ obstack_free(&kwset->obstack, 0);
+ free(kws);
+}
diff --git a/gnu/usr.bin/grep/kwset.h b/gnu/usr.bin/grep/kwset.h
new file mode 100644
index 0000000..95f62e7
--- /dev/null
+++ b/gnu/usr.bin/grep/kwset.h
@@ -0,0 +1,69 @@
+/* kwset.h - header declaring the keyword set library.
+ Copyright 1989 Free Software Foundation
+ Written August 1989 by Mike Haertel.
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 1, or (at your option)
+ any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ The author may be reached (Email) at the address mike@ai.mit.edu,
+ or (US mail) as Mike Haertel c/o Free Software Foundation. */
+
+struct kwsmatch
+{
+ int index; /* Index number of matching keyword. */
+ char *beg[1]; /* Begin pointer for each submatch. */
+ size_t size[1]; /* Length of each submatch. */
+};
+
+#if __STDC__
+
+typedef void *kwset_t;
+
+/* Return an opaque pointer to a newly allocated keyword set, or NULL
+ if enough memory cannot be obtained. The argument if non-NULL
+ specifies a table of character translations to be applied to all
+ pattern and search text. */
+extern kwset_t kwsalloc(char *);
+
+/* Incrementally extend the keyword set to include the given string.
+ Return NULL for success, or an error message. Remember an index
+ number for each keyword included in the set. */
+extern char *kwsincr(kwset_t, char *, size_t);
+
+/* When the keyword set has been completely built, prepare it for
+ use. Return NULL for success, or an error message. */
+extern char *kwsprep(kwset_t);
+
+/* Search through the given buffer for a member of the keyword set.
+ Return a pointer to the leftmost longest match found, or NULL if
+ no match is found. If foundlen is non-NULL, store the length of
+ the matching substring in the integer it points to. Similarly,
+ if foundindex is non-NULL, store the index of the particular
+ keyword found therein. */
+extern char *kwsexec(kwset_t, char *, size_t, struct kwsmatch *);
+
+/* Deallocate the given keyword set and all its associated storage. */
+extern void kwsfree(kwset_t);
+
+#else
+
+typedef char *kwset_t;
+
+extern kwset_t kwsalloc();
+extern char *kwsincr();
+extern char *kwsprep();
+extern char *kwsexec();
+extern void kwsfree();
+
+#endif
diff --git a/gnu/usr.bin/grep/obstack.c b/gnu/usr.bin/grep/obstack.c
new file mode 100644
index 0000000..7b9d3b9
--- /dev/null
+++ b/gnu/usr.bin/grep/obstack.c
@@ -0,0 +1,454 @@
+/* obstack.c - subroutines used implicitly by object stack macros
+ Copyright (C) 1988, 1993 Free Software Foundation, Inc.
+
+This program is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 2, or (at your option) any
+later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, 675 Mass Ave, Cambridge, MA 02139, USA. */
+
+#include "obstack.h"
+
+/* This is just to get __GNU_LIBRARY__ defined. */
+#include <stdio.h>
+
+/* Comment out all this code if we are using the GNU C Library, and are not
+ actually compiling the library itself. This code is part of the GNU C
+ Library, but also included in many other GNU distributions. Compiling
+ and linking in this code is a waste when using the GNU C library
+ (especially if it is a shared library). Rather than having every GNU
+ program understand `configure --with-gnu-libc' and omit the object files,
+ it is simpler to just do this in the source for each such file. */
+
+#if defined (_LIBC) || !defined (__GNU_LIBRARY__)
+
+
+#ifdef __STDC__
+#define POINTER void *
+#else
+#define POINTER char *
+#endif
+
+/* Determine default alignment. */
+struct fooalign {char x; double d;};
+#define DEFAULT_ALIGNMENT \
+ ((PTR_INT_TYPE) ((char *)&((struct fooalign *) 0)->d - (char *)0))
+/* If malloc were really smart, it would round addresses to DEFAULT_ALIGNMENT.
+ But in fact it might be less smart and round addresses to as much as
+ DEFAULT_ROUNDING. So we prepare for it to do that. */
+union fooround {long x; double d;};
+#define DEFAULT_ROUNDING (sizeof (union fooround))
+
+/* When we copy a long block of data, this is the unit to do it with.
+ On some machines, copying successive ints does not work;
+ in such a case, redefine COPYING_UNIT to `long' (if that works)
+ or `char' as a last resort. */
+#ifndef COPYING_UNIT
+#define COPYING_UNIT int
+#endif
+
+/* The non-GNU-C macros copy the obstack into this global variable
+ to avoid multiple evaluation. */
+
+struct obstack *_obstack;
+
+/* Define a macro that either calls functions with the traditional malloc/free
+ calling interface, or calls functions with the mmalloc/mfree interface
+ (that adds an extra first argument), based on the state of use_extra_arg.
+ For free, do not use ?:, since some compilers, like the MIPS compilers,
+ do not allow (expr) ? void : void. */
+
+#define CALL_CHUNKFUN(h, size) \
+ (((h) -> use_extra_arg) \
+ ? (*(h)->chunkfun) ((h)->extra_arg, (size)) \
+ : (*(h)->chunkfun) ((size)))
+
+#define CALL_FREEFUN(h, old_chunk) \
+ do { \
+ if ((h) -> use_extra_arg) \
+ (*(h)->freefun) ((h)->extra_arg, (old_chunk)); \
+ else \
+ (*(h)->freefun) ((old_chunk)); \
+ } while (0)
+
+
+/* Initialize an obstack H for use. Specify chunk size SIZE (0 means default).
+ Objects start on multiples of ALIGNMENT (0 means use default).
+ CHUNKFUN is the function to use to allocate chunks,
+ and FREEFUN the function to free them. */
+
+void
+_obstack_begin (h, size, alignment, chunkfun, freefun)
+ struct obstack *h;
+ int size;
+ int alignment;
+ POINTER (*chunkfun) ();
+ void (*freefun) ();
+{
+ register struct _obstack_chunk* chunk; /* points to new chunk */
+
+ if (alignment == 0)
+ alignment = DEFAULT_ALIGNMENT;
+ if (size == 0)
+ /* Default size is what GNU malloc can fit in a 4096-byte block. */
+ {
+ /* 12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
+ Use the values for range checking, because if range checking is off,
+ the extra bytes won't be missed terribly, but if range checking is on
+ and we used a larger request, a whole extra 4096 bytes would be
+ allocated.
+
+ These number are irrelevant to the new GNU malloc. I suspect it is
+ less sensitive to the size of the request. */
+ int extra = ((((12 + DEFAULT_ROUNDING - 1) & ~(DEFAULT_ROUNDING - 1))
+ + 4 + DEFAULT_ROUNDING - 1)
+ & ~(DEFAULT_ROUNDING - 1));
+ size = 4096 - extra;
+ }
+
+ h->chunkfun = (struct _obstack_chunk * (*)()) chunkfun;
+ h->freefun = freefun;
+ h->chunk_size = size;
+ h->alignment_mask = alignment - 1;
+ h->use_extra_arg = 0;
+
+ chunk = h->chunk = CALL_CHUNKFUN (h, h -> chunk_size);
+ h->next_free = h->object_base = chunk->contents;
+ h->chunk_limit = chunk->limit
+ = (char *) chunk + h->chunk_size;
+ chunk->prev = 0;
+ /* The initial chunk now contains no empty object. */
+ h->maybe_empty_object = 0;
+}
+
+void
+_obstack_begin_1 (h, size, alignment, chunkfun, freefun, arg)
+ struct obstack *h;
+ int size;
+ int alignment;
+ POINTER (*chunkfun) ();
+ void (*freefun) ();
+ POINTER arg;
+{
+ register struct _obstack_chunk* chunk; /* points to new chunk */
+
+ if (alignment == 0)
+ alignment = DEFAULT_ALIGNMENT;
+ if (size == 0)
+ /* Default size is what GNU malloc can fit in a 4096-byte block. */
+ {
+ /* 12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
+ Use the values for range checking, because if range checking is off,
+ the extra bytes won't be missed terribly, but if range checking is on
+ and we used a larger request, a whole extra 4096 bytes would be
+ allocated.
+
+ These number are irrelevant to the new GNU malloc. I suspect it is
+ less sensitive to the size of the request. */
+ int extra = ((((12 + DEFAULT_ROUNDING - 1) & ~(DEFAULT_ROUNDING - 1))
+ + 4 + DEFAULT_ROUNDING - 1)
+ & ~(DEFAULT_ROUNDING - 1));
+ size = 4096 - extra;
+ }
+
+ h->chunkfun = (struct _obstack_chunk * (*)()) chunkfun;
+ h->freefun = freefun;
+ h->chunk_size = size;
+ h->alignment_mask = alignment - 1;
+ h->extra_arg = arg;
+ h->use_extra_arg = 1;
+
+ chunk = h->chunk = CALL_CHUNKFUN (h, h -> chunk_size);
+ h->next_free = h->object_base = chunk->contents;
+ h->chunk_limit = chunk->limit
+ = (char *) chunk + h->chunk_size;
+ chunk->prev = 0;
+ /* The initial chunk now contains no empty object. */
+ h->maybe_empty_object = 0;
+}
+
+/* Allocate a new current chunk for the obstack *H
+ on the assumption that LENGTH bytes need to be added
+ to the current object, or a new object of length LENGTH allocated.
+ Copies any partial object from the end of the old chunk
+ to the beginning of the new one. */
+
+void
+_obstack_newchunk (h, length)
+ struct obstack *h;
+ int length;
+{
+ register struct _obstack_chunk* old_chunk = h->chunk;
+ register struct _obstack_chunk* new_chunk;
+ register long new_size;
+ register int obj_size = h->next_free - h->object_base;
+ register int i;
+ int already;
+
+ /* Compute size for new chunk. */
+ new_size = (obj_size + length) + (obj_size >> 3) + 100;
+ if (new_size < h->chunk_size)
+ new_size = h->chunk_size;
+
+ /* Allocate and initialize the new chunk. */
+ new_chunk = h->chunk = CALL_CHUNKFUN (h, new_size);
+ new_chunk->prev = old_chunk;
+ new_chunk->limit = h->chunk_limit = (char *) new_chunk + new_size;
+
+ /* Move the existing object to the new chunk.
+ Word at a time is fast and is safe if the object
+ is sufficiently aligned. */
+ if (h->alignment_mask + 1 >= DEFAULT_ALIGNMENT)
+ {
+ for (i = obj_size / sizeof (COPYING_UNIT) - 1;
+ i >= 0; i--)
+ ((COPYING_UNIT *)new_chunk->contents)[i]
+ = ((COPYING_UNIT *)h->object_base)[i];
+ /* We used to copy the odd few remaining bytes as one extra COPYING_UNIT,
+ but that can cross a page boundary on a machine
+ which does not do strict alignment for COPYING_UNITS. */
+ already = obj_size / sizeof (COPYING_UNIT) * sizeof (COPYING_UNIT);
+ }
+ else
+ already = 0;
+ /* Copy remaining bytes one by one. */
+ for (i = already; i < obj_size; i++)
+ new_chunk->contents[i] = h->object_base[i];
+
+ /* If the object just copied was the only data in OLD_CHUNK,
+ free that chunk and remove it from the chain.
+ But not if that chunk might contain an empty object. */
+ if (h->object_base == old_chunk->contents && ! h->maybe_empty_object)
+ {
+ new_chunk->prev = old_chunk->prev;
+ CALL_FREEFUN (h, old_chunk);
+ }
+
+ h->object_base = new_chunk->contents;
+ h->next_free = h->object_base + obj_size;
+ /* The new chunk certainly contains no empty object yet. */
+ h->maybe_empty_object = 0;
+}
+
+/* Return nonzero if object OBJ has been allocated from obstack H.
+ This is here for debugging.
+ If you use it in a program, you are probably losing. */
+
+int
+_obstack_allocated_p (h, obj)
+ struct obstack *h;
+ POINTER obj;
+{
+ register struct _obstack_chunk* lp; /* below addr of any objects in this chunk */
+ register struct _obstack_chunk* plp; /* point to previous chunk if any */
+
+ lp = (h)->chunk;
+ /* We use >= rather than > since the object cannot be exactly at
+ the beginning of the chunk but might be an empty object exactly
+ at the end of an adjacent chunk. */
+ while (lp != 0 && ((POINTER)lp >= obj || (POINTER)(lp)->limit < obj))
+ {
+ plp = lp->prev;
+ lp = plp;
+ }
+ return lp != 0;
+}
+
+/* Free objects in obstack H, including OBJ and everything allocate
+ more recently than OBJ. If OBJ is zero, free everything in H. */
+
+#undef obstack_free
+
+/* This function has two names with identical definitions.
+ This is the first one, called from non-ANSI code. */
+
+void
+_obstack_free (h, obj)
+ struct obstack *h;
+ POINTER obj;
+{
+ register struct _obstack_chunk* lp; /* below addr of any objects in this chunk */
+ register struct _obstack_chunk* plp; /* point to previous chunk if any */
+
+ lp = h->chunk;
+ /* We use >= because there cannot be an object at the beginning of a chunk.
+ But there can be an empty object at that address
+ at the end of another chunk. */
+ while (lp != 0 && ((POINTER)lp >= obj || (POINTER)(lp)->limit < obj))
+ {
+ plp = lp->prev;
+ CALL_FREEFUN (h, lp);
+ lp = plp;
+ /* If we switch chunks, we can't tell whether the new current
+ chunk contains an empty object, so assume that it may. */
+ h->maybe_empty_object = 1;
+ }
+ if (lp)
+ {
+ h->object_base = h->next_free = (char *)(obj);
+ h->chunk_limit = lp->limit;
+ h->chunk = lp;
+ }
+ else if (obj != 0)
+ /* obj is not in any of the chunks! */
+ abort ();
+}
+
+/* This function is used from ANSI code. */
+
+void
+obstack_free (h, obj)
+ struct obstack *h;
+ POINTER obj;
+{
+ register struct _obstack_chunk* lp; /* below addr of any objects in this chunk */
+ register struct _obstack_chunk* plp; /* point to previous chunk if any */
+
+ lp = h->chunk;
+ /* We use >= because there cannot be an object at the beginning of a chunk.
+ But there can be an empty object at that address
+ at the end of another chunk. */
+ while (lp != 0 && ((POINTER)lp >= obj || (POINTER)(lp)->limit < obj))
+ {
+ plp = lp->prev;
+ CALL_FREEFUN (h, lp);
+ lp = plp;
+ /* If we switch chunks, we can't tell whether the new current
+ chunk contains an empty object, so assume that it may. */
+ h->maybe_empty_object = 1;
+ }
+ if (lp)
+ {
+ h->object_base = h->next_free = (char *)(obj);
+ h->chunk_limit = lp->limit;
+ h->chunk = lp;
+ }
+ else if (obj != 0)
+ /* obj is not in any of the chunks! */
+ abort ();
+}
+
+#if 0
+/* These are now turned off because the applications do not use it
+ and it uses bcopy via obstack_grow, which causes trouble on sysV. */
+
+/* Now define the functional versions of the obstack macros.
+ Define them to simply use the corresponding macros to do the job. */
+
+#ifdef __STDC__
+/* These function definitions do not work with non-ANSI preprocessors;
+ they won't pass through the macro names in parentheses. */
+
+/* The function names appear in parentheses in order to prevent
+ the macro-definitions of the names from being expanded there. */
+
+POINTER (obstack_base) (obstack)
+ struct obstack *obstack;
+{
+ return obstack_base (obstack);
+}
+
+POINTER (obstack_next_free) (obstack)
+ struct obstack *obstack;
+{
+ return obstack_next_free (obstack);
+}
+
+int (obstack_object_size) (obstack)
+ struct obstack *obstack;
+{
+ return obstack_object_size (obstack);
+}
+
+int (obstack_room) (obstack)
+ struct obstack *obstack;
+{
+ return obstack_room (obstack);
+}
+
+void (obstack_grow) (obstack, pointer, length)
+ struct obstack *obstack;
+ POINTER pointer;
+ int length;
+{
+ obstack_grow (obstack, pointer, length);
+}
+
+void (obstack_grow0) (obstack, pointer, length)
+ struct obstack *obstack;
+ POINTER pointer;
+ int length;
+{
+ obstack_grow0 (obstack, pointer, length);
+}
+
+void (obstack_1grow) (obstack, character)
+ struct obstack *obstack;
+ int character;
+{
+ obstack_1grow (obstack, character);
+}
+
+void (obstack_blank) (obstack, length)
+ struct obstack *obstack;
+ int length;
+{
+ obstack_blank (obstack, length);
+}
+
+void (obstack_1grow_fast) (obstack, character)
+ struct obstack *obstack;
+ int character;
+{
+ obstack_1grow_fast (obstack, character);
+}
+
+void (obstack_blank_fast) (obstack, length)
+ struct obstack *obstack;
+ int length;
+{
+ obstack_blank_fast (obstack, length);
+}
+
+POINTER (obstack_finish) (obstack)
+ struct obstack *obstack;
+{
+ return obstack_finish (obstack);
+}
+
+POINTER (obstack_alloc) (obstack, length)
+ struct obstack *obstack;
+ int length;
+{
+ return obstack_alloc (obstack, length);
+}
+
+POINTER (obstack_copy) (obstack, pointer, length)
+ struct obstack *obstack;
+ POINTER pointer;
+ int length;
+{
+ return obstack_copy (obstack, pointer, length);
+}
+
+POINTER (obstack_copy0) (obstack, pointer, length)
+ struct obstack *obstack;
+ POINTER pointer;
+ int length;
+{
+ return obstack_copy0 (obstack, pointer, length);
+}
+
+#endif /* __STDC__ */
+
+#endif /* 0 */
+
+#endif /* _LIBC or not __GNU_LIBRARY__. */
diff --git a/gnu/usr.bin/grep/obstack.h b/gnu/usr.bin/grep/obstack.h
new file mode 100644
index 0000000..8a18e45
--- /dev/null
+++ b/gnu/usr.bin/grep/obstack.h
@@ -0,0 +1,484 @@
+/* obstack.h - object stack macros
+ Copyright (C) 1988, 1992 Free Software Foundation, Inc.
+
+This program is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 2, or (at your option) any
+later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, 675 Mass Ave, Cambridge, MA 02139, USA. */
+
+/* Summary:
+
+All the apparent functions defined here are macros. The idea
+is that you would use these pre-tested macros to solve a
+very specific set of problems, and they would run fast.
+Caution: no side-effects in arguments please!! They may be
+evaluated MANY times!!
+
+These macros operate a stack of objects. Each object starts life
+small, and may grow to maturity. (Consider building a word syllable
+by syllable.) An object can move while it is growing. Once it has
+been "finished" it never changes address again. So the "top of the
+stack" is typically an immature growing object, while the rest of the
+stack is of mature, fixed size and fixed address objects.
+
+These routines grab large chunks of memory, using a function you
+supply, called `obstack_chunk_alloc'. On occasion, they free chunks,
+by calling `obstack_chunk_free'. You must define them and declare
+them before using any obstack macros.
+
+Each independent stack is represented by a `struct obstack'.
+Each of the obstack macros expects a pointer to such a structure
+as the first argument.
+
+One motivation for this package is the problem of growing char strings
+in symbol tables. Unless you are "fascist pig with a read-only mind"
+--Gosper's immortal quote from HAKMEM item 154, out of context--you
+would not like to put any arbitrary upper limit on the length of your
+symbols.
+
+In practice this often means you will build many short symbols and a
+few long symbols. At the time you are reading a symbol you don't know
+how long it is. One traditional method is to read a symbol into a
+buffer, realloc()ating the buffer every time you try to read a symbol
+that is longer than the buffer. This is beaut, but you still will
+want to copy the symbol from the buffer to a more permanent
+symbol-table entry say about half the time.
+
+With obstacks, you can work differently. Use one obstack for all symbol
+names. As you read a symbol, grow the name in the obstack gradually.
+When the name is complete, finalize it. Then, if the symbol exists already,
+free the newly read name.
+
+The way we do this is to take a large chunk, allocating memory from
+low addresses. When you want to build a symbol in the chunk you just
+add chars above the current "high water mark" in the chunk. When you
+have finished adding chars, because you got to the end of the symbol,
+you know how long the chars are, and you can create a new object.
+Mostly the chars will not burst over the highest address of the chunk,
+because you would typically expect a chunk to be (say) 100 times as
+long as an average object.
+
+In case that isn't clear, when we have enough chars to make up
+the object, THEY ARE ALREADY CONTIGUOUS IN THE CHUNK (guaranteed)
+so we just point to it where it lies. No moving of chars is
+needed and this is the second win: potentially long strings need
+never be explicitly shuffled. Once an object is formed, it does not
+change its address during its lifetime.
+
+When the chars burst over a chunk boundary, we allocate a larger
+chunk, and then copy the partly formed object from the end of the old
+chunk to the beginning of the new larger chunk. We then carry on
+accreting characters to the end of the object as we normally would.
+
+A special macro is provided to add a single char at a time to a
+growing object. This allows the use of register variables, which
+break the ordinary 'growth' macro.
+
+Summary:
+ We allocate large chunks.
+ We carve out one object at a time from the current chunk.
+ Once carved, an object never moves.
+ We are free to append data of any size to the currently
+ growing object.
+ Exactly one object is growing in an obstack at any one time.
+ You can run one obstack per control block.
+ You may have as many control blocks as you dare.
+ Because of the way we do it, you can `unwind' an obstack
+ back to a previous state. (You may remove objects much
+ as you would with a stack.)
+*/
+
+
+/* Don't do the contents of this file more than once. */
+
+#ifndef __OBSTACKS__
+#define __OBSTACKS__
+
+/* We use subtraction of (char *)0 instead of casting to int
+ because on word-addressable machines a simple cast to int
+ may ignore the byte-within-word field of the pointer. */
+
+#ifndef __PTR_TO_INT
+#define __PTR_TO_INT(P) ((P) - (char *)0)
+#endif
+
+#ifndef __INT_TO_PTR
+#define __INT_TO_PTR(P) ((P) + (char *)0)
+#endif
+
+/* We need the type of the resulting object. In ANSI C it is ptrdiff_t
+ but in traditional C it is usually long. If we are in ANSI C and
+ don't already have ptrdiff_t get it. */
+
+#if defined (__STDC__) && ! defined (offsetof)
+#if defined (__GNUC__) && defined (IN_GCC)
+/* On Next machine, the system's stddef.h screws up if included
+ after we have defined just ptrdiff_t, so include all of gstddef.h.
+ Otherwise, define just ptrdiff_t, which is all we need. */
+#ifndef __NeXT__
+#define __need_ptrdiff_t
+#endif
+
+/* While building GCC, the stddef.h that goes with GCC has this name. */
+#include "gstddef.h"
+#else
+#include <stddef.h>
+#endif
+#endif
+
+#ifdef __STDC__
+#define PTR_INT_TYPE ptrdiff_t
+#else
+#define PTR_INT_TYPE long
+#endif
+
+struct _obstack_chunk /* Lives at front of each chunk. */
+{
+ char *limit; /* 1 past end of this chunk */
+ struct _obstack_chunk *prev; /* address of prior chunk or NULL */
+ char contents[4]; /* objects begin here */
+};
+
+struct obstack /* control current object in current chunk */
+{
+ long chunk_size; /* preferred size to allocate chunks in */
+ struct _obstack_chunk* chunk; /* address of current struct obstack_chunk */
+ char *object_base; /* address of object we are building */
+ char *next_free; /* where to add next char to current object */
+ char *chunk_limit; /* address of char after current chunk */
+ PTR_INT_TYPE temp; /* Temporary for some macros. */
+ int alignment_mask; /* Mask of alignment for each object. */
+ struct _obstack_chunk *(*chunkfun) (); /* User's fcn to allocate a chunk. */
+ void (*freefun) (); /* User's function to free a chunk. */
+ char *extra_arg; /* first arg for chunk alloc/dealloc funcs */
+ unsigned use_extra_arg:1; /* chunk alloc/dealloc funcs take extra arg */
+ unsigned maybe_empty_object:1;/* There is a possibility that the current
+ chunk contains a zero-length object. This
+ prevents freeing the chunk if we allocate
+ a bigger chunk to replace it. */
+};
+
+/* Declare the external functions we use; they are in obstack.c. */
+
+#ifdef __STDC__
+extern void _obstack_newchunk (struct obstack *, int);
+extern void _obstack_free (struct obstack *, void *);
+extern void _obstack_begin (struct obstack *, int, int,
+ void *(*) (), void (*) ());
+extern void _obstack_begin_1 (struct obstack *, int, int,
+ void *(*) (), void (*) (), void *);
+#else
+extern void _obstack_newchunk ();
+extern void _obstack_free ();
+extern void _obstack_begin ();
+extern void _obstack_begin_1 ();
+#endif
+
+#ifdef __STDC__
+
+/* Do the function-declarations after the structs
+ but before defining the macros. */
+
+void obstack_init (struct obstack *obstack);
+
+void * obstack_alloc (struct obstack *obstack, int size);
+
+void * obstack_copy (struct obstack *obstack, void *address, int size);
+void * obstack_copy0 (struct obstack *obstack, void *address, int size);
+
+void obstack_free (struct obstack *obstack, void *block);
+
+void obstack_blank (struct obstack *obstack, int size);
+
+void obstack_grow (struct obstack *obstack, void *data, int size);
+void obstack_grow0 (struct obstack *obstack, void *data, int size);
+
+void obstack_1grow (struct obstack *obstack, int data_char);
+void obstack_ptr_grow (struct obstack *obstack, void *data);
+void obstack_int_grow (struct obstack *obstack, int data);
+
+void * obstack_finish (struct obstack *obstack);
+
+int obstack_object_size (struct obstack *obstack);
+
+int obstack_room (struct obstack *obstack);
+void obstack_1grow_fast (struct obstack *obstack, int data_char);
+void obstack_ptr_grow_fast (struct obstack *obstack, void *data);
+void obstack_int_grow_fast (struct obstack *obstack, int data);
+void obstack_blank_fast (struct obstack *obstack, int size);
+
+void * obstack_base (struct obstack *obstack);
+void * obstack_next_free (struct obstack *obstack);
+int obstack_alignment_mask (struct obstack *obstack);
+int obstack_chunk_size (struct obstack *obstack);
+
+#endif /* __STDC__ */
+
+/* Non-ANSI C cannot really support alternative functions for these macros,
+ so we do not declare them. */
+
+/* Pointer to beginning of object being allocated or to be allocated next.
+ Note that this might not be the final address of the object
+ because a new chunk might be needed to hold the final size. */
+
+#define obstack_base(h) ((h)->object_base)
+
+/* Size for allocating ordinary chunks. */
+
+#define obstack_chunk_size(h) ((h)->chunk_size)
+
+/* Pointer to next byte not yet allocated in current chunk. */
+
+#define obstack_next_free(h) ((h)->next_free)
+
+/* Mask specifying low bits that should be clear in address of an object. */
+
+#define obstack_alignment_mask(h) ((h)->alignment_mask)
+
+#define obstack_init(h) \
+ _obstack_begin ((h), 0, 0, \
+ (void *(*) ()) obstack_chunk_alloc, (void (*) ()) obstack_chunk_free)
+
+#define obstack_begin(h, size) \
+ _obstack_begin ((h), (size), 0, \
+ (void *(*) ()) obstack_chunk_alloc, (void (*) ()) obstack_chunk_free)
+
+#define obstack_specify_allocation(h, size, alignment, chunkfun, freefun) \
+ _obstack_begin ((h), (size), (alignment), \
+ (void *(*) ()) (chunkfun), (void (*) ()) (freefun))
+
+#define obstack_specify_allocation_with_arg(h, size, alignment, chunkfun, freefun, arg) \
+ _obstack_begin_1 ((h), (size), (alignment), \
+ (void *(*) ()) (chunkfun), (void (*) ()) (freefun), (arg))
+
+#define obstack_1grow_fast(h,achar) (*((h)->next_free)++ = achar)
+
+#define obstack_blank_fast(h,n) ((h)->next_free += (n))
+
+#if defined (__GNUC__) && defined (__STDC__)
+#if __GNUC__ < 2 || defined(NeXT)
+#define __extension__
+#endif
+
+/* For GNU C, if not -traditional,
+ we can define these macros to compute all args only once
+ without using a global variable.
+ Also, we can avoid using the `temp' slot, to make faster code. */
+
+#define obstack_object_size(OBSTACK) \
+ __extension__ \
+ ({ struct obstack *__o = (OBSTACK); \
+ (unsigned) (__o->next_free - __o->object_base); })
+
+#define obstack_room(OBSTACK) \
+ __extension__ \
+ ({ struct obstack *__o = (OBSTACK); \
+ (unsigned) (__o->chunk_limit - __o->next_free); })
+
+/* Note that the call to _obstack_newchunk is enclosed in (..., 0)
+ so that we can avoid having void expressions
+ in the arms of the conditional expression.
+ Casting the third operand to void was tried before,
+ but some compilers won't accept it. */
+#define obstack_grow(OBSTACK,where,length) \
+__extension__ \
+({ struct obstack *__o = (OBSTACK); \
+ int __len = (length); \
+ ((__o->next_free + __len > __o->chunk_limit) \
+ ? (_obstack_newchunk (__o, __len), 0) : 0); \
+ bcopy (where, __o->next_free, __len); \
+ __o->next_free += __len; \
+ (void) 0; })
+
+#define obstack_grow0(OBSTACK,where,length) \
+__extension__ \
+({ struct obstack *__o = (OBSTACK); \
+ int __len = (length); \
+ ((__o->next_free + __len + 1 > __o->chunk_limit) \
+ ? (_obstack_newchunk (__o, __len + 1), 0) : 0), \
+ bcopy (where, __o->next_free, __len), \
+ __o->next_free += __len, \
+ *(__o->next_free)++ = 0; \
+ (void) 0; })
+
+#define obstack_1grow(OBSTACK,datum) \
+__extension__ \
+({ struct obstack *__o = (OBSTACK); \
+ ((__o->next_free + 1 > __o->chunk_limit) \
+ ? (_obstack_newchunk (__o, 1), 0) : 0), \
+ *(__o->next_free)++ = (datum); \
+ (void) 0; })
+
+/* These assume that the obstack alignment is good enough for pointers or ints,
+ and that the data added so far to the current object
+ shares that much alignment. */
+
+#define obstack_ptr_grow(OBSTACK,datum) \
+__extension__ \
+({ struct obstack *__o = (OBSTACK); \
+ ((__o->next_free + sizeof (void *) > __o->chunk_limit) \
+ ? (_obstack_newchunk (__o, sizeof (void *)), 0) : 0), \
+ *((void **)__o->next_free)++ = ((void *)datum); \
+ (void) 0; })
+
+#define obstack_int_grow(OBSTACK,datum) \
+__extension__ \
+({ struct obstack *__o = (OBSTACK); \
+ ((__o->next_free + sizeof (int) > __o->chunk_limit) \
+ ? (_obstack_newchunk (__o, sizeof (int)), 0) : 0), \
+ *((int *)__o->next_free)++ = ((int)datum); \
+ (void) 0; })
+
+#define obstack_ptr_grow_fast(h,aptr) (*((void **)(h)->next_free)++ = (void *)aptr)
+#define obstack_int_grow_fast(h,aint) (*((int *)(h)->next_free)++ = (int)aint)
+
+#define obstack_blank(OBSTACK,length) \
+__extension__ \
+({ struct obstack *__o = (OBSTACK); \
+ int __len = (length); \
+ ((__o->chunk_limit - __o->next_free < __len) \
+ ? (_obstack_newchunk (__o, __len), 0) : 0); \
+ __o->next_free += __len; \
+ (void) 0; })
+
+#define obstack_alloc(OBSTACK,length) \
+__extension__ \
+({ struct obstack *__h = (OBSTACK); \
+ obstack_blank (__h, (length)); \
+ obstack_finish (__h); })
+
+#define obstack_copy(OBSTACK,where,length) \
+__extension__ \
+({ struct obstack *__h = (OBSTACK); \
+ obstack_grow (__h, (where), (length)); \
+ obstack_finish (__h); })
+
+#define obstack_copy0(OBSTACK,where,length) \
+__extension__ \
+({ struct obstack *__h = (OBSTACK); \
+ obstack_grow0 (__h, (where), (length)); \
+ obstack_finish (__h); })
+
+/* The local variable is named __o1 to avoid a name conflict
+ when obstack_blank is called. */
+#define obstack_finish(OBSTACK) \
+__extension__ \
+({ struct obstack *__o1 = (OBSTACK); \
+ void *value = (void *) __o1->object_base; \
+ if (__o1->next_free == value) \
+ __o1->maybe_empty_object = 1; \
+ __o1->next_free \
+ = __INT_TO_PTR ((__PTR_TO_INT (__o1->next_free)+__o1->alignment_mask)\
+ & ~ (__o1->alignment_mask)); \
+ ((__o1->next_free - (char *)__o1->chunk \
+ > __o1->chunk_limit - (char *)__o1->chunk) \
+ ? (__o1->next_free = __o1->chunk_limit) : 0); \
+ __o1->object_base = __o1->next_free; \
+ value; })
+
+#define obstack_free(OBSTACK, OBJ) \
+__extension__ \
+({ struct obstack *__o = (OBSTACK); \
+ void *__obj = (OBJ); \
+ if (__obj > (void *)__o->chunk && __obj < (void *)__o->chunk_limit) \
+ __o->next_free = __o->object_base = __obj; \
+ else (obstack_free) (__o, __obj); })
+
+#else /* not __GNUC__ or not __STDC__ */
+
+#define obstack_object_size(h) \
+ (unsigned) ((h)->next_free - (h)->object_base)
+
+#define obstack_room(h) \
+ (unsigned) ((h)->chunk_limit - (h)->next_free)
+
+#define obstack_grow(h,where,length) \
+( (h)->temp = (length), \
+ (((h)->next_free + (h)->temp > (h)->chunk_limit) \
+ ? (_obstack_newchunk ((h), (h)->temp), 0) : 0), \
+ bcopy (where, (h)->next_free, (h)->temp), \
+ (h)->next_free += (h)->temp)
+
+#define obstack_grow0(h,where,length) \
+( (h)->temp = (length), \
+ (((h)->next_free + (h)->temp + 1 > (h)->chunk_limit) \
+ ? (_obstack_newchunk ((h), (h)->temp + 1), 0) : 0), \
+ bcopy (where, (h)->next_free, (h)->temp), \
+ (h)->next_free += (h)->temp, \
+ *((h)->next_free)++ = 0)
+
+#define obstack_1grow(h,datum) \
+( (((h)->next_free + 1 > (h)->chunk_limit) \
+ ? (_obstack_newchunk ((h), 1), 0) : 0), \
+ *((h)->next_free)++ = (datum))
+
+#define obstack_ptr_grow(h,datum) \
+( (((h)->next_free + sizeof (char *) > (h)->chunk_limit) \
+ ? (_obstack_newchunk ((h), sizeof (char *)), 0) : 0), \
+ *((char **)(((h)->next_free+=sizeof(char *))-sizeof(char *))) = ((char *)datum))
+
+#define obstack_int_grow(h,datum) \
+( (((h)->next_free + sizeof (int) > (h)->chunk_limit) \
+ ? (_obstack_newchunk ((h), sizeof (int)), 0) : 0), \
+ *((int *)(((h)->next_free+=sizeof(int))-sizeof(int))) = ((int)datum))
+
+#define obstack_ptr_grow_fast(h,aptr) (*((char **)(h)->next_free)++ = (char *)aptr)
+#define obstack_int_grow_fast(h,aint) (*((int *)(h)->next_free)++ = (int)aint)
+
+#define obstack_blank(h,length) \
+( (h)->temp = (length), \
+ (((h)->chunk_limit - (h)->next_free < (h)->temp) \
+ ? (_obstack_newchunk ((h), (h)->temp), 0) : 0), \
+ (h)->next_free += (h)->temp)
+
+#define obstack_alloc(h,length) \
+ (obstack_blank ((h), (length)), obstack_finish ((h)))
+
+#define obstack_copy(h,where,length) \
+ (obstack_grow ((h), (where), (length)), obstack_finish ((h)))
+
+#define obstack_copy0(h,where,length) \
+ (obstack_grow0 ((h), (where), (length)), obstack_finish ((h)))
+
+#define obstack_finish(h) \
+( ((h)->next_free == (h)->object_base \
+ ? (((h)->maybe_empty_object = 1), 0) \
+ : 0), \
+ (h)->temp = __PTR_TO_INT ((h)->object_base), \
+ (h)->next_free \
+ = __INT_TO_PTR ((__PTR_TO_INT ((h)->next_free)+(h)->alignment_mask) \
+ & ~ ((h)->alignment_mask)), \
+ (((h)->next_free - (char *)(h)->chunk \
+ > (h)->chunk_limit - (char *)(h)->chunk) \
+ ? ((h)->next_free = (h)->chunk_limit) : 0), \
+ (h)->object_base = (h)->next_free, \
+ __INT_TO_PTR ((h)->temp))
+
+#ifdef __STDC__
+#define obstack_free(h,obj) \
+( (h)->temp = (char *)(obj) - (char *) (h)->chunk, \
+ (((h)->temp > 0 && (h)->temp < (h)->chunk_limit - (char *) (h)->chunk)\
+ ? (int) ((h)->next_free = (h)->object_base \
+ = (h)->temp + (char *) (h)->chunk) \
+ : (((obstack_free) ((h), (h)->temp + (char *) (h)->chunk), 0), 0)))
+#else
+#define obstack_free(h,obj) \
+( (h)->temp = (char *)(obj) - (char *) (h)->chunk, \
+ (((h)->temp > 0 && (h)->temp < (h)->chunk_limit - (char *) (h)->chunk)\
+ ? (int) ((h)->next_free = (h)->object_base \
+ = (h)->temp + (char *) (h)->chunk) \
+ : (_obstack_free ((h), (h)->temp + (char *) (h)->chunk), 0)))
+#endif
+
+#endif /* not __GNUC__ or not __STDC__ */
+
+#endif /* not __OBSTACKS__ */
diff --git a/gnu/usr.bin/grep/regex.c b/gnu/usr.bin/grep/regex.c
index d69e784..e8b5882 100644
--- a/gnu/usr.bin/grep/regex.c
+++ b/gnu/usr.bin/grep/regex.c
@@ -1,5 +1,9 @@
-/* Extended regular expression matching and search library.
- Copyright (C) 1985, 1989 Free Software Foundation, Inc.
+/* Extended regular expression matching and search library,
+ version 0.12.
+ (Implements POSIX draft P10003.2/D11.2, except for
+ internationalization features.)
+
+ Copyright (C) 1993 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -13,84 +17,78 @@
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
- In other words, you are welcome to use, share and improve this program.
- You are forbidden to forbid anyone else to use, share and improve
- what you give them. Help stamp out software-hoarding! */
+/* AIX requires this to be the first thing in the file. */
+#if defined (_AIX) && !defined (REGEX_MALLOC)
+ #pragma alloca
+#endif
+#define _GNU_SOURCE
-/* To test, compile with -Dtest.
- This Dtestable feature turns this into a self-contained program
- which reads a pattern, describes how it compiles,
- then reads a string and searches for it. */
+/* We need this for `regex.h', and perhaps for the Emacs include files. */
+#include <sys/types.h>
-/* AIX requires this to be the first thing in the file. */
-#ifdef __GNUC__
-#undef alloca
-#define alloca __builtin_alloca
-#else /* not __GNUC__ */
-#if defined(sparc) && !defined(USG) && !defined(SVR4) && !defined(__svr4__)
-#include <alloca.h>
-#else
-#ifdef _AIX
- #pragma alloca
-#else
-char *alloca ();
+#ifdef HAVE_CONFIG_H
+#include "config.h"
#endif
-#endif /* sparc */
-#endif /* not __GNUC__ */
-
+/* The `emacs' switch turns on certain matching commands
+ that make sense only in Emacs. */
#ifdef emacs
-/* The `emacs' switch turns on certain special matching commands
- that make sense only in emacs. */
-
-#include "config.h"
#include "lisp.h"
#include "buffer.h"
#include "syntax.h"
+/* Emacs uses `NULL' as a predicate. */
+#undef NULL
+
#else /* not emacs */
-#if defined(USG) || defined(STDC_HEADERS)
+/* We used to test for `BSTRING' here, but only GCC and Emacs define
+ `BSTRING', as far as I know, and neither of them use this code. */
+#if HAVE_STRING_H || STDC_HEADERS
#include <string.h>
-#ifndef bcopy
-#define bcopy(s,d,n) memcpy((d),(s),(n))
-#endif
#ifndef bcmp
-#define bcmp(s1,s2,n) memcmp((s1),(s2),(n))
+#define bcmp(s1, s2, n) memcmp ((s1), (s2), (n))
+#endif
+#ifndef bcopy
+#define bcopy(s, d, n) memcpy ((d), (s), (n))
#endif
#ifndef bzero
-#define bzero(s,n) memset((s),0,(n))
+#define bzero(s, n) memset ((s), 0, (n))
#endif
+#else
+#include <strings.h>
#endif
#ifdef STDC_HEADERS
#include <stdlib.h>
#else
char *malloc ();
+char *realloc ();
#endif
-/*
- * Define the syntax stuff, so we can do the \<...\> things.
- */
-#ifndef Sword /* must be non-zero in some of the tests below... */
+/* Define the syntax stuff for \<, \>, etc. */
+
+/* This must be nonzero for the wordchar and notwordchar pattern
+ commands in re_match_2. */
+#ifndef Sword
#define Sword 1
#endif
-#define SYNTAX(c) re_syntax_table[c]
-
#ifdef SYNTAX_TABLE
-char *re_syntax_table;
+extern char *re_syntax_table;
-#else
+#else /* not SYNTAX_TABLE */
+
+/* How many characters in the character set. */
+#define CHAR_SET_SIZE 256
-static char re_syntax_table[256];
+static char re_syntax_table[CHAR_SET_SIZE];
static void
init_syntax_once ()
@@ -112,731 +110,2549 @@ init_syntax_once ()
for (c = '0'; c <= '9'; c++)
re_syntax_table[c] = Sword;
+ re_syntax_table['_'] = Sword;
+
done = 1;
}
-#endif /* SYNTAX_TABLE */
-#endif /* not emacs */
-
-#include "regex.h"
-
-/* Number of failure points to allocate space for initially,
- when matching. If this number is exceeded, more space is allocated,
- so it is not a hard limit. */
+#endif /* not SYNTAX_TABLE */
-#ifndef NFAILURES
-#define NFAILURES 80
-#endif /* NFAILURES */
+#define SYNTAX(c) re_syntax_table[c]
-/* width of a byte in bits */
+#endif /* not emacs */
+
+/* Get the interface, including the syntax bits. */
+#include "regex.h"
-#define BYTEWIDTH 8
+/* isalpha etc. are used for the character classes. */
+#include <ctype.h>
+
+/* Jim Meyering writes:
+
+ "... Some ctype macros are valid only for character codes that
+ isascii says are ASCII (SGI's IRIX-4.0.5 is one such system --when
+ using /bin/cc or gcc but without giving an ansi option). So, all
+ ctype uses should be through macros like ISPRINT... If
+ STDC_HEADERS is defined, then autoconf has verified that the ctype
+ macros don't need to be guarded with references to isascii. ...
+ Defining isascii to 1 should let any compiler worth its salt
+ eliminate the && through constant folding." */
+#if ! defined (isascii) || defined (STDC_HEADERS)
+#undef isascii
+#define isascii(c) 1
+#endif
-#ifndef SIGN_EXTEND_CHAR
-#ifdef __CHAR_UNSIGNED__
-#define SIGN_EXTEND_CHAR(c) ((c)>(char)127?(c)-256:(c))
+#ifdef isblank
+#define ISBLANK(c) (isascii (c) && isblank (c))
+#else
+#define ISBLANK(c) ((c) == ' ' || (c) == '\t')
+#endif
+#ifdef isgraph
+#define ISGRAPH(c) (isascii (c) && isgraph (c))
#else
-#define SIGN_EXTEND_CHAR(x) (x)
+#define ISGRAPH(c) (isascii (c) && isprint (c) && !isspace (c))
+#endif
+
+#define ISPRINT(c) (isascii (c) && isprint (c))
+#define ISDIGIT(c) (isascii (c) && isdigit (c))
+#define ISALNUM(c) (isascii (c) && isalnum (c))
+#define ISALPHA(c) (isascii (c) && isalpha (c))
+#define ISCNTRL(c) (isascii (c) && iscntrl (c))
+#define ISLOWER(c) (isascii (c) && islower (c))
+#define ISPUNCT(c) (isascii (c) && ispunct (c))
+#define ISSPACE(c) (isascii (c) && isspace (c))
+#define ISUPPER(c) (isascii (c) && isupper (c))
+#define ISXDIGIT(c) (isascii (c) && isxdigit (c))
+
+#ifndef NULL
+#define NULL 0
#endif
+
+/* We remove any previous definition of `SIGN_EXTEND_CHAR',
+ since ours (we hope) works properly with all combinations of
+ machines, compilers, `char' and `unsigned char' argument types.
+ (Per Bothner suggested the basic approach.) */
+#undef SIGN_EXTEND_CHAR
+#if __STDC__
+#define SIGN_EXTEND_CHAR(c) ((signed char) (c))
+#else /* not __STDC__ */
+/* As in Harbison and Steele. */
+#define SIGN_EXTEND_CHAR(c) ((((unsigned char) (c)) ^ 128) - 128)
#endif
-static int obscure_syntax = 0;
+/* Should we use malloc or alloca? If REGEX_MALLOC is not defined, we
+ use `alloca' instead of `malloc'. This is because using malloc in
+ re_search* or re_match* could cause memory leaks when C-g is used in
+ Emacs; also, malloc is slower and causes storage fragmentation. On
+ the other hand, malloc is more portable, and easier to debug.
+
+ Because we sometimes use alloca, some routines have to be macros,
+ not functions -- `alloca'-allocated space disappears at the end of the
+ function it is called in. */
-/* Specify the precise syntax of regexp for compilation.
- This provides for compatibility for various utilities
- which historically have different, incompatible syntaxes.
+#ifdef REGEX_MALLOC
- The argument SYNTAX is a bit-mask containing the two bits
- RE_NO_BK_PARENS and RE_NO_BK_VBAR. */
+#define REGEX_ALLOCATE malloc
+#define REGEX_REALLOCATE(source, osize, nsize) realloc (source, nsize)
-int
-re_set_syntax (syntax)
- int syntax;
+#else /* not REGEX_MALLOC */
+
+/* Emacs already defines alloca, sometimes. */
+#ifndef alloca
+
+/* Make alloca work the best possible way. */
+#ifdef __GNUC__
+#define alloca __builtin_alloca
+#else /* not __GNUC__ */
+#if HAVE_ALLOCA_H
+#include <alloca.h>
+#else /* not __GNUC__ or HAVE_ALLOCA_H */
+#ifndef _AIX /* Already did AIX, up at the top. */
+char *alloca ();
+#endif /* not _AIX */
+#endif /* not HAVE_ALLOCA_H */
+#endif /* not __GNUC__ */
+
+#endif /* not alloca */
+
+#define REGEX_ALLOCATE alloca
+
+/* Assumes a `char *destination' variable. */
+#define REGEX_REALLOCATE(source, osize, nsize) \
+ (destination = (char *) alloca (nsize), \
+ bcopy (source, destination, osize), \
+ destination)
+
+#endif /* not REGEX_MALLOC */
+
+
+/* True if `size1' is non-NULL and PTR is pointing anywhere inside
+ `string1' or just past its end. This works if PTR is NULL, which is
+ a good thing. */
+#define FIRST_STRING_P(ptr) \
+ (size1 && string1 <= (ptr) && (ptr) <= string1 + size1)
+
+/* (Re)Allocate N items of type T using malloc, or fail. */
+#define TALLOC(n, t) ((t *) malloc ((n) * sizeof (t)))
+#define RETALLOC(addr, n, t) ((addr) = (t *) realloc (addr, (n) * sizeof (t)))
+#define REGEX_TALLOC(n, t) ((t *) REGEX_ALLOCATE ((n) * sizeof (t)))
+
+#define BYTEWIDTH 8 /* In bits. */
+
+#define STREQ(s1, s2) ((strcmp (s1, s2) == 0))
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+typedef char boolean;
+#define false 0
+#define true 1
+
+/* These are the command codes that appear in compiled regular
+ expressions. Some opcodes are followed by argument bytes. A
+ command code can specify any interpretation whatsoever for its
+ arguments. Zero bytes may appear in the compiled regular expression.
+
+ The value of `exactn' is needed in search.c (search_buffer) in Emacs.
+ So regex.h defines a symbol `RE_EXACTN_VALUE' to be 1; the value of
+ `exactn' we use here must also be 1. */
+
+typedef enum
{
- int ret;
+ no_op = 0,
+
+ /* Followed by one byte giving n, then by n literal bytes. */
+ exactn = 1,
+
+ /* Matches any (more or less) character. */
+ anychar,
+
+ /* Matches any one char belonging to specified set. First
+ following byte is number of bitmap bytes. Then come bytes
+ for a bitmap saying which chars are in. Bits in each byte
+ are ordered low-bit-first. A character is in the set if its
+ bit is 1. A character too large to have a bit in the map is
+ automatically not in the set. */
+ charset,
+
+ /* Same parameters as charset, but match any character that is
+ not one of those specified. */
+ charset_not,
+
+ /* Start remembering the text that is matched, for storing in a
+ register. Followed by one byte with the register number, in
+ the range 0 to one less than the pattern buffer's re_nsub
+ field. Then followed by one byte with the number of groups
+ inner to this one. (This last has to be part of the
+ start_memory only because we need it in the on_failure_jump
+ of re_match_2.) */
+ start_memory,
+
+ /* Stop remembering the text that is matched and store it in a
+ memory register. Followed by one byte with the register
+ number, in the range 0 to one less than `re_nsub' in the
+ pattern buffer, and one byte with the number of inner groups,
+ just like `start_memory'. (We need the number of inner
+ groups here because we don't have any easy way of finding the
+ corresponding start_memory when we're at a stop_memory.) */
+ stop_memory,
+
+ /* Match a duplicate of something remembered. Followed by one
+ byte containing the register number. */
+ duplicate,
+
+ /* Fail unless at beginning of line. */
+ begline,
+
+ /* Fail unless at end of line. */
+ endline,
+
+ /* Succeeds if at beginning of buffer (if emacs) or at beginning
+ of string to be matched (if not). */
+ begbuf,
+
+ /* Analogously, for end of buffer/string. */
+ endbuf,
+
+ /* Followed by two byte relative address to which to jump. */
+ jump,
+
+ /* Same as jump, but marks the end of an alternative. */
+ jump_past_alt,
+
+ /* Followed by two-byte relative address of place to resume at
+ in case of failure. */
+ on_failure_jump,
+
+ /* Like on_failure_jump, but pushes a placeholder instead of the
+ current string position when executed. */
+ on_failure_keep_string_jump,
+
+ /* Throw away latest failure point and then jump to following
+ two-byte relative address. */
+ pop_failure_jump,
+
+ /* Change to pop_failure_jump if know won't have to backtrack to
+ match; otherwise change to jump. This is used to jump
+ back to the beginning of a repeat. If what follows this jump
+ clearly won't match what the repeat does, such that we can be
+ sure that there is no use backtracking out of repetitions
+ already matched, then we change it to a pop_failure_jump.
+ Followed by two-byte address. */
+ maybe_pop_jump,
+
+ /* Jump to following two-byte address, and push a dummy failure
+ point. This failure point will be thrown away if an attempt
+ is made to use it for a failure. A `+' construct makes this
+ before the first repeat. Also used as an intermediary kind
+ of jump when compiling an alternative. */
+ dummy_failure_jump,
+
+ /* Push a dummy failure point and continue. Used at the end of
+ alternatives. */
+ push_dummy_failure,
+
+ /* Followed by two-byte relative address and two-byte number n.
+ After matching N times, jump to the address upon failure. */
+ succeed_n,
+
+ /* Followed by two-byte relative address, and two-byte number n.
+ Jump to the address N times, then fail. */
+ jump_n,
+
+ /* Set the following two-byte relative address to the
+ subsequent two-byte number. The address *includes* the two
+ bytes of number. */
+ set_number_at,
+
+ wordchar, /* Matches any word-constituent character. */
+ notwordchar, /* Matches any char that is not a word-constituent. */
+
+ wordbeg, /* Succeeds if at word beginning. */
+ wordend, /* Succeeds if at word end. */
+
+ wordbound, /* Succeeds if at a word boundary. */
+ notwordbound /* Succeeds if not at a word boundary. */
- ret = obscure_syntax;
- obscure_syntax = syntax;
- return ret;
-}
+#ifdef emacs
+ ,before_dot, /* Succeeds if before point. */
+ at_dot, /* Succeeds if at point. */
+ after_dot, /* Succeeds if after point. */
+
+ /* Matches any character whose syntax is specified. Followed by
+ a byte which contains a syntax code, e.g., Sword. */
+ syntaxspec,
+
+ /* Matches any character whose syntax is not that specified. */
+ notsyntaxspec
+#endif /* emacs */
+} re_opcode_t;
-/* re_compile_pattern takes a regular-expression string
- and converts it into a buffer full of byte commands for matching.
-
- PATTERN is the address of the pattern string
- SIZE is the length of it.
- BUFP is a struct re_pattern_buffer * which points to the info
- on where to store the byte commands.
- This structure contains a char * which points to the
- actual space, which should have been obtained with malloc.
- re_compile_pattern may use realloc to grow the buffer space.
-
- The number of bytes of commands can be found out by looking in
- the struct re_pattern_buffer that bufp pointed to,
- after re_compile_pattern returns.
-*/
+/* Common operations on the compiled pattern. */
-#define PATPUSH(ch) (*b++ = (char) (ch))
+/* Store NUMBER in two contiguous bytes starting at DESTINATION. */
-#define PATFETCH(c) \
- {if (p == pend) goto end_of_pattern; \
- c = * (unsigned char *) p++; \
- if (translate) c = translate[c]; }
+#define STORE_NUMBER(destination, number) \
+ do { \
+ (destination)[0] = (number) & 0377; \
+ (destination)[1] = (number) >> 8; \
+ } while (0)
-#define PATFETCH_RAW(c) \
- {if (p == pend) goto end_of_pattern; \
- c = * (unsigned char *) p++; }
+/* Same as STORE_NUMBER, except increment DESTINATION to
+ the byte after where the number is stored. Therefore, DESTINATION
+ must be an lvalue. */
-#define PATUNFETCH p--
+#define STORE_NUMBER_AND_INCR(destination, number) \
+ do { \
+ STORE_NUMBER (destination, number); \
+ (destination) += 2; \
+ } while (0)
-#define EXTEND_BUFFER \
- { char *old_buffer = bufp->buffer; \
- if (bufp->allocated == (1<<16)) goto too_big; \
- bufp->allocated *= 2; \
- if (bufp->allocated > (1<<16)) bufp->allocated = (1<<16); \
- if (!(bufp->buffer = (char *) realloc (bufp->buffer, bufp->allocated))) \
- goto memory_exhausted; \
- c = bufp->buffer - old_buffer; \
- b += c; \
- if (fixup_jump) \
- fixup_jump += c; \
- if (laststart) \
- laststart += c; \
- begalt += c; \
- if (pending_exact) \
- pending_exact += c; \
- }
+/* Put into DESTINATION a number stored in two contiguous bytes starting
+ at SOURCE. */
-static void store_jump (), insert_jump ();
+#define EXTRACT_NUMBER(destination, source) \
+ do { \
+ (destination) = *(source) & 0377; \
+ (destination) += SIGN_EXTEND_CHAR (*((source) + 1)) << 8; \
+ } while (0)
-char *
-re_compile_pattern (pattern, size, bufp)
- char *pattern;
- int size;
- struct re_pattern_buffer *bufp;
+#ifdef DEBUG
+static void
+extract_number (dest, source)
+ int *dest;
+ unsigned char *source;
{
- register char *b = bufp->buffer;
- register char *p = pattern;
- char *pend = pattern + size;
- register unsigned c, c1;
- char *p1;
- unsigned char *translate = (unsigned char *) bufp->translate;
+ int temp = SIGN_EXTEND_CHAR (*(source + 1));
+ *dest = *source & 0377;
+ *dest += temp << 8;
+}
- /* address of the count-byte of the most recently inserted "exactn" command.
- This makes it possible to tell whether a new exact-match character
- can be added to that command or requires a new "exactn" command. */
-
- char *pending_exact = 0;
+#ifndef EXTRACT_MACROS /* To debug the macros. */
+#undef EXTRACT_NUMBER
+#define EXTRACT_NUMBER(dest, src) extract_number (&dest, src)
+#endif /* not EXTRACT_MACROS */
- /* address of the place where a forward-jump should go
- to the end of the containing expression.
- Each alternative of an "or", except the last, ends with a forward-jump
- of this sort. */
+#endif /* DEBUG */
- char *fixup_jump = 0;
+/* Same as EXTRACT_NUMBER, except increment SOURCE to after the number.
+ SOURCE must be an lvalue. */
- /* address of start of the most recently finished expression.
- This tells postfix * where to find the start of its operand. */
+#define EXTRACT_NUMBER_AND_INCR(destination, source) \
+ do { \
+ EXTRACT_NUMBER (destination, source); \
+ (source) += 2; \
+ } while (0)
- char *laststart = 0;
+#ifdef DEBUG
+static void
+extract_number_and_incr (destination, source)
+ int *destination;
+ unsigned char **source;
+{
+ extract_number (destination, *source);
+ *source += 2;
+}
- /* In processing a repeat, 1 means zero matches is allowed */
+#ifndef EXTRACT_MACROS
+#undef EXTRACT_NUMBER_AND_INCR
+#define EXTRACT_NUMBER_AND_INCR(dest, src) \
+ extract_number_and_incr (&dest, &src)
+#endif /* not EXTRACT_MACROS */
- char zero_times_ok;
+#endif /* DEBUG */
+
+/* If DEBUG is defined, Regex prints many voluminous messages about what
+ it is doing (if the variable `debug' is nonzero). If linked with the
+ main program in `iregex.c', you can enter patterns and strings
+ interactively. And if linked with the main program in `main.c' and
+ the other test files, you can run the already-written tests. */
- /* In processing a repeat, 1 means many matches is allowed */
+#ifdef DEBUG
- char many_times_ok;
+/* We use standard I/O for debugging. */
+#include <stdio.h>
- /* address of beginning of regexp, or inside of last \( */
+/* It is useful to test things that ``must'' be true when debugging. */
+#include <assert.h>
- char *begalt = b;
+static int debug = 0;
- /* Stack of information saved by \( and restored by \).
- Four stack elements are pushed by each \(:
- First, the value of b.
- Second, the value of fixup_jump.
- Third, the value of regnum.
- Fourth, the value of begalt. */
+#define DEBUG_STATEMENT(e) e
+#define DEBUG_PRINT1(x) if (debug) printf (x)
+#define DEBUG_PRINT2(x1, x2) if (debug) printf (x1, x2)
+#define DEBUG_PRINT3(x1, x2, x3) if (debug) printf (x1, x2, x3)
+#define DEBUG_PRINT4(x1, x2, x3, x4) if (debug) printf (x1, x2, x3, x4)
+#define DEBUG_PRINT_COMPILED_PATTERN(p, s, e) \
+ if (debug) print_partial_compiled_pattern (s, e)
+#define DEBUG_PRINT_DOUBLE_STRING(w, s1, sz1, s2, sz2) \
+ if (debug) print_double_string (w, s1, sz1, s2, sz2)
- int stackb[40];
- int *stackp = stackb;
- int *stacke = stackb + 40;
- int *stackt;
- /* Counts \('s as they are encountered. Remembered for the matching \),
- where it becomes the "register number" to put in the stop_memory command */
+extern void printchar ();
- int regnum = 1;
+/* Print the fastmap in human-readable form. */
- bufp->fastmap_accurate = 0;
+void
+print_fastmap (fastmap)
+ char *fastmap;
+{
+ unsigned was_a_range = 0;
+ unsigned i = 0;
+
+ while (i < (1 << BYTEWIDTH))
+ {
+ if (fastmap[i++])
+ {
+ was_a_range = 0;
+ printchar (i - 1);
+ while (i < (1 << BYTEWIDTH) && fastmap[i])
+ {
+ was_a_range = 1;
+ i++;
+ }
+ if (was_a_range)
+ {
+ printf ("-");
+ printchar (i - 1);
+ }
+ }
+ }
+ putchar ('\n');
+}
-#ifndef emacs
-#ifndef SYNTAX_TABLE
- /*
- * Initialize the syntax table.
- */
- init_syntax_once();
-#endif
-#endif
- if (bufp->allocated == 0)
+/* Print a compiled pattern string in human-readable form, starting at
+ the START pointer into it and ending just before the pointer END. */
+
+void
+print_partial_compiled_pattern (start, end)
+ unsigned char *start;
+ unsigned char *end;
+{
+ int mcnt, mcnt2;
+ unsigned char *p = start;
+ unsigned char *pend = end;
+
+ if (start == NULL)
{
- bufp->allocated = 28;
- if (bufp->buffer)
- /* EXTEND_BUFFER loses when bufp->allocated is 0 */
- bufp->buffer = (char *) realloc (bufp->buffer, 28);
- else
- /* Caller did not allocate a buffer. Do it for him */
- bufp->buffer = (char *) malloc (28);
- if (!bufp->buffer) goto memory_exhausted;
- begalt = b = bufp->buffer;
+ printf ("(null)\n");
+ return;
}
-
- while (p != pend)
+
+ /* Loop over pattern commands. */
+ while (p < pend)
{
- if (b - bufp->buffer > bufp->allocated - 10)
- /* Note that EXTEND_BUFFER clobbers c */
- EXTEND_BUFFER;
-
- PATFETCH (c);
+ printf ("%d:\t", p - start);
- switch (c)
+ switch ((re_opcode_t) *p++)
{
- case '$':
- if (obscure_syntax & RE_TIGHT_VBAR)
- {
- if (! (obscure_syntax & RE_CONTEXT_INDEP_OPS) && p != pend)
- goto normal_char;
- /* Make operand of last vbar end before this `$'. */
- if (fixup_jump)
- store_jump (fixup_jump, jump, b);
- fixup_jump = 0;
- PATPUSH (endline);
- break;
- }
+ case no_op:
+ printf ("/no_op");
+ break;
- /* $ means succeed if at end of line, but only in special contexts.
- If randomly in the middle of a pattern, it is a normal character. */
- if (p == pend || *p == '\n'
- || (obscure_syntax & RE_CONTEXT_INDEP_OPS)
- || (obscure_syntax & RE_NO_BK_PARENS
- ? *p == ')'
- : *p == '\\' && p[1] == ')')
- || (obscure_syntax & RE_NO_BK_VBAR
- ? *p == '|'
- : *p == '\\' && p[1] == '|'))
+ case exactn:
+ mcnt = *p++;
+ printf ("/exactn/%d", mcnt);
+ do
{
- PATPUSH (endline);
- break;
- }
- goto normal_char;
+ putchar ('/');
+ printchar (*p++);
+ }
+ while (--mcnt);
+ break;
- case '^':
- /* ^ means succeed if at beg of line, but only if no preceding pattern. */
+ case start_memory:
+ mcnt = *p++;
+ printf ("/start_memory/%d/%d", mcnt, *p++);
+ break;
- if (laststart && p[-2] != '\n'
- && ! (obscure_syntax & RE_CONTEXT_INDEP_OPS))
- goto normal_char;
- if (obscure_syntax & RE_TIGHT_VBAR)
- {
- if (p != pattern + 1
- && ! (obscure_syntax & RE_CONTEXT_INDEP_OPS))
- goto normal_char;
- PATPUSH (begline);
- begalt = b;
- }
- else
- PATPUSH (begline);
+ case stop_memory:
+ mcnt = *p++;
+ printf ("/stop_memory/%d/%d", mcnt, *p++);
+ break;
+
+ case duplicate:
+ printf ("/duplicate/%d", *p++);
break;
- case '+':
- case '?':
- if (obscure_syntax & RE_BK_PLUS_QM)
- goto normal_char;
- handle_plus:
- case '*':
- /* If there is no previous pattern, char not special. */
- if (!laststart && ! (obscure_syntax & RE_CONTEXT_INDEP_OPS))
- goto normal_char;
- /* If there is a sequence of repetition chars,
- collapse it down to equivalent to just one. */
- zero_times_ok = 0;
- many_times_ok = 0;
- while (1)
- {
- zero_times_ok |= c != '+';
- many_times_ok |= c != '?';
- if (p == pend)
- break;
- PATFETCH (c);
- if (c == '*')
- ;
- else if (!(obscure_syntax & RE_BK_PLUS_QM)
- && (c == '+' || c == '?'))
- ;
- else if ((obscure_syntax & RE_BK_PLUS_QM)
- && c == '\\')
+ case anychar:
+ printf ("/anychar");
+ break;
+
+ case charset:
+ case charset_not:
+ {
+ register int c, last = -100;
+ register int in_range = 0;
+
+ printf ("/charset [%s",
+ (re_opcode_t) *(p - 1) == charset_not ? "^" : "");
+
+ assert (p + *p < pend);
+
+ for (c = 0; c < 256; c++)
+ if (c / 8 < *p
+ && (p[1 + (c/8)] & (1 << (c % 8))))
{
- int c1;
- PATFETCH (c1);
- if (!(c1 == '+' || c1 == '?'))
+ /* Are we starting a range? */
+ if (last + 1 == c && ! in_range)
{
- PATUNFETCH;
- PATUNFETCH;
- break;
+ putchar ('-');
+ in_range = 1;
}
- c = c1;
- }
- else
- {
- PATUNFETCH;
- break;
- }
- }
+ /* Have we broken a range? */
+ else if (last + 1 != c && in_range)
+ {
+ printchar (last);
+ in_range = 0;
+ }
+
+ if (! in_range)
+ printchar (c);
- /* Star, etc. applied to an empty pattern is equivalent
- to an empty pattern. */
- if (!laststart)
- break;
+ last = c;
+ }
- /* Now we know whether 0 matches is allowed,
- and whether 2 or more matches is allowed. */
- if (many_times_ok)
- {
- /* If more than one repetition is allowed,
- put in a backward jump at the end. */
- store_jump (b, maybe_finalize_jump, laststart - 3);
- b += 3;
- }
- insert_jump (on_failure_jump, laststart, b + 3, b);
- pending_exact = 0;
- b += 3;
- if (!zero_times_ok)
- {
- /* At least one repetition required: insert before the loop
- a skip over the initial on-failure-jump instruction */
- insert_jump (dummy_failure_jump, laststart, laststart + 6, b);
- b += 3;
- }
+ if (in_range)
+ printchar (last);
+
+ putchar (']');
+
+ p += 1 + *p;
+ }
break;
- case '.':
- laststart = b;
- PATPUSH (anychar);
+ case begline:
+ printf ("/begline");
+ break;
+
+ case endline:
+ printf ("/endline");
+ break;
+
+ case on_failure_jump:
+ extract_number_and_incr (&mcnt, &p);
+ printf ("/on_failure_jump to %d", p + mcnt - start);
+ break;
+
+ case on_failure_keep_string_jump:
+ extract_number_and_incr (&mcnt, &p);
+ printf ("/on_failure_keep_string_jump to %d", p + mcnt - start);
+ break;
+
+ case dummy_failure_jump:
+ extract_number_and_incr (&mcnt, &p);
+ printf ("/dummy_failure_jump to %d", p + mcnt - start);
+ break;
+
+ case push_dummy_failure:
+ printf ("/push_dummy_failure");
+ break;
+
+ case maybe_pop_jump:
+ extract_number_and_incr (&mcnt, &p);
+ printf ("/maybe_pop_jump to %d", p + mcnt - start);
break;
- case '[':
- while (b - bufp->buffer
- > bufp->allocated - 3 - (1 << BYTEWIDTH) / BYTEWIDTH)
- /* Note that EXTEND_BUFFER clobbers c */
- EXTEND_BUFFER;
+ case pop_failure_jump:
+ extract_number_and_incr (&mcnt, &p);
+ printf ("/pop_failure_jump to %d", p + mcnt - start);
+ break;
+
+ case jump_past_alt:
+ extract_number_and_incr (&mcnt, &p);
+ printf ("/jump_past_alt to %d", p + mcnt - start);
+ break;
+
+ case jump:
+ extract_number_and_incr (&mcnt, &p);
+ printf ("/jump to %d", p + mcnt - start);
+ break;
- laststart = b;
- if (*p == '^')
- PATPUSH (charset_not), p++;
- else
- PATPUSH (charset);
- p1 = p;
-
- PATPUSH ((1 << BYTEWIDTH) / BYTEWIDTH);
- /* Clear the whole map */
- bzero (b, (1 << BYTEWIDTH) / BYTEWIDTH);
- /* Read in characters and ranges, setting map bits */
- while (1)
- {
- PATFETCH (c);
- if (c == ']' && p != p1 + 1) break;
- if (*p == '-' && p[1] != ']')
- {
- PATFETCH (c1);
- PATFETCH (c1);
- while (c <= c1)
- b[c / BYTEWIDTH] |= 1 << (c % BYTEWIDTH), c++;
- }
- else
- {
- b[c / BYTEWIDTH] |= 1 << (c % BYTEWIDTH);
- }
- }
- /* Discard any bitmap bytes that are all 0 at the end of the map.
- Decrement the map-length byte too. */
- while ((int) b[-1] > 0 && b[b[-1] - 1] == 0)
- b[-1]--;
- b += b[-1];
+ case succeed_n:
+ extract_number_and_incr (&mcnt, &p);
+ extract_number_and_incr (&mcnt2, &p);
+ printf ("/succeed_n to %d, %d times", p + mcnt - start, mcnt2);
+ break;
+
+ case jump_n:
+ extract_number_and_incr (&mcnt, &p);
+ extract_number_and_incr (&mcnt2, &p);
+ printf ("/jump_n to %d, %d times", p + mcnt - start, mcnt2);
+ break;
+
+ case set_number_at:
+ extract_number_and_incr (&mcnt, &p);
+ extract_number_and_incr (&mcnt2, &p);
+ printf ("/set_number_at location %d to %d", p + mcnt - start, mcnt2);
+ break;
+
+ case wordbound:
+ printf ("/wordbound");
+ break;
+
+ case notwordbound:
+ printf ("/notwordbound");
+ break;
+
+ case wordbeg:
+ printf ("/wordbeg");
+ break;
+
+ case wordend:
+ printf ("/wordend");
+
+#ifdef emacs
+ case before_dot:
+ printf ("/before_dot");
+ break;
+
+ case at_dot:
+ printf ("/at_dot");
+ break;
+
+ case after_dot:
+ printf ("/after_dot");
+ break;
+
+ case syntaxspec:
+ printf ("/syntaxspec");
+ mcnt = *p++;
+ printf ("/%d", mcnt);
+ break;
+
+ case notsyntaxspec:
+ printf ("/notsyntaxspec");
+ mcnt = *p++;
+ printf ("/%d", mcnt);
+ break;
+#endif /* emacs */
+
+ case wordchar:
+ printf ("/wordchar");
+ break;
+
+ case notwordchar:
+ printf ("/notwordchar");
+ break;
+
+ case begbuf:
+ printf ("/begbuf");
+ break;
+
+ case endbuf:
+ printf ("/endbuf");
+ break;
+
+ default:
+ printf ("?%d", *(p-1));
+ }
+
+ putchar ('\n');
+ }
+
+ printf ("%d:\tend of pattern.\n", p - start);
+}
+
+
+void
+print_compiled_pattern (bufp)
+ struct re_pattern_buffer *bufp;
+{
+ unsigned char *buffer = bufp->buffer;
+
+ print_partial_compiled_pattern (buffer, buffer + bufp->used);
+ printf ("%d bytes used/%d bytes allocated.\n", bufp->used, bufp->allocated);
+
+ if (bufp->fastmap_accurate && bufp->fastmap)
+ {
+ printf ("fastmap: ");
+ print_fastmap (bufp->fastmap);
+ }
+
+ printf ("re_nsub: %d\t", bufp->re_nsub);
+ printf ("regs_alloc: %d\t", bufp->regs_allocated);
+ printf ("can_be_null: %d\t", bufp->can_be_null);
+ printf ("newline_anchor: %d\n", bufp->newline_anchor);
+ printf ("no_sub: %d\t", bufp->no_sub);
+ printf ("not_bol: %d\t", bufp->not_bol);
+ printf ("not_eol: %d\t", bufp->not_eol);
+ printf ("syntax: %d\n", bufp->syntax);
+ /* Perhaps we should print the translate table? */
+}
+
+
+void
+print_double_string (where, string1, size1, string2, size2)
+ const char *where;
+ const char *string1;
+ const char *string2;
+ int size1;
+ int size2;
+{
+ unsigned this_char;
+
+ if (where == NULL)
+ printf ("(null)");
+ else
+ {
+ if (FIRST_STRING_P (where))
+ {
+ for (this_char = where - string1; this_char < size1; this_char++)
+ printchar (string1[this_char]);
+
+ where = string2;
+ }
+
+ for (this_char = where - string2; this_char < size2; this_char++)
+ printchar (string2[this_char]);
+ }
+}
+
+#else /* not DEBUG */
+
+#undef assert
+#define assert(e)
+
+#define DEBUG_STATEMENT(e)
+#define DEBUG_PRINT1(x)
+#define DEBUG_PRINT2(x1, x2)
+#define DEBUG_PRINT3(x1, x2, x3)
+#define DEBUG_PRINT4(x1, x2, x3, x4)
+#define DEBUG_PRINT_COMPILED_PATTERN(p, s, e)
+#define DEBUG_PRINT_DOUBLE_STRING(w, s1, sz1, s2, sz2)
+
+#endif /* not DEBUG */
+
+/* Set by `re_set_syntax' to the current regexp syntax to recognize. Can
+ also be assigned to arbitrarily: each pattern buffer stores its own
+ syntax, so it can be changed between regex compilations. */
+reg_syntax_t re_syntax_options = RE_SYNTAX_EMACS;
+
+
+/* Specify the precise syntax of regexps for compilation. This provides
+ for compatibility for various utilities which historically have
+ different, incompatible syntaxes.
+
+ The argument SYNTAX is a bit mask comprised of the various bits
+ defined in regex.h. We return the old syntax. */
+
+reg_syntax_t
+re_set_syntax (syntax)
+ reg_syntax_t syntax;
+{
+ reg_syntax_t ret = re_syntax_options;
+
+ re_syntax_options = syntax;
+ return ret;
+}
+
+/* This table gives an error message for each of the error codes listed
+ in regex.h. Obviously the order here has to be same as there. */
+
+static const char *re_error_msg[] =
+ { NULL, /* REG_NOERROR */
+ "No match", /* REG_NOMATCH */
+ "Invalid regular expression", /* REG_BADPAT */
+ "Invalid collation character", /* REG_ECOLLATE */
+ "Invalid character class name", /* REG_ECTYPE */
+ "Trailing backslash", /* REG_EESCAPE */
+ "Invalid back reference", /* REG_ESUBREG */
+ "Unmatched [ or [^", /* REG_EBRACK */
+ "Unmatched ( or \\(", /* REG_EPAREN */
+ "Unmatched \\{", /* REG_EBRACE */
+ "Invalid content of \\{\\}", /* REG_BADBR */
+ "Invalid range end", /* REG_ERANGE */
+ "Memory exhausted", /* REG_ESPACE */
+ "Invalid preceding regular expression", /* REG_BADRPT */
+ "Premature end of regular expression", /* REG_EEND */
+ "Regular expression too big", /* REG_ESIZE */
+ "Unmatched ) or \\)", /* REG_ERPAREN */
+ };
+
+/* Subroutine declarations and macros for regex_compile. */
+
+static void store_op1 (), store_op2 ();
+static void insert_op1 (), insert_op2 ();
+static boolean at_begline_loc_p (), at_endline_loc_p ();
+static boolean group_in_compile_stack ();
+static reg_errcode_t compile_range ();
+
+/* Fetch the next character in the uncompiled pattern---translating it
+ if necessary. Also cast from a signed character in the constant
+ string passed to us by the user to an unsigned char that we can use
+ as an array index (in, e.g., `translate'). */
+#define PATFETCH(c) \
+ do {if (p == pend) return REG_EEND; \
+ c = (unsigned char) *p++; \
+ if (translate) c = translate[c]; \
+ } while (0)
+
+/* Fetch the next character in the uncompiled pattern, with no
+ translation. */
+#define PATFETCH_RAW(c) \
+ do {if (p == pend) return REG_EEND; \
+ c = (unsigned char) *p++; \
+ } while (0)
+
+/* Go backwards one character in the pattern. */
+#define PATUNFETCH p--
+
+
+/* If `translate' is non-null, return translate[D], else just D. We
+ cast the subscript to translate because some data is declared as
+ `char *', to avoid warnings when a string constant is passed. But
+ when we use a character as a subscript we must make it unsigned. */
+#define TRANSLATE(d) (translate ? translate[(unsigned char) (d)] : (d))
+
+
+/* Macros for outputting the compiled pattern into `buffer'. */
+
+/* If the buffer isn't allocated when it comes in, use this. */
+#define INIT_BUF_SIZE 32
+
+/* Make sure we have at least N more bytes of space in buffer. */
+#define GET_BUFFER_SPACE(n) \
+ while (b - bufp->buffer + (n) > bufp->allocated) \
+ EXTEND_BUFFER ()
+
+/* Make sure we have one more byte of buffer space and then add C to it. */
+#define BUF_PUSH(c) \
+ do { \
+ GET_BUFFER_SPACE (1); \
+ *b++ = (unsigned char) (c); \
+ } while (0)
+
+
+/* Ensure we have two more bytes of buffer space and then append C1 and C2. */
+#define BUF_PUSH_2(c1, c2) \
+ do { \
+ GET_BUFFER_SPACE (2); \
+ *b++ = (unsigned char) (c1); \
+ *b++ = (unsigned char) (c2); \
+ } while (0)
+
+
+/* As with BUF_PUSH_2, except for three bytes. */
+#define BUF_PUSH_3(c1, c2, c3) \
+ do { \
+ GET_BUFFER_SPACE (3); \
+ *b++ = (unsigned char) (c1); \
+ *b++ = (unsigned char) (c2); \
+ *b++ = (unsigned char) (c3); \
+ } while (0)
+
+
+/* Store a jump with opcode OP at LOC to location TO. We store a
+ relative address offset by the three bytes the jump itself occupies. */
+#define STORE_JUMP(op, loc, to) \
+ store_op1 (op, loc, (to) - (loc) - 3)
+
+/* Likewise, for a two-argument jump. */
+#define STORE_JUMP2(op, loc, to, arg) \
+ store_op2 (op, loc, (to) - (loc) - 3, arg)
+
+/* Like `STORE_JUMP', but for inserting. Assume `b' is the buffer end. */
+#define INSERT_JUMP(op, loc, to) \
+ insert_op1 (op, loc, (to) - (loc) - 3, b)
+
+/* Like `STORE_JUMP2', but for inserting. Assume `b' is the buffer end. */
+#define INSERT_JUMP2(op, loc, to, arg) \
+ insert_op2 (op, loc, (to) - (loc) - 3, arg, b)
+
+
+/* This is not an arbitrary limit: the arguments which represent offsets
+ into the pattern are two bytes long. So if 2^16 bytes turns out to
+ be too small, many things would have to change. */
+#define MAX_BUF_SIZE (1L << 16)
+
+
+/* Extend the buffer by twice its current size via realloc and
+ reset the pointers that pointed into the old block to point to the
+ correct places in the new one. If extending the buffer results in it
+ being larger than MAX_BUF_SIZE, then flag memory exhausted. */
+#define EXTEND_BUFFER() \
+ do { \
+ unsigned char *old_buffer = bufp->buffer; \
+ if (bufp->allocated == MAX_BUF_SIZE) \
+ return REG_ESIZE; \
+ bufp->allocated <<= 1; \
+ if (bufp->allocated > MAX_BUF_SIZE) \
+ bufp->allocated = MAX_BUF_SIZE; \
+ bufp->buffer = (unsigned char *) realloc (bufp->buffer, bufp->allocated);\
+ if (bufp->buffer == NULL) \
+ return REG_ESPACE; \
+ /* If the buffer moved, move all the pointers into it. */ \
+ if (old_buffer != bufp->buffer) \
+ { \
+ b = (b - old_buffer) + bufp->buffer; \
+ begalt = (begalt - old_buffer) + bufp->buffer; \
+ if (fixup_alt_jump) \
+ fixup_alt_jump = (fixup_alt_jump - old_buffer) + bufp->buffer;\
+ if (laststart) \
+ laststart = (laststart - old_buffer) + bufp->buffer; \
+ if (pending_exact) \
+ pending_exact = (pending_exact - old_buffer) + bufp->buffer; \
+ } \
+ } while (0)
+
+
+/* Since we have one byte reserved for the register number argument to
+ {start,stop}_memory, the maximum number of groups we can report
+ things about is what fits in that byte. */
+#define MAX_REGNUM 255
+
+/* But patterns can have more than `MAX_REGNUM' registers. We just
+ ignore the excess. */
+typedef unsigned regnum_t;
+
+
+/* Macros for the compile stack. */
+
+/* Since offsets can go either forwards or backwards, this type needs to
+ be able to hold values from -(MAX_BUF_SIZE - 1) to MAX_BUF_SIZE - 1. */
+typedef int pattern_offset_t;
+
+typedef struct
+{
+ pattern_offset_t begalt_offset;
+ pattern_offset_t fixup_alt_jump;
+ pattern_offset_t inner_group_offset;
+ pattern_offset_t laststart_offset;
+ regnum_t regnum;
+} compile_stack_elt_t;
+
+
+typedef struct
+{
+ compile_stack_elt_t *stack;
+ unsigned size;
+ unsigned avail; /* Offset of next open position. */
+} compile_stack_type;
+
+
+#define INIT_COMPILE_STACK_SIZE 32
+
+#define COMPILE_STACK_EMPTY (compile_stack.avail == 0)
+#define COMPILE_STACK_FULL (compile_stack.avail == compile_stack.size)
+
+/* The next available element. */
+#define COMPILE_STACK_TOP (compile_stack.stack[compile_stack.avail])
+
+
+/* Set the bit for character C in a list. */
+#define SET_LIST_BIT(c) \
+ (b[((unsigned char) (c)) / BYTEWIDTH] \
+ |= 1 << (((unsigned char) c) % BYTEWIDTH))
+
+
+/* Get the next unsigned number in the uncompiled pattern. */
+#define GET_UNSIGNED_NUMBER(num) \
+ { if (p != pend) \
+ { \
+ PATFETCH (c); \
+ while (ISDIGIT (c)) \
+ { \
+ if (num < 0) \
+ num = 0; \
+ num = num * 10 + c - '0'; \
+ if (p == pend) \
+ break; \
+ PATFETCH (c); \
+ } \
+ } \
+ }
+
+#define CHAR_CLASS_MAX_LENGTH 6 /* Namely, `xdigit'. */
+
+#define IS_CHAR_CLASS(string) \
+ (STREQ (string, "alpha") || STREQ (string, "upper") \
+ || STREQ (string, "lower") || STREQ (string, "digit") \
+ || STREQ (string, "alnum") || STREQ (string, "xdigit") \
+ || STREQ (string, "space") || STREQ (string, "print") \
+ || STREQ (string, "punct") || STREQ (string, "graph") \
+ || STREQ (string, "cntrl") || STREQ (string, "blank"))
+
+/* `regex_compile' compiles PATTERN (of length SIZE) according to SYNTAX.
+ Returns one of error codes defined in `regex.h', or zero for success.
+
+ Assumes the `allocated' (and perhaps `buffer') and `translate'
+ fields are set in BUFP on entry.
+
+ If it succeeds, results are put in BUFP (if it returns an error, the
+ contents of BUFP are undefined):
+ `buffer' is the compiled pattern;
+ `syntax' is set to SYNTAX;
+ `used' is set to the length of the compiled pattern;
+ `fastmap_accurate' is zero;
+ `re_nsub' is the number of subexpressions in PATTERN;
+ `not_bol' and `not_eol' are zero;
+
+ The `fastmap' and `newline_anchor' fields are neither
+ examined nor set. */
+
+static reg_errcode_t
+regex_compile (pattern, size, syntax, bufp)
+ const char *pattern;
+ int size;
+ reg_syntax_t syntax;
+ struct re_pattern_buffer *bufp;
+{
+ /* We fetch characters from PATTERN here. Even though PATTERN is
+ `char *' (i.e., signed), we declare these variables as unsigned, so
+ they can be reliably used as array indices. */
+ register unsigned char c, c1;
+
+ /* A random tempory spot in PATTERN. */
+ const char *p1;
+
+ /* Points to the end of the buffer, where we should append. */
+ register unsigned char *b;
+
+ /* Keeps track of unclosed groups. */
+ compile_stack_type compile_stack;
+
+ /* Points to the current (ending) position in the pattern. */
+ const char *p = pattern;
+ const char *pend = pattern + size;
+
+ /* How to translate the characters in the pattern. */
+ char *translate = bufp->translate;
+
+ /* Address of the count-byte of the most recently inserted `exactn'
+ command. This makes it possible to tell if a new exact-match
+ character can be added to that command or if the character requires
+ a new `exactn' command. */
+ unsigned char *pending_exact = 0;
+
+ /* Address of start of the most recently finished expression.
+ This tells, e.g., postfix * where to find the start of its
+ operand. Reset at the beginning of groups and alternatives. */
+ unsigned char *laststart = 0;
+
+ /* Address of beginning of regexp, or inside of last group. */
+ unsigned char *begalt;
+
+ /* Place in the uncompiled pattern (i.e., the {) to
+ which to go back if the interval is invalid. */
+ const char *beg_interval;
+
+ /* Address of the place where a forward jump should go to the end of
+ the containing expression. Each alternative of an `or' -- except the
+ last -- ends with a forward jump of this sort. */
+ unsigned char *fixup_alt_jump = 0;
+
+ /* Counts open-groups as they are encountered. Remembered for the
+ matching close-group on the compile stack, so the same register
+ number is put in the stop_memory as the start_memory. */
+ regnum_t regnum = 0;
+
+#ifdef DEBUG
+ DEBUG_PRINT1 ("\nCompiling pattern: ");
+ if (debug)
+ {
+ unsigned debug_count;
+
+ for (debug_count = 0; debug_count < size; debug_count++)
+ printchar (pattern[debug_count]);
+ putchar ('\n');
+ }
+#endif /* DEBUG */
+
+ /* Initialize the compile stack. */
+ compile_stack.stack = TALLOC (INIT_COMPILE_STACK_SIZE, compile_stack_elt_t);
+ if (compile_stack.stack == NULL)
+ return REG_ESPACE;
+
+ compile_stack.size = INIT_COMPILE_STACK_SIZE;
+ compile_stack.avail = 0;
+
+ /* Initialize the pattern buffer. */
+ bufp->syntax = syntax;
+ bufp->fastmap_accurate = 0;
+ bufp->not_bol = bufp->not_eol = 0;
+
+ /* Set `used' to zero, so that if we return an error, the pattern
+ printer (for debugging) will think there's no pattern. We reset it
+ at the end. */
+ bufp->used = 0;
+
+ /* Always count groups, whether or not bufp->no_sub is set. */
+ bufp->re_nsub = 0;
+
+#if !defined (emacs) && !defined (SYNTAX_TABLE)
+ /* Initialize the syntax table. */
+ init_syntax_once ();
+#endif
+
+ if (bufp->allocated == 0)
+ {
+ if (bufp->buffer)
+ { /* If zero allocated, but buffer is non-null, try to realloc
+ enough space. This loses if buffer's address is bogus, but
+ that is the user's responsibility. */
+ RETALLOC (bufp->buffer, INIT_BUF_SIZE, unsigned char);
+ }
+ else
+ { /* Caller did not allocate a buffer. Do it for them. */
+ bufp->buffer = TALLOC (INIT_BUF_SIZE, unsigned char);
+ }
+ if (!bufp->buffer) return REG_ESPACE;
+
+ bufp->allocated = INIT_BUF_SIZE;
+ }
+
+ begalt = b = bufp->buffer;
+
+ /* Loop through the uncompiled pattern until we're at the end. */
+ while (p != pend)
+ {
+ PATFETCH (c);
+
+ switch (c)
+ {
+ case '^':
+ {
+ if ( /* If at start of pattern, it's an operator. */
+ p == pattern + 1
+ /* If context independent, it's an operator. */
+ || syntax & RE_CONTEXT_INDEP_ANCHORS
+ /* Otherwise, depends on what's come before. */
+ || at_begline_loc_p (pattern, p, syntax))
+ BUF_PUSH (begline);
+ else
+ goto normal_char;
+ }
+ break;
+
+
+ case '$':
+ {
+ if ( /* If at end of pattern, it's an operator. */
+ p == pend
+ /* If context independent, it's an operator. */
+ || syntax & RE_CONTEXT_INDEP_ANCHORS
+ /* Otherwise, depends on what's next. */
+ || at_endline_loc_p (p, pend, syntax))
+ BUF_PUSH (endline);
+ else
+ goto normal_char;
+ }
+ break;
+
+
+ case '+':
+ case '?':
+ if ((syntax & RE_BK_PLUS_QM)
+ || (syntax & RE_LIMITED_OPS))
+ goto normal_char;
+ handle_plus:
+ case '*':
+ /* If there is no previous pattern... */
+ if (!laststart)
+ {
+ if (syntax & RE_CONTEXT_INVALID_OPS)
+ return REG_BADRPT;
+ else if (!(syntax & RE_CONTEXT_INDEP_OPS))
+ goto normal_char;
+ }
+
+ {
+ /* Are we optimizing this jump? */
+ boolean keep_string_p = false;
+
+ /* 1 means zero (many) matches is allowed. */
+ char zero_times_ok = 0, many_times_ok = 0;
+
+ /* If there is a sequence of repetition chars, collapse it
+ down to just one (the right one). We can't combine
+ interval operators with these because of, e.g., `a{2}*',
+ which should only match an even number of `a's. */
+
+ for (;;)
+ {
+ zero_times_ok |= c != '+';
+ many_times_ok |= c != '?';
+
+ if (p == pend)
+ break;
+
+ PATFETCH (c);
+
+ if (c == '*'
+ || (!(syntax & RE_BK_PLUS_QM) && (c == '+' || c == '?')))
+ ;
+
+ else if (syntax & RE_BK_PLUS_QM && c == '\\')
+ {
+ if (p == pend) return REG_EESCAPE;
+
+ PATFETCH (c1);
+ if (!(c1 == '+' || c1 == '?'))
+ {
+ PATUNFETCH;
+ PATUNFETCH;
+ break;
+ }
+
+ c = c1;
+ }
+ else
+ {
+ PATUNFETCH;
+ break;
+ }
+
+ /* If we get here, we found another repeat character. */
+ }
+
+ /* Star, etc. applied to an empty pattern is equivalent
+ to an empty pattern. */
+ if (!laststart)
+ break;
+
+ /* Now we know whether or not zero matches is allowed
+ and also whether or not two or more matches is allowed. */
+ if (many_times_ok)
+ { /* More than one repetition is allowed, so put in at the
+ end a backward relative jump from `b' to before the next
+ jump we're going to put in below (which jumps from
+ laststart to after this jump).
+
+ But if we are at the `*' in the exact sequence `.*\n',
+ insert an unconditional jump backwards to the .,
+ instead of the beginning of the loop. This way we only
+ push a failure point once, instead of every time
+ through the loop. */
+ assert (p - 1 > pattern);
+
+ /* Allocate the space for the jump. */
+ GET_BUFFER_SPACE (3);
+
+ /* We know we are not at the first character of the pattern,
+ because laststart was nonzero. And we've already
+ incremented `p', by the way, to be the character after
+ the `*'. Do we have to do something analogous here
+ for null bytes, because of RE_DOT_NOT_NULL? */
+ if (TRANSLATE (*(p - 2)) == TRANSLATE ('.')
+ && zero_times_ok
+ && p < pend && TRANSLATE (*p) == TRANSLATE ('\n')
+ && !(syntax & RE_DOT_NEWLINE))
+ { /* We have .*\n. */
+ STORE_JUMP (jump, b, laststart);
+ keep_string_p = true;
+ }
+ else
+ /* Anything else. */
+ STORE_JUMP (maybe_pop_jump, b, laststart - 3);
+
+ /* We've added more stuff to the buffer. */
+ b += 3;
+ }
+
+ /* On failure, jump from laststart to b + 3, which will be the
+ end of the buffer after this jump is inserted. */
+ GET_BUFFER_SPACE (3);
+ INSERT_JUMP (keep_string_p ? on_failure_keep_string_jump
+ : on_failure_jump,
+ laststart, b + 3);
+ pending_exact = 0;
+ b += 3;
+
+ if (!zero_times_ok)
+ {
+ /* At least one repetition is required, so insert a
+ `dummy_failure_jump' before the initial
+ `on_failure_jump' instruction of the loop. This
+ effects a skip over that instruction the first time
+ we hit that loop. */
+ GET_BUFFER_SPACE (3);
+ INSERT_JUMP (dummy_failure_jump, laststart, laststart + 6);
+ b += 3;
+ }
+ }
break;
+
+ case '.':
+ laststart = b;
+ BUF_PUSH (anychar);
+ break;
+
+
+ case '[':
+ {
+ boolean had_char_class = false;
+
+ if (p == pend) return REG_EBRACK;
+
+ /* Ensure that we have enough space to push a charset: the
+ opcode, the length count, and the bitset; 34 bytes in all. */
+ GET_BUFFER_SPACE (34);
+
+ laststart = b;
+
+ /* We test `*p == '^' twice, instead of using an if
+ statement, so we only need one BUF_PUSH. */
+ BUF_PUSH (*p == '^' ? charset_not : charset);
+ if (*p == '^')
+ p++;
+
+ /* Remember the first position in the bracket expression. */
+ p1 = p;
+
+ /* Push the number of bytes in the bitmap. */
+ BUF_PUSH ((1 << BYTEWIDTH) / BYTEWIDTH);
+
+ /* Clear the whole map. */
+ bzero (b, (1 << BYTEWIDTH) / BYTEWIDTH);
+
+ /* charset_not matches newline according to a syntax bit. */
+ if ((re_opcode_t) b[-2] == charset_not
+ && (syntax & RE_HAT_LISTS_NOT_NEWLINE))
+ SET_LIST_BIT ('\n');
+
+ /* Read in characters and ranges, setting map bits. */
+ for (;;)
+ {
+ if (p == pend) return REG_EBRACK;
+
+ PATFETCH (c);
+
+ /* \ might escape characters inside [...] and [^...]. */
+ if ((syntax & RE_BACKSLASH_ESCAPE_IN_LISTS) && c == '\\')
+ {
+ if (p == pend) return REG_EESCAPE;
+
+ PATFETCH (c1);
+ SET_LIST_BIT (c1);
+ continue;
+ }
+
+ /* Could be the end of the bracket expression. If it's
+ not (i.e., when the bracket expression is `[]' so
+ far), the ']' character bit gets set way below. */
+ if (c == ']' && p != p1 + 1)
+ break;
+
+ /* Look ahead to see if it's a range when the last thing
+ was a character class. */
+ if (had_char_class && c == '-' && *p != ']')
+ return REG_ERANGE;
+
+ /* Look ahead to see if it's a range when the last thing
+ was a character: if this is a hyphen not at the
+ beginning or the end of a list, then it's the range
+ operator. */
+ if (c == '-'
+ && !(p - 2 >= pattern && p[-2] == '[')
+ && !(p - 3 >= pattern && p[-3] == '[' && p[-2] == '^')
+ && *p != ']')
+ {
+ reg_errcode_t ret
+ = compile_range (&p, pend, translate, syntax, b);
+ if (ret != REG_NOERROR) return ret;
+ }
+
+ else if (p[0] == '-' && p[1] != ']')
+ { /* This handles ranges made up of characters only. */
+ reg_errcode_t ret;
+
+ /* Move past the `-'. */
+ PATFETCH (c1);
+
+ ret = compile_range (&p, pend, translate, syntax, b);
+ if (ret != REG_NOERROR) return ret;
+ }
+
+ /* See if we're at the beginning of a possible character
+ class. */
+
+ else if (syntax & RE_CHAR_CLASSES && c == '[' && *p == ':')
+ { /* Leave room for the null. */
+ char str[CHAR_CLASS_MAX_LENGTH + 1];
+
+ PATFETCH (c);
+ c1 = 0;
+
+ /* If pattern is `[[:'. */
+ if (p == pend) return REG_EBRACK;
+
+ for (;;)
+ {
+ PATFETCH (c);
+ if (c == ':' || c == ']' || p == pend
+ || c1 == CHAR_CLASS_MAX_LENGTH)
+ break;
+ str[c1++] = c;
+ }
+ str[c1] = '\0';
+
+ /* If isn't a word bracketed by `[:' and:`]':
+ undo the ending character, the letters, and leave
+ the leading `:' and `[' (but set bits for them). */
+ if (c == ':' && *p == ']')
+ {
+ int ch;
+ boolean is_alnum = STREQ (str, "alnum");
+ boolean is_alpha = STREQ (str, "alpha");
+ boolean is_blank = STREQ (str, "blank");
+ boolean is_cntrl = STREQ (str, "cntrl");
+ boolean is_digit = STREQ (str, "digit");
+ boolean is_graph = STREQ (str, "graph");
+ boolean is_lower = STREQ (str, "lower");
+ boolean is_print = STREQ (str, "print");
+ boolean is_punct = STREQ (str, "punct");
+ boolean is_space = STREQ (str, "space");
+ boolean is_upper = STREQ (str, "upper");
+ boolean is_xdigit = STREQ (str, "xdigit");
+
+ if (!IS_CHAR_CLASS (str)) return REG_ECTYPE;
+
+ /* Throw away the ] at the end of the character
+ class. */
+ PATFETCH (c);
+
+ if (p == pend) return REG_EBRACK;
+
+ for (ch = 0; ch < 1 << BYTEWIDTH; ch++)
+ {
+ if ( (is_alnum && ISALNUM (ch))
+ || (is_alpha && ISALPHA (ch))
+ || (is_blank && ISBLANK (ch))
+ || (is_cntrl && ISCNTRL (ch))
+ || (is_digit && ISDIGIT (ch))
+ || (is_graph && ISGRAPH (ch))
+ || (is_lower && ISLOWER (ch))
+ || (is_print && ISPRINT (ch))
+ || (is_punct && ISPUNCT (ch))
+ || (is_space && ISSPACE (ch))
+ || (is_upper && ISUPPER (ch))
+ || (is_xdigit && ISXDIGIT (ch)))
+ SET_LIST_BIT (ch);
+ }
+ had_char_class = true;
+ }
+ else
+ {
+ c1++;
+ while (c1--)
+ PATUNFETCH;
+ SET_LIST_BIT ('[');
+ SET_LIST_BIT (':');
+ had_char_class = false;
+ }
+ }
+ else
+ {
+ had_char_class = false;
+ SET_LIST_BIT (c);
+ }
+ }
+
+ /* Discard any (non)matching list bytes that are all 0 at the
+ end of the map. Decrease the map-length byte too. */
+ while ((int) b[-1] > 0 && b[b[-1] - 1] == 0)
+ b[-1]--;
+ b += b[-1];
+ }
+ break;
+
+
case '(':
- if (! (obscure_syntax & RE_NO_BK_PARENS))
- goto normal_char;
- else
- goto handle_open;
+ if (syntax & RE_NO_BK_PARENS)
+ goto handle_open;
+ else
+ goto normal_char;
- case ')':
- if (! (obscure_syntax & RE_NO_BK_PARENS))
- goto normal_char;
- else
- goto handle_close;
- case '\n':
- if (! (obscure_syntax & RE_NEWLINE_OR))
- goto normal_char;
- else
- goto handle_bar;
+ case ')':
+ if (syntax & RE_NO_BK_PARENS)
+ goto handle_close;
+ else
+ goto normal_char;
+
+
+ case '\n':
+ if (syntax & RE_NEWLINE_ALT)
+ goto handle_alt;
+ else
+ goto normal_char;
+
case '|':
- if (! (obscure_syntax & RE_NO_BK_VBAR))
- goto normal_char;
- else
- goto handle_bar;
+ if (syntax & RE_NO_BK_VBAR)
+ goto handle_alt;
+ else
+ goto normal_char;
+
+
+ case '{':
+ if (syntax & RE_INTERVALS && syntax & RE_NO_BK_BRACES)
+ goto handle_interval;
+ else
+ goto normal_char;
+
case '\\':
- if (p == pend) goto invalid_pattern;
- PATFETCH_RAW (c);
- switch (c)
- {
- case '(':
- if (obscure_syntax & RE_NO_BK_PARENS)
- goto normal_backsl;
- handle_open:
- if (stackp == stacke) goto nesting_too_deep;
- if (regnum < RE_NREGS)
- {
- PATPUSH (start_memory);
- PATPUSH (regnum);
- }
- *stackp++ = b - bufp->buffer;
- *stackp++ = fixup_jump ? fixup_jump - bufp->buffer + 1 : 0;
- *stackp++ = regnum++;
- *stackp++ = begalt - bufp->buffer;
- fixup_jump = 0;
- laststart = 0;
- begalt = b;
- break;
-
- case ')':
- if (obscure_syntax & RE_NO_BK_PARENS)
- goto normal_backsl;
- handle_close:
- if (stackp == stackb) goto unmatched_close;
- begalt = *--stackp + bufp->buffer;
- if (fixup_jump)
- store_jump (fixup_jump, jump, b);
- if (stackp[-1] < RE_NREGS)
- {
- PATPUSH (stop_memory);
- PATPUSH (stackp[-1]);
- }
- stackp -= 2;
- fixup_jump = 0;
- if (*stackp)
- fixup_jump = *stackp + bufp->buffer - 1;
- laststart = *--stackp + bufp->buffer;
- break;
-
- case '|':
- if (obscure_syntax & RE_NO_BK_VBAR)
- goto normal_backsl;
- handle_bar:
- insert_jump (on_failure_jump, begalt, b + 6, b);
+ if (p == pend) return REG_EESCAPE;
+
+ /* Do not translate the character after the \, so that we can
+ distinguish, e.g., \B from \b, even if we normally would
+ translate, e.g., B to b. */
+ PATFETCH_RAW (c);
+
+ switch (c)
+ {
+ case '(':
+ if (syntax & RE_NO_BK_PARENS)
+ goto normal_backslash;
+
+ handle_open:
+ bufp->re_nsub++;
+ regnum++;
+
+ if (COMPILE_STACK_FULL)
+ {
+ RETALLOC (compile_stack.stack, compile_stack.size << 1,
+ compile_stack_elt_t);
+ if (compile_stack.stack == NULL) return REG_ESPACE;
+
+ compile_stack.size <<= 1;
+ }
+
+ /* These are the values to restore when we hit end of this
+ group. They are all relative offsets, so that if the
+ whole pattern moves because of realloc, they will still
+ be valid. */
+ COMPILE_STACK_TOP.begalt_offset = begalt - bufp->buffer;
+ COMPILE_STACK_TOP.fixup_alt_jump
+ = fixup_alt_jump ? fixup_alt_jump - bufp->buffer + 1 : 0;
+ COMPILE_STACK_TOP.laststart_offset = b - bufp->buffer;
+ COMPILE_STACK_TOP.regnum = regnum;
+
+ /* We will eventually replace the 0 with the number of
+ groups inner to this one. But do not push a
+ start_memory for groups beyond the last one we can
+ represent in the compiled pattern. */
+ if (regnum <= MAX_REGNUM)
+ {
+ COMPILE_STACK_TOP.inner_group_offset = b - bufp->buffer + 2;
+ BUF_PUSH_3 (start_memory, regnum, 0);
+ }
+
+ compile_stack.avail++;
+
+ fixup_alt_jump = 0;
+ laststart = 0;
+ begalt = b;
+ /* If we've reached MAX_REGNUM groups, then this open
+ won't actually generate any code, so we'll have to
+ clear pending_exact explicitly. */
pending_exact = 0;
- b += 3;
- if (fixup_jump)
- store_jump (fixup_jump, jump, b);
- fixup_jump = b;
- b += 3;
- laststart = 0;
- begalt = b;
- break;
+ break;
+
+
+ case ')':
+ if (syntax & RE_NO_BK_PARENS) goto normal_backslash;
+
+ if (COMPILE_STACK_EMPTY)
+ if (syntax & RE_UNMATCHED_RIGHT_PAREN_ORD)
+ goto normal_backslash;
+ else
+ return REG_ERPAREN;
+
+ handle_close:
+ if (fixup_alt_jump)
+ { /* Push a dummy failure point at the end of the
+ alternative for a possible future
+ `pop_failure_jump' to pop. See comments at
+ `push_dummy_failure' in `re_match_2'. */
+ BUF_PUSH (push_dummy_failure);
+
+ /* We allocated space for this jump when we assigned
+ to `fixup_alt_jump', in the `handle_alt' case below. */
+ STORE_JUMP (jump_past_alt, fixup_alt_jump, b - 1);
+ }
+
+ /* See similar code for backslashed left paren above. */
+ if (COMPILE_STACK_EMPTY)
+ if (syntax & RE_UNMATCHED_RIGHT_PAREN_ORD)
+ goto normal_char;
+ else
+ return REG_ERPAREN;
+
+ /* Since we just checked for an empty stack above, this
+ ``can't happen''. */
+ assert (compile_stack.avail != 0);
+ {
+ /* We don't just want to restore into `regnum', because
+ later groups should continue to be numbered higher,
+ as in `(ab)c(de)' -- the second group is #2. */
+ regnum_t this_group_regnum;
+
+ compile_stack.avail--;
+ begalt = bufp->buffer + COMPILE_STACK_TOP.begalt_offset;
+ fixup_alt_jump
+ = COMPILE_STACK_TOP.fixup_alt_jump
+ ? bufp->buffer + COMPILE_STACK_TOP.fixup_alt_jump - 1
+ : 0;
+ laststart = bufp->buffer + COMPILE_STACK_TOP.laststart_offset;
+ this_group_regnum = COMPILE_STACK_TOP.regnum;
+ /* If we've reached MAX_REGNUM groups, then this open
+ won't actually generate any code, so we'll have to
+ clear pending_exact explicitly. */
+ pending_exact = 0;
+
+ /* We're at the end of the group, so now we know how many
+ groups were inside this one. */
+ if (this_group_regnum <= MAX_REGNUM)
+ {
+ unsigned char *inner_group_loc
+ = bufp->buffer + COMPILE_STACK_TOP.inner_group_offset;
+
+ *inner_group_loc = regnum - this_group_regnum;
+ BUF_PUSH_3 (stop_memory, this_group_regnum,
+ regnum - this_group_regnum);
+ }
+ }
+ break;
+
+
+ case '|': /* `\|'. */
+ if (syntax & RE_LIMITED_OPS || syntax & RE_NO_BK_VBAR)
+ goto normal_backslash;
+ handle_alt:
+ if (syntax & RE_LIMITED_OPS)
+ goto normal_char;
+
+ /* Insert before the previous alternative a jump which
+ jumps to this alternative if the former fails. */
+ GET_BUFFER_SPACE (3);
+ INSERT_JUMP (on_failure_jump, begalt, b + 6);
+ pending_exact = 0;
+ b += 3;
+
+ /* The alternative before this one has a jump after it
+ which gets executed if it gets matched. Adjust that
+ jump so it will jump to this alternative's analogous
+ jump (put in below, which in turn will jump to the next
+ (if any) alternative's such jump, etc.). The last such
+ jump jumps to the correct final destination. A picture:
+ _____ _____
+ | | | |
+ | v | v
+ a | b | c
+
+ If we are at `b', then fixup_alt_jump right now points to a
+ three-byte space after `a'. We'll put in the jump, set
+ fixup_alt_jump to right after `b', and leave behind three
+ bytes which we'll fill in when we get to after `c'. */
+
+ if (fixup_alt_jump)
+ STORE_JUMP (jump_past_alt, fixup_alt_jump, b);
+
+ /* Mark and leave space for a jump after this alternative,
+ to be filled in later either by next alternative or
+ when know we're at the end of a series of alternatives. */
+ fixup_alt_jump = b;
+ GET_BUFFER_SPACE (3);
+ b += 3;
+
+ laststart = 0;
+ begalt = b;
+ break;
+
+
+ case '{':
+ /* If \{ is a literal. */
+ if (!(syntax & RE_INTERVALS)
+ /* If we're at `\{' and it's not the open-interval
+ operator. */
+ || ((syntax & RE_INTERVALS) && (syntax & RE_NO_BK_BRACES))
+ || (p - 2 == pattern && p == pend))
+ goto normal_backslash;
+
+ handle_interval:
+ {
+ /* If got here, then the syntax allows intervals. */
+
+ /* At least (most) this many matches must be made. */
+ int lower_bound = -1, upper_bound = -1;
+
+ beg_interval = p - 1;
+
+ if (p == pend)
+ {
+ if (syntax & RE_NO_BK_BRACES)
+ goto unfetch_interval;
+ else
+ return REG_EBRACE;
+ }
+
+ GET_UNSIGNED_NUMBER (lower_bound);
+
+ if (c == ',')
+ {
+ GET_UNSIGNED_NUMBER (upper_bound);
+ if (upper_bound < 0) upper_bound = RE_DUP_MAX;
+ }
+ else
+ /* Interval such as `{1}' => match exactly once. */
+ upper_bound = lower_bound;
+
+ if (lower_bound < 0 || upper_bound > RE_DUP_MAX
+ || lower_bound > upper_bound)
+ {
+ if (syntax & RE_NO_BK_BRACES)
+ goto unfetch_interval;
+ else
+ return REG_BADBR;
+ }
+
+ if (!(syntax & RE_NO_BK_BRACES))
+ {
+ if (c != '\\') return REG_EBRACE;
+
+ PATFETCH (c);
+ }
+
+ if (c != '}')
+ {
+ if (syntax & RE_NO_BK_BRACES)
+ goto unfetch_interval;
+ else
+ return REG_BADBR;
+ }
+
+ /* We just parsed a valid interval. */
+
+ /* If it's invalid to have no preceding re. */
+ if (!laststart)
+ {
+ if (syntax & RE_CONTEXT_INVALID_OPS)
+ return REG_BADRPT;
+ else if (syntax & RE_CONTEXT_INDEP_OPS)
+ laststart = b;
+ else
+ goto unfetch_interval;
+ }
+
+ /* If the upper bound is zero, don't want to succeed at
+ all; jump from `laststart' to `b + 3', which will be
+ the end of the buffer after we insert the jump. */
+ if (upper_bound == 0)
+ {
+ GET_BUFFER_SPACE (3);
+ INSERT_JUMP (jump, laststart, b + 3);
+ b += 3;
+ }
+
+ /* Otherwise, we have a nontrivial interval. When
+ we're all done, the pattern will look like:
+ set_number_at <jump count> <upper bound>
+ set_number_at <succeed_n count> <lower bound>
+ succeed_n <after jump addr> <succed_n count>
+ <body of loop>
+ jump_n <succeed_n addr> <jump count>
+ (The upper bound and `jump_n' are omitted if
+ `upper_bound' is 1, though.) */
+ else
+ { /* If the upper bound is > 1, we need to insert
+ more at the end of the loop. */
+ unsigned nbytes = 10 + (upper_bound > 1) * 10;
+
+ GET_BUFFER_SPACE (nbytes);
+
+ /* Initialize lower bound of the `succeed_n', even
+ though it will be set during matching by its
+ attendant `set_number_at' (inserted next),
+ because `re_compile_fastmap' needs to know.
+ Jump to the `jump_n' we might insert below. */
+ INSERT_JUMP2 (succeed_n, laststart,
+ b + 5 + (upper_bound > 1) * 5,
+ lower_bound);
+ b += 5;
+
+ /* Code to initialize the lower bound. Insert
+ before the `succeed_n'. The `5' is the last two
+ bytes of this `set_number_at', plus 3 bytes of
+ the following `succeed_n'. */
+ insert_op2 (set_number_at, laststart, 5, lower_bound, b);
+ b += 5;
+
+ if (upper_bound > 1)
+ { /* More than one repetition is allowed, so
+ append a backward jump to the `succeed_n'
+ that starts this interval.
+
+ When we've reached this during matching,
+ we'll have matched the interval once, so
+ jump back only `upper_bound - 1' times. */
+ STORE_JUMP2 (jump_n, b, laststart + 5,
+ upper_bound - 1);
+ b += 5;
+
+ /* The location we want to set is the second
+ parameter of the `jump_n'; that is `b-2' as
+ an absolute address. `laststart' will be
+ the `set_number_at' we're about to insert;
+ `laststart+3' the number to set, the source
+ for the relative address. But we are
+ inserting into the middle of the pattern --
+ so everything is getting moved up by 5.
+ Conclusion: (b - 2) - (laststart + 3) + 5,
+ i.e., b - laststart.
+
+ We insert this at the beginning of the loop
+ so that if we fail during matching, we'll
+ reinitialize the bounds. */
+ insert_op2 (set_number_at, laststart, b - laststart,
+ upper_bound - 1, b);
+ b += 5;
+ }
+ }
+ pending_exact = 0;
+ beg_interval = NULL;
+ }
+ break;
+
+ unfetch_interval:
+ /* If an invalid interval, match the characters as literals. */
+ assert (beg_interval);
+ p = beg_interval;
+ beg_interval = NULL;
+
+ /* normal_char and normal_backslash need `c'. */
+ PATFETCH (c);
+
+ if (!(syntax & RE_NO_BK_BRACES))
+ {
+ if (p > pattern && p[-1] == '\\')
+ goto normal_backslash;
+ }
+ goto normal_char;
#ifdef emacs
- case '=':
- PATPUSH (at_dot);
- break;
-
- case 's':
- laststart = b;
- PATPUSH (syntaxspec);
- PATFETCH (c);
- PATPUSH (syntax_spec_code[c]);
- break;
-
- case 'S':
- laststart = b;
- PATPUSH (notsyntaxspec);
- PATFETCH (c);
- PATPUSH (syntax_spec_code[c]);
- break;
+ /* There is no way to specify the before_dot and after_dot
+ operators. rms says this is ok. --karl */
+ case '=':
+ BUF_PUSH (at_dot);
+ break;
+
+ case 's':
+ laststart = b;
+ PATFETCH (c);
+ BUF_PUSH_2 (syntaxspec, syntax_spec_code[c]);
+ break;
+
+ case 'S':
+ laststart = b;
+ PATFETCH (c);
+ BUF_PUSH_2 (notsyntaxspec, syntax_spec_code[c]);
+ break;
#endif /* emacs */
- case 'w':
- laststart = b;
- PATPUSH (wordchar);
- break;
-
- case 'W':
- laststart = b;
- PATPUSH (notwordchar);
- break;
-
- case '<':
- PATPUSH (wordbeg);
- break;
-
- case '>':
- PATPUSH (wordend);
- break;
-
- case 'b':
- PATPUSH (wordbound);
- break;
-
- case 'B':
- PATPUSH (notwordbound);
- break;
-
- case '`':
- PATPUSH (begbuf);
- break;
-
- case '\'':
- PATPUSH (endbuf);
- break;
-
- case '1':
- case '2':
- case '3':
- case '4':
- case '5':
- case '6':
- case '7':
- case '8':
- case '9':
- c1 = c - '0';
- if (c1 >= regnum)
- goto normal_char;
- for (stackt = stackp - 2; stackt > stackb; stackt -= 4)
- if (*stackt == c1)
- goto normal_char;
- laststart = b;
- PATPUSH (duplicate);
- PATPUSH (c1);
- break;
-
- case '+':
- case '?':
- if (obscure_syntax & RE_BK_PLUS_QM)
- goto handle_plus;
-
- default:
- normal_backsl:
- /* You might think it would be useful for \ to mean
- not to translate; but if we don't translate it
- it will never match anything. */
- if (translate) c = translate[c];
- goto normal_char;
- }
- break;
+
+ case 'w':
+ laststart = b;
+ BUF_PUSH (wordchar);
+ break;
+
+
+ case 'W':
+ laststart = b;
+ BUF_PUSH (notwordchar);
+ break;
+
+
+ case '<':
+ BUF_PUSH (wordbeg);
+ break;
+
+ case '>':
+ BUF_PUSH (wordend);
+ break;
+
+ case 'b':
+ BUF_PUSH (wordbound);
+ break;
+
+ case 'B':
+ BUF_PUSH (notwordbound);
+ break;
+
+ case '`':
+ BUF_PUSH (begbuf);
+ break;
+
+ case '\'':
+ BUF_PUSH (endbuf);
+ break;
+
+ case '1': case '2': case '3': case '4': case '5':
+ case '6': case '7': case '8': case '9':
+ if (syntax & RE_NO_BK_REFS)
+ goto normal_char;
+
+ c1 = c - '0';
+
+ if (c1 > regnum)
+ return REG_ESUBREG;
+
+ /* Can't back reference to a subexpression if inside of it. */
+ if (group_in_compile_stack (compile_stack, c1))
+ goto normal_char;
+
+ laststart = b;
+ BUF_PUSH_2 (duplicate, c1);
+ break;
+
+
+ case '+':
+ case '?':
+ if (syntax & RE_BK_PLUS_QM)
+ goto handle_plus;
+ else
+ goto normal_backslash;
+
+ default:
+ normal_backslash:
+ /* You might think it would be useful for \ to mean
+ not to translate; but if we don't translate it
+ it will never match anything. */
+ c = TRANSLATE (c);
+ goto normal_char;
+ }
+ break;
+
default:
+ /* Expects the character in `c'. */
normal_char:
- if (!pending_exact || pending_exact + *pending_exact + 1 != b
- || *pending_exact == 0177 || *p == '*' || *p == '^'
- || ((obscure_syntax & RE_BK_PLUS_QM)
+ /* If no exactn currently being built. */
+ if (!pending_exact
+
+ /* If last exactn not at current position. */
+ || pending_exact + *pending_exact + 1 != b
+
+ /* We have only one byte following the exactn for the count. */
+ || *pending_exact == (1 << BYTEWIDTH) - 1
+
+ /* If followed by a repetition operator. */
+ || *p == '*' || *p == '^'
+ || ((syntax & RE_BK_PLUS_QM)
? *p == '\\' && (p[1] == '+' || p[1] == '?')
- : (*p == '+' || *p == '?')))
+ : (*p == '+' || *p == '?'))
+ || ((syntax & RE_INTERVALS)
+ && ((syntax & RE_NO_BK_BRACES)
+ ? *p == '{'
+ : (p[0] == '\\' && p[1] == '{'))))
{
- laststart = b;
- PATPUSH (exactn);
- pending_exact = b;
- PATPUSH (0);
- }
- PATPUSH (c);
- (*pending_exact)++;
- }
- }
+ /* Start building a new exactn. */
+
+ laststart = b;
+
+ BUF_PUSH_2 (exactn, 0);
+ pending_exact = b - 1;
+ }
+
+ BUF_PUSH (c);
+ (*pending_exact)++;
+ break;
+ } /* switch (c) */
+ } /* while p != pend */
- if (fixup_jump)
- store_jump (fixup_jump, jump, b);
+
+ /* Through the pattern now. */
+
+ if (fixup_alt_jump)
+ STORE_JUMP (jump_past_alt, fixup_alt_jump, b);
+
+ if (!COMPILE_STACK_EMPTY)
+ return REG_EPAREN;
- if (stackp != stackb) goto unmatched_open;
+ free (compile_stack.stack);
+ /* We have succeeded; set the length of the buffer. */
bufp->used = b - bufp->buffer;
- return 0;
- invalid_pattern:
- return "Invalid regular expression";
+#ifdef DEBUG
+ if (debug)
+ {
+ DEBUG_PRINT1 ("\nCompiled pattern: \n");
+ print_compiled_pattern (bufp);
+ }
+#endif /* DEBUG */
- unmatched_open:
- return "Unmatched \\(";
+ return REG_NOERROR;
+} /* regex_compile */
+
+/* Subroutines for `regex_compile'. */
- unmatched_close:
- return "Unmatched \\)";
+/* Store OP at LOC followed by two-byte integer parameter ARG. */
- end_of_pattern:
- return "Premature end of regular expression";
+static void
+store_op1 (op, loc, arg)
+ re_opcode_t op;
+ unsigned char *loc;
+ int arg;
+{
+ *loc = (unsigned char) op;
+ STORE_NUMBER (loc + 1, arg);
+}
- nesting_too_deep:
- return "Nesting too deep";
- too_big:
- return "Regular expression too big";
+/* Like `store_op1', but for two two-byte parameters ARG1 and ARG2. */
- memory_exhausted:
- return "Memory exhausted";
+static void
+store_op2 (op, loc, arg1, arg2)
+ re_opcode_t op;
+ unsigned char *loc;
+ int arg1, arg2;
+{
+ *loc = (unsigned char) op;
+ STORE_NUMBER (loc + 1, arg1);
+ STORE_NUMBER (loc + 3, arg2);
}
-/* Store where `from' points a jump operation to jump to where `to' points.
- `opcode' is the opcode to store. */
+
+/* Copy the bytes from LOC to END to open up three bytes of space at LOC
+ for OP followed by two-byte integer parameter ARG. */
static void
-store_jump (from, opcode, to)
- char *from, *to;
- char opcode;
+insert_op1 (op, loc, arg, end)
+ re_opcode_t op;
+ unsigned char *loc;
+ int arg;
+ unsigned char *end;
{
- from[0] = opcode;
- from[1] = (to - (from + 3)) & 0377;
- from[2] = (to - (from + 3)) >> 8;
+ register unsigned char *pfrom = end;
+ register unsigned char *pto = end + 3;
+
+ while (pfrom != loc)
+ *--pto = *--pfrom;
+
+ store_op1 (op, loc, arg);
}
-/* Open up space at char FROM, and insert there a jump to TO.
- CURRENT_END gives te end of the storage no in use,
- so we know how much data to copy up.
- OP is the opcode of the jump to insert.
- If you call this function, you must zero out pending_exact. */
+/* Like `insert_op1', but for two two-byte parameters ARG1 and ARG2. */
static void
-insert_jump (op, from, to, current_end)
- char op;
- char *from, *to, *current_end;
+insert_op2 (op, loc, arg1, arg2, end)
+ re_opcode_t op;
+ unsigned char *loc;
+ int arg1, arg2;
+ unsigned char *end;
{
- register char *pto = current_end + 3;
- register char *pfrom = current_end;
- while (pfrom != from)
+ register unsigned char *pfrom = end;
+ register unsigned char *pto = end + 5;
+
+ while (pfrom != loc)
*--pto = *--pfrom;
- store_jump (from, op, to);
+
+ store_op2 (op, loc, arg1, arg2);
+}
+
+
+/* P points to just after a ^ in PATTERN. Return true if that ^ comes
+ after an alternative or a begin-subexpression. We assume there is at
+ least one character before the ^. */
+
+static boolean
+at_begline_loc_p (pattern, p, syntax)
+ const char *pattern, *p;
+ reg_syntax_t syntax;
+{
+ const char *prev = p - 2;
+ boolean prev_prev_backslash = prev > pattern && prev[-1] == '\\';
+
+ return
+ /* After a subexpression? */
+ (*prev == '(' && (syntax & RE_NO_BK_PARENS || prev_prev_backslash))
+ /* After an alternative? */
+ || (*prev == '|' && (syntax & RE_NO_BK_VBAR || prev_prev_backslash));
+}
+
+
+/* The dual of at_begline_loc_p. This one is for $. We assume there is
+ at least one character after the $, i.e., `P < PEND'. */
+
+static boolean
+at_endline_loc_p (p, pend, syntax)
+ const char *p, *pend;
+ int syntax;
+{
+ const char *next = p;
+ boolean next_backslash = *next == '\\';
+ const char *next_next = p + 1 < pend ? p + 1 : NULL;
+
+ return
+ /* Before a subexpression? */
+ (syntax & RE_NO_BK_PARENS ? *next == ')'
+ : next_backslash && next_next && *next_next == ')')
+ /* Before an alternative? */
+ || (syntax & RE_NO_BK_VBAR ? *next == '|'
+ : next_backslash && next_next && *next_next == '|');
+}
+
+
+/* Returns true if REGNUM is in one of COMPILE_STACK's elements and
+ false if it's not. */
+
+static boolean
+group_in_compile_stack (compile_stack, regnum)
+ compile_stack_type compile_stack;
+ regnum_t regnum;
+{
+ int this_element;
+
+ for (this_element = compile_stack.avail - 1;
+ this_element >= 0;
+ this_element--)
+ if (compile_stack.stack[this_element].regnum == regnum)
+ return true;
+
+ return false;
+}
+
+
+/* Read the ending character of a range (in a bracket expression) from the
+ uncompiled pattern *P_PTR (which ends at PEND). We assume the
+ starting character is in `P[-2]'. (`P[-1]' is the character `-'.)
+ Then we set the translation of all bits between the starting and
+ ending characters (inclusive) in the compiled pattern B.
+
+ Return an error code.
+
+ We use these short variable names so we can use the same macros as
+ `regex_compile' itself. */
+
+static reg_errcode_t
+compile_range (p_ptr, pend, translate, syntax, b)
+ const char **p_ptr, *pend;
+ char *translate;
+ reg_syntax_t syntax;
+ unsigned char *b;
+{
+ unsigned this_char;
+
+ const char *p = *p_ptr;
+ int range_start, range_end;
+
+ if (p == pend)
+ return REG_ERANGE;
+
+ /* Even though the pattern is a signed `char *', we need to fetch
+ with unsigned char *'s; if the high bit of the pattern character
+ is set, the range endpoints will be negative if we fetch using a
+ signed char *.
+
+ We also want to fetch the endpoints without translating them; the
+ appropriate translation is done in the bit-setting loop below. */
+ range_start = ((unsigned char *) p)[-2];
+ range_end = ((unsigned char *) p)[0];
+
+ /* Have to increment the pointer into the pattern string, so the
+ caller isn't still at the ending character. */
+ (*p_ptr)++;
+
+ /* If the start is after the end, the range is empty. */
+ if (range_start > range_end)
+ return syntax & RE_NO_EMPTY_RANGES ? REG_ERANGE : REG_NOERROR;
+
+ /* Here we see why `this_char' has to be larger than an `unsigned
+ char' -- the range is inclusive, so if `range_end' == 0xff
+ (assuming 8-bit characters), we would otherwise go into an infinite
+ loop, since all characters <= 0xff. */
+ for (this_char = range_start; this_char <= range_end; this_char++)
+ {
+ SET_LIST_BIT (TRANSLATE (this_char));
+ }
+
+ return REG_NOERROR;
}
-/* Given a pattern, compute a fastmap from it.
- The fastmap records which of the (1 << BYTEWIDTH) possible characters
- can start a string that matches the pattern.
- This fastmap is used by re_search to skip quickly over totally implausible text.
+/* Failure stack declarations and macros; both re_compile_fastmap and
+ re_match_2 use a failure stack. These have to be macros because of
+ REGEX_ALLOCATE. */
+
+
+/* Number of failure points for which to initially allocate space
+ when matching. If this number is exceeded, we allocate more
+ space, so it is not a hard limit. */
+#ifndef INIT_FAILURE_ALLOC
+#define INIT_FAILURE_ALLOC 5
+#endif
- The caller must supply the address of a (1 << BYTEWIDTH)-byte data area
- as bufp->fastmap.
- The other components of bufp describe the pattern to be used. */
+/* Roughly the maximum number of failure points on the stack. Would be
+ exactly that if always used MAX_FAILURE_SPACE each time we failed.
+ This is a variable only so users of regex can assign to it; we never
+ change it ourselves. */
+int re_max_failures = 2000;
-void
+typedef const unsigned char *fail_stack_elt_t;
+
+typedef struct
+{
+ fail_stack_elt_t *stack;
+ unsigned size;
+ unsigned avail; /* Offset of next open position. */
+} fail_stack_type;
+
+#define FAIL_STACK_EMPTY() (fail_stack.avail == 0)
+#define FAIL_STACK_PTR_EMPTY() (fail_stack_ptr->avail == 0)
+#define FAIL_STACK_FULL() (fail_stack.avail == fail_stack.size)
+#define FAIL_STACK_TOP() (fail_stack.stack[fail_stack.avail])
+
+
+/* Initialize `fail_stack'. Do `return -2' if the alloc fails. */
+
+#define INIT_FAIL_STACK() \
+ do { \
+ fail_stack.stack = (fail_stack_elt_t *) \
+ REGEX_ALLOCATE (INIT_FAILURE_ALLOC * sizeof (fail_stack_elt_t)); \
+ \
+ if (fail_stack.stack == NULL) \
+ return -2; \
+ \
+ fail_stack.size = INIT_FAILURE_ALLOC; \
+ fail_stack.avail = 0; \
+ } while (0)
+
+
+/* Double the size of FAIL_STACK, up to approximately `re_max_failures' items.
+
+ Return 1 if succeeds, and 0 if either ran out of memory
+ allocating space for it or it was already too large.
+
+ REGEX_REALLOCATE requires `destination' be declared. */
+
+#define DOUBLE_FAIL_STACK(fail_stack) \
+ ((fail_stack).size > re_max_failures * MAX_FAILURE_ITEMS \
+ ? 0 \
+ : ((fail_stack).stack = (fail_stack_elt_t *) \
+ REGEX_REALLOCATE ((fail_stack).stack, \
+ (fail_stack).size * sizeof (fail_stack_elt_t), \
+ ((fail_stack).size << 1) * sizeof (fail_stack_elt_t)), \
+ \
+ (fail_stack).stack == NULL \
+ ? 0 \
+ : ((fail_stack).size <<= 1, \
+ 1)))
+
+
+/* Push PATTERN_OP on FAIL_STACK.
+
+ Return 1 if was able to do so and 0 if ran out of memory allocating
+ space to do so. */
+#define PUSH_PATTERN_OP(pattern_op, fail_stack) \
+ ((FAIL_STACK_FULL () \
+ && !DOUBLE_FAIL_STACK (fail_stack)) \
+ ? 0 \
+ : ((fail_stack).stack[(fail_stack).avail++] = pattern_op, \
+ 1))
+
+/* This pushes an item onto the failure stack. Must be a four-byte
+ value. Assumes the variable `fail_stack'. Probably should only
+ be called from within `PUSH_FAILURE_POINT'. */
+#define PUSH_FAILURE_ITEM(item) \
+ fail_stack.stack[fail_stack.avail++] = (fail_stack_elt_t) item
+
+/* The complement operation. Assumes `fail_stack' is nonempty. */
+#define POP_FAILURE_ITEM() fail_stack.stack[--fail_stack.avail]
+
+/* Used to omit pushing failure point id's when we're not debugging. */
+#ifdef DEBUG
+#define DEBUG_PUSH PUSH_FAILURE_ITEM
+#define DEBUG_POP(item_addr) *(item_addr) = POP_FAILURE_ITEM ()
+#else
+#define DEBUG_PUSH(item)
+#define DEBUG_POP(item_addr)
+#endif
+
+
+/* Push the information about the state we will need
+ if we ever fail back to it.
+
+ Requires variables fail_stack, regstart, regend, reg_info, and
+ num_regs be declared. DOUBLE_FAIL_STACK requires `destination' be
+ declared.
+
+ Does `return FAILURE_CODE' if runs out of memory. */
+
+#define PUSH_FAILURE_POINT(pattern_place, string_place, failure_code) \
+ do { \
+ char *destination; \
+ /* Must be int, so when we don't save any registers, the arithmetic \
+ of 0 + -1 isn't done as unsigned. */ \
+ int this_reg; \
+ \
+ DEBUG_STATEMENT (failure_id++); \
+ DEBUG_STATEMENT (nfailure_points_pushed++); \
+ DEBUG_PRINT2 ("\nPUSH_FAILURE_POINT #%u:\n", failure_id); \
+ DEBUG_PRINT2 (" Before push, next avail: %d\n", (fail_stack).avail);\
+ DEBUG_PRINT2 (" size: %d\n", (fail_stack).size);\
+ \
+ DEBUG_PRINT2 (" slots needed: %d\n", NUM_FAILURE_ITEMS); \
+ DEBUG_PRINT2 (" available: %d\n", REMAINING_AVAIL_SLOTS); \
+ \
+ /* Ensure we have enough space allocated for what we will push. */ \
+ while (REMAINING_AVAIL_SLOTS < NUM_FAILURE_ITEMS) \
+ { \
+ if (!DOUBLE_FAIL_STACK (fail_stack)) \
+ return failure_code; \
+ \
+ DEBUG_PRINT2 ("\n Doubled stack; size now: %d\n", \
+ (fail_stack).size); \
+ DEBUG_PRINT2 (" slots available: %d\n", REMAINING_AVAIL_SLOTS);\
+ } \
+ \
+ /* Push the info, starting with the registers. */ \
+ DEBUG_PRINT1 ("\n"); \
+ \
+ for (this_reg = lowest_active_reg; this_reg <= highest_active_reg; \
+ this_reg++) \
+ { \
+ DEBUG_PRINT2 (" Pushing reg: %d\n", this_reg); \
+ DEBUG_STATEMENT (num_regs_pushed++); \
+ \
+ DEBUG_PRINT2 (" start: 0x%x\n", regstart[this_reg]); \
+ PUSH_FAILURE_ITEM (regstart[this_reg]); \
+ \
+ DEBUG_PRINT2 (" end: 0x%x\n", regend[this_reg]); \
+ PUSH_FAILURE_ITEM (regend[this_reg]); \
+ \
+ DEBUG_PRINT2 (" info: 0x%x\n ", reg_info[this_reg]); \
+ DEBUG_PRINT2 (" match_null=%d", \
+ REG_MATCH_NULL_STRING_P (reg_info[this_reg])); \
+ DEBUG_PRINT2 (" active=%d", IS_ACTIVE (reg_info[this_reg])); \
+ DEBUG_PRINT2 (" matched_something=%d", \
+ MATCHED_SOMETHING (reg_info[this_reg])); \
+ DEBUG_PRINT2 (" ever_matched=%d", \
+ EVER_MATCHED_SOMETHING (reg_info[this_reg])); \
+ DEBUG_PRINT1 ("\n"); \
+ PUSH_FAILURE_ITEM (reg_info[this_reg].word); \
+ } \
+ \
+ DEBUG_PRINT2 (" Pushing low active reg: %d\n", lowest_active_reg);\
+ PUSH_FAILURE_ITEM (lowest_active_reg); \
+ \
+ DEBUG_PRINT2 (" Pushing high active reg: %d\n", highest_active_reg);\
+ PUSH_FAILURE_ITEM (highest_active_reg); \
+ \
+ DEBUG_PRINT2 (" Pushing pattern 0x%x: ", pattern_place); \
+ DEBUG_PRINT_COMPILED_PATTERN (bufp, pattern_place, pend); \
+ PUSH_FAILURE_ITEM (pattern_place); \
+ \
+ DEBUG_PRINT2 (" Pushing string 0x%x: `", string_place); \
+ DEBUG_PRINT_DOUBLE_STRING (string_place, string1, size1, string2, \
+ size2); \
+ DEBUG_PRINT1 ("'\n"); \
+ PUSH_FAILURE_ITEM (string_place); \
+ \
+ DEBUG_PRINT2 (" Pushing failure id: %u\n", failure_id); \
+ DEBUG_PUSH (failure_id); \
+ } while (0)
+
+/* This is the number of items that are pushed and popped on the stack
+ for each register. */
+#define NUM_REG_ITEMS 3
+
+/* Individual items aside from the registers. */
+#ifdef DEBUG
+#define NUM_NONREG_ITEMS 5 /* Includes failure point id. */
+#else
+#define NUM_NONREG_ITEMS 4
+#endif
+
+/* We push at most this many items on the stack. */
+#define MAX_FAILURE_ITEMS ((num_regs - 1) * NUM_REG_ITEMS + NUM_NONREG_ITEMS)
+
+/* We actually push this many items. */
+#define NUM_FAILURE_ITEMS \
+ ((highest_active_reg - lowest_active_reg + 1) * NUM_REG_ITEMS \
+ + NUM_NONREG_ITEMS)
+
+/* How many items can still be added to the stack without overflowing it. */
+#define REMAINING_AVAIL_SLOTS ((fail_stack).size - (fail_stack).avail)
+
+
+/* Pops what PUSH_FAIL_STACK pushes.
+
+ We restore into the parameters, all of which should be lvalues:
+ STR -- the saved data position.
+ PAT -- the saved pattern position.
+ LOW_REG, HIGH_REG -- the highest and lowest active registers.
+ REGSTART, REGEND -- arrays of string positions.
+ REG_INFO -- array of information about each subexpression.
+
+ Also assumes the variables `fail_stack' and (if debugging), `bufp',
+ `pend', `string1', `size1', `string2', and `size2'. */
+
+#define POP_FAILURE_POINT(str, pat, low_reg, high_reg, regstart, regend, reg_info)\
+{ \
+ DEBUG_STATEMENT (fail_stack_elt_t failure_id;) \
+ int this_reg; \
+ const unsigned char *string_temp; \
+ \
+ assert (!FAIL_STACK_EMPTY ()); \
+ \
+ /* Remove failure points and point to how many regs pushed. */ \
+ DEBUG_PRINT1 ("POP_FAILURE_POINT:\n"); \
+ DEBUG_PRINT2 (" Before pop, next avail: %d\n", fail_stack.avail); \
+ DEBUG_PRINT2 (" size: %d\n", fail_stack.size); \
+ \
+ assert (fail_stack.avail >= NUM_NONREG_ITEMS); \
+ \
+ DEBUG_POP (&failure_id); \
+ DEBUG_PRINT2 (" Popping failure id: %u\n", failure_id); \
+ \
+ /* If the saved string location is NULL, it came from an \
+ on_failure_keep_string_jump opcode, and we want to throw away the \
+ saved NULL, thus retaining our current position in the string. */ \
+ string_temp = POP_FAILURE_ITEM (); \
+ if (string_temp != NULL) \
+ str = (const char *) string_temp; \
+ \
+ DEBUG_PRINT2 (" Popping string 0x%x: `", str); \
+ DEBUG_PRINT_DOUBLE_STRING (str, string1, size1, string2, size2); \
+ DEBUG_PRINT1 ("'\n"); \
+ \
+ pat = (unsigned char *) POP_FAILURE_ITEM (); \
+ DEBUG_PRINT2 (" Popping pattern 0x%x: ", pat); \
+ DEBUG_PRINT_COMPILED_PATTERN (bufp, pat, pend); \
+ \
+ /* Restore register info. */ \
+ high_reg = (unsigned) POP_FAILURE_ITEM (); \
+ DEBUG_PRINT2 (" Popping high active reg: %d\n", high_reg); \
+ \
+ low_reg = (unsigned) POP_FAILURE_ITEM (); \
+ DEBUG_PRINT2 (" Popping low active reg: %d\n", low_reg); \
+ \
+ for (this_reg = high_reg; this_reg >= low_reg; this_reg--) \
+ { \
+ DEBUG_PRINT2 (" Popping reg: %d\n", this_reg); \
+ \
+ reg_info[this_reg].word = POP_FAILURE_ITEM (); \
+ DEBUG_PRINT2 (" info: 0x%x\n", reg_info[this_reg]); \
+ \
+ regend[this_reg] = (const char *) POP_FAILURE_ITEM (); \
+ DEBUG_PRINT2 (" end: 0x%x\n", regend[this_reg]); \
+ \
+ regstart[this_reg] = (const char *) POP_FAILURE_ITEM (); \
+ DEBUG_PRINT2 (" start: 0x%x\n", regstart[this_reg]); \
+ } \
+ \
+ DEBUG_STATEMENT (nfailure_points_popped++); \
+} /* POP_FAILURE_POINT */
+
+/* re_compile_fastmap computes a ``fastmap'' for the compiled pattern in
+ BUFP. A fastmap records which of the (1 << BYTEWIDTH) possible
+ characters can start a string that matches the pattern. This fastmap
+ is used by re_search to skip quickly over impossible starting points.
+
+ The caller must supply the address of a (1 << BYTEWIDTH)-byte data
+ area as BUFP->fastmap.
+
+ We set the `fastmap', `fastmap_accurate', and `can_be_null' fields in
+ the pattern buffer.
+
+ Returns 0 if we succeed, -2 if an internal error. */
+
+int
re_compile_fastmap (bufp)
struct re_pattern_buffer *bufp;
{
- unsigned char *pattern = (unsigned char *) bufp->buffer;
- int size = bufp->used;
+ int j, k;
+ fail_stack_type fail_stack;
+#ifndef REGEX_MALLOC
+ char *destination;
+#endif
+ /* We don't push any register information onto the failure stack. */
+ unsigned num_regs = 0;
+
register char *fastmap = bufp->fastmap;
- register unsigned char *p = pattern;
+ unsigned char *pattern = bufp->buffer;
+ unsigned long size = bufp->used;
+ const unsigned char *p = pattern;
register unsigned char *pend = pattern + size;
- register int j, k;
- unsigned char *translate = (unsigned char *) bufp->translate;
- unsigned char *stackb[NFAILURES];
- unsigned char **stackp = stackb;
+ /* Assume that each path through the pattern can be null until
+ proven otherwise. We set this false at the bottom of switch
+ statement, to which we get only if a particular path doesn't
+ match the empty string. */
+ boolean path_can_be_null = true;
+
+ /* We aren't doing a `succeed_n' to begin with. */
+ boolean succeed_n_p = false;
- bzero (fastmap, (1 << BYTEWIDTH));
- bufp->fastmap_accurate = 1;
+ assert (fastmap != NULL && p != NULL);
+
+ INIT_FAIL_STACK ();
+ bzero (fastmap, 1 << BYTEWIDTH); /* Assume nothing's valid. */
+ bufp->fastmap_accurate = 1; /* It will be when we're done. */
bufp->can_be_null = 0;
- while (p)
+ while (p != pend || !FAIL_STACK_EMPTY ())
{
if (p == pend)
- {
- bufp->can_be_null = 1;
- break;
+ {
+ bufp->can_be_null |= path_can_be_null;
+
+ /* Reset for next path. */
+ path_can_be_null = true;
+
+ p = fail_stack.stack[--fail_stack.avail];
}
+
+ /* We should never be about to go beyond the end of the pattern. */
+ assert (p < pend);
+
#ifdef SWITCH_ENUM_BUG
- switch ((int) ((enum regexpcode) *p++))
+ switch ((int) ((re_opcode_t) *p++))
#else
- switch ((enum regexpcode) *p++)
+ switch ((re_opcode_t) *p++)
#endif
{
+
+ /* I guess the idea here is to simply not bother with a fastmap
+ if a backreference is used, since it's too hard to figure out
+ the fastmap for the corresponding group. Setting
+ `can_be_null' stops `re_search_2' from using the fastmap, so
+ that is all we do. */
+ case duplicate:
+ bufp->can_be_null = 1;
+ return 0;
+
+
+ /* Following are the cases which match a character. These end
+ with `break'. */
+
case exactn:
- if (translate)
- fastmap[translate[p[1]]] = 1;
- else
- fastmap[p[1]] = 1;
+ fastmap[p[1]] = 1;
break;
- case begline:
- case before_dot:
- case at_dot:
- case after_dot:
- case begbuf:
- case endbuf:
- case wordbound:
- case notwordbound:
- case wordbeg:
- case wordend:
- continue;
- case endline:
- if (translate)
- fastmap[translate['\n']] = 1;
- else
- fastmap['\n'] = 1;
- if (bufp->can_be_null != 1)
- bufp->can_be_null = 2;
+ case charset:
+ for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--)
+ if (p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH)))
+ fastmap[j] = 1;
break;
- case finalize_jump:
- case maybe_finalize_jump:
- case jump:
- case dummy_failure_jump:
- bufp->can_be_null = 1;
- j = *p++ & 0377;
- j += SIGN_EXTEND_CHAR (*(char *)p) << 8;
- p += j + 1; /* The 1 compensates for missing ++ above */
- if (j > 0)
- continue;
- /* Jump backward reached implies we just went through
- the body of a loop and matched nothing.
- Opcode jumped to should be an on_failure_jump.
- Just treat it like an ordinary jump.
- For a * loop, it has pushed its failure point already;
- if so, discard that as redundant. */
- if ((enum regexpcode) *p != on_failure_jump)
- continue;
- p++;
- j = *p++ & 0377;
- j += SIGN_EXTEND_CHAR (*(char *)p) << 8;
- p += j + 1; /* The 1 compensates for missing ++ above */
- if (stackp != stackb && *stackp == p)
- stackp--;
- continue;
-
- case on_failure_jump:
- j = *p++ & 0377;
- j += SIGN_EXTEND_CHAR (*(char *)p) << 8;
- p++;
- *++stackp = p + j;
- continue;
- case start_memory:
- case stop_memory:
- p++;
- continue;
+ case charset_not:
+ /* Chars beyond end of map must be allowed. */
+ for (j = *p * BYTEWIDTH; j < (1 << BYTEWIDTH); j++)
+ fastmap[j] = 1;
+
+ for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--)
+ if (!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))))
+ fastmap[j] = 1;
+ break;
- case duplicate:
- bufp->can_be_null = 1;
- fastmap['\n'] = 1;
- case anychar:
- for (j = 0; j < (1 << BYTEWIDTH); j++)
- if (j != '\n')
- fastmap[j] = 1;
- if (bufp->can_be_null)
- return;
- /* Don't return; check the alternative paths
- so we can set can_be_null if appropriate. */
- break;
case wordchar:
for (j = 0; j < (1 << BYTEWIDTH); j++)
@@ -844,118 +2660,283 @@ re_compile_fastmap (bufp)
fastmap[j] = 1;
break;
+
case notwordchar:
for (j = 0; j < (1 << BYTEWIDTH); j++)
if (SYNTAX (j) != Sword)
fastmap[j] = 1;
break;
+
+ case anychar:
+ /* `.' matches anything ... */
+ for (j = 0; j < (1 << BYTEWIDTH); j++)
+ fastmap[j] = 1;
+
+ /* ... except perhaps newline. */
+ if (!(bufp->syntax & RE_DOT_NEWLINE))
+ fastmap['\n'] = 0;
+
+ /* Return if we have already set `can_be_null'; if we have,
+ then the fastmap is irrelevant. Something's wrong here. */
+ else if (bufp->can_be_null)
+ return 0;
+
+ /* Otherwise, have to check alternative paths. */
+ break;
+
+
#ifdef emacs
- case syntaxspec:
+ case syntaxspec:
k = *p++;
for (j = 0; j < (1 << BYTEWIDTH); j++)
if (SYNTAX (j) == (enum syntaxcode) k)
fastmap[j] = 1;
break;
+
case notsyntaxspec:
k = *p++;
for (j = 0; j < (1 << BYTEWIDTH); j++)
if (SYNTAX (j) != (enum syntaxcode) k)
fastmap[j] = 1;
break;
-#endif /* emacs */
- case charset:
- for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--)
- if (p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH)))
- {
- if (translate)
- fastmap[translate[j]] = 1;
- else
- fastmap[j] = 1;
- }
- break;
- case charset_not:
- /* Chars beyond end of map must be allowed */
- for (j = *p * BYTEWIDTH; j < (1 << BYTEWIDTH); j++)
- if (translate)
- fastmap[translate[j]] = 1;
- else
- fastmap[j] = 1;
+ /* All cases after this match the empty string. These end with
+ `continue'. */
+
+
+ case before_dot:
+ case at_dot:
+ case after_dot:
+ continue;
+#endif /* not emacs */
+
+
+ case no_op:
+ case begline:
+ case endline:
+ case begbuf:
+ case endbuf:
+ case wordbound:
+ case notwordbound:
+ case wordbeg:
+ case wordend:
+ case push_dummy_failure:
+ continue;
+
+
+ case jump_n:
+ case pop_failure_jump:
+ case maybe_pop_jump:
+ case jump:
+ case jump_past_alt:
+ case dummy_failure_jump:
+ EXTRACT_NUMBER_AND_INCR (j, p);
+ p += j;
+ if (j > 0)
+ continue;
+
+ /* Jump backward implies we just went through the body of a
+ loop and matched nothing. Opcode jumped to should be
+ `on_failure_jump' or `succeed_n'. Just treat it like an
+ ordinary jump. For a * loop, it has pushed its failure
+ point already; if so, discard that as redundant. */
+ if ((re_opcode_t) *p != on_failure_jump
+ && (re_opcode_t) *p != succeed_n)
+ continue;
+
+ p++;
+ EXTRACT_NUMBER_AND_INCR (j, p);
+ p += j;
+
+ /* If what's on the stack is where we are now, pop it. */
+ if (!FAIL_STACK_EMPTY ()
+ && fail_stack.stack[fail_stack.avail - 1] == p)
+ fail_stack.avail--;
+
+ continue;
+
+
+ case on_failure_jump:
+ case on_failure_keep_string_jump:
+ handle_on_failure_jump:
+ EXTRACT_NUMBER_AND_INCR (j, p);
+
+ /* For some patterns, e.g., `(a?)?', `p+j' here points to the
+ end of the pattern. We don't want to push such a point,
+ since when we restore it above, entering the switch will
+ increment `p' past the end of the pattern. We don't need
+ to push such a point since we obviously won't find any more
+ fastmap entries beyond `pend'. Such a pattern can match
+ the null string, though. */
+ if (p + j < pend)
+ {
+ if (!PUSH_PATTERN_OP (p + j, fail_stack))
+ return -2;
+ }
+ else
+ bufp->can_be_null = 1;
+
+ if (succeed_n_p)
+ {
+ EXTRACT_NUMBER_AND_INCR (k, p); /* Skip the n. */
+ succeed_n_p = false;
+ }
+
+ continue;
+
+
+ case succeed_n:
+ /* Get to the number of times to succeed. */
+ p += 2;
+
+ /* Increment p past the n for when k != 0. */
+ EXTRACT_NUMBER_AND_INCR (k, p);
+ if (k == 0)
+ {
+ p -= 4;
+ succeed_n_p = true; /* Spaghetti code alert. */
+ goto handle_on_failure_jump;
+ }
+ continue;
+
+
+ case set_number_at:
+ p += 4;
+ continue;
+
+
+ case start_memory:
+ case stop_memory:
+ p += 2;
+ continue;
- for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--)
- if (!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))))
- {
- if (translate)
- fastmap[translate[j]] = 1;
- else
- fastmap[j] = 1;
- }
- break;
default:
- break;
- }
+ abort (); /* We have listed all the cases. */
+ } /* switch *p++ */
+
+ /* Getting here means we have found the possible starting
+ characters for one path of the pattern -- and that the empty
+ string does not match. We need not follow this path further.
+ Instead, look at the next alternative (remembered on the
+ stack), or quit if no more. The test at the top of the loop
+ does these things. */
+ path_can_be_null = false;
+ p = pend;
+ } /* while p */
+
+ /* Set `can_be_null' for the last path (also the first path, if the
+ pattern is empty). */
+ bufp->can_be_null |= path_can_be_null;
+ return 0;
+} /* re_compile_fastmap */
+
+/* Set REGS to hold NUM_REGS registers, storing them in STARTS and
+ ENDS. Subsequent matches using PATTERN_BUFFER and REGS will use
+ this memory for recording register information. STARTS and ENDS
+ must be allocated using the malloc library routine, and must each
+ be at least NUM_REGS * sizeof (regoff_t) bytes long.
- /* Get here means we have successfully found the possible starting characters
- of one path of the pattern. We need not follow this path any farther.
- Instead, look at the next alternative remembered in the stack. */
- if (stackp != stackb)
- p = *stackp--;
- else
- break;
+ If NUM_REGS == 0, then subsequent matches should allocate their own
+ register data.
+
+ Unless this function is called, the first search or match using
+ PATTERN_BUFFER will allocate its own register data, without
+ freeing the old data. */
+
+void
+re_set_registers (bufp, regs, num_regs, starts, ends)
+ struct re_pattern_buffer *bufp;
+ struct re_registers *regs;
+ unsigned num_regs;
+ regoff_t *starts, *ends;
+{
+ if (num_regs)
+ {
+ bufp->regs_allocated = REGS_REALLOCATE;
+ regs->num_regs = num_regs;
+ regs->start = starts;
+ regs->end = ends;
+ }
+ else
+ {
+ bufp->regs_allocated = REGS_UNALLOCATED;
+ regs->num_regs = 0;
+ regs->start = regs->end = (regoff_t) 0;
}
}
-/* Like re_search_2, below, but only one string is specified. */
+/* Searching routines. */
+
+/* Like re_search_2, below, but only one string is specified, and
+ doesn't let you say where to stop matching. */
int
-re_search (pbufp, string, size, startpos, range, regs)
- struct re_pattern_buffer *pbufp;
- char *string;
+re_search (bufp, string, size, startpos, range, regs)
+ struct re_pattern_buffer *bufp;
+ const char *string;
int size, startpos, range;
struct re_registers *regs;
{
- return re_search_2 (pbufp, 0, 0, string, size, startpos, range, regs, size);
+ return re_search_2 (bufp, NULL, 0, string, size, startpos, range,
+ regs, size);
}
-/* Like re_match_2 but tries first a match starting at index STARTPOS,
- then at STARTPOS + 1, and so on.
- RANGE is the number of places to try before giving up.
- If RANGE is negative, the starting positions tried are
- STARTPOS, STARTPOS - 1, etc.
- It is up to the caller to make sure that range is not so large
- as to take the starting position outside of the input strings.
-The value returned is the position at which the match was found,
- or -1 if no match was found,
- or -2 if error (such as failure stack overflow). */
+/* Using the compiled pattern in BUFP->buffer, first tries to match the
+ virtual concatenation of STRING1 and STRING2, starting first at index
+ STARTPOS, then at STARTPOS + 1, and so on.
+
+ STRING1 and STRING2 have length SIZE1 and SIZE2, respectively.
+
+ RANGE is how far to scan while trying to match. RANGE = 0 means try
+ only at STARTPOS; in general, the last start tried is STARTPOS +
+ RANGE.
+
+ In REGS, return the indices of the virtual concatenation of STRING1
+ and STRING2 that matched the entire BUFP->buffer and its contained
+ subexpressions.
+
+ Do not consider matching one past the index STOP in the virtual
+ concatenation of STRING1 and STRING2.
+
+ We return either the position in the strings at which the match was
+ found, -1 if no match, or -2 if error (such as failure
+ stack overflow). */
int
-re_search_2 (pbufp, string1, size1, string2, size2, startpos, range, regs, mstop)
- struct re_pattern_buffer *pbufp;
- char *string1, *string2;
+re_search_2 (bufp, string1, size1, string2, size2, startpos, range, regs, stop)
+ struct re_pattern_buffer *bufp;
+ const char *string1, *string2;
int size1, size2;
int startpos;
- register int range;
+ int range;
struct re_registers *regs;
- int mstop;
+ int stop;
{
- register char *fastmap = pbufp->fastmap;
- register unsigned char *translate = (unsigned char *) pbufp->translate;
- int total = size1 + size2;
int val;
-
- /* Update the fastmap now if not correct already */
- if (fastmap && !pbufp->fastmap_accurate)
- re_compile_fastmap (pbufp);
-
- /* Don't waste time in a long search for a pattern
- that says it is anchored. */
- if (pbufp->used > 0 && (enum regexpcode) pbufp->buffer[0] == begbuf
- && range > 0)
+ register char *fastmap = bufp->fastmap;
+ register char *translate = bufp->translate;
+ int total_size = size1 + size2;
+ int endpos = startpos + range;
+
+ /* Check for out-of-range STARTPOS. */
+ if (startpos < 0 || startpos > total_size)
+ return -1;
+
+ /* Fix up RANGE if it might eventually take us outside
+ the virtual concatenation of STRING1 and STRING2. */
+ if (endpos < -1)
+ range = -1 - startpos;
+ else if (endpos > total_size)
+ range = total_size - startpos;
+
+ /* If the search isn't to be a backwards one, don't waste time in a
+ search for a pattern that must be anchored. */
+ if (bufp->used > 0 && (re_opcode_t) bufp->buffer[0] == begbuf && range > 0)
{
if (startpos > 0)
return -1;
@@ -963,159 +2944,420 @@ re_search_2 (pbufp, string1, size1, string2, size2, startpos, range, regs, mstop
range = 1;
}
- while (1)
- {
- /* If a fastmap is supplied, skip quickly over characters
- that cannot possibly be the start of a match.
- Note, however, that if the pattern can possibly match
- the null string, we must test it at each starting point
- so that we take the first null string we get. */
-
- if (fastmap && startpos < total && pbufp->can_be_null != 1)
+ /* Update the fastmap now if not correct already. */
+ if (fastmap && !bufp->fastmap_accurate)
+ if (re_compile_fastmap (bufp) == -2)
+ return -2;
+
+ /* Loop through the string, looking for a place to start matching. */
+ for (;;)
+ {
+ /* If a fastmap is supplied, skip quickly over characters that
+ cannot be the start of a match. If the pattern can match the
+ null string, however, we don't need to skip characters; we want
+ the first null string. */
+ if (fastmap && startpos < total_size && !bufp->can_be_null)
{
- if (range > 0)
+ if (range > 0) /* Searching forwards. */
{
+ register const char *d;
register int lim = 0;
- register unsigned char *p;
int irange = range;
- if (startpos < size1 && startpos + range >= size1)
- lim = range - (size1 - startpos);
- p = ((unsigned char *)
- &(startpos >= size1 ? string2 - size1 : string1)[startpos]);
+ if (startpos < size1 && startpos + range >= size1)
+ lim = range - (size1 - startpos);
+ d = (startpos >= size1 ? string2 - size1 : string1) + startpos;
+
+ /* Written out as an if-else to avoid testing `translate'
+ inside the loop. */
if (translate)
- {
- while (range > lim && !fastmap[translate[*p++]])
- range--;
- }
+ while (range > lim
+ && !fastmap[(unsigned char)
+ translate[(unsigned char) *d++]])
+ range--;
else
- {
- while (range > lim && !fastmap[*p++])
- range--;
- }
+ while (range > lim && !fastmap[(unsigned char) *d++])
+ range--;
+
startpos += irange - range;
}
- else
+ else /* Searching backwards. */
{
- register unsigned char c;
- if (startpos >= size1)
- c = string2[startpos - size1];
- else
- c = string1[startpos];
- c &= 0xff;
- if (translate ? !fastmap[translate[c]] : !fastmap[c])
+ register char c = (size1 == 0 || startpos >= size1
+ ? string2[startpos - size1]
+ : string1[startpos]);
+
+ if (!fastmap[(unsigned char) TRANSLATE (c)])
goto advance;
}
}
- if (range >= 0 && startpos == total
- && fastmap && pbufp->can_be_null == 0)
+ /* If can't match the null string, and that's all we have left, fail. */
+ if (range >= 0 && startpos == total_size && fastmap
+ && !bufp->can_be_null)
return -1;
- val = re_match_2 (pbufp, string1, size1, string2, size2, startpos, regs, mstop);
- if (0 <= val)
- {
- if (val == -2)
- return -2;
- return startpos;
- }
-
-#ifdef C_ALLOCA
- alloca (0);
-#endif /* C_ALLOCA */
+ val = re_match_2 (bufp, string1, size1, string2, size2,
+ startpos, regs, stop);
+ if (val >= 0)
+ return startpos;
+
+ if (val == -2)
+ return -2;
advance:
- if (!range) break;
- if (range > 0) range--, startpos++; else range++, startpos--;
+ if (!range)
+ break;
+ else if (range > 0)
+ {
+ range--;
+ startpos++;
+ }
+ else
+ {
+ range++;
+ startpos--;
+ }
}
return -1;
-}
+} /* re_search_2 */
-#ifndef emacs /* emacs never uses this */
-int
-re_match (pbufp, string, size, pos, regs)
- struct re_pattern_buffer *pbufp;
- char *string;
- int size, pos;
- struct re_registers *regs;
+/* Declarations and macros for re_match_2. */
+
+static int bcmp_translate ();
+static boolean alt_match_null_string_p (),
+ common_op_match_null_string_p (),
+ group_match_null_string_p ();
+
+/* Structure for per-register (a.k.a. per-group) information.
+ This must not be longer than one word, because we push this value
+ onto the failure stack. Other register information, such as the
+ starting and ending positions (which are addresses), and the list of
+ inner groups (which is a bits list) are maintained in separate
+ variables.
+
+ We are making a (strictly speaking) nonportable assumption here: that
+ the compiler will pack our bit fields into something that fits into
+ the type of `word', i.e., is something that fits into one item on the
+ failure stack. */
+typedef union
{
- return re_match_2 (pbufp, 0, 0, string, size, pos, regs, size);
-}
-#endif /* emacs */
+ fail_stack_elt_t word;
+ struct
+ {
+ /* This field is one if this group can match the empty string,
+ zero if not. If not yet determined, `MATCH_NULL_UNSET_VALUE'. */
+#define MATCH_NULL_UNSET_VALUE 3
+ unsigned match_null_string_p : 2;
+ unsigned is_active : 1;
+ unsigned matched_something : 1;
+ unsigned ever_matched_something : 1;
+ } bits;
+} register_info_type;
+
+#define REG_MATCH_NULL_STRING_P(R) ((R).bits.match_null_string_p)
+#define IS_ACTIVE(R) ((R).bits.is_active)
+#define MATCHED_SOMETHING(R) ((R).bits.matched_something)
+#define EVER_MATCHED_SOMETHING(R) ((R).bits.ever_matched_something)
+
+
+/* Call this when have matched a real character; it sets `matched' flags
+ for the subexpressions which we are currently inside. Also records
+ that those subexprs have matched. */
+#define SET_REGS_MATCHED() \
+ do \
+ { \
+ unsigned r; \
+ for (r = lowest_active_reg; r <= highest_active_reg; r++) \
+ { \
+ MATCHED_SOMETHING (reg_info[r]) \
+ = EVER_MATCHED_SOMETHING (reg_info[r]) \
+ = 1; \
+ } \
+ } \
+ while (0)
+
+
+/* This converts PTR, a pointer into one of the search strings `string1'
+ and `string2' into an offset from the beginning of that string. */
+#define POINTER_TO_OFFSET(ptr) \
+ (FIRST_STRING_P (ptr) ? (ptr) - string1 : (ptr) - string2 + size1)
+
+/* Registers are set to a sentinel when they haven't yet matched. */
+#define REG_UNSET_VALUE ((char *) -1)
+#define REG_UNSET(e) ((e) == REG_UNSET_VALUE)
+
+
+/* Macros for dealing with the split strings in re_match_2. */
+
+#define MATCHING_IN_FIRST_STRING (dend == end_match_1)
+
+/* Call before fetching a character with *d. This switches over to
+ string2 if necessary. */
+#define PREFETCH() \
+ while (d == dend) \
+ { \
+ /* End of string2 => fail. */ \
+ if (dend == end_match_2) \
+ goto fail; \
+ /* End of string1 => advance to string2. */ \
+ d = string2; \
+ dend = end_match_2; \
+ }
-/* Maximum size of failure stack. Beyond this, overflow is an error. */
-int re_max_failures = 2000;
+/* Test if at very beginning or at very end of the virtual concatenation
+ of `string1' and `string2'. If only one string, it's `string2'. */
+#define AT_STRINGS_BEG(d) ((d) == (size1 ? string1 : string2) || !size2)
+#define AT_STRINGS_END(d) ((d) == end2)
+
+
+/* Test if D points to a character which is word-constituent. We have
+ two special cases to check for: if past the end of string1, look at
+ the first character in string2; and if before the beginning of
+ string2, look at the last character in string1. */
+#define WORDCHAR_P(d) \
+ (SYNTAX ((d) == end1 ? *string2 \
+ : (d) == string2 - 1 ? *(end1 - 1) : *(d)) \
+ == Sword)
+
+/* Test if the character before D and the one at D differ with respect
+ to being word-constituent. */
+#define AT_WORD_BOUNDARY(d) \
+ (AT_STRINGS_BEG (d) || AT_STRINGS_END (d) \
+ || WORDCHAR_P (d - 1) != WORDCHAR_P (d))
+
+
+/* Free everything we malloc. */
+#ifdef REGEX_MALLOC
+#define FREE_VAR(var) if (var) free (var); var = NULL
+#define FREE_VARIABLES() \
+ do { \
+ FREE_VAR (fail_stack.stack); \
+ FREE_VAR (regstart); \
+ FREE_VAR (regend); \
+ FREE_VAR (old_regstart); \
+ FREE_VAR (old_regend); \
+ FREE_VAR (best_regstart); \
+ FREE_VAR (best_regend); \
+ FREE_VAR (reg_info); \
+ FREE_VAR (reg_dummy); \
+ FREE_VAR (reg_info_dummy); \
+ } while (0)
+#else /* not REGEX_MALLOC */
+/* Some MIPS systems (at least) want this to free alloca'd storage. */
+#define FREE_VARIABLES() alloca (0)
+#endif /* not REGEX_MALLOC */
+
+
+/* These values must meet several constraints. They must not be valid
+ register values; since we have a limit of 255 registers (because
+ we use only one byte in the pattern for the register number), we can
+ use numbers larger than 255. They must differ by 1, because of
+ NUM_FAILURE_ITEMS above. And the value for the lowest register must
+ be larger than the value for the highest register, so we do not try
+ to actually save any registers when none are active. */
+#define NO_HIGHEST_ACTIVE_REG (1 << BYTEWIDTH)
+#define NO_LOWEST_ACTIVE_REG (NO_HIGHEST_ACTIVE_REG + 1)
+
+/* Matching routines. */
-static int bcmp_translate();
-/* Match the pattern described by PBUFP
- against data which is the virtual concatenation of STRING1 and STRING2.
- SIZE1 and SIZE2 are the sizes of the two data strings.
- Start the match at position POS.
- Do not consider matching past the position MSTOP.
+#ifndef emacs /* Emacs never uses this. */
+/* re_match is like re_match_2 except it takes only a single string. */
- If pbufp->fastmap is nonzero, then it had better be up to date.
+int
+re_match (bufp, string, size, pos, regs)
+ struct re_pattern_buffer *bufp;
+ const char *string;
+ int size, pos;
+ struct re_registers *regs;
+ {
+ return re_match_2 (bufp, NULL, 0, string, size, pos, regs, size);
+}
+#endif /* not emacs */
- The reason that the data to match are specified as two components
- which are to be regarded as concatenated
- is so this function can be used directly on the contents of an Emacs buffer.
- -1 is returned if there is no match. -2 is returned if there is
- an error (such as match stack overflow). Otherwise the value is the length
- of the substring which was matched. */
+/* re_match_2 matches the compiled pattern in BUFP against the
+ the (virtual) concatenation of STRING1 and STRING2 (of length SIZE1
+ and SIZE2, respectively). We start matching at POS, and stop
+ matching at STOP.
+
+ If REGS is non-null and the `no_sub' field of BUFP is nonzero, we
+ store offsets for the substring each group matched in REGS. See the
+ documentation for exactly how many groups we fill.
+
+ We return -1 if no match, -2 if an internal error (such as the
+ failure stack overflowing). Otherwise, we return the length of the
+ matched substring. */
int
-re_match_2 (pbufp, string1, size1, string2, size2, pos, regs, mstop)
- struct re_pattern_buffer *pbufp;
- unsigned char *string1, *string2;
+re_match_2 (bufp, string1, size1, string2, size2, pos, regs, stop)
+ struct re_pattern_buffer *bufp;
+ const char *string1, *string2;
int size1, size2;
int pos;
struct re_registers *regs;
- int mstop;
+ int stop;
{
- register unsigned char *p = (unsigned char *) pbufp->buffer;
- register unsigned char *pend = p + pbufp->used;
- /* End of first string */
- unsigned char *end1;
- /* End of second string */
- unsigned char *end2;
- /* Pointer just past last char to consider matching */
- unsigned char *end_match_1, *end_match_2;
- register unsigned char *d, *dend;
- register int mcnt;
- unsigned char *translate = (unsigned char *) pbufp->translate;
-
- /* Failure point stack. Each place that can handle a failure further down the line
- pushes a failure point on this stack. It consists of two char *'s.
- The first one pushed is where to resume scanning the pattern;
- the second pushed is where to resume scanning the strings.
- If the latter is zero, the failure point is a "dummy".
- If a failure happens and the innermost failure point is dormant,
- it discards that failure point and tries the next one. */
-
- unsigned char *initial_stack[2 * NFAILURES];
- unsigned char **stackb = initial_stack;
- unsigned char **stackp = stackb, **stacke = &stackb[2 * NFAILURES];
-
- /* Information on the "contents" of registers.
- These are pointers into the input strings; they record
- just what was matched (on this attempt) by some part of the pattern.
- The start_memory command stores the start of a register's contents
- and the stop_memory command stores the end.
-
- At that point, regstart[regnum] points to the first character in the register,
- regend[regnum] points to the first character beyond the end of the register,
- regstart_seg1[regnum] is true iff regstart[regnum] points into string1,
- and regend_seg1[regnum] is true iff regend[regnum] points into string1. */
-
- unsigned char *regstart[RE_NREGS];
- unsigned char *regend[RE_NREGS];
- unsigned char regstart_seg1[RE_NREGS], regend_seg1[RE_NREGS];
-
- /* Set up pointers to ends of strings.
- Don't allow the second string to be empty unless both are empty. */
- if (!size2)
+ /* General temporaries. */
+ int mcnt;
+ unsigned char *p1;
+
+ /* Just past the end of the corresponding string. */
+ const char *end1, *end2;
+
+ /* Pointers into string1 and string2, just past the last characters in
+ each to consider matching. */
+ const char *end_match_1, *end_match_2;
+
+ /* Where we are in the data, and the end of the current string. */
+ const char *d, *dend;
+
+ /* Where we are in the pattern, and the end of the pattern. */
+ unsigned char *p = bufp->buffer;
+ register unsigned char *pend = p + bufp->used;
+
+ /* We use this to map every character in the string. */
+ char *translate = bufp->translate;
+
+ /* Failure point stack. Each place that can handle a failure further
+ down the line pushes a failure point on this stack. It consists of
+ restart, regend, and reg_info for all registers corresponding to
+ the subexpressions we're currently inside, plus the number of such
+ registers, and, finally, two char *'s. The first char * is where
+ to resume scanning the pattern; the second one is where to resume
+ scanning the strings. If the latter is zero, the failure point is
+ a ``dummy''; if a failure happens and the failure point is a dummy,
+ it gets discarded and the next next one is tried. */
+ fail_stack_type fail_stack;
+#ifdef DEBUG
+ static unsigned failure_id = 0;
+ unsigned nfailure_points_pushed = 0, nfailure_points_popped = 0;
+#endif
+
+ /* We fill all the registers internally, independent of what we
+ return, for use in backreferences. The number here includes
+ an element for register zero. */
+ unsigned num_regs = bufp->re_nsub + 1;
+
+ /* The currently active registers. */
+ unsigned lowest_active_reg = NO_LOWEST_ACTIVE_REG;
+ unsigned highest_active_reg = NO_HIGHEST_ACTIVE_REG;
+
+ /* Information on the contents of registers. These are pointers into
+ the input strings; they record just what was matched (on this
+ attempt) by a subexpression part of the pattern, that is, the
+ regnum-th regstart pointer points to where in the pattern we began
+ matching and the regnum-th regend points to right after where we
+ stopped matching the regnum-th subexpression. (The zeroth register
+ keeps track of what the whole pattern matches.) */
+ const char **regstart, **regend;
+
+ /* If a group that's operated upon by a repetition operator fails to
+ match anything, then the register for its start will need to be
+ restored because it will have been set to wherever in the string we
+ are when we last see its open-group operator. Similarly for a
+ register's end. */
+ const char **old_regstart, **old_regend;
+
+ /* The is_active field of reg_info helps us keep track of which (possibly
+ nested) subexpressions we are currently in. The matched_something
+ field of reg_info[reg_num] helps us tell whether or not we have
+ matched any of the pattern so far this time through the reg_num-th
+ subexpression. These two fields get reset each time through any
+ loop their register is in. */
+ register_info_type *reg_info;
+
+ /* The following record the register info as found in the above
+ variables when we find a match better than any we've seen before.
+ This happens as we backtrack through the failure points, which in
+ turn happens only if we have not yet matched the entire string. */
+ unsigned best_regs_set = false;
+ const char **best_regstart, **best_regend;
+
+ /* Logically, this is `best_regend[0]'. But we don't want to have to
+ allocate space for that if we're not allocating space for anything
+ else (see below). Also, we never need info about register 0 for
+ any of the other register vectors, and it seems rather a kludge to
+ treat `best_regend' differently than the rest. So we keep track of
+ the end of the best match so far in a separate variable. We
+ initialize this to NULL so that when we backtrack the first time
+ and need to test it, it's not garbage. */
+ const char *match_end = NULL;
+
+ /* Used when we pop values we don't care about. */
+ const char **reg_dummy;
+ register_info_type *reg_info_dummy;
+
+#ifdef DEBUG
+ /* Counts the total number of registers pushed. */
+ unsigned num_regs_pushed = 0;
+#endif
+
+ DEBUG_PRINT1 ("\n\nEntering re_match_2.\n");
+
+ INIT_FAIL_STACK ();
+
+ /* Do not bother to initialize all the register variables if there are
+ no groups in the pattern, as it takes a fair amount of time. If
+ there are groups, we include space for register 0 (the whole
+ pattern), even though we never use it, since it simplifies the
+ array indexing. We should fix this. */
+ if (bufp->re_nsub)
+ {
+ regstart = REGEX_TALLOC (num_regs, const char *);
+ regend = REGEX_TALLOC (num_regs, const char *);
+ old_regstart = REGEX_TALLOC (num_regs, const char *);
+ old_regend = REGEX_TALLOC (num_regs, const char *);
+ best_regstart = REGEX_TALLOC (num_regs, const char *);
+ best_regend = REGEX_TALLOC (num_regs, const char *);
+ reg_info = REGEX_TALLOC (num_regs, register_info_type);
+ reg_dummy = REGEX_TALLOC (num_regs, const char *);
+ reg_info_dummy = REGEX_TALLOC (num_regs, register_info_type);
+
+ if (!(regstart && regend && old_regstart && old_regend && reg_info
+ && best_regstart && best_regend && reg_dummy && reg_info_dummy))
+ {
+ FREE_VARIABLES ();
+ return -2;
+ }
+ }
+#ifdef REGEX_MALLOC
+ else
+ {
+ /* We must initialize all our variables to NULL, so that
+ `FREE_VARIABLES' doesn't try to free them. */
+ regstart = regend = old_regstart = old_regend = best_regstart
+ = best_regend = reg_dummy = NULL;
+ reg_info = reg_info_dummy = (register_info_type *) NULL;
+ }
+#endif /* REGEX_MALLOC */
+
+ /* The starting position is bogus. */
+ if (pos < 0 || pos > size1 + size2)
+ {
+ FREE_VARIABLES ();
+ return -1;
+ }
+
+ /* Initialize subexpression text positions to -1 to mark ones that no
+ start_memory/stop_memory has been seen for. Also initialize the
+ register information struct. */
+ for (mcnt = 1; mcnt < num_regs; mcnt++)
+ {
+ regstart[mcnt] = regend[mcnt]
+ = old_regstart[mcnt] = old_regend[mcnt] = REG_UNSET_VALUE;
+
+ REG_MATCH_NULL_STRING_P (reg_info[mcnt]) = MATCH_NULL_UNSET_VALUE;
+ IS_ACTIVE (reg_info[mcnt]) = 0;
+ MATCHED_SOMETHING (reg_info[mcnt]) = 0;
+ EVER_MATCHED_SOMETHING (reg_info[mcnt]) = 0;
+ }
+
+ /* We move `string1' into `string2' if the latter's empty -- but not if
+ `string1' is null. */
+ if (size2 == 0 && string1 != NULL)
{
string2 = string1;
size2 = size1;
@@ -1125,482 +3367,1329 @@ re_match_2 (pbufp, string1, size1, string2, size2, pos, regs, mstop)
end1 = string1 + size1;
end2 = string2 + size2;
- /* Compute where to stop matching, within the two strings */
- if (mstop <= size1)
+ /* Compute where to stop matching, within the two strings. */
+ if (stop <= size1)
{
- end_match_1 = string1 + mstop;
+ end_match_1 = string1 + stop;
end_match_2 = string2;
}
else
{
end_match_1 = end1;
- end_match_2 = string2 + mstop - size1;
+ end_match_2 = string2 + stop - size1;
}
- /* Initialize \) text positions to -1
- to mark ones that no \( or \) has been seen for. */
-
- for (mcnt = 0; mcnt < sizeof (regend) / sizeof (*regend); mcnt++)
- regend[mcnt] = (unsigned char *) -1;
-
- /* `p' scans through the pattern as `d' scans through the data.
- `dend' is the end of the input string that `d' points within.
- `d' is advanced into the following input string whenever necessary,
- but this happens before fetching;
- therefore, at the beginning of the loop,
- `d' can be pointing at the end of a string,
- but it cannot equal string2. */
-
- if (pos <= size1)
- d = string1 + pos, dend = end_match_1;
+ /* `p' scans through the pattern as `d' scans through the data.
+ `dend' is the end of the input string that `d' points within. `d'
+ is advanced into the following input string whenever necessary, but
+ this happens before fetching; therefore, at the beginning of the
+ loop, `d' can be pointing at the end of a string, but it cannot
+ equal `string2'. */
+ if (size1 > 0 && pos <= size1)
+ {
+ d = string1 + pos;
+ dend = end_match_1;
+ }
else
- d = string2 + pos - size1, dend = end_match_2;
-
-/* Write PREFETCH; just before fetching a character with *d. */
-#define PREFETCH \
- while (d == dend) \
- { if (dend == end_match_2) goto fail; /* end of string2 => failure */ \
- d = string2; /* end of string1 => advance to string2. */ \
- dend = end_match_2; }
-
- /* This loop loops over pattern commands.
- It exits by returning from the function if match is complete,
- or it drops through if match fails at this starting point in the input data. */
+ {
+ d = string2 + pos - size1;
+ dend = end_match_2;
+ }
- while (1)
+ DEBUG_PRINT1 ("The compiled pattern is: ");
+ DEBUG_PRINT_COMPILED_PATTERN (bufp, p, pend);
+ DEBUG_PRINT1 ("The string to match is: `");
+ DEBUG_PRINT_DOUBLE_STRING (d, string1, size1, string2, size2);
+ DEBUG_PRINT1 ("'\n");
+
+ /* This loops over pattern commands. It exits by returning from the
+ function if the match is complete, or it drops through if the match
+ fails at this starting point in the input data. */
+ for (;;)
{
+ DEBUG_PRINT2 ("\n0x%x: ", p);
+
if (p == pend)
- /* End of pattern means we have succeeded! */
- {
- /* If caller wants register contents data back, convert it to indices */
- if (regs)
+ { /* End of pattern means we might have succeeded. */
+ DEBUG_PRINT1 ("end of pattern ... ");
+
+ /* If we haven't matched the entire string, and we want the
+ longest match, try backtracking. */
+ if (d != end_match_2)
{
- regs->start[0] = pos;
- if (dend == end_match_1)
- regs->end[0] = d - string1;
- else
- regs->end[0] = d - string2 + size1;
- for (mcnt = 1; mcnt < RE_NREGS; mcnt++)
- {
- if (regend[mcnt] == (unsigned char *) -1)
+ DEBUG_PRINT1 ("backtracking.\n");
+
+ if (!FAIL_STACK_EMPTY ())
+ { /* More failure points to try. */
+ boolean same_str_p = (FIRST_STRING_P (match_end)
+ == MATCHING_IN_FIRST_STRING);
+
+ /* If exceeds best match so far, save it. */
+ if (!best_regs_set
+ || (same_str_p && d > match_end)
+ || (!same_str_p && !MATCHING_IN_FIRST_STRING))
+ {
+ best_regs_set = true;
+ match_end = d;
+
+ DEBUG_PRINT1 ("\nSAVING match as best so far.\n");
+
+ for (mcnt = 1; mcnt < num_regs; mcnt++)
+ {
+ best_regstart[mcnt] = regstart[mcnt];
+ best_regend[mcnt] = regend[mcnt];
+ }
+ }
+ goto fail;
+ }
+
+ /* If no failure points, don't restore garbage. */
+ else if (best_regs_set)
+ {
+ restore_best_regs:
+ /* Restore best match. It may happen that `dend ==
+ end_match_1' while the restored d is in string2.
+ For example, the pattern `x.*y.*z' against the
+ strings `x-' and `y-z-', if the two strings are
+ not consecutive in memory. */
+ DEBUG_PRINT1 ("Restoring best registers.\n");
+
+ d = match_end;
+ dend = ((d >= string1 && d <= end1)
+ ? end_match_1 : end_match_2);
+
+ for (mcnt = 1; mcnt < num_regs; mcnt++)
{
- regs->start[mcnt] = -1;
- regs->end[mcnt] = -1;
- continue;
+ regstart[mcnt] = best_regstart[mcnt];
+ regend[mcnt] = best_regend[mcnt];
}
- if (regstart_seg1[mcnt])
- regs->start[mcnt] = regstart[mcnt] - string1;
- else
- regs->start[mcnt] = regstart[mcnt] - string2 + size1;
- if (regend_seg1[mcnt])
- regs->end[mcnt] = regend[mcnt] - string1;
- else
- regs->end[mcnt] = regend[mcnt] - string2 + size1;
+ }
+ } /* d != end_match_2 */
+
+ DEBUG_PRINT1 ("Accepting match.\n");
+
+ /* If caller wants register contents data back, do it. */
+ if (regs && !bufp->no_sub)
+ {
+ /* Have the register data arrays been allocated? */
+ if (bufp->regs_allocated == REGS_UNALLOCATED)
+ { /* No. So allocate them with malloc. We need one
+ extra element beyond `num_regs' for the `-1' marker
+ GNU code uses. */
+ regs->num_regs = MAX (RE_NREGS, num_regs + 1);
+ regs->start = TALLOC (regs->num_regs, regoff_t);
+ regs->end = TALLOC (regs->num_regs, regoff_t);
+ if (regs->start == NULL || regs->end == NULL)
+ return -2;
+ bufp->regs_allocated = REGS_REALLOCATE;
+ }
+ else if (bufp->regs_allocated == REGS_REALLOCATE)
+ { /* Yes. If we need more elements than were already
+ allocated, reallocate them. If we need fewer, just
+ leave it alone. */
+ if (regs->num_regs < num_regs + 1)
+ {
+ regs->num_regs = num_regs + 1;
+ RETALLOC (regs->start, regs->num_regs, regoff_t);
+ RETALLOC (regs->end, regs->num_regs, regoff_t);
+ if (regs->start == NULL || regs->end == NULL)
+ return -2;
+ }
+ }
+ else
+ {
+ /* These braces fend off a "empty body in an else-statement"
+ warning under GCC when assert expands to nothing. */
+ assert (bufp->regs_allocated == REGS_FIXED);
}
- }
- if (dend == end_match_1)
- return (d - string1 - pos);
- else
- return d - string2 + size1 - pos;
- }
- /* Otherwise match next pattern command */
+ /* Convert the pointer data in `regstart' and `regend' to
+ indices. Register zero has to be set differently,
+ since we haven't kept track of any info for it. */
+ if (regs->num_regs > 0)
+ {
+ regs->start[0] = pos;
+ regs->end[0] = (MATCHING_IN_FIRST_STRING ? d - string1
+ : d - string2 + size1);
+ }
+
+ /* Go through the first `min (num_regs, regs->num_regs)'
+ registers, since that is all we initialized. */
+ for (mcnt = 1; mcnt < MIN (num_regs, regs->num_regs); mcnt++)
+ {
+ if (REG_UNSET (regstart[mcnt]) || REG_UNSET (regend[mcnt]))
+ regs->start[mcnt] = regs->end[mcnt] = -1;
+ else
+ {
+ regs->start[mcnt] = POINTER_TO_OFFSET (regstart[mcnt]);
+ regs->end[mcnt] = POINTER_TO_OFFSET (regend[mcnt]);
+ }
+ }
+
+ /* If the regs structure we return has more elements than
+ were in the pattern, set the extra elements to -1. If
+ we (re)allocated the registers, this is the case,
+ because we always allocate enough to have at least one
+ -1 at the end. */
+ for (mcnt = num_regs; mcnt < regs->num_regs; mcnt++)
+ regs->start[mcnt] = regs->end[mcnt] = -1;
+ } /* regs && !bufp->no_sub */
+
+ FREE_VARIABLES ();
+ DEBUG_PRINT4 ("%u failure points pushed, %u popped (%u remain).\n",
+ nfailure_points_pushed, nfailure_points_popped,
+ nfailure_points_pushed - nfailure_points_popped);
+ DEBUG_PRINT2 ("%u registers pushed.\n", num_regs_pushed);
+
+ mcnt = d - pos - (MATCHING_IN_FIRST_STRING
+ ? string1
+ : string2 - size1);
+
+ DEBUG_PRINT2 ("Returning %d from re_match_2.\n", mcnt);
+
+ return mcnt;
+ }
+
+ /* Otherwise match next pattern command. */
#ifdef SWITCH_ENUM_BUG
- switch ((int) ((enum regexpcode) *p++))
+ switch ((int) ((re_opcode_t) *p++))
#else
- switch ((enum regexpcode) *p++)
+ switch ((re_opcode_t) *p++)
#endif
{
+ /* Ignore these. Used to ignore the n of succeed_n's which
+ currently have n == 0. */
+ case no_op:
+ DEBUG_PRINT1 ("EXECUTING no_op.\n");
+ break;
- /* \( is represented by a start_memory, \) by a stop_memory.
- Both of those commands contain a "register number" argument.
- The text matched within the \( and \) is recorded under that number.
- Then, \<digit> turns into a `duplicate' command which
- is followed by the numeric value of <digit> as the register number. */
- case start_memory:
- regstart[*p] = d;
- regstart_seg1[*p++] = (dend == end_match_1);
- break;
-
- case stop_memory:
- regend[*p] = d;
- regend_seg1[*p++] = (dend == end_match_1);
- break;
+ /* Match the next n pattern characters exactly. The following
+ byte in the pattern defines n, and the n bytes after that
+ are the characters to match. */
+ case exactn:
+ mcnt = *p++;
+ DEBUG_PRINT2 ("EXECUTING exactn %d.\n", mcnt);
- case duplicate:
- {
- int regno = *p++; /* Get which register to match against */
- register unsigned char *d2, *dend2;
+ /* This is written out as an if-else so we don't waste time
+ testing `translate' inside the loop. */
+ if (translate)
+ {
+ do
+ {
+ PREFETCH ();
+ if (translate[(unsigned char) *d++] != (char) *p++)
+ goto fail;
+ }
+ while (--mcnt);
+ }
+ else
+ {
+ do
+ {
+ PREFETCH ();
+ if (*d++ != (char) *p++) goto fail;
+ }
+ while (--mcnt);
+ }
+ SET_REGS_MATCHED ();
+ break;
- d2 = regstart[regno];
- dend2 = ((regstart_seg1[regno] == regend_seg1[regno])
- ? regend[regno] : end_match_1);
- while (1)
- {
- /* Advance to next segment in register contents, if necessary */
- while (d2 == dend2)
- {
- if (dend2 == end_match_2) break;
- if (dend2 == regend[regno]) break;
- d2 = string2, dend2 = regend[regno]; /* end of string1 => advance to string2. */
- }
- /* At end of register contents => success */
- if (d2 == dend2) break;
- /* Advance to next segment in data being matched, if necessary */
- PREFETCH;
+ /* Match any character except possibly a newline or a null. */
+ case anychar:
+ DEBUG_PRINT1 ("EXECUTING anychar.\n");
- /* mcnt gets # consecutive chars to compare */
- mcnt = dend - d;
- if (mcnt > dend2 - d2)
- mcnt = dend2 - d2;
- /* Compare that many; failure if mismatch, else skip them. */
- if (translate ? bcmp_translate (d, d2, mcnt, translate) : bcmp (d, d2, mcnt))
- goto fail;
- d += mcnt, d2 += mcnt;
- }
- }
- break;
+ PREFETCH ();
- case anychar:
- /* fetch a data character */
- PREFETCH;
- /* Match anything but a newline. */
- if ((translate ? translate[*d++] : *d++) == '\n')
+ if ((!(bufp->syntax & RE_DOT_NEWLINE) && TRANSLATE (*d) == '\n')
+ || (bufp->syntax & RE_DOT_NOT_NULL && TRANSLATE (*d) == '\000'))
goto fail;
+
+ SET_REGS_MATCHED ();
+ DEBUG_PRINT2 (" Matched `%d'.\n", *d);
+ d++;
break;
+
case charset:
case charset_not:
{
- /* Nonzero for charset_not */
- int not = 0;
- register int c;
- if (*(p - 1) == (unsigned char) charset_not)
- not = 1;
+ register unsigned char c;
+ boolean not = (re_opcode_t) *(p - 1) == charset_not;
- /* fetch a data character */
- PREFETCH;
+ DEBUG_PRINT2 ("EXECUTING charset%s.\n", not ? "_not" : "");
- if (translate)
- c = translate [*d];
- else
- c = *d;
+ PREFETCH ();
+ c = TRANSLATE (*d); /* The character to match. */
- if (c < *p * BYTEWIDTH
+ /* Cast to `unsigned' instead of `unsigned char' in case the
+ bit list is a full 32 bytes long. */
+ if (c < (unsigned) (*p * BYTEWIDTH)
&& p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
not = !not;
p += 1 + *p;
if (!not) goto fail;
- d++;
+
+ SET_REGS_MATCHED ();
+ d++;
break;
}
- case begline:
- if (d == string1 || d[-1] == '\n')
- break;
- goto fail;
-
- case endline:
- if (d == end2
- || (d == end1 ? (size2 == 0 || *string2 == '\n') : *d == '\n'))
- break;
- goto fail;
- /* "or" constructs ("|") are handled by starting each alternative
- with an on_failure_jump that points to the start of the next alternative.
- Each alternative except the last ends with a jump to the joining point.
- (Actually, each jump except for the last one really jumps
- to the following jump, because tensioning the jumps is a hassle.) */
+ /* The beginning of a group is represented by start_memory.
+ The arguments are the register number in the next byte, and the
+ number of groups inner to this one in the next. The text
+ matched within the group is recorded (in the internal
+ registers data structure) under the register number. */
+ case start_memory:
+ DEBUG_PRINT3 ("EXECUTING start_memory %d (%d):\n", *p, p[1]);
+
+ /* Find out if this group can match the empty string. */
+ p1 = p; /* To send to group_match_null_string_p. */
+
+ if (REG_MATCH_NULL_STRING_P (reg_info[*p]) == MATCH_NULL_UNSET_VALUE)
+ REG_MATCH_NULL_STRING_P (reg_info[*p])
+ = group_match_null_string_p (&p1, pend, reg_info);
+
+ /* Save the position in the string where we were the last time
+ we were at this open-group operator in case the group is
+ operated upon by a repetition operator, e.g., with `(a*)*b'
+ against `ab'; then we want to ignore where we are now in
+ the string in case this attempt to match fails. */
+ old_regstart[*p] = REG_MATCH_NULL_STRING_P (reg_info[*p])
+ ? REG_UNSET (regstart[*p]) ? d : regstart[*p]
+ : regstart[*p];
+ DEBUG_PRINT2 (" old_regstart: %d\n",
+ POINTER_TO_OFFSET (old_regstart[*p]));
+
+ regstart[*p] = d;
+ DEBUG_PRINT2 (" regstart: %d\n", POINTER_TO_OFFSET (regstart[*p]));
+
+ IS_ACTIVE (reg_info[*p]) = 1;
+ MATCHED_SOMETHING (reg_info[*p]) = 0;
+
+ /* This is the new highest active register. */
+ highest_active_reg = *p;
+
+ /* If nothing was active before, this is the new lowest active
+ register. */
+ if (lowest_active_reg == NO_LOWEST_ACTIVE_REG)
+ lowest_active_reg = *p;
+
+ /* Move past the register number and inner group count. */
+ p += 2;
+ break;
+
+
+ /* The stop_memory opcode represents the end of a group. Its
+ arguments are the same as start_memory's: the register
+ number, and the number of inner groups. */
+ case stop_memory:
+ DEBUG_PRINT3 ("EXECUTING stop_memory %d (%d):\n", *p, p[1]);
+
+ /* We need to save the string position the last time we were at
+ this close-group operator in case the group is operated
+ upon by a repetition operator, e.g., with `((a*)*(b*)*)*'
+ against `aba'; then we want to ignore where we are now in
+ the string in case this attempt to match fails. */
+ old_regend[*p] = REG_MATCH_NULL_STRING_P (reg_info[*p])
+ ? REG_UNSET (regend[*p]) ? d : regend[*p]
+ : regend[*p];
+ DEBUG_PRINT2 (" old_regend: %d\n",
+ POINTER_TO_OFFSET (old_regend[*p]));
+
+ regend[*p] = d;
+ DEBUG_PRINT2 (" regend: %d\n", POINTER_TO_OFFSET (regend[*p]));
+
+ /* This register isn't active anymore. */
+ IS_ACTIVE (reg_info[*p]) = 0;
+
+ /* If this was the only register active, nothing is active
+ anymore. */
+ if (lowest_active_reg == highest_active_reg)
+ {
+ lowest_active_reg = NO_LOWEST_ACTIVE_REG;
+ highest_active_reg = NO_HIGHEST_ACTIVE_REG;
+ }
+ else
+ { /* We must scan for the new highest active register, since
+ it isn't necessarily one less than now: consider
+ (a(b)c(d(e)f)g). When group 3 ends, after the f), the
+ new highest active register is 1. */
+ unsigned char r = *p - 1;
+ while (r > 0 && !IS_ACTIVE (reg_info[r]))
+ r--;
+
+ /* If we end up at register zero, that means that we saved
+ the registers as the result of an `on_failure_jump', not
+ a `start_memory', and we jumped to past the innermost
+ `stop_memory'. For example, in ((.)*) we save
+ registers 1 and 2 as a result of the *, but when we pop
+ back to the second ), we are at the stop_memory 1.
+ Thus, nothing is active. */
+ if (r == 0)
+ {
+ lowest_active_reg = NO_LOWEST_ACTIVE_REG;
+ highest_active_reg = NO_HIGHEST_ACTIVE_REG;
+ }
+ else
+ highest_active_reg = r;
+ }
+
+ /* If just failed to match something this time around with a
+ group that's operated on by a repetition operator, try to
+ force exit from the ``loop'', and restore the register
+ information for this group that we had before trying this
+ last match. */
+ if ((!MATCHED_SOMETHING (reg_info[*p])
+ || (re_opcode_t) p[-3] == start_memory)
+ && (p + 2) < pend)
+ {
+ boolean is_a_jump_n = false;
+
+ p1 = p + 2;
+ mcnt = 0;
+ switch ((re_opcode_t) *p1++)
+ {
+ case jump_n:
+ is_a_jump_n = true;
+ case pop_failure_jump:
+ case maybe_pop_jump:
+ case jump:
+ case dummy_failure_jump:
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+ if (is_a_jump_n)
+ p1 += 2;
+ break;
+
+ default:
+ /* do nothing */ ;
+ }
+ p1 += mcnt;
+
+ /* If the next operation is a jump backwards in the pattern
+ to an on_failure_jump right before the start_memory
+ corresponding to this stop_memory, exit from the loop
+ by forcing a failure after pushing on the stack the
+ on_failure_jump's jump in the pattern, and d. */
+ if (mcnt < 0 && (re_opcode_t) *p1 == on_failure_jump
+ && (re_opcode_t) p1[3] == start_memory && p1[4] == *p)
+ {
+ /* If this group ever matched anything, then restore
+ what its registers were before trying this last
+ failed match, e.g., with `(a*)*b' against `ab' for
+ regstart[1], and, e.g., with `((a*)*(b*)*)*'
+ against `aba' for regend[3].
+
+ Also restore the registers for inner groups for,
+ e.g., `((a*)(b*))*' against `aba' (register 3 would
+ otherwise get trashed). */
+
+ if (EVER_MATCHED_SOMETHING (reg_info[*p]))
+ {
+ unsigned r;
+
+ EVER_MATCHED_SOMETHING (reg_info[*p]) = 0;
+
+ /* Restore this and inner groups' (if any) registers. */
+ for (r = *p; r < *p + *(p + 1); r++)
+ {
+ regstart[r] = old_regstart[r];
+
+ /* xx why this test? */
+ if ((int) old_regend[r] >= (int) regstart[r])
+ regend[r] = old_regend[r];
+ }
+ }
+ p1++;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+ PUSH_FAILURE_POINT (p1 + mcnt, d, -2);
+
+ goto fail;
+ }
+ }
+
+ /* Move past the register number and the inner group count. */
+ p += 2;
+ break;
+
+
+ /* \<digit> has been turned into a `duplicate' command which is
+ followed by the numeric value of <digit> as the register number. */
+ case duplicate:
+ {
+ register const char *d2, *dend2;
+ int regno = *p++; /* Get which register to match against. */
+ DEBUG_PRINT2 ("EXECUTING duplicate %d.\n", regno);
+
+ /* Can't back reference a group which we've never matched. */
+ if (REG_UNSET (regstart[regno]) || REG_UNSET (regend[regno]))
+ goto fail;
+
+ /* Where in input to try to start matching. */
+ d2 = regstart[regno];
+
+ /* Where to stop matching; if both the place to start and
+ the place to stop matching are in the same string, then
+ set to the place to stop, otherwise, for now have to use
+ the end of the first string. */
+
+ dend2 = ((FIRST_STRING_P (regstart[regno])
+ == FIRST_STRING_P (regend[regno]))
+ ? regend[regno] : end_match_1);
+ for (;;)
+ {
+ /* If necessary, advance to next segment in register
+ contents. */
+ while (d2 == dend2)
+ {
+ if (dend2 == end_match_2) break;
+ if (dend2 == regend[regno]) break;
- /* The start of a stupid repeat has an on_failure_jump that points
- past the end of the repeat text.
- This makes a failure point so that, on failure to match a repetition,
- matching restarts past as many repetitions have been found
- with no way to fail and look for another one. */
+ /* End of string1 => advance to string2. */
+ d2 = string2;
+ dend2 = regend[regno];
+ }
+ /* At end of register contents => success */
+ if (d2 == dend2) break;
- /* A smart repeat is similar but loops back to the on_failure_jump
- so that each repetition makes another failure point. */
+ /* If necessary, advance to next segment in data. */
+ PREFETCH ();
- case on_failure_jump:
- if (stackp == stacke)
- {
- unsigned char **stackx;
- if (stacke - stackb > re_max_failures * 2)
- return -2;
- stackx = (unsigned char **) alloca (2 * (stacke - stackb)
- * sizeof (char *));
- bcopy (stackb, stackx, (stacke - stackb) * sizeof (char *));
- stackp = stackx + (stackp - stackb);
- stacke = stackx + 2 * (stacke - stackb);
- stackb = stackx;
- }
- mcnt = *p++ & 0377;
- mcnt += SIGN_EXTEND_CHAR (*(char *)p) << 8;
- p++;
- *stackp++ = mcnt + p;
- *stackp++ = d;
+ /* How many characters left in this segment to match. */
+ mcnt = dend - d;
+
+ /* Want how many consecutive characters we can match in
+ one shot, so, if necessary, adjust the count. */
+ if (mcnt > dend2 - d2)
+ mcnt = dend2 - d2;
+
+ /* Compare that many; failure if mismatch, else move
+ past them. */
+ if (translate
+ ? bcmp_translate (d, d2, mcnt, translate)
+ : bcmp (d, d2, mcnt))
+ goto fail;
+ d += mcnt, d2 += mcnt;
+ }
+ }
break;
- /* The end of a smart repeat has an maybe_finalize_jump back.
- Change it either to a finalize_jump or an ordinary jump. */
- case maybe_finalize_jump:
- mcnt = *p++ & 0377;
- mcnt += SIGN_EXTEND_CHAR (*(char *)p) << 8;
- p++;
- {
+ /* begline matches the empty string at the beginning of the string
+ (unless `not_bol' is set in `bufp'), and, if
+ `newline_anchor' is set, after newlines. */
+ case begline:
+ DEBUG_PRINT1 ("EXECUTING begline.\n");
+
+ if (AT_STRINGS_BEG (d))
+ {
+ if (!bufp->not_bol) break;
+ }
+ else if (d[-1] == '\n' && bufp->newline_anchor)
+ {
+ break;
+ }
+ /* In all other cases, we fail. */
+ goto fail;
+
+
+ /* endline is the dual of begline. */
+ case endline:
+ DEBUG_PRINT1 ("EXECUTING endline.\n");
+
+ if (AT_STRINGS_END (d))
+ {
+ if (!bufp->not_eol) break;
+ }
+
+ /* We have to ``prefetch'' the next character. */
+ else if ((d == end1 ? *string2 : *d) == '\n'
+ && bufp->newline_anchor)
+ {
+ break;
+ }
+ goto fail;
+
+
+ /* Match at the very beginning of the data. */
+ case begbuf:
+ DEBUG_PRINT1 ("EXECUTING begbuf.\n");
+ if (AT_STRINGS_BEG (d))
+ break;
+ goto fail;
+
+
+ /* Match at the very end of the data. */
+ case endbuf:
+ DEBUG_PRINT1 ("EXECUTING endbuf.\n");
+ if (AT_STRINGS_END (d))
+ break;
+ goto fail;
+
+
+ /* on_failure_keep_string_jump is used to optimize `.*\n'. It
+ pushes NULL as the value for the string on the stack. Then
+ `pop_failure_point' will keep the current value for the
+ string, instead of restoring it. To see why, consider
+ matching `foo\nbar' against `.*\n'. The .* matches the foo;
+ then the . fails against the \n. But the next thing we want
+ to do is match the \n against the \n; if we restored the
+ string value, we would be back at the foo.
+
+ Because this is used only in specific cases, we don't need to
+ check all the things that `on_failure_jump' does, to make
+ sure the right things get saved on the stack. Hence we don't
+ share its code. The only reason to push anything on the
+ stack at all is that otherwise we would have to change
+ `anychar's code to do something besides goto fail in this
+ case; that seems worse than this. */
+ case on_failure_keep_string_jump:
+ DEBUG_PRINT1 ("EXECUTING on_failure_keep_string_jump");
+
+ EXTRACT_NUMBER_AND_INCR (mcnt, p);
+ DEBUG_PRINT3 (" %d (to 0x%x):\n", mcnt, p + mcnt);
+
+ PUSH_FAILURE_POINT (p + mcnt, NULL, -2);
+ break;
+
+
+ /* Uses of on_failure_jump:
+
+ Each alternative starts with an on_failure_jump that points
+ to the beginning of the next alternative. Each alternative
+ except the last ends with a jump that in effect jumps past
+ the rest of the alternatives. (They really jump to the
+ ending jump of the following alternative, because tensioning
+ these jumps is a hassle.)
+
+ Repeats start with an on_failure_jump that points past both
+ the repetition text and either the following jump or
+ pop_failure_jump back to this on_failure_jump. */
+ case on_failure_jump:
+ on_failure:
+ DEBUG_PRINT1 ("EXECUTING on_failure_jump");
+
+ EXTRACT_NUMBER_AND_INCR (mcnt, p);
+ DEBUG_PRINT3 (" %d (to 0x%x)", mcnt, p + mcnt);
+
+ /* If this on_failure_jump comes right before a group (i.e.,
+ the original * applied to a group), save the information
+ for that group and all inner ones, so that if we fail back
+ to this point, the group's information will be correct.
+ For example, in \(a*\)*\1, we need the preceding group,
+ and in \(\(a*\)b*\)\2, we need the inner group. */
+
+ /* We can't use `p' to check ahead because we push
+ a failure point to `p + mcnt' after we do this. */
+ p1 = p;
+
+ /* We need to skip no_op's before we look for the
+ start_memory in case this on_failure_jump is happening as
+ the result of a completed succeed_n, as in \(a\)\{1,3\}b\1
+ against aba. */
+ while (p1 < pend && (re_opcode_t) *p1 == no_op)
+ p1++;
+
+ if (p1 < pend && (re_opcode_t) *p1 == start_memory)
+ {
+ /* We have a new highest active register now. This will
+ get reset at the start_memory we are about to get to,
+ but we will have saved all the registers relevant to
+ this repetition op, as described above. */
+ highest_active_reg = *(p1 + 1) + *(p1 + 2);
+ if (lowest_active_reg == NO_LOWEST_ACTIVE_REG)
+ lowest_active_reg = *(p1 + 1);
+ }
+
+ DEBUG_PRINT1 (":\n");
+ PUSH_FAILURE_POINT (p + mcnt, d, -2);
+ break;
+
+
+ /* A smart repeat ends with `maybe_pop_jump'.
+ We change it to either `pop_failure_jump' or `jump'. */
+ case maybe_pop_jump:
+ EXTRACT_NUMBER_AND_INCR (mcnt, p);
+ DEBUG_PRINT2 ("EXECUTING maybe_pop_jump %d.\n", mcnt);
+ {
register unsigned char *p2 = p;
- /* Compare what follows with the begining of the repeat.
- If we can establish that there is nothing that they would
- both match, we can change to finalize_jump */
- while (p2 != pend
- && (*p2 == (unsigned char) stop_memory
- || *p2 == (unsigned char) start_memory))
- p2++;
- if (p2 == pend)
- p[-3] = (unsigned char) finalize_jump;
- else if (*p2 == (unsigned char) exactn
- || *p2 == (unsigned char) endline)
+
+ /* Compare the beginning of the repeat with what in the
+ pattern follows its end. If we can establish that there
+ is nothing that they would both match, i.e., that we
+ would have to backtrack because of (as in, e.g., `a*a')
+ then we can change to pop_failure_jump, because we'll
+ never have to backtrack.
+
+ This is not true in the case of alternatives: in
+ `(a|ab)*' we do need to backtrack to the `ab' alternative
+ (e.g., if the string was `ab'). But instead of trying to
+ detect that here, the alternative has put on a dummy
+ failure point which is what we will end up popping. */
+
+ /* Skip over open/close-group commands. */
+ while (p2 + 2 < pend
+ && ((re_opcode_t) *p2 == stop_memory
+ || (re_opcode_t) *p2 == start_memory))
+ p2 += 3; /* Skip over args, too. */
+
+ /* If we're at the end of the pattern, we can change. */
+ if (p2 == pend)
+ {
+ /* Consider what happens when matching ":\(.*\)"
+ against ":/". I don't really understand this code
+ yet. */
+ p[-3] = (unsigned char) pop_failure_jump;
+ DEBUG_PRINT1
+ (" End of pattern: change to `pop_failure_jump'.\n");
+ }
+
+ else if ((re_opcode_t) *p2 == exactn
+ || (bufp->newline_anchor && (re_opcode_t) *p2 == endline))
{
- register int c = *p2 == (unsigned char) endline ? '\n' : p2[2];
- register unsigned char *p1 = p + mcnt;
- /* p1[0] ... p1[2] are an on_failure_jump.
- Examine what follows that */
- if (p1[3] == (unsigned char) exactn && p1[5] != c)
- p[-3] = (unsigned char) finalize_jump;
- else if (p1[3] == (unsigned char) charset
- || p1[3] == (unsigned char) charset_not)
+ register unsigned char c
+ = *p2 == (unsigned char) endline ? '\n' : p2[2];
+ p1 = p + mcnt;
+
+ /* p1[0] ... p1[2] are the `on_failure_jump' corresponding
+ to the `maybe_finalize_jump' of this case. Examine what
+ follows. */
+ if ((re_opcode_t) p1[3] == exactn && p1[5] != c)
+ {
+ p[-3] = (unsigned char) pop_failure_jump;
+ DEBUG_PRINT3 (" %c != %c => pop_failure_jump.\n",
+ c, p1[5]);
+ }
+
+ else if ((re_opcode_t) p1[3] == charset
+ || (re_opcode_t) p1[3] == charset_not)
{
- int not = p1[3] == (unsigned char) charset_not;
- if (c < p1[4] * BYTEWIDTH
+ int not = (re_opcode_t) p1[3] == charset_not;
+
+ if (c < (unsigned char) (p1[4] * BYTEWIDTH)
&& p1[5 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
not = !not;
- /* not is 1 if c would match */
- /* That means it is not safe to finalize */
+
+ /* `not' is equal to 1 if c would match, which means
+ that we can't change to pop_failure_jump. */
if (!not)
- p[-3] = (unsigned char) finalize_jump;
+ {
+ p[-3] = (unsigned char) pop_failure_jump;
+ DEBUG_PRINT1 (" No match => pop_failure_jump.\n");
+ }
}
}
}
- p -= 2;
- if (p[-1] != (unsigned char) finalize_jump)
+ p -= 2; /* Point at relative address again. */
+ if ((re_opcode_t) p[-1] != pop_failure_jump)
{
p[-1] = (unsigned char) jump;
- goto nofinalize;
+ DEBUG_PRINT1 (" Match => jump.\n");
+ goto unconditional_jump;
}
-
- /* The end of a stupid repeat has a finalize-jump
- back to the start, where another failure point will be made
- which will point after all the repetitions found so far. */
-
- case finalize_jump:
- stackp -= 2;
-
- case jump:
- nofinalize:
- mcnt = *p++ & 0377;
- mcnt += SIGN_EXTEND_CHAR (*(char *)p) << 8;
- p += mcnt + 1; /* The 1 compensates for missing ++ above */
+ /* Note fall through. */
+
+
+ /* The end of a simple repeat has a pop_failure_jump back to
+ its matching on_failure_jump, where the latter will push a
+ failure point. The pop_failure_jump takes off failure
+ points put on by this pop_failure_jump's matching
+ on_failure_jump; we got through the pattern to here from the
+ matching on_failure_jump, so didn't fail. */
+ case pop_failure_jump:
+ {
+ /* We need to pass separate storage for the lowest and
+ highest registers, even though we don't care about the
+ actual values. Otherwise, we will restore only one
+ register from the stack, since lowest will == highest in
+ `pop_failure_point'. */
+ unsigned dummy_low_reg, dummy_high_reg;
+ unsigned char *pdummy;
+ const char *sdummy;
+
+ DEBUG_PRINT1 ("EXECUTING pop_failure_jump.\n");
+ POP_FAILURE_POINT (sdummy, pdummy,
+ dummy_low_reg, dummy_high_reg,
+ reg_dummy, reg_dummy, reg_info_dummy);
+ }
+ /* Note fall through. */
+
+
+ /* Unconditionally jump (without popping any failure points). */
+ case jump:
+ unconditional_jump:
+ EXTRACT_NUMBER_AND_INCR (mcnt, p); /* Get the amount to jump. */
+ DEBUG_PRINT2 ("EXECUTING jump %d ", mcnt);
+ p += mcnt; /* Do the jump. */
+ DEBUG_PRINT2 ("(to 0x%x).\n", p);
break;
- case dummy_failure_jump:
- if (stackp == stacke)
- {
- unsigned char **stackx
- = (unsigned char **) alloca (2 * (stacke - stackb)
- * sizeof (char *));
- bcopy (stackb, stackx, (stacke - stackb) * sizeof (char *));
- stackp = stackx + (stackp - stackb);
- stacke = stackx + 2 * (stacke - stackb);
- stackb = stackx;
- }
- *stackp++ = 0;
- *stackp++ = 0;
- goto nofinalize;
-
- case wordbound:
- if (d == string1 /* Points to first char */
- || d == end2 /* Points to end */
- || (d == end1 && size2 == 0)) /* Points to end */
- break;
- if ((SYNTAX (d[-1]) == Sword)
- != (SYNTAX (d == end1 ? *string2 : *d) == Sword))
+
+ /* We need this opcode so we can detect where alternatives end
+ in `group_match_null_string_p' et al. */
+ case jump_past_alt:
+ DEBUG_PRINT1 ("EXECUTING jump_past_alt.\n");
+ goto unconditional_jump;
+
+
+ /* Normally, the on_failure_jump pushes a failure point, which
+ then gets popped at pop_failure_jump. We will end up at
+ pop_failure_jump, also, and with a pattern of, say, `a+', we
+ are skipping over the on_failure_jump, so we have to push
+ something meaningless for pop_failure_jump to pop. */
+ case dummy_failure_jump:
+ DEBUG_PRINT1 ("EXECUTING dummy_failure_jump.\n");
+ /* It doesn't matter what we push for the string here. What
+ the code at `fail' tests is the value for the pattern. */
+ PUSH_FAILURE_POINT (0, 0, -2);
+ goto unconditional_jump;
+
+
+ /* At the end of an alternative, we need to push a dummy failure
+ point in case we are followed by a `pop_failure_jump', because
+ we don't want the failure point for the alternative to be
+ popped. For example, matching `(a|ab)*' against `aab'
+ requires that we match the `ab' alternative. */
+ case push_dummy_failure:
+ DEBUG_PRINT1 ("EXECUTING push_dummy_failure.\n");
+ /* See comments just above at `dummy_failure_jump' about the
+ two zeroes. */
+ PUSH_FAILURE_POINT (0, 0, -2);
+ break;
+
+ /* Have to succeed matching what follows at least n times.
+ After that, handle like `on_failure_jump'. */
+ case succeed_n:
+ EXTRACT_NUMBER (mcnt, p + 2);
+ DEBUG_PRINT2 ("EXECUTING succeed_n %d.\n", mcnt);
+
+ assert (mcnt >= 0);
+ /* Originally, this is how many times we HAVE to succeed. */
+ if (mcnt > 0)
+ {
+ mcnt--;
+ p += 2;
+ STORE_NUMBER_AND_INCR (p, mcnt);
+ DEBUG_PRINT3 (" Setting 0x%x to %d.\n", p, mcnt);
+ }
+ else if (mcnt == 0)
+ {
+ DEBUG_PRINT2 (" Setting two bytes from 0x%x to no_op.\n", p+2);
+ p[2] = (unsigned char) no_op;
+ p[3] = (unsigned char) no_op;
+ goto on_failure;
+ }
+ break;
+
+ case jump_n:
+ EXTRACT_NUMBER (mcnt, p + 2);
+ DEBUG_PRINT2 ("EXECUTING jump_n %d.\n", mcnt);
+
+ /* Originally, this is how many times we CAN jump. */
+ if (mcnt)
+ {
+ mcnt--;
+ STORE_NUMBER (p + 2, mcnt);
+ goto unconditional_jump;
+ }
+ /* If don't have to jump any more, skip over the rest of command. */
+ else
+ p += 4;
+ break;
+
+ case set_number_at:
+ {
+ DEBUG_PRINT1 ("EXECUTING set_number_at.\n");
+
+ EXTRACT_NUMBER_AND_INCR (mcnt, p);
+ p1 = p + mcnt;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p);
+ DEBUG_PRINT3 (" Setting 0x%x to %d.\n", p1, mcnt);
+ STORE_NUMBER (p1, mcnt);
+ break;
+ }
+
+ case wordbound:
+ DEBUG_PRINT1 ("EXECUTING wordbound.\n");
+ if (AT_WORD_BOUNDARY (d))
break;
- goto fail;
+ goto fail;
case notwordbound:
- if (d == string1 /* Points to first char */
- || d == end2 /* Points to end */
- || (d == end1 && size2 == 0)) /* Points to end */
- goto fail;
- if ((SYNTAX (d[-1]) == Sword)
- != (SYNTAX (d == end1 ? *string2 : *d) == Sword))
+ DEBUG_PRINT1 ("EXECUTING notwordbound.\n");
+ if (AT_WORD_BOUNDARY (d))
goto fail;
- break;
+ break;
case wordbeg:
- if (d == end2 /* Points to end */
- || (d == end1 && size2 == 0) /* Points to end */
- || SYNTAX (* (d == end1 ? string2 : d)) != Sword) /* Next char not a letter */
- goto fail;
- if (d == string1 /* Points to first char */
- || SYNTAX (d[-1]) != Sword) /* prev char not letter */
+ DEBUG_PRINT1 ("EXECUTING wordbeg.\n");
+ if (WORDCHAR_P (d) && (AT_STRINGS_BEG (d) || !WORDCHAR_P (d - 1)))
break;
- goto fail;
+ goto fail;
case wordend:
- if (d == string1 /* Points to first char */
- || SYNTAX (d[-1]) != Sword) /* prev char not letter */
- goto fail;
- if (d == end2 /* Points to end */
- || (d == end1 && size2 == 0) /* Points to end */
- || SYNTAX (d == end1 ? *string2 : *d) != Sword) /* Next char not a letter */
+ DEBUG_PRINT1 ("EXECUTING wordend.\n");
+ if (!AT_STRINGS_BEG (d) && WORDCHAR_P (d - 1)
+ && (!WORDCHAR_P (d) || AT_STRINGS_END (d)))
break;
- goto fail;
+ goto fail;
#ifdef emacs
- case before_dot:
- if (((d - string2 <= (unsigned) size2)
- ? d - bf_p2 : d - bf_p1)
- <= point)
- goto fail;
- break;
-
+#ifdef emacs19
+ case before_dot:
+ DEBUG_PRINT1 ("EXECUTING before_dot.\n");
+ if (PTR_CHAR_POS ((unsigned char *) d) >= point)
+ goto fail;
+ break;
+
+ case at_dot:
+ DEBUG_PRINT1 ("EXECUTING at_dot.\n");
+ if (PTR_CHAR_POS ((unsigned char *) d) != point)
+ goto fail;
+ break;
+
+ case after_dot:
+ DEBUG_PRINT1 ("EXECUTING after_dot.\n");
+ if (PTR_CHAR_POS ((unsigned char *) d) <= point)
+ goto fail;
+ break;
+#else /* not emacs19 */
case at_dot:
- if (((d - string2 <= (unsigned) size2)
- ? d - bf_p2 : d - bf_p1)
- == point)
+ DEBUG_PRINT1 ("EXECUTING at_dot.\n");
+ if (PTR_CHAR_POS ((unsigned char *) d) + 1 != point)
goto fail;
break;
-
- case after_dot:
- if (((d - string2 <= (unsigned) size2)
- ? d - bf_p2 : d - bf_p1)
- >= point)
- goto fail;
- break;
-
- case wordchar:
- mcnt = (int) Sword;
- goto matchsyntax;
+#endif /* not emacs19 */
case syntaxspec:
+ DEBUG_PRINT2 ("EXECUTING syntaxspec %d.\n", mcnt);
mcnt = *p++;
- matchsyntax:
- PREFETCH;
- if (SYNTAX (*d++) != (enum syntaxcode) mcnt) goto fail;
- break;
-
- case notwordchar:
+ goto matchsyntax;
+
+ case wordchar:
+ DEBUG_PRINT1 ("EXECUTING Emacs wordchar.\n");
mcnt = (int) Sword;
- goto matchnotsyntax;
+ matchsyntax:
+ PREFETCH ();
+ if (SYNTAX (*d++) != (enum syntaxcode) mcnt)
+ goto fail;
+ SET_REGS_MATCHED ();
+ break;
case notsyntaxspec:
+ DEBUG_PRINT2 ("EXECUTING notsyntaxspec %d.\n", mcnt);
mcnt = *p++;
- matchnotsyntax:
- PREFETCH;
- if (SYNTAX (*d++) == (enum syntaxcode) mcnt) goto fail;
- break;
-#else
+ goto matchnotsyntax;
+
+ case notwordchar:
+ DEBUG_PRINT1 ("EXECUTING Emacs notwordchar.\n");
+ mcnt = (int) Sword;
+ matchnotsyntax:
+ PREFETCH ();
+ if (SYNTAX (*d++) == (enum syntaxcode) mcnt)
+ goto fail;
+ SET_REGS_MATCHED ();
+ break;
+
+#else /* not emacs */
case wordchar:
- PREFETCH;
- if (SYNTAX (*d++) == 0) goto fail;
+ DEBUG_PRINT1 ("EXECUTING non-Emacs wordchar.\n");
+ PREFETCH ();
+ if (!WORDCHAR_P (d))
+ goto fail;
+ SET_REGS_MATCHED ();
+ d++;
break;
case notwordchar:
- PREFETCH;
- if (SYNTAX (*d++) != 0) goto fail;
+ DEBUG_PRINT1 ("EXECUTING non-Emacs notwordchar.\n");
+ PREFETCH ();
+ if (WORDCHAR_P (d))
+ goto fail;
+ SET_REGS_MATCHED ();
+ d++;
break;
#endif /* not emacs */
+
+ default:
+ abort ();
+ }
+ continue; /* Successfully executed one pattern command; keep going. */
- case begbuf:
- if (d == string1) /* Note, d cannot equal string2 */
- break; /* unless string1 == string2. */
- goto fail;
- case endbuf:
- if (d == end2 || (d == end1 && size2 == 0))
- break;
- goto fail;
+ /* We goto here if a matching operation fails. */
+ fail:
+ if (!FAIL_STACK_EMPTY ())
+ { /* A restart point is known. Restore to that state. */
+ DEBUG_PRINT1 ("\nFAIL:\n");
+ POP_FAILURE_POINT (d, p,
+ lowest_active_reg, highest_active_reg,
+ regstart, regend, reg_info);
+
+ /* If this failure point is a dummy, try the next one. */
+ if (!p)
+ goto fail;
- case exactn:
- /* Match the next few pattern characters exactly.
- mcnt is how many characters to match. */
- mcnt = *p++;
- if (translate)
- {
- do
- {
- PREFETCH;
- if (translate[*d++] != *p++) goto fail;
- }
- while (--mcnt);
- }
- else
+ /* If we failed to the end of the pattern, don't examine *p. */
+ assert (p <= pend);
+ if (p < pend)
+ {
+ boolean is_a_jump_n = false;
+
+ /* If failed to a backwards jump that's part of a repetition
+ loop, need to pop this failure point and use the next one. */
+ switch ((re_opcode_t) *p)
+ {
+ case jump_n:
+ is_a_jump_n = true;
+ case maybe_pop_jump:
+ case pop_failure_jump:
+ case jump:
+ p1 = p + 1;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+ p1 += mcnt;
+
+ if ((is_a_jump_n && (re_opcode_t) *p1 == succeed_n)
+ || (!is_a_jump_n
+ && (re_opcode_t) *p1 == on_failure_jump))
+ goto fail;
+ break;
+ default:
+ /* do nothing */ ;
+ }
+ }
+
+ if (d >= string1 && d <= end1)
+ dend = end_match_1;
+ }
+ else
+ break; /* Matching at this starting point really fails. */
+ } /* for (;;) */
+
+ if (best_regs_set)
+ goto restore_best_regs;
+
+ FREE_VARIABLES ();
+
+ return -1; /* Failure to match. */
+} /* re_match_2 */
+
+/* Subroutine definitions for re_match_2. */
+
+
+/* We are passed P pointing to a register number after a start_memory.
+
+ Return true if the pattern up to the corresponding stop_memory can
+ match the empty string, and false otherwise.
+
+ If we find the matching stop_memory, sets P to point to one past its number.
+ Otherwise, sets P to an undefined byte less than or equal to END.
+
+ We don't handle duplicates properly (yet). */
+
+static boolean
+group_match_null_string_p (p, end, reg_info)
+ unsigned char **p, *end;
+ register_info_type *reg_info;
+{
+ int mcnt;
+ /* Point to after the args to the start_memory. */
+ unsigned char *p1 = *p + 2;
+
+ while (p1 < end)
+ {
+ /* Skip over opcodes that can match nothing, and return true or
+ false, as appropriate, when we get to one that can't, or to the
+ matching stop_memory. */
+
+ switch ((re_opcode_t) *p1)
+ {
+ /* Could be either a loop or a series of alternatives. */
+ case on_failure_jump:
+ p1++;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+
+ /* If the next operation is not a jump backwards in the
+ pattern. */
+
+ if (mcnt >= 0)
{
- do
- {
- PREFETCH;
- if (*d++ != *p++) goto fail;
- }
- while (--mcnt);
- }
- break;
+ /* Go through the on_failure_jumps of the alternatives,
+ seeing if any of the alternatives cannot match nothing.
+ The last alternative starts with only a jump,
+ whereas the rest start with on_failure_jump and end
+ with a jump, e.g., here is the pattern for `a|b|c':
+
+ /on_failure_jump/0/6/exactn/1/a/jump_past_alt/0/6
+ /on_failure_jump/0/6/exactn/1/b/jump_past_alt/0/3
+ /exactn/1/c
+
+ So, we have to first go through the first (n-1)
+ alternatives and then deal with the last one separately. */
+
+
+ /* Deal with the first (n-1) alternatives, which start
+ with an on_failure_jump (see above) that jumps to right
+ past a jump_past_alt. */
+
+ while ((re_opcode_t) p1[mcnt-3] == jump_past_alt)
+ {
+ /* `mcnt' holds how many bytes long the alternative
+ is, including the ending `jump_past_alt' and
+ its number. */
+
+ if (!alt_match_null_string_p (p1, p1 + mcnt - 3,
+ reg_info))
+ return false;
+
+ /* Move to right after this alternative, including the
+ jump_past_alt. */
+ p1 += mcnt;
+
+ /* Break if it's the beginning of an n-th alternative
+ that doesn't begin with an on_failure_jump. */
+ if ((re_opcode_t) *p1 != on_failure_jump)
+ break;
+
+ /* Still have to check that it's not an n-th
+ alternative that starts with an on_failure_jump. */
+ p1++;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+ if ((re_opcode_t) p1[mcnt-3] != jump_past_alt)
+ {
+ /* Get to the beginning of the n-th alternative. */
+ p1 -= 3;
+ break;
+ }
+ }
+
+ /* Deal with the last alternative: go back and get number
+ of the `jump_past_alt' just before it. `mcnt' contains
+ the length of the alternative. */
+ EXTRACT_NUMBER (mcnt, p1 - 2);
+
+ if (!alt_match_null_string_p (p1, p1 + mcnt, reg_info))
+ return false;
+
+ p1 += mcnt; /* Get past the n-th alternative. */
+ } /* if mcnt > 0 */
+ break;
+
+
+ case stop_memory:
+ assert (p1[1] == **p);
+ *p = p1 + 2;
+ return true;
+
+
+ default:
+ if (!common_op_match_null_string_p (&p1, end, reg_info))
+ return false;
+ }
+ } /* while p1 < end */
+
+ return false;
+} /* group_match_null_string_p */
+
+
+/* Similar to group_match_null_string_p, but doesn't deal with alternatives:
+ It expects P to be the first byte of a single alternative and END one
+ byte past the last. The alternative can contain groups. */
+
+static boolean
+alt_match_null_string_p (p, end, reg_info)
+ unsigned char *p, *end;
+ register_info_type *reg_info;
+{
+ int mcnt;
+ unsigned char *p1 = p;
+
+ while (p1 < end)
+ {
+ /* Skip over opcodes that can match nothing, and break when we get
+ to one that can't. */
+
+ switch ((re_opcode_t) *p1)
+ {
+ /* It's a loop. */
+ case on_failure_jump:
+ p1++;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+ p1 += mcnt;
+ break;
+
+ default:
+ if (!common_op_match_null_string_p (&p1, end, reg_info))
+ return false;
+ }
+ } /* while p1 < end */
+
+ return true;
+} /* alt_match_null_string_p */
+
+
+/* Deals with the ops common to group_match_null_string_p and
+ alt_match_null_string_p.
+
+ Sets P to one after the op and its arguments, if any. */
+
+static boolean
+common_op_match_null_string_p (p, end, reg_info)
+ unsigned char **p, *end;
+ register_info_type *reg_info;
+{
+ int mcnt;
+ boolean ret;
+ int reg_no;
+ unsigned char *p1 = *p;
- default:
- break;
- }
- continue; /* Successfully matched one pattern command; keep matching */
+ switch ((re_opcode_t) *p1++)
+ {
+ case no_op:
+ case begline:
+ case endline:
+ case begbuf:
+ case endbuf:
+ case wordbeg:
+ case wordend:
+ case wordbound:
+ case notwordbound:
+#ifdef emacs
+ case before_dot:
+ case at_dot:
+ case after_dot:
+#endif
+ break;
- /* Jump here if any matching operation fails. */
- fail:
- if (stackp != stackb)
- /* A restart point is known. Restart there and pop it. */
- {
- if (!stackp[-2])
- { /* If innermost failure point is dormant, flush it and keep looking */
- stackp -= 2;
- goto fail;
- }
- d = *--stackp;
- p = *--stackp;
- if (d >= string1 && d <= end1)
- dend = end_match_1;
- }
- else break; /* Matching at this starting point really fails! */
- }
- return -1; /* Failure to match */
-}
+ case start_memory:
+ reg_no = *p1;
+ assert (reg_no > 0 && reg_no <= MAX_REGNUM);
+ ret = group_match_null_string_p (&p1, end, reg_info);
+
+ /* Have to set this here in case we're checking a group which
+ contains a group and a back reference to it. */
+
+ if (REG_MATCH_NULL_STRING_P (reg_info[reg_no]) == MATCH_NULL_UNSET_VALUE)
+ REG_MATCH_NULL_STRING_P (reg_info[reg_no]) = ret;
+
+ if (!ret)
+ return false;
+ break;
+
+ /* If this is an optimized succeed_n for zero times, make the jump. */
+ case jump:
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+ if (mcnt >= 0)
+ p1 += mcnt;
+ else
+ return false;
+ break;
+
+ case succeed_n:
+ /* Get to the number of times to succeed. */
+ p1 += 2;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+
+ if (mcnt == 0)
+ {
+ p1 -= 4;
+ EXTRACT_NUMBER_AND_INCR (mcnt, p1);
+ p1 += mcnt;
+ }
+ else
+ return false;
+ break;
+
+ case duplicate:
+ if (!REG_MATCH_NULL_STRING_P (reg_info[*p1]))
+ return false;
+ break;
+
+ case set_number_at:
+ p1 += 4;
+ default:
+ /* All other opcodes mean we cannot match the empty string. */
+ return false;
+ }
+
+ *p = p1;
+ return true;
+} /* common_op_match_null_string_p */
+
+
+/* Return zero if TRANSLATE[S1] and TRANSLATE[S2] are identical for LEN
+ bytes; nonzero otherwise. */
+
static int
bcmp_translate (s1, s2, len, translate)
unsigned char *s1, *s2;
register int len;
- unsigned char *translate;
+ char *translate;
{
register unsigned char *p1 = s1, *p2 = s2;
while (len)
{
- if (translate [*p1++] != translate [*p2++]) return 1;
+ if (translate[*p1++] != translate[*p2++]) return 1;
len--;
}
return 0;
}
-/* Entry points compatible with bsd4.2 regex library */
+/* Entry points for GNU code. */
+
+/* re_compile_pattern is the GNU regular expression compiler: it
+ compiles PATTERN (of length SIZE) and puts the result in BUFP.
+ Returns 0 if the pattern was valid, otherwise an error string.
+
+ Assumes the `allocated' (and perhaps `buffer') and `translate' fields
+ are set in BUFP on entry.
+
+ We call regex_compile to do the actual compilation. */
+
+const char *
+re_compile_pattern (pattern, length, bufp)
+ const char *pattern;
+ int length;
+ struct re_pattern_buffer *bufp;
+{
+ reg_errcode_t ret;
+
+ /* GNU code is written to assume at least RE_NREGS registers will be set
+ (and at least one extra will be -1). */
+ bufp->regs_allocated = REGS_UNALLOCATED;
+
+ /* And GNU code determines whether or not to get register information
+ by passing null for the REGS argument to re_match, etc., not by
+ setting no_sub. */
+ bufp->no_sub = 0;
+
+ /* Match anchors at newline. */
+ bufp->newline_anchor = 1;
+
+ ret = regex_compile (pattern, length, re_syntax_options, bufp);
-#ifndef emacs
+ return re_error_msg[(int) ret];
+}
+
+/* Entry points compatible with 4.2 BSD regex library. We don't define
+ them if this is an Emacs or POSIX compilation. */
+
+#if !defined (emacs) && !defined (_POSIX_SOURCE)
+/* BSD has one and only one pattern buffer. */
static struct re_pattern_buffer re_comp_buf;
char *
re_comp (s)
- char *s;
+ const char *s;
{
+ reg_errcode_t ret;
+
if (!s)
{
if (!re_comp_buf.buffer)
@@ -1610,159 +4699,289 @@ re_comp (s)
if (!re_comp_buf.buffer)
{
- if (!(re_comp_buf.buffer = (char *) malloc (200)))
- return "Memory exhausted";
+ re_comp_buf.buffer = (unsigned char *) malloc (200);
+ if (re_comp_buf.buffer == NULL)
+ return "Memory exhausted";
re_comp_buf.allocated = 200;
- if (!(re_comp_buf.fastmap = (char *) malloc (1 << BYTEWIDTH)))
+
+ re_comp_buf.fastmap = (char *) malloc (1 << BYTEWIDTH);
+ if (re_comp_buf.fastmap == NULL)
return "Memory exhausted";
}
- return re_compile_pattern (s, strlen (s), &re_comp_buf);
+
+ /* Since `re_exec' always passes NULL for the `regs' argument, we
+ don't need to initialize the pattern buffer fields which affect it. */
+
+ /* Match anchors at newlines. */
+ re_comp_buf.newline_anchor = 1;
+
+ ret = regex_compile (s, strlen (s), re_syntax_options, &re_comp_buf);
+
+ /* Yes, we're discarding `const' here. */
+ return (char *) re_error_msg[(int) ret];
}
+
int
re_exec (s)
- char *s;
+ const char *s;
{
- int len = strlen (s);
- return 0 <= re_search (&re_comp_buf, s, len, 0, len, 0);
+ const int len = strlen (s);
+ return
+ 0 <= re_search (&re_comp_buf, s, len, 0, len, (struct re_registers *) 0);
}
-
-#endif /* emacs */
+#endif /* not emacs and not _POSIX_SOURCE */
-#ifdef test
+/* POSIX.2 functions. Don't define these for Emacs. */
-#include <stdio.h>
+#ifndef emacs
-/* Indexed by a character, gives the upper case equivalent of the character */
-
-static char upcase[0400] =
- { 000, 001, 002, 003, 004, 005, 006, 007,
- 010, 011, 012, 013, 014, 015, 016, 017,
- 020, 021, 022, 023, 024, 025, 026, 027,
- 030, 031, 032, 033, 034, 035, 036, 037,
- 040, 041, 042, 043, 044, 045, 046, 047,
- 050, 051, 052, 053, 054, 055, 056, 057,
- 060, 061, 062, 063, 064, 065, 066, 067,
- 070, 071, 072, 073, 074, 075, 076, 077,
- 0100, 0101, 0102, 0103, 0104, 0105, 0106, 0107,
- 0110, 0111, 0112, 0113, 0114, 0115, 0116, 0117,
- 0120, 0121, 0122, 0123, 0124, 0125, 0126, 0127,
- 0130, 0131, 0132, 0133, 0134, 0135, 0136, 0137,
- 0140, 0101, 0102, 0103, 0104, 0105, 0106, 0107,
- 0110, 0111, 0112, 0113, 0114, 0115, 0116, 0117,
- 0120, 0121, 0122, 0123, 0124, 0125, 0126, 0127,
- 0130, 0131, 0132, 0173, 0174, 0175, 0176, 0177,
- 0200, 0201, 0202, 0203, 0204, 0205, 0206, 0207,
- 0210, 0211, 0212, 0213, 0214, 0215, 0216, 0217,
- 0220, 0221, 0222, 0223, 0224, 0225, 0226, 0227,
- 0230, 0231, 0232, 0233, 0234, 0235, 0236, 0237,
- 0240, 0241, 0242, 0243, 0244, 0245, 0246, 0247,
- 0250, 0251, 0252, 0253, 0254, 0255, 0256, 0257,
- 0260, 0261, 0262, 0263, 0264, 0265, 0266, 0267,
- 0270, 0271, 0272, 0273, 0274, 0275, 0276, 0277,
- 0300, 0301, 0302, 0303, 0304, 0305, 0306, 0307,
- 0310, 0311, 0312, 0313, 0314, 0315, 0316, 0317,
- 0320, 0321, 0322, 0323, 0324, 0325, 0326, 0327,
- 0330, 0331, 0332, 0333, 0334, 0335, 0336, 0337,
- 0340, 0341, 0342, 0343, 0344, 0345, 0346, 0347,
- 0350, 0351, 0352, 0353, 0354, 0355, 0356, 0357,
- 0360, 0361, 0362, 0363, 0364, 0365, 0366, 0367,
- 0370, 0371, 0372, 0373, 0374, 0375, 0376, 0377
- };
+/* regcomp takes a regular expression as a string and compiles it.
-main (argc, argv)
- int argc;
- char **argv;
-{
- char pat[80];
- struct re_pattern_buffer buf;
- int i;
- char c;
- char fastmap[(1 << BYTEWIDTH)];
-
- /* Allow a command argument to specify the style of syntax. */
- if (argc > 1)
- obscure_syntax = atoi (argv[1]);
-
- buf.allocated = 40;
- buf.buffer = (char *) malloc (buf.allocated);
- buf.fastmap = fastmap;
- buf.translate = upcase;
-
- while (1)
- {
- gets (pat);
+ PREG is a regex_t *. We do not expect any fields to be initialized,
+ since POSIX says we shouldn't. Thus, we set
- if (*pat)
- {
- re_compile_pattern (pat, strlen(pat), &buf);
+ `buffer' to the compiled pattern;
+ `used' to the length of the compiled pattern;
+ `syntax' to RE_SYNTAX_POSIX_EXTENDED if the
+ REG_EXTENDED bit in CFLAGS is set; otherwise, to
+ RE_SYNTAX_POSIX_BASIC;
+ `newline_anchor' to REG_NEWLINE being set in CFLAGS;
+ `fastmap' and `fastmap_accurate' to zero;
+ `re_nsub' to the number of subexpressions in PATTERN.
- for (i = 0; i < buf.used; i++)
- printchar (buf.buffer[i]);
+ PATTERN is the address of the pattern string.
- putchar ('\n');
+ CFLAGS is a series of bits which affect compilation.
- printf ("%d allocated, %d used.\n", buf.allocated, buf.used);
+ If REG_EXTENDED is set, we use POSIX extended syntax; otherwise, we
+ use POSIX basic syntax.
- re_compile_fastmap (&buf);
- printf ("Allowed by fastmap: ");
- for (i = 0; i < (1 << BYTEWIDTH); i++)
- if (fastmap[i]) printchar (i);
- putchar ('\n');
- }
+ If REG_NEWLINE is set, then . and [^...] don't match newline.
+ Also, regexec will try a match beginning after every newline.
+
+ If REG_ICASE is set, then we considers upper- and lowercase
+ versions of letters to be equivalent when matching.
+
+ If REG_NOSUB is set, then when PREG is passed to regexec, that
+ routine will report only success or failure, and nothing about the
+ registers.
+
+ It returns 0 if it succeeds, nonzero if it doesn't. (See regex.h for
+ the return codes and their meanings.) */
- gets (pat); /* Now read the string to match against */
+int
+regcomp (preg, pattern, cflags)
+ regex_t *preg;
+ const char *pattern;
+ int cflags;
+{
+ reg_errcode_t ret;
+ unsigned syntax
+ = (cflags & REG_EXTENDED) ?
+ RE_SYNTAX_POSIX_EXTENDED : RE_SYNTAX_POSIX_BASIC;
+
+ /* regex_compile will allocate the space for the compiled pattern. */
+ preg->buffer = 0;
+ preg->allocated = 0;
+ preg->used = 0;
+
+ /* Don't bother to use a fastmap when searching. This simplifies the
+ REG_NEWLINE case: if we used a fastmap, we'd have to put all the
+ characters after newlines into the fastmap. This way, we just try
+ every character. */
+ preg->fastmap = 0;
+
+ if (cflags & REG_ICASE)
+ {
+ unsigned i;
+
+ preg->translate = (char *) malloc (CHAR_SET_SIZE);
+ if (preg->translate == NULL)
+ return (int) REG_ESPACE;
- i = re_match (&buf, pat, strlen (pat), 0, 0);
- printf ("Match value %d.\n", i);
+ /* Map uppercase characters to corresponding lowercase ones. */
+ for (i = 0; i < CHAR_SET_SIZE; i++)
+ preg->translate[i] = ISUPPER (i) ? tolower (i) : i;
}
+ else
+ preg->translate = NULL;
+
+ /* If REG_NEWLINE is set, newlines are treated differently. */
+ if (cflags & REG_NEWLINE)
+ { /* REG_NEWLINE implies neither . nor [^...] match newline. */
+ syntax &= ~RE_DOT_NEWLINE;
+ syntax |= RE_HAT_LISTS_NOT_NEWLINE;
+ /* It also changes the matching behavior. */
+ preg->newline_anchor = 1;
+ }
+ else
+ preg->newline_anchor = 0;
+
+ preg->no_sub = !!(cflags & REG_NOSUB);
+
+ /* POSIX says a null character in the pattern terminates it, so we
+ can use strlen here in compiling the pattern. */
+ ret = regex_compile (pattern, strlen (pattern), syntax, preg);
+
+ /* POSIX doesn't distinguish between an unmatched open-group and an
+ unmatched close-group: both are REG_EPAREN. */
+ if (ret == REG_ERPAREN) ret = REG_EPAREN;
+
+ return (int) ret;
}
-#ifdef NOTDEF
-print_buf (bufp)
- struct re_pattern_buffer *bufp;
+
+/* regexec searches for a given pattern, specified by PREG, in the
+ string STRING.
+
+ If NMATCH is zero or REG_NOSUB was set in the cflags argument to
+ `regcomp', we ignore PMATCH. Otherwise, we assume PMATCH has at
+ least NMATCH elements, and we set them to the offsets of the
+ corresponding matched substrings.
+
+ EFLAGS specifies `execution flags' which affect matching: if
+ REG_NOTBOL is set, then ^ does not match at the beginning of the
+ string; if REG_NOTEOL is set, then $ does not match at the end.
+
+ We return 0 if we find a match and REG_NOMATCH if not. */
+
+int
+regexec (preg, string, nmatch, pmatch, eflags)
+ const regex_t *preg;
+ const char *string;
+ size_t nmatch;
+ regmatch_t pmatch[];
+ int eflags;
{
- int i;
+ int ret;
+ struct re_registers regs;
+ regex_t private_preg;
+ int len = strlen (string);
+ boolean want_reg_info = !preg->no_sub && nmatch > 0;
- printf ("buf is :\n----------------\n");
- for (i = 0; i < bufp->used; i++)
- printchar (bufp->buffer[i]);
+ private_preg = *preg;
+
+ private_preg.not_bol = !!(eflags & REG_NOTBOL);
+ private_preg.not_eol = !!(eflags & REG_NOTEOL);
- printf ("\n%d allocated, %d used.\n", bufp->allocated, bufp->used);
+ /* The user has told us exactly how many registers to return
+ information about, via `nmatch'. We have to pass that on to the
+ matching routines. */
+ private_preg.regs_allocated = REGS_FIXED;
- printf ("Allowed by fastmap: ");
- for (i = 0; i < (1 << BYTEWIDTH); i++)
- if (bufp->fastmap[i])
- printchar (i);
- printf ("\nAllowed by translate: ");
- if (bufp->translate)
- for (i = 0; i < (1 << BYTEWIDTH); i++)
- if (bufp->translate[i])
- printchar (i);
- printf ("\nfastmap is%s accurate\n", bufp->fastmap_accurate ? "" : "n't");
- printf ("can %s be null\n----------", bufp->can_be_null ? "" : "not");
+ if (want_reg_info)
+ {
+ regs.num_regs = nmatch;
+ regs.start = TALLOC (nmatch, regoff_t);
+ regs.end = TALLOC (nmatch, regoff_t);
+ if (regs.start == NULL || regs.end == NULL)
+ return (int) REG_NOMATCH;
+ }
+
+ /* Perform the searching operation. */
+ ret = re_search (&private_preg, string, len,
+ /* start: */ 0, /* range: */ len,
+ want_reg_info ? &regs : (struct re_registers *) 0);
+
+ /* Copy the register information to the POSIX structure. */
+ if (want_reg_info)
+ {
+ if (ret >= 0)
+ {
+ unsigned r;
+
+ for (r = 0; r < nmatch; r++)
+ {
+ pmatch[r].rm_so = regs.start[r];
+ pmatch[r].rm_eo = regs.end[r];
+ }
+ }
+
+ /* If we needed the temporary register info, free the space now. */
+ free (regs.start);
+ free (regs.end);
+ }
+
+ /* We want zero return to mean success, unlike `re_search'. */
+ return ret >= 0 ? (int) REG_NOERROR : (int) REG_NOMATCH;
}
-#endif
-printchar (c)
- char c;
+
+/* Returns a message corresponding to an error code, ERRCODE, returned
+ from either regcomp or regexec. We don't use PREG here. */
+
+size_t
+regerror (errcode, preg, errbuf, errbuf_size)
+ int errcode;
+ const regex_t *preg;
+ char *errbuf;
+ size_t errbuf_size;
{
- if (c < 041 || c >= 0177)
+ const char *msg;
+ size_t msg_size;
+
+ if (errcode < 0
+ || errcode >= (sizeof (re_error_msg) / sizeof (re_error_msg[0])))
+ /* Only error codes returned by the rest of the code should be passed
+ to this routine. If we are given anything else, or if other regex
+ code generates an invalid error code, then the program has a bug.
+ Dump core so we can fix it. */
+ abort ();
+
+ msg = re_error_msg[errcode];
+
+ /* POSIX doesn't require that we do anything in this case, but why
+ not be nice. */
+ if (! msg)
+ msg = "Success";
+
+ msg_size = strlen (msg) + 1; /* Includes the null. */
+
+ if (errbuf_size != 0)
{
- putchar ('\\');
- putchar (((c >> 6) & 3) + '0');
- putchar (((c >> 3) & 7) + '0');
- putchar ((c & 7) + '0');
+ if (msg_size > errbuf_size)
+ {
+ strncpy (errbuf, msg, errbuf_size - 1);
+ errbuf[errbuf_size - 1] = 0;
+ }
+ else
+ strcpy (errbuf, msg);
}
- else
- putchar (c);
+
+ return msg_size;
}
-error (string)
- char *string;
+
+/* Free dynamically allocated space used by PREG. */
+
+void
+regfree (preg)
+ regex_t *preg;
{
- puts (string);
- exit (1);
+ if (preg->buffer != NULL)
+ free (preg->buffer);
+ preg->buffer = NULL;
+
+ preg->allocated = 0;
+ preg->used = 0;
+
+ if (preg->fastmap != NULL)
+ free (preg->fastmap);
+ preg->fastmap = NULL;
+ preg->fastmap_accurate = 0;
+
+ if (preg->translate != NULL)
+ free (preg->translate);
+ preg->translate = NULL;
}
-#endif /* test */
+#endif /* not emacs */
+
+/*
+Local variables:
+make-backup-files: t
+version-control: t
+trim-versions-without-asking: nil
+End:
+*/
diff --git a/gnu/usr.bin/grep/regex.h b/gnu/usr.bin/grep/regex.h
index a5e89d3..408dd21 100644
--- a/gnu/usr.bin/grep/regex.h
+++ b/gnu/usr.bin/grep/regex.h
@@ -1,5 +1,7 @@
-/* Definitions for data structures callers pass the regex library.
- Copyright (C) 1985, 1989 Free Software Foundation, Inc.
+/* Definitions for data structures and routines for the regular
+ expression library, version 0.12.
+
+ Copyright (C) 1985, 1989, 1990, 1991, 1992, 1993 Free Software Foundation, Inc.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@@ -13,173 +15,476 @@
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */
- In other words, you are welcome to use, share and improve this program.
- You are forbidden to forbid anyone else to use, share and improve
- what you give them. Help stamp out software-hoarding! */
+#ifndef __REGEXP_LIBRARY_H__
+#define __REGEXP_LIBRARY_H__
+/* POSIX says that <sys/types.h> must be included (by the caller) before
+ <regex.h>. */
-/* Define number of parens for which we record the beginnings and ends.
- This affects how much space the `struct re_registers' type takes up. */
-#ifndef RE_NREGS
-#define RE_NREGS 10
+#ifdef VMS
+/* VMS doesn't have `size_t' in <sys/types.h>, even though POSIX says it
+ should be there. */
+#include <stddef.h>
#endif
-/* These bits are used in the obscure_syntax variable to choose among
- alternative regexp syntaxes. */
-
-/* 1 means plain parentheses serve as grouping, and backslash
- parentheses are needed for literal searching.
- 0 means backslash-parentheses are grouping, and plain parentheses
- are for literal searching. */
-#define RE_NO_BK_PARENS 1
-
-/* 1 means plain | serves as the "or"-operator, and \| is a literal.
- 0 means \| serves as the "or"-operator, and | is a literal. */
-#define RE_NO_BK_VBAR 2
-
-/* 0 means plain + or ? serves as an operator, and \+, \? are literals.
- 1 means \+, \? are operators and plain +, ? are literals. */
-#define RE_BK_PLUS_QM 4
-
-/* 1 means | binds tighter than ^ or $.
- 0 means the contrary. */
-#define RE_TIGHT_VBAR 8
-
-/* 1 means treat \n as an _OR operator
- 0 means treat it as a normal character */
-#define RE_NEWLINE_OR 16
-
-/* 0 means that a special characters (such as *, ^, and $) always have
- their special meaning regardless of the surrounding context.
- 1 means that special characters may act as normal characters in some
- contexts. Specifically, this applies to:
- ^ - only special at the beginning, or after ( or |
- $ - only special at the end, or before ) or |
- *, +, ? - only special when not after the beginning, (, or | */
-#define RE_CONTEXT_INDEP_OPS 32
-
-/* Now define combinations of bits for the standard possibilities. */
-#define RE_SYNTAX_AWK (RE_NO_BK_PARENS | RE_NO_BK_VBAR | RE_CONTEXT_INDEP_OPS)
-#define RE_SYNTAX_EGREP (RE_SYNTAX_AWK | RE_NEWLINE_OR)
-#define RE_SYNTAX_GREP (RE_BK_PLUS_QM | RE_NEWLINE_OR)
+
+/* The following bits are used to determine the regexp syntax we
+ recognize. The set/not-set meanings are chosen so that Emacs syntax
+ remains the value 0. The bits are given in alphabetical order, and
+ the definitions shifted by one from the previous bit; thus, when we
+ add or remove a bit, only one other definition need change. */
+typedef unsigned reg_syntax_t;
+
+/* If this bit is not set, then \ inside a bracket expression is literal.
+ If set, then such a \ quotes the following character. */
+#define RE_BACKSLASH_ESCAPE_IN_LISTS (1)
+
+/* If this bit is not set, then + and ? are operators, and \+ and \? are
+ literals.
+ If set, then \+ and \? are operators and + and ? are literals. */
+#define RE_BK_PLUS_QM (RE_BACKSLASH_ESCAPE_IN_LISTS << 1)
+
+/* If this bit is set, then character classes are supported. They are:
+ [:alpha:], [:upper:], [:lower:], [:digit:], [:alnum:], [:xdigit:],
+ [:space:], [:print:], [:punct:], [:graph:], and [:cntrl:].
+ If not set, then character classes are not supported. */
+#define RE_CHAR_CLASSES (RE_BK_PLUS_QM << 1)
+
+/* If this bit is set, then ^ and $ are always anchors (outside bracket
+ expressions, of course).
+ If this bit is not set, then it depends:
+ ^ is an anchor if it is at the beginning of a regular
+ expression or after an open-group or an alternation operator;
+ $ is an anchor if it is at the end of a regular expression, or
+ before a close-group or an alternation operator.
+
+ This bit could be (re)combined with RE_CONTEXT_INDEP_OPS, because
+ POSIX draft 11.2 says that * etc. in leading positions is undefined.
+ We already implemented a previous draft which made those constructs
+ invalid, though, so we haven't changed the code back. */
+#define RE_CONTEXT_INDEP_ANCHORS (RE_CHAR_CLASSES << 1)
+
+/* If this bit is set, then special characters are always special
+ regardless of where they are in the pattern.
+ If this bit is not set, then special characters are special only in
+ some contexts; otherwise they are ordinary. Specifically,
+ * + ? and intervals are only special when not after the beginning,
+ open-group, or alternation operator. */
+#define RE_CONTEXT_INDEP_OPS (RE_CONTEXT_INDEP_ANCHORS << 1)
+
+/* If this bit is set, then *, +, ?, and { cannot be first in an re or
+ immediately after an alternation or begin-group operator. */
+#define RE_CONTEXT_INVALID_OPS (RE_CONTEXT_INDEP_OPS << 1)
+
+/* If this bit is set, then . matches newline.
+ If not set, then it doesn't. */
+#define RE_DOT_NEWLINE (RE_CONTEXT_INVALID_OPS << 1)
+
+/* If this bit is set, then . doesn't match NUL.
+ If not set, then it does. */
+#define RE_DOT_NOT_NULL (RE_DOT_NEWLINE << 1)
+
+/* If this bit is set, nonmatching lists [^...] do not match newline.
+ If not set, they do. */
+#define RE_HAT_LISTS_NOT_NEWLINE (RE_DOT_NOT_NULL << 1)
+
+/* If this bit is set, either \{...\} or {...} defines an
+ interval, depending on RE_NO_BK_BRACES.
+ If not set, \{, \}, {, and } are literals. */
+#define RE_INTERVALS (RE_HAT_LISTS_NOT_NEWLINE << 1)
+
+/* If this bit is set, +, ? and | aren't recognized as operators.
+ If not set, they are. */
+#define RE_LIMITED_OPS (RE_INTERVALS << 1)
+
+/* If this bit is set, newline is an alternation operator.
+ If not set, newline is literal. */
+#define RE_NEWLINE_ALT (RE_LIMITED_OPS << 1)
+
+/* If this bit is set, then `{...}' defines an interval, and \{ and \}
+ are literals.
+ If not set, then `\{...\}' defines an interval. */
+#define RE_NO_BK_BRACES (RE_NEWLINE_ALT << 1)
+
+/* If this bit is set, (...) defines a group, and \( and \) are literals.
+ If not set, \(...\) defines a group, and ( and ) are literals. */
+#define RE_NO_BK_PARENS (RE_NO_BK_BRACES << 1)
+
+/* If this bit is set, then \<digit> matches <digit>.
+ If not set, then \<digit> is a back-reference. */
+#define RE_NO_BK_REFS (RE_NO_BK_PARENS << 1)
+
+/* If this bit is set, then | is an alternation operator, and \| is literal.
+ If not set, then \| is an alternation operator, and | is literal. */
+#define RE_NO_BK_VBAR (RE_NO_BK_REFS << 1)
+
+/* If this bit is set, then an ending range point collating higher
+ than the starting range point, as in [z-a], is invalid.
+ If not set, then when ending range point collates higher than the
+ starting range point, the range is ignored. */
+#define RE_NO_EMPTY_RANGES (RE_NO_BK_VBAR << 1)
+
+/* If this bit is set, then an unmatched ) is ordinary.
+ If not set, then an unmatched ) is invalid. */
+#define RE_UNMATCHED_RIGHT_PAREN_ORD (RE_NO_EMPTY_RANGES << 1)
+
+/* This global variable defines the particular regexp syntax to use (for
+ some interfaces). When a regexp is compiled, the syntax used is
+ stored in the pattern buffer, so changing this does not affect
+ already-compiled regexps. */
+extern reg_syntax_t re_syntax_options;
+
+/* Define combinations of the above bits for the standard possibilities.
+ (The [[[ comments delimit what gets put into the Texinfo file, so
+ don't delete them!) */
+/* [[[begin syntaxes]]] */
#define RE_SYNTAX_EMACS 0
-/* This data structure is used to represent a compiled pattern. */
+#define RE_SYNTAX_AWK \
+ (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \
+ | RE_NO_BK_PARENS | RE_NO_BK_REFS \
+ | RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \
+ | RE_UNMATCHED_RIGHT_PAREN_ORD)
+
+#define RE_SYNTAX_POSIX_AWK \
+ (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS)
+
+#define RE_SYNTAX_GREP \
+ (RE_BK_PLUS_QM | RE_CHAR_CLASSES \
+ | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \
+ | RE_NEWLINE_ALT)
+
+#define RE_SYNTAX_EGREP \
+ (RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \
+ | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \
+ | RE_NEWLINE_ALT | RE_NO_BK_PARENS \
+ | RE_NO_BK_VBAR)
+
+#define RE_SYNTAX_POSIX_EGREP \
+ (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES)
+
+/* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */
+#define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC
+
+#define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC
+
+/* Syntax bits common to both basic and extended POSIX regex syntax. */
+#define _RE_SYNTAX_POSIX_COMMON \
+ (RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \
+ | RE_INTERVALS | RE_NO_EMPTY_RANGES)
+
+#define RE_SYNTAX_POSIX_BASIC \
+ (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM)
+
+/* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes
+ RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this
+ isn't minimal, since other operators, such as \`, aren't disabled. */
+#define RE_SYNTAX_POSIX_MINIMAL_BASIC \
+ (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS)
+
+#define RE_SYNTAX_POSIX_EXTENDED \
+ (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
+ | RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \
+ | RE_NO_BK_PARENS | RE_NO_BK_VBAR \
+ | RE_UNMATCHED_RIGHT_PAREN_ORD)
+
+/* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS
+ replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */
+#define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \
+ (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
+ | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \
+ | RE_NO_BK_PARENS | RE_NO_BK_REFS \
+ | RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD)
+/* [[[end syntaxes]]] */
+
+/* Maximum number of duplicates an interval can allow. Some systems
+ (erroneously) define this in other header files, but we want our
+ value, so remove any previous define. */
+#ifdef RE_DUP_MAX
+#undef RE_DUP_MAX
+#endif
+#define RE_DUP_MAX ((1 << 15) - 1)
+
+
+/* POSIX `cflags' bits (i.e., information for `regcomp'). */
+
+/* If this bit is set, then use extended regular expression syntax.
+ If not set, then use basic regular expression syntax. */
+#define REG_EXTENDED 1
+
+/* If this bit is set, then ignore case when matching.
+ If not set, then case is significant. */
+#define REG_ICASE (REG_EXTENDED << 1)
+
+/* If this bit is set, then anchors do not match at newline
+ characters in the string.
+ If not set, then anchors do match at newlines. */
+#define REG_NEWLINE (REG_ICASE << 1)
+
+/* If this bit is set, then report only success or fail in regexec.
+ If not set, then returns differ between not matching and errors. */
+#define REG_NOSUB (REG_NEWLINE << 1)
+
+
+/* POSIX `eflags' bits (i.e., information for regexec). */
+
+/* If this bit is set, then the beginning-of-line operator doesn't match
+ the beginning of the string (presumably because it's not the
+ beginning of a line).
+ If not set, then the beginning-of-line operator does match the
+ beginning of the string. */
+#define REG_NOTBOL 1
+
+/* Like REG_NOTBOL, except for the end-of-line. */
+#define REG_NOTEOL (1 << 1)
+
+
+/* If any error codes are removed, changed, or added, update the
+ `re_error_msg' table in regex.c. */
+typedef enum
+{
+ REG_NOERROR = 0, /* Success. */
+ REG_NOMATCH, /* Didn't find a match (for regexec). */
+
+ /* POSIX regcomp return error codes. (In the order listed in the
+ standard.) */
+ REG_BADPAT, /* Invalid pattern. */
+ REG_ECOLLATE, /* Not implemented. */
+ REG_ECTYPE, /* Invalid character class name. */
+ REG_EESCAPE, /* Trailing backslash. */
+ REG_ESUBREG, /* Invalid back reference. */
+ REG_EBRACK, /* Unmatched left bracket. */
+ REG_EPAREN, /* Parenthesis imbalance. */
+ REG_EBRACE, /* Unmatched \{. */
+ REG_BADBR, /* Invalid contents of \{\}. */
+ REG_ERANGE, /* Invalid range end. */
+ REG_ESPACE, /* Ran out of memory. */
+ REG_BADRPT, /* No preceding re for repetition op. */
+
+ /* Error codes we've added. */
+ REG_EEND, /* Premature end. */
+ REG_ESIZE, /* Compiled pattern bigger than 2^16 bytes. */
+ REG_ERPAREN /* Unmatched ) or \); not returned from regcomp. */
+} reg_errcode_t;
+
+/* This data structure represents a compiled pattern. Before calling
+ the pattern compiler, the fields `buffer', `allocated', `fastmap',
+ `translate', and `no_sub' can be set. After the pattern has been
+ compiled, the `re_nsub' field is available. All other fields are
+ private to the regex routines. */
struct re_pattern_buffer
- {
- char *buffer; /* Space holding the compiled pattern commands. */
- int allocated; /* Size of space that buffer points to */
- int used; /* Length of portion of buffer actually occupied */
- char *fastmap; /* Pointer to fastmap, if any, or zero if none. */
- /* re_search uses the fastmap, if there is one,
- to skip quickly over totally implausible characters */
- char *translate; /* Translate table to apply to all characters before comparing.
- Or zero for no translation.
- The translation is applied to a pattern when it is compiled
- and to data when it is matched. */
- char fastmap_accurate;
- /* Set to zero when a new pattern is stored,
- set to one when the fastmap is updated from it. */
- char can_be_null; /* Set to one by compiling fastmap
- if this pattern might match the null string.
- It does not necessarily match the null string
- in that case, but if this is zero, it cannot.
- 2 as value means can match null string
- but at end of range or before a character
- listed in the fastmap. */
- };
-
-/* Structure to store "register" contents data in.
-
- Pass the address of such a structure as an argument to re_match, etc.,
- if you want this information back.
-
- start[i] and end[i] record the string matched by \( ... \) grouping i,
- for i from 1 to RE_NREGS - 1.
- start[0] and end[0] record the entire string matched. */
+{
+/* [[[begin pattern_buffer]]] */
+ /* Space that holds the compiled pattern. It is declared as
+ `unsigned char *' because its elements are
+ sometimes used as array indexes. */
+ unsigned char *buffer;
-struct re_registers
- {
- int start[RE_NREGS];
- int end[RE_NREGS];
- };
-
-/* These are the command codes that appear in compiled regular expressions, one per byte.
- Some command codes are followed by argument bytes.
- A command code can specify any interpretation whatever for its arguments.
- Zero-bytes may appear in the compiled regular expression. */
-
-enum regexpcode
- {
- unused,
- exactn, /* followed by one byte giving n, and then by n literal bytes */
- begline, /* fails unless at beginning of line */
- endline, /* fails unless at end of line */
- jump, /* followed by two bytes giving relative address to jump to */
- on_failure_jump, /* followed by two bytes giving relative address of place
- to resume at in case of failure. */
- finalize_jump, /* Throw away latest failure point and then jump to address. */
- maybe_finalize_jump, /* Like jump but finalize if safe to do so.
- This is used to jump back to the beginning
- of a repeat. If the command that follows
- this jump is clearly incompatible with the
- one at the beginning of the repeat, such that
- we can be sure that there is no use backtracking
- out of repetitions already completed,
- then we finalize. */
- dummy_failure_jump, /* jump, and push a dummy failure point.
- This failure point will be thrown away
- if an attempt is made to use it for a failure.
- A + construct makes this before the first repeat. */
- anychar, /* matches any one character */
- charset, /* matches any one char belonging to specified set.
- First following byte is # bitmap bytes.
- Then come bytes for a bit-map saying which chars are in.
- Bits in each byte are ordered low-bit-first.
- A character is in the set if its bit is 1.
- A character too large to have a bit in the map
- is automatically not in the set */
- charset_not, /* similar but match any character that is NOT one of those specified */
- start_memory, /* starts remembering the text that is matched
- and stores it in a memory register.
- followed by one byte containing the register number.
- Register numbers must be in the range 0 through NREGS. */
- stop_memory, /* stops remembering the text that is matched
- and stores it in a memory register.
- followed by one byte containing the register number.
- Register numbers must be in the range 0 through NREGS. */
- duplicate, /* match a duplicate of something remembered.
- Followed by one byte containing the index of the memory register. */
- before_dot, /* Succeeds if before dot */
- at_dot, /* Succeeds if at dot */
- after_dot, /* Succeeds if after dot */
- begbuf, /* Succeeds if at beginning of buffer */
- endbuf, /* Succeeds if at end of buffer */
- wordchar, /* Matches any word-constituent character */
- notwordchar, /* Matches any char that is not a word-constituent */
- wordbeg, /* Succeeds if at word beginning */
- wordend, /* Succeeds if at word end */
- wordbound, /* Succeeds if at a word boundary */
- notwordbound, /* Succeeds if not at a word boundary */
- syntaxspec, /* Matches any character whose syntax is specified.
- followed by a byte which contains a syntax code, Sword or such like */
- notsyntaxspec /* Matches any character whose syntax differs from the specified. */
- };
+ /* Number of bytes to which `buffer' points. */
+ unsigned long allocated;
+
+ /* Number of bytes actually used in `buffer'. */
+ unsigned long used;
+
+ /* Syntax setting with which the pattern was compiled. */
+ reg_syntax_t syntax;
+
+ /* Pointer to a fastmap, if any, otherwise zero. re_search uses
+ the fastmap, if there is one, to skip over impossible
+ starting points for matches. */
+ char *fastmap;
+
+ /* Either a translate table to apply to all characters before
+ comparing them, or zero for no translation. The translation
+ is applied to a pattern when it is compiled and to a string
+ when it is matched. */
+ char *translate;
+
+ /* Number of subexpressions found by the compiler. */
+ size_t re_nsub;
+
+ /* Zero if this pattern cannot match the empty string, one else.
+ Well, in truth it's used only in `re_search_2', to see
+ whether or not we should use the fastmap, so we don't set
+ this absolutely perfectly; see `re_compile_fastmap' (the
+ `duplicate' case). */
+ unsigned can_be_null : 1;
+
+ /* If REGS_UNALLOCATED, allocate space in the `regs' structure
+ for `max (RE_NREGS, re_nsub + 1)' groups.
+ If REGS_REALLOCATE, reallocate space if necessary.
+ If REGS_FIXED, use what's there. */
+#define REGS_UNALLOCATED 0
+#define REGS_REALLOCATE 1
+#define REGS_FIXED 2
+ unsigned regs_allocated : 2;
+
+ /* Set to zero when `regex_compile' compiles a pattern; set to one
+ by `re_compile_fastmap' if it updates the fastmap. */
+ unsigned fastmap_accurate : 1;
+
+ /* If set, `re_match_2' does not return information about
+ subexpressions. */
+ unsigned no_sub : 1;
+
+ /* If set, a beginning-of-line anchor doesn't match at the
+ beginning of the string. */
+ unsigned not_bol : 1;
+
+ /* Similarly for an end-of-line anchor. */
+ unsigned not_eol : 1;
+
+ /* If true, an anchor at a newline matches. */
+ unsigned newline_anchor : 1;
+
+/* [[[end pattern_buffer]]] */
+};
+
+typedef struct re_pattern_buffer regex_t;
+
+
+/* search.c (search_buffer) in Emacs needs this one opcode value. It is
+ defined both in `regex.c' and here. */
+#define RE_EXACTN_VALUE 1
-extern char *re_compile_pattern ();
-/* Is this really advertised? */
-extern void re_compile_fastmap ();
-extern int re_search (), re_search_2 ();
-extern int re_match (), re_match_2 ();
-
-/* 4.2 bsd compatibility (yuck) */
-extern char *re_comp ();
-extern int re_exec ();
-
-#ifdef SYNTAX_TABLE
-extern char *re_syntax_table;
+/* Type for byte offsets within the string. POSIX mandates this. */
+typedef int regoff_t;
+
+
+/* This is the structure we store register match data in. See
+ regex.texinfo for a full description of what registers match. */
+struct re_registers
+{
+ unsigned num_regs;
+ regoff_t *start;
+ regoff_t *end;
+};
+
+
+/* If `regs_allocated' is REGS_UNALLOCATED in the pattern buffer,
+ `re_match_2' returns information about at least this many registers
+ the first time a `regs' structure is passed. */
+#ifndef RE_NREGS
+#define RE_NREGS 30
#endif
+
+
+/* POSIX specification for registers. Aside from the different names than
+ `re_registers', POSIX uses an array of structures, instead of a
+ structure of arrays. */
+typedef struct
+{
+ regoff_t rm_so; /* Byte offset from string's start to substring's start. */
+ regoff_t rm_eo; /* Byte offset from string's start to substring's end. */
+} regmatch_t;
+
+/* Declarations for routines. */
+
+/* To avoid duplicating every routine declaration -- once with a
+ prototype (if we are ANSI), and once without (if we aren't) -- we
+ use the following macro to declare argument types. This
+ unfortunately clutters up the declarations a bit, but I think it's
+ worth it. */
+
+#if __STDC__
+
+#define _RE_ARGS(args) args
+
+#else /* not __STDC__ */
+
+#define _RE_ARGS(args) ()
+
+#endif /* not __STDC__ */
+
+/* Sets the current default syntax to SYNTAX, and return the old syntax.
+ You can also simply assign to the `re_syntax_options' variable. */
+extern reg_syntax_t re_set_syntax _RE_ARGS ((reg_syntax_t syntax));
+
+/* Compile the regular expression PATTERN, with length LENGTH
+ and syntax given by the global `re_syntax_options', into the buffer
+ BUFFER. Return NULL if successful, and an error string if not. */
+extern const char *re_compile_pattern
+ _RE_ARGS ((const char *pattern, int length,
+ struct re_pattern_buffer *buffer));
+
+
+/* Compile a fastmap for the compiled pattern in BUFFER; used to
+ accelerate searches. Return 0 if successful and -2 if was an
+ internal error. */
+extern int re_compile_fastmap _RE_ARGS ((struct re_pattern_buffer *buffer));
+
+
+/* Search in the string STRING (with length LENGTH) for the pattern
+ compiled into BUFFER. Start searching at position START, for RANGE
+ characters. Return the starting position of the match, -1 for no
+ match, or -2 for an internal error. Also return register
+ information in REGS (if REGS and BUFFER->no_sub are nonzero). */
+extern int re_search
+ _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string,
+ int length, int start, int range, struct re_registers *regs));
+
+
+/* Like `re_search', but search in the concatenation of STRING1 and
+ STRING2. Also, stop searching at index START + STOP. */
+extern int re_search_2
+ _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string1,
+ int length1, const char *string2, int length2,
+ int start, int range, struct re_registers *regs, int stop));
+
+
+/* Like `re_search', but return how many characters in STRING the regexp
+ in BUFFER matched, starting at position START. */
+extern int re_match
+ _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string,
+ int length, int start, struct re_registers *regs));
+
+
+/* Relates to `re_match' as `re_search_2' relates to `re_search'. */
+extern int re_match_2
+ _RE_ARGS ((struct re_pattern_buffer *buffer, const char *string1,
+ int length1, const char *string2, int length2,
+ int start, struct re_registers *regs, int stop));
+
+
+/* Set REGS to hold NUM_REGS registers, storing them in STARTS and
+ ENDS. Subsequent matches using BUFFER and REGS will use this memory
+ for recording register information. STARTS and ENDS must be
+ allocated with malloc, and must each be at least `NUM_REGS * sizeof
+ (regoff_t)' bytes long.
+
+ If NUM_REGS == 0, then subsequent matches should allocate their own
+ register data.
+
+ Unless this function is called, the first search or match using
+ PATTERN_BUFFER will allocate its own register data, without
+ freeing the old data. */
+extern void re_set_registers
+ _RE_ARGS ((struct re_pattern_buffer *buffer, struct re_registers *regs,
+ unsigned num_regs, regoff_t *starts, regoff_t *ends));
+
+/* 4.2 bsd compatibility. */
+extern char *re_comp _RE_ARGS ((const char *));
+extern int re_exec _RE_ARGS ((const char *));
+
+/* POSIX compatibility. */
+extern int regcomp _RE_ARGS ((regex_t *preg, const char *pattern, int cflags));
+extern int regexec
+ _RE_ARGS ((const regex_t *preg, const char *string, size_t nmatch,
+ regmatch_t pmatch[], int eflags));
+extern size_t regerror
+ _RE_ARGS ((int errcode, const regex_t *preg, char *errbuf,
+ size_t errbuf_size));
+extern void regfree _RE_ARGS ((regex_t *preg));
+
+#endif /* not __REGEXP_LIBRARY_H__ */
+
+/*
+Local variables:
+make-backup-files: t
+version-control: t
+trim-versions-without-asking: nil
+End:
+*/
diff --git a/gnu/usr.bin/grep/search.c b/gnu/usr.bin/grep/search.c
new file mode 100644
index 0000000..d2be489
--- /dev/null
+++ b/gnu/usr.bin/grep/search.c
@@ -0,0 +1,481 @@
+/* search.c - searching subroutines using dfa, kwset and regex for grep.
+ Copyright (C) 1992 Free Software Foundation, Inc.
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2, or (at your option)
+ any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+ Written August 1992 by Mike Haertel. */
+
+#include <ctype.h>
+
+#ifdef STDC_HEADERS
+#include <limits.h>
+#include <stdlib.h>
+#else
+#define UCHAR_MAX 255
+#include <sys/types.h>
+extern char *malloc();
+#endif
+
+#ifdef HAVE_MEMCHR
+#include <string.h>
+#ifdef NEED_MEMORY_H
+#include <memory.h>
+#endif
+#else
+#ifdef __STDC__
+extern void *memchr();
+#else
+extern char *memchr();
+#endif
+#endif
+
+#if defined(HAVE_STRING_H) || defined(STDC_HEADERS)
+#undef bcopy
+#define bcopy(s, d, n) memcpy((d), (s), (n))
+#endif
+
+#ifdef isascii
+#define ISALNUM(C) (isascii(C) && isalnum(C))
+#define ISUPPER(C) (isascii(C) && isupper(C))
+#else
+#define ISALNUM(C) isalnum(C)
+#define ISUPPER(C) isupper(C)
+#endif
+
+#define TOLOWER(C) (ISUPPER(C) ? tolower(C) : (C))
+
+#include "grep.h"
+#include "dfa.h"
+#include "kwset.h"
+#include "regex.h"
+
+#define NCHAR (UCHAR_MAX + 1)
+
+#if __STDC__
+static void Gcompile(char *, size_t);
+static void Ecompile(char *, size_t);
+static char *EGexecute(char *, size_t, char **);
+static void Fcompile(char *, size_t);
+static char *Fexecute(char *, size_t, char **);
+#else
+static void Gcompile();
+static void Ecompile();
+static char *EGexecute();
+static void Fcompile();
+static char *Fexecute();
+#endif
+
+/* Here is the matchers vector for the main program. */
+struct matcher matchers[] = {
+ { "default", Gcompile, EGexecute },
+ { "grep", Gcompile, EGexecute },
+ { "ggrep", Gcompile, EGexecute },
+ { "egrep", Ecompile, EGexecute },
+ { "posix-egrep", Ecompile, EGexecute },
+ { "gegrep", Ecompile, EGexecute },
+ { "fgrep", Fcompile, Fexecute },
+ { "gfgrep", Fcompile, Fexecute },
+ { 0, 0, 0 },
+};
+
+/* For -w, we also consider _ to be word constituent. */
+#define WCHAR(C) (ISALNUM(C) || (C) == '_')
+
+/* DFA compiled regexp. */
+static struct dfa dfa;
+
+/* Regex compiled regexp. */
+static struct re_pattern_buffer regex;
+
+/* KWset compiled pattern. For Ecompile and Gcompile, we compile
+ a list of strings, at least one of which is known to occur in
+ any string matching the regexp. */
+static kwset_t kwset;
+
+/* Last compiled fixed string known to exactly match the regexp.
+ If kwsexec() returns < lastexact, then we don't need to
+ call the regexp matcher at all. */
+static int lastexact;
+
+void
+dfaerror(mesg)
+ char *mesg;
+{
+ fatal(mesg, 0);
+}
+
+static void
+kwsinit()
+{
+ static char trans[NCHAR];
+ int i;
+
+ if (match_icase)
+ for (i = 0; i < NCHAR; ++i)
+ trans[i] = TOLOWER(i);
+
+ if (!(kwset = kwsalloc(match_icase ? trans : (char *) 0)))
+ fatal("memory exhausted", 0);
+}
+
+/* If the DFA turns out to have some set of fixed strings one of
+ which must occur in the match, then we build a kwset matcher
+ to find those strings, and thus quickly filter out impossible
+ matches. */
+static void
+kwsmusts()
+{
+ struct dfamust *dm;
+ char *err;
+
+ if (dfa.musts)
+ {
+ kwsinit();
+ /* First, we compile in the substrings known to be exact
+ matches. The kwset matcher will return the index
+ of the matching string that it chooses. */
+ for (dm = dfa.musts; dm; dm = dm->next)
+ {
+ if (!dm->exact)
+ continue;
+ ++lastexact;
+ if ((err = kwsincr(kwset, dm->must, strlen(dm->must))) != 0)
+ fatal(err, 0);
+ }
+ /* Now, we compile the substrings that will require
+ the use of the regexp matcher. */
+ for (dm = dfa.musts; dm; dm = dm->next)
+ {
+ if (dm->exact)
+ continue;
+ if ((err = kwsincr(kwset, dm->must, strlen(dm->must))) != 0)
+ fatal(err, 0);
+ }
+ if ((err = kwsprep(kwset)) != 0)
+ fatal(err, 0);
+ }
+}
+
+static void
+Gcompile(pattern, size)
+ char *pattern;
+ size_t size;
+{
+#ifdef __STDC__
+ const
+#endif
+ char *err;
+
+ re_set_syntax(RE_SYNTAX_GREP | RE_HAT_LISTS_NOT_NEWLINE);
+ dfasyntax(RE_SYNTAX_GREP | RE_HAT_LISTS_NOT_NEWLINE, match_icase);
+
+ if ((err = re_compile_pattern(pattern, size, &regex)) != 0)
+ fatal(err, 0);
+
+ dfainit(&dfa);
+
+ /* In the match_words and match_lines cases, we use a different pattern
+ for the DFA matcher that will quickly throw out cases that won't work.
+ Then if DFA succeeds we do some hairy stuff using the regex matcher
+ to decide whether the match should really count. */
+ if (match_words || match_lines)
+ {
+ /* In the whole-word case, we use the pattern:
+ (^|[^A-Za-z_])(userpattern)([^A-Za-z_]|$).
+ In the whole-line case, we use the pattern:
+ ^(userpattern)$.
+ BUG: Using [A-Za-z_] is locale-dependent! */
+
+ char *n = malloc(size + 50);
+ int i = 0;
+
+ strcpy(n, "");
+
+ if (match_lines)
+ strcpy(n, "^\\(");
+ if (match_words)
+ strcpy(n, "\\(^\\|[^0-9A-Za-z_]\\)\\(");
+
+ i = strlen(n);
+ bcopy(pattern, n + i, size);
+ i += size;
+
+ if (match_words)
+ strcpy(n + i, "\\)\\([^0-9A-Za-z_]\\|$\\)");
+ if (match_lines)
+ strcpy(n + i, "\\)$");
+
+ i += strlen(n + i);
+ dfacomp(n, i, &dfa, 1);
+ }
+ else
+ dfacomp(pattern, size, &dfa, 1);
+
+ kwsmusts();
+}
+
+static void
+Ecompile(pattern, size)
+ char *pattern;
+ size_t size;
+{
+#ifdef __STDC__
+ const
+#endif
+ char *err;
+
+ if (strcmp(matcher, "posix-egrep") == 0)
+ {
+ re_set_syntax(RE_SYNTAX_POSIX_EGREP);
+ dfasyntax(RE_SYNTAX_POSIX_EGREP, match_icase);
+ }
+ else
+ {
+ re_set_syntax(RE_SYNTAX_EGREP);
+ dfasyntax(RE_SYNTAX_EGREP, match_icase);
+ }
+
+ if ((err = re_compile_pattern(pattern, size, &regex)) != 0)
+ fatal(err, 0);
+
+ dfainit(&dfa);
+
+ /* In the match_words and match_lines cases, we use a different pattern
+ for the DFA matcher that will quickly throw out cases that won't work.
+ Then if DFA succeeds we do some hairy stuff using the regex matcher
+ to decide whether the match should really count. */
+ if (match_words || match_lines)
+ {
+ /* In the whole-word case, we use the pattern:
+ (^|[^A-Za-z_])(userpattern)([^A-Za-z_]|$).
+ In the whole-line case, we use the pattern:
+ ^(userpattern)$.
+ BUG: Using [A-Za-z_] is locale-dependent! */
+
+ char *n = malloc(size + 50);
+ int i = 0;
+
+ strcpy(n, "");
+
+ if (match_lines)
+ strcpy(n, "^(");
+ if (match_words)
+ strcpy(n, "(^|[^0-9A-Za-z_])(");
+
+ i = strlen(n);
+ bcopy(pattern, n + i, size);
+ i += size;
+
+ if (match_words)
+ strcpy(n + i, ")([^0-9A-Za-z_]|$)");
+ if (match_lines)
+ strcpy(n + i, ")$");
+
+ i += strlen(n + i);
+ dfacomp(n, i, &dfa, 1);
+ }
+ else
+ dfacomp(pattern, size, &dfa, 1);
+
+ kwsmusts();
+}
+
+static char *
+EGexecute(buf, size, endp)
+ char *buf;
+ size_t size;
+ char **endp;
+{
+ register char *buflim, *beg, *end, save;
+ int backref, start, len;
+ struct kwsmatch kwsm;
+ static struct re_registers regs; /* This is static on account of a BRAIN-DEAD
+ Q@#%!# library interface in regex.c. */
+
+ buflim = buf + size;
+
+ for (beg = end = buf; end < buflim; beg = end + 1)
+ {
+ if (kwset)
+ {
+ /* Find a possible match using the KWset matcher. */
+ beg = kwsexec(kwset, beg, buflim - beg, &kwsm);
+ if (!beg)
+ goto failure;
+ /* Narrow down to the line containing the candidate, and
+ run it through DFA. */
+ end = memchr(beg, '\n', buflim - beg);
+ if (!end)
+ end = buflim;
+ while (beg > buf && beg[-1] != '\n')
+ --beg;
+ save = *end;
+ if (kwsm.index < lastexact)
+ goto success;
+ if (!dfaexec(&dfa, beg, end, 0, (int *) 0, &backref))
+ {
+ *end = save;
+ continue;
+ }
+ *end = save;
+ /* Successful, no backreferences encountered. */
+ if (!backref)
+ goto success;
+ }
+ else
+ {
+ /* No good fixed strings; start with DFA. */
+ save = *buflim;
+ beg = dfaexec(&dfa, beg, buflim, 0, (int *) 0, &backref);
+ *buflim = save;
+ if (!beg)
+ goto failure;
+ /* Narrow down to the line we've found. */
+ end = memchr(beg, '\n', buflim - beg);
+ if (!end)
+ end = buflim;
+ while (beg > buf && beg[-1] != '\n')
+ --beg;
+ /* Successful, no backreferences encountered! */
+ if (!backref)
+ goto success;
+ }
+ /* If we've made it to this point, this means DFA has seen
+ a probable match, and we need to run it through Regex. */
+ regex.not_eol = 0;
+ if ((start = re_search(&regex, beg, end - beg, 0, end - beg, &regs)) >= 0)
+ {
+ len = regs.end[0] - start;
+ if (!match_lines && !match_words || match_lines && len == end - beg)
+ goto success;
+ /* If -w, check if the match aligns with word boundaries.
+ We do this iteratively because:
+ (a) the line may contain more than one occurence of the pattern, and
+ (b) Several alternatives in the pattern might be valid at a given
+ point, and we may need to consider a shorter one to find a word
+ boundary. */
+ if (match_words)
+ while (start >= 0)
+ {
+ if ((start == 0 || !WCHAR(beg[start - 1]))
+ && (len == end - beg || !WCHAR(beg[start + len])))
+ goto success;
+ if (len > 0)
+ {
+ /* Try a shorter length anchored at the same place. */
+ --len;
+ regex.not_eol = 1;
+ len = re_match(&regex, beg, start + len, start, &regs);
+ }
+ if (len <= 0)
+ {
+ /* Try looking further on. */
+ if (start == end - beg)
+ break;
+ ++start;
+ regex.not_eol = 0;
+ start = re_search(&regex, beg, end - beg,
+ start, end - beg - start, &regs);
+ len = regs.end[0] - start;
+ }
+ }
+ }
+ }
+
+ failure:
+ return 0;
+
+ success:
+ *endp = end < buflim ? end + 1 : end;
+ return beg;
+}
+
+static void
+Fcompile(pattern, size)
+ char *pattern;
+ size_t size;
+{
+ char *beg, *lim, *err;
+
+ kwsinit();
+ beg = pattern;
+ do
+ {
+ for (lim = beg; lim < pattern + size && *lim != '\n'; ++lim)
+ ;
+ if ((err = kwsincr(kwset, beg, lim - beg)) != 0)
+ fatal(err, 0);
+ if (lim < pattern + size)
+ ++lim;
+ beg = lim;
+ }
+ while (beg < pattern + size);
+
+ if ((err = kwsprep(kwset)) != 0)
+ fatal(err, 0);
+}
+
+static char *
+Fexecute(buf, size, endp)
+ char *buf;
+ size_t size;
+ char **endp;
+{
+ register char *beg, *try, *end;
+ register size_t len;
+ struct kwsmatch kwsmatch;
+
+ for (beg = buf; beg <= buf + size; ++beg)
+ {
+ if (!(beg = kwsexec(kwset, beg, buf + size - beg, &kwsmatch)))
+ return 0;
+ len = kwsmatch.size[0];
+ if (match_lines)
+ {
+ if (beg > buf && beg[-1] != '\n')
+ continue;
+ if (beg + len < buf + size && beg[len] != '\n')
+ continue;
+ goto success;
+ }
+ else if (match_words)
+ for (try = beg; len && try;)
+ {
+ if (try > buf && WCHAR((unsigned char) try[-1]))
+ break;
+ if (try + len < buf + size && WCHAR((unsigned char) try[len]))
+ {
+ try = kwsexec(kwset, beg, --len, &kwsmatch);
+ len = kwsmatch.size[0];
+ }
+ else
+ goto success;
+ }
+ else
+ goto success;
+ }
+
+ return 0;
+
+ success:
+ if ((end = memchr(beg + len, '\n', (buf + size) - (beg + len))) != 0)
+ ++end;
+ else
+ end = buf + size;
+ *endp = end;
+ while (beg > buf && beg[-1] != '\n')
+ --beg;
+ return beg;
+}
diff --git a/gnu/usr.bin/grep/tests/check.sh b/gnu/usr.bin/grep/tests/check.sh
new file mode 100644
index 0000000..d2c8fdb
--- /dev/null
+++ b/gnu/usr.bin/grep/tests/check.sh
@@ -0,0 +1,24 @@
+#! /bin/sh
+# Regression test for GNU grep.
+# Usage: regress.sh [testdir]
+
+testdir=${1-tests}
+
+failures=0
+
+# The Khadafy test is brought to you by Scott Anderson . . .
+./grep -E -f $testdir/khadafy.regexp $testdir/khadafy.lines > khadafy.out
+if cmp $testdir/khadafy.lines khadafy.out
+then
+ :
+else
+ echo Khadafy test failed -- output left on khadafy.out
+ failures=1
+fi
+
+# . . . and the following by Henry Spencer.
+
+${AWK-awk} -F: -f $testdir/scriptgen.awk $testdir/spencer.tests > tmp.script
+
+sh tmp.script && exit $failures
+exit 1
diff --git a/gnu/usr.bin/grep/tests/scriptgen.awk b/gnu/usr.bin/grep/tests/scriptgen.awk
index cdadc91..44ef4df 100644
--- a/gnu/usr.bin/grep/tests/scriptgen.awk
+++ b/gnu/usr.bin/grep/tests/scriptgen.awk
@@ -1,6 +1,6 @@
BEGIN { print "failures=0"; }
-!/^#/ && NF == 3 {
- print "echo '" $3 "' | $1/egrep -e '" $2 "' > /dev/null 2>&1";
+$0 !~ /^#/ && NF == 3 {
+ print "echo '" $3 "' | ./grep -E -e '" $2 "' > /dev/null 2>&1";
print "if [ $? != " $1 " ]"
print "then"
printf "\techo Spencer test \\#%d failed\n", ++n
diff --git a/gnu/usr.bin/grep/tests/spencer.tests b/gnu/usr.bin/grep/tests/spencer.tests
index c7eee9b..913f198 100644
--- a/gnu/usr.bin/grep/tests/spencer.tests
+++ b/gnu/usr.bin/grep/tests/spencer.tests
@@ -33,7 +33,7 @@
0:a[b-d]e:ace
0:a[b-d]:aac
0:a[-b]:a-
-2:a[b-]:a-
+0:a[b-]:a-
1:a[b-a]:-
2:a[]b:-
2:a[:-
OpenPOWER on IntegriCloud