Revert "awk: Merge upstream 2nd Edition Awk Book"

The pre-push testing I did turned out to be testing the old version with
the old testsuite (for reasons I don't understnad). There's issues with
the new version, the new test in the suite or (likely) both. Revert
until they can be chased down.

This should also fix the github CI that's gone red since this commit.

This reverts commit 3fd60a6b73, reversing
changes made to 194df014fe.

Sponsored by:		Netflix
main
Warner Losh 3 months ago
parent 6f38d2e7b0
commit b2376a5f1e

1
.gitattributes vendored

@ -5,4 +5,3 @@
*.py diff=python
. svn-properties=svn:keywords=tools/build/options/WITHOUT_LOADER_ZFS
.clang-format svn-properties=svn:keywords=FreeBSD=%H
contrib/one-true-awk/bugs-fixed/unicode-null-match.bad binary

@ -27,11 +27,6 @@ NOTE TO PEOPLE WHO THINK THAT FreeBSD 15.x IS SLOW:
world, or to merely disable the most expensive debugging functionality
at runtime, run "ln -s 'abort:false,junk:false' /etc/malloc.conf".)
20231114:
One True Awk updated to the Second Edition as of 20231102 (254b979f32df)
Notable features include UTF-8 support and --csv to support comma
separated data.
20231113:
The WITHOUT_LLD_IS_LD option has been removed. When LLD is enabled
it is always installed as /usr/bin/ld.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

@ -1,38 +1,8 @@
# The One True Awk
This is the version of `awk` described in _The AWK Programming Language_,
Second Edition, by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 2024, ISBN-13 978-0138269722, ISBN-10 0138269726).
## What's New? ##
This version of Awk handles UTF-8 and comma-separated values (CSV) input.
### Strings ###
Functions that process strings now count Unicode code points, not bytes;
this affects `length`, `substr`, `index`, `match`, `split`,
`sub`, `gsub`, and others. Note that code
points are not necessarily characters.
UTF-8 sequences may appear in literal strings and regular expressions.
Aribtrary characters may be included with `\u` followed by 1 to 8 hexadecimal digits.
### Regular expressions ###
Regular expressions may include UTF-8 code points, including `\u`.
Character classes are likely to be limited to about 256 characters
when expanded.
### CSV ###
The option `--csv` turns on CSV processing of input:
fields are separated by commas, fields may be quoted with
double-quote (`"`) characters, quoted fields may contain embedded newlines.
In CSV mode, `FS` is ignored.
If no explicit separator argument is provided,
field-splitting in `split` is determined by CSV mode.
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X).
## Copyright
@ -65,7 +35,7 @@ in `FIXES`. If you distribute this code further, please please please
distribute `FIXES` with it.
If you find errors, please report them
to the current maintainer, ozan.yigit@gmail.com.
to bwk@cs.princeton.edu.
Please _also_ open an issue in the GitHub issue tracker, to make
it easy to track issues.
Thanks.
@ -97,22 +67,22 @@ The program itself is created by
which should produce a sequence of messages roughly like this:
bison -d awkgram.y
awkgram.y: warning: 44 shift/reduce conflicts [-Wconflicts-sr]
awkgram.y: warning: 85 reduce/reduce conflicts [-Wconflicts-rr]
awkgram.y: note: rerun with option '-Wcounterexamples' to generate conflict counterexamples
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o awkgram.tab.o awkgram.tab.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o b.o b.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o main.o main.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o parse.o parse.c
gcc -g -Wall -pedantic -Wcast-qual -O2 maketab.c -o maketab
./maketab awkgram.tab.h >proctab.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o proctab.o proctab.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o tran.o tran.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o lib.o lib.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o run.o run.c
gcc -g -Wall -pedantic -Wcast-qual -O2 -c -o lex.o lex.c
gcc -g -Wall -pedantic -Wcast-qual -O2 awkgram.tab.o b.o main.o parse.o proctab.o tran.o lib.o run.o lex.o -lm
yacc -d awkgram.y
conflicts: 43 shift/reduce, 85 reduce/reduce
mv y.tab.c ytab.c
mv y.tab.h ytab.h
cc -c ytab.c
cc -c b.c
cc -c main.c
cc -c parse.c
cc maketab.c -o maketab
./maketab >proctab.c
cc -c proctab.c
cc -c tran.c
cc -c lib.c
cc -c run.c
cc -c lex.c
cc ytab.o b.o main.o parse.o proctab.o tran.o lib.o run.o lex.o -lm
This produces an executable `a.out`; you will eventually want to
move this to some place like `/usr/bin/awk`.
@ -120,7 +90,7 @@ move this to some place like `/usr/bin/awk`.
If your system does not have `yacc` or `bison` (the GNU
equivalent), you need to install one of them first.
NOTE: This version uses ISO/IEC C99, as you should also. We have
NOTE: This version uses ANSI C (C 99), as you should also. We have
compiled this without any changes using `gcc -Wall` and/or local C
compilers on a variety of systems, but new systems or compilers
may raise some new complaint; reports of difficulties are
@ -132,9 +102,14 @@ the standard developer tools.
You can also use `make CC=g++` to build with the GNU C++ compiler,
should you choose to do so.
The version of `malloc` that comes with some systems is sometimes
astonishly slow. If `awk` seems slow, you might try fixing that.
More generally, turning on optimization can significantly improve
`awk`'s speed, perhaps by 1/3 for highest levels.
## A Note About Releases
We don't usually do releases.
We don't do releases.
## A Note About Maintenance
@ -145,4 +120,4 @@ is not at the top of our priority list.
#### Last Updated
Sun 15 Oct 2023 06:28:36 IDT
Sat Jul 25 14:00:07 EDT 2021

@ -20,8 +20,6 @@ awk \- pattern-directed scanning and processing language
[
.BI \-F
.I fs
|
.B \-\^\-csv
]
[
.BI \-v
@ -78,12 +76,6 @@ The
.I fs
option defines the input field separator to be the regular expression
.IR fs .
The
.B \-\^\-csv
option causes
.I awk
to process records using (more or less) standard comma-separated values
(CSV) format.
.PP
An input line is normally made up of fields separated by white space,
or by the regular expression
@ -210,9 +202,9 @@ and
.B sqrt
are built in.
Other built-in functions:
.TF "\fBlength(\fR[\fIv\^\fR]\fB)\fR"
.TF length
.TP
\fBlength(\fR[\fIv\^\fR]\fB)\fR
.B length
the length of its argument
taken as a string,
number of elements in an array for an array argument,
@ -220,15 +212,15 @@ or length of
.B $0
if no argument.
.TP
.B rand()
.B rand
random number on [0,1).
.TP
\fBsrand(\fR[\fIs\^\fR]\fB)\fR
.B srand
sets seed for
.B rand
and returns the previous seed.
.TP
.BI int( x\^ )
.B int
truncates to an integer value.
.TP
\fBsubstr(\fIs\fB, \fIm\fR [\fB, \fIn\^\fR]\fB)\fR
@ -449,7 +441,7 @@ in a pattern.
A pattern may consist of two patterns separated by a comma;
in this case, the action is performed for all lines
from an occurrence of the first pattern
through an occurrence of the second, inclusive.
though an occurrence of the second.
.PP
A relational expression is one of the following:
.IP
@ -459,7 +451,7 @@ A relational expression is one of the following:
.br
.IB expression " in " array-name
.br
.BI ( expr ,\| expr ,\| ... ") in " array-name
.BI ( expr , expr,... ") in " array-name
.PP
where a
.I relop
@ -559,7 +551,7 @@ separates multiple subscripts (default 034).
Functions may be defined (at the position of a pattern-action statement) thus:
.IP
.B
function foo(a, b, c) { ... }
function foo(a, b, c) { ...; return x }
.PP
Parameters are passed by value if scalar and by reference if array name;
functions may be called recursively.
@ -625,8 +617,8 @@ BEGIN { # Simulate echo(1)
.IR sed (1)
.br
A. V. Aho, B. W. Kernighan, P. J. Weinberger,
.IR "The AWK Programming Language, Second Edition" ,
Addison-Wesley, 2024. ISBN 978-0-13-826972-2, 0-13-826972-6.
.IR "The AWK Programming Language" ,
Addison-Wesley, 1988. ISBN 0-201-07981-X.
.SH BUGS
There are no explicit conversions between numbers and strings.
To force an expression to be treated as a number add 0 to it;
@ -636,8 +628,7 @@ to force it to be treated as a string concatenate
The scope rules for variables in functions are a botch;
the syntax is worse.
.PP
Input is expected to be UTF-8 encoded. Other multibyte
character sets are not handled.
Only eight-bit characters sets are handled correctly.
.SH UNUSUAL FLOATING-POINT VALUES
.I Awk
was designed before IEEE 754 arithmetic defined Not-A-Number (NaN)

@ -37,7 +37,7 @@ typedef double Awkfloat;
typedef unsigned char uschar;
#define xfree(a) { free((void *)(intptr_t)(a)); (a) = NULL; }
#define xfree(a) { if ((a) != NULL) { free((void *)(intptr_t)(a)); (a) = NULL; } }
/*
* We sometimes cheat writing read-only pointers to NUL-terminate them
* and then put back the original value
@ -64,8 +64,6 @@ extern bool safe; /* false => unsafe, true => safe */
#define RECSIZE (8 * 1024) /* sets limit on records, fields, etc., etc. */
extern int recsize; /* size of current record, orig RECSIZE */
extern size_t awk_mb_cur_max; /* max size of a multi-byte character */
extern char EMPTY[]; /* this avoid -Wwritable-strings issues */
extern char **FS;
extern char **RS;
@ -80,8 +78,6 @@ extern char **SUBSEP;
extern Awkfloat *RSTART;
extern Awkfloat *RLENGTH;
extern bool CSV; /* true for csv input */
extern char *record; /* points to $0 */
extern int lineno; /* line number in awk program */
extern int errorflag; /* 1 if error has occurred */
@ -237,8 +233,7 @@ extern int pairstack[], paircnt;
/* structures used by regular expression matching machinery, mostly b.c: */
#define NCHARS (1256+3) /* 256 handles 8-bit chars; 128 does 7-bit */
/* BUG: some overflows (caught) if we use 256 */
#define NCHARS (256+3) /* 256 handles 8-bit chars; 128 does 7-bit */
/* watch out in match(), etc. */
#define HAT (NCHARS+2) /* matches ^ in regular expr */
#define NSTATES 32
@ -249,19 +244,12 @@ typedef struct rrow {
int i;
Node *np;
uschar *up;
int *rp; /* rune representation of char class */
} lval; /* because Al stores a pointer in it! */
int *lfollow;
} rrow;
typedef struct gtt { /* gototab entry */
unsigned int ch;
unsigned int state;
} gtt;
typedef struct fa {
gtt **gototab;
int gototab_len;
unsigned int **gototab;
uschar *out;
uschar *restr;
int **posns;

@ -204,12 +204,11 @@ ppattern:
{ $$ = op2(BOR, notnull($1), notnull($3)); }
| ppattern and ppattern %prec AND
{ $$ = op2(AND, notnull($1), notnull($3)); }
| ppattern MATCHOP reg_expr { $$ = op3($2, NIL, $1, (Node*)makedfa($3, 0)); free($3); }
| ppattern MATCHOP reg_expr { $$ = op3($2, NIL, $1, (Node*)makedfa($3, 0)); }
| ppattern MATCHOP ppattern
{ if (constnode($3)) {
{ if (constnode($3))
$$ = op3($2, NIL, $1, (Node*)makedfa(strnode($3), 0));
free($3);
} else
else
$$ = op3($2, (Node *)1, $1, $3); }
| ppattern IN varname { $$ = op2(INTEST, $1, makearr($3)); }
| '(' plist ')' IN varname { $$ = op2(INTEST, $2, makearr($5)); }
@ -232,12 +231,11 @@ pattern:
| pattern LE pattern { $$ = op2($2, $1, $3); }
| pattern LT pattern { $$ = op2($2, $1, $3); }
| pattern NE pattern { $$ = op2($2, $1, $3); }
| pattern MATCHOP reg_expr { $$ = op3($2, NIL, $1, (Node*)makedfa($3, 0)); free($3); }
| pattern MATCHOP reg_expr { $$ = op3($2, NIL, $1, (Node*)makedfa($3, 0)); }
| pattern MATCHOP pattern
{ if (constnode($3)) {
{ if (constnode($3))
$$ = op3($2, NIL, $1, (Node*)makedfa(strnode($3), 0));
free($3);
} else
else
$$ = op3($2, (Node *)1, $1, $3); }
| pattern IN varname { $$ = op2(INTEST, $1, makearr($3)); }
| '(' plist ')' IN varname { $$ = op2(INTEST, $2, makearr($5)); }
@ -282,7 +280,7 @@ rbrace:
re:
reg_expr
{ $$ = op3(MATCH, NIL, rectonode(), (Node*)makedfa($1, 0)); free($1); }
{ $$ = op3(MATCH, NIL, rectonode(), (Node*)makedfa($1, 0)); }
| NOT re { $$ = op1(NOT, notnull($2)); }
;
@ -380,19 +378,17 @@ term:
| GENSUB '(' reg_expr comma pattern comma pattern ')'
{ $$ = op5(GENSUB, NIL, (Node*)makedfa($3, 1), $5, $7, rectonode()); }
| GENSUB '(' pattern comma pattern comma pattern ')'
{ if (constnode($3)) {
{ if (constnode($3))
$$ = op5(GENSUB, NIL, (Node *)makedfa(strnode($3), 1), $5, $7, rectonode());
free($3);
} else
else
$$ = op5(GENSUB, (Node *)1, $3, $5, $7, rectonode());
}
| GENSUB '(' reg_expr comma pattern comma pattern comma pattern ')'
{ $$ = op5(GENSUB, NIL, (Node*)makedfa($3, 1), $5, $7, $9); }
| GENSUB '(' pattern comma pattern comma pattern comma pattern ')'
{ if (constnode($3)) {
{ if (constnode($3))
$$ = op5(GENSUB, NIL, (Node *)makedfa(strnode($3),1), $5,$7,$9);
free($3);
} else
else
$$ = op5(GENSUB, (Node *)1, $3, $5, $7, $9);
}
| GETLINE var LT term { $$ = op3(GETLINE, $2, itonp($3), $4); }
@ -406,37 +402,34 @@ term:
$$ = op2(INDEX, $3, (Node*)$5); }
| '(' pattern ')' { $$ = $2; }
| MATCHFCN '(' pattern comma reg_expr ')'
{ $$ = op3(MATCHFCN, NIL, $3, (Node*)makedfa($5, 1)); free($5); }
{ $$ = op3(MATCHFCN, NIL, $3, (Node*)makedfa($5, 1)); }
| MATCHFCN '(' pattern comma pattern ')'
{ if (constnode($5)) {
{ if (constnode($5))
$$ = op3(MATCHFCN, NIL, $3, (Node*)makedfa(strnode($5), 1));
free($5);
} else
else
$$ = op3(MATCHFCN, (Node *)1, $3, $5); }
| NUMBER { $$ = celltonode($1, CCON); }
| SPLIT '(' pattern comma varname comma pattern ')' /* string */
{ $$ = op4(SPLIT, $3, makearr($5), $7, (Node*)STRING); }
| SPLIT '(' pattern comma varname comma reg_expr ')' /* const /regexp/ */
{ $$ = op4(SPLIT, $3, makearr($5), (Node*)makedfa($7, 1), (Node *)REGEXPR); free($7); }
{ $$ = op4(SPLIT, $3, makearr($5), (Node*)makedfa($7, 1), (Node *)REGEXPR); }
| SPLIT '(' pattern comma varname ')'
{ $$ = op4(SPLIT, $3, makearr($5), NIL, (Node*)STRING); } /* default */
| SPRINTF '(' patlist ')' { $$ = op1($1, $3); }
| string { $$ = celltonode($1, CCON); }
| subop '(' reg_expr comma pattern ')'
{ $$ = op4($1, NIL, (Node*)makedfa($3, 1), $5, rectonode()); free($3); }
{ $$ = op4($1, NIL, (Node*)makedfa($3, 1), $5, rectonode()); }
| subop '(' pattern comma pattern ')'
{ if (constnode($3)) {
{ if (constnode($3))
$$ = op4($1, NIL, (Node*)makedfa(strnode($3), 1), $5, rectonode());
free($3);
} else
else
$$ = op4($1, (Node *)1, $3, $5, rectonode()); }
| subop '(' reg_expr comma pattern comma var ')'
{ $$ = op4($1, NIL, (Node*)makedfa($3, 1), $5, $7); free($3); }
{ $$ = op4($1, NIL, (Node*)makedfa($3, 1), $5, $7); }
| subop '(' pattern comma pattern comma var ')'
{ if (constnode($3)) {
{ if (constnode($3))
$$ = op4($1, NIL, (Node*)makedfa(strnode($3), 1), $5, $7);
free($3);
} else
else
$$ = op4($1, (Node *)1, $3, $5, $7); }
| SUBSTR '(' pattern comma pattern comma pattern ')'
{ $$ = op3(SUBSTR, $3, $5, $7); }

@ -80,43 +80,6 @@ int patlen;
fa *fatab[NFA];
int nfatab = 0; /* entries in fatab */
extern int u8_nextlen(const char *s);
/* utf-8 mechanism:
For most of Awk, utf-8 strings just "work", since they look like
null-terminated sequences of 8-bit bytes.
Functions like length(), index(), and substr() have to operate
in units of utf-8 characters. The u8_* functions in run.c
handle this.
Regular expressions are more complicated, since the basic
mechanism of the goto table used 8-bit byte indices into the
gototab entries to compute the next state. Unicode is a lot
bigger, so the gototab entries are now structs with a character
and a next state, and there is a linear search of the characters
to find the state. (Yes, this is slower, by a significant
amount. Tough.)
Throughout the RE mechanism in b.c, utf-8 characters are
converted to their utf-32 value. This mostly shows up in
cclenter, which expands character class ranges like a-z and now
alpha-omega. The size of a gototab array is still about 256.
This should be dynamic, but for now things work ok for a single
code page of Unicode, which is the most likely case.
The code changes are localized in run.c and b.c. I have added a
handful of functions to somewhat better hide the implementation,
but a lot more could be done.
*/
static int get_gototab(fa*, int, int);
static int set_gototab(fa*, int, int, int);
extern int u8_rune(int *, const uschar *);
static int *
intalloc(size_t n, const char *f)
{
@ -142,7 +105,7 @@ resizesetvec(const char *f)
static void
resize_state(fa *f, int state)
{
gtt **p;
unsigned int **p;
uschar *p2;
int **p3;
int i, new_count;
@ -152,7 +115,7 @@ resize_state(fa *f, int state)
new_count = state + 10; /* needs to be tuned */
p = (gtt **) realloc(f->gototab, new_count * sizeof(f->gototab[0]));
p = (unsigned int **) realloc(f->gototab, new_count * sizeof(f->gototab[0]));
if (p == NULL)
goto out;
f->gototab = p;
@ -168,13 +131,12 @@ resize_state(fa *f, int state)
f->posns = p3;
for (i = f->state_count; i < new_count; ++i) {
f->gototab[i] = (gtt *) calloc(NCHARS, sizeof(**f->gototab));
f->gototab[i] = (unsigned int *) calloc(NCHARS, sizeof(**f->gototab));
if (f->gototab[i] == NULL)
goto out;
f->out[i] = 0;
f->posns[i] = NULL;
}
f->gototab_len = NCHARS; /* should be variable, growable */
f->state_count = new_count;
return;
out:
@ -269,7 +231,7 @@ int makeinit(fa *f, bool anchor)
if ((f->posns[2])[1] == f->accept)
f->out[2] = 1;
for (i = 0; i < NCHARS; i++)
set_gototab(f, 2, 0, 0); /* f->gototab[2][i] = 0; */
f->gototab[2][i] = 0;
f->curstat = cgoto(f, 2, HAT);
if (anchor) {
*f->posns[2] = k-1; /* leave out position 0 */
@ -338,13 +300,13 @@ void freetr(Node *p) /* free parse tree */
/* in the parsing of regular expressions, metacharacters like . have */
/* to be seen literally; \056 is not a metacharacter. */
int hexstr(const uschar **pp, int max) /* find and eval hex string at pp, return new p */
int hexstr(const uschar **pp) /* find and eval hex string at pp, return new p */
{ /* only pick up one 8-bit byte (2 chars) */
const uschar *p;
int n = 0;
int i;
for (i = 0, p = *pp; i < max && isxdigit(*p); i++, p++) {
for (i = 0, p = *pp; i < 2 && isxdigit(*p); i++, p++) {
if (isdigit(*p))
n = 16 * n + *p - '0';
else if (*p >= 'a' && *p <= 'f')
@ -356,8 +318,6 @@ int hexstr(const uschar **pp, int max) /* find and eval hex string at pp, return
return n;
}
#define isoctdigit(c) ((c) >= '0' && (c) <= '7') /* multiple use of arg */
int quoted(const uschar **pp) /* pick up next thing after a \\ */
@ -366,28 +326,24 @@ int quoted(const uschar **pp) /* pick up next thing after a \\ */
const uschar *p = *pp;
int c;
/* BUG: should advance by utf-8 char even if makes no sense */
if ((c = *p++) == 't') {
if ((c = *p++) == 't')
c = '\t';
} else if (c == 'n') {
else if (c == 'n')
c = '\n';
} else if (c == 'f') {
else if (c == 'f')
c = '\f';
} else if (c == 'r') {
else if (c == 'r')
c = '\r';
} else if (c == 'b') {
else if (c == 'b')
c = '\b';
} else if (c == 'v') {
else if (c == 'v')
c = '\v';
} else if (c == 'a') {
else if (c == 'a')
c = '\a';
} else if (c == '\\') {
else if (c == '\\')
c = '\\';
} else if (c == 'x') { /* 2 hex digits follow */
c = hexstr(&p, 2); /* this adds a null if number is invalid */
} else if (c == 'u') { /* unicode char number up to 8 hex digits */
c = hexstr(&p, 8);
else if (c == 'x') { /* hexadecimal goo follows */
c = hexstr(&p); /* this adds a null if number is invalid */
} else if (isoctdigit(c)) { /* \d \dd \ddd */
int n = c - '0';
if (isoctdigit(*p)) {
@ -402,67 +358,50 @@ int quoted(const uschar **pp) /* pick up next thing after a \\ */
return c;
}
int *cclenter(const char *argp) /* add a character class */
char *cclenter(const char *argp) /* add a character class */
{
int i, c, c2;
int n;
const uschar *p = (const uschar *) argp;
int *bp, *retp;
static int *buf = NULL;
const uschar *op, *p = (const uschar *) argp;
uschar *bp;
static uschar *buf = NULL;
static int bufsz = 100;
if (buf == NULL && (buf = (int *) calloc(bufsz, sizeof(int))) == NULL)
op = p;
if (buf == NULL && (buf = (uschar *) malloc(bufsz)) == NULL)
FATAL("out of space for character class [%.10s...] 1", p);
bp = buf;
for (i = 0; *p != 0; ) {
n = u8_rune(&c, p);
p += n;
for (i = 0; (c = *p++) != 0; ) {
if (c == '\\') {
c = quoted(&p);
} else if (c == '-' && i > 0 && bp[-1] != 0) {
if (*p != 0) {
c = bp[-1];
/* c2 = *p++; */
n = u8_rune(&c2, p);
p += n;
c2 = *p++;
if (c2 == '\\')
c2 = quoted(&p); /* BUG: sets p, has to be u8 size */
c2 = quoted(&p);
if (c > c2) { /* empty; ignore */
bp--;
i--;
continue;
}
while (c < c2) {
if (i >= bufsz) {
bufsz *= 2;
buf = (int *) realloc(buf, bufsz * sizeof(int));
if (buf == NULL)
FATAL("out of space for character class [%.10s...] 2", p);
bp = buf + i;
}
if (!adjbuf((char **) &buf, &bufsz, bp-buf+2, 100, (char **) &bp, "cclenter1"))
FATAL("out of space for character class [%.10s...] 2", p);
*bp++ = ++c;
i++;
}
continue;
}
}
if (i >= bufsz) {
bufsz *= 2;
buf = (int *) realloc(buf, bufsz * sizeof(int));
if (buf == NULL)
FATAL("out of space for character class [%.10s...] 2", p);
bp = buf + i;
}
if (!adjbuf((char **) &buf, &bufsz, bp-buf+2, 100, (char **) &bp, "cclenter2"))
FATAL("out of space for character class [%.10s...] 3", p);
*bp++ = c;
i++;
}
*bp = 0;
/* DPRINTF("cclenter: in = |%s|, out = |%s|\n", op, buf); BUG: can't print array of int */
/* xfree(op); BUG: what are we freeing here? */
retp = (int *) calloc(bp-buf+1, sizeof(int));
for (i = 0; i < bp-buf+1; i++)
retp[i] = buf[i];
return retp;
DPRINTF("cclenter: in = |%s|, out = |%s|\n", op, buf);
xfree(op);
return (char *) tostring((char *) buf);
}
void overflo(const char *s)
@ -529,7 +468,7 @@ int first(Node *p) /* collects initially active leaves of p into setvec */
setvec[lp] = 1;
setcnt++;
}
if (type(p) == CCL && (*(int *) right(p)) == 0)
if (type(p) == CCL && (*(char *) right(p)) == '\0')
return(0); /* empty CCL */
return(1);
case PLUS:
@ -585,9 +524,9 @@ void follow(Node *v) /* collects leaves that can follow v into setvec */
}
}
int member(int c, int *sarg) /* is c in s? */
int member(int c, const char *sarg) /* is c in s? */
{
int *s = (int *) sarg;
const uschar *s = (const uschar *) sarg;
while (*s)
if (c == *s++)
@ -595,41 +534,11 @@ int member(int c, int *sarg) /* is c in s? */
return(0);
}
static int get_gototab(fa *f, int state, int ch) /* hide gototab inplementation */
{
int i;
for (i = 0; i < f->gototab_len; i++) {
if (f->gototab[state][i].ch == 0)
break;
if (f->gototab[state][i].ch == ch)
return f->gototab[state][i].state;
}
return 0;
}
static int set_gototab(fa *f, int state, int ch, int val) /* hide gototab inplementation */
{
int i;
for (i = 0; i < f->gototab_len; i++) {
if (f->gototab[state][i].ch == 0 || f->gototab[state][i].ch == ch) {
f->gototab[state][i].ch = ch;
f->gototab[state][i].state = val;
return val;
}
}
overflo(__func__);
return val; /* not used anywhere at the moment */
}
int match(fa *f, const char *p0) /* shortest match ? */
{
int s, ns;
int n;
int rune;
const uschar *p = (const uschar *) p0;
/* return pmatch(f, p0); does it matter whether longest or shortest? */
s = f->initstat;
assert (s < f->state_count);
@ -637,25 +546,19 @@ int match(fa *f, const char *p0) /* shortest match ? */
return(1);
do {
/* assert(*p < NCHARS); */
n = u8_rune(&rune, p);
if ((ns = get_gototab(f, s, rune)) != 0)
if ((ns = f->gototab[s][*p]) != 0)
s = ns;
else
s = cgoto(f, s, rune);
s = cgoto(f, s, *p);
if (f->out[s])
return(1);
if (*p == 0)
break;
p += n;
} while (1); /* was *p++ != 0 */
} while (*p++ != 0);
return(0);
}
int pmatch(fa *f, const char *p0) /* longest match, for sub */
{
int s, ns;
int n;
int rune;
const uschar *p = (const uschar *) p0;
const uschar *q;
@ -670,11 +573,10 @@ int pmatch(fa *f, const char *p0) /* longest match, for sub */
if (f->out[s]) /* final state */
patlen = q-p;
/* assert(*q < NCHARS); */
n = u8_rune(&rune, q);
if ((ns = get_gototab(f, s, rune)) != 0)
if ((ns = f->gototab[s][*q]) != 0)
s = ns;
else
s = cgoto(f, s, rune);
s = cgoto(f, s, *q);
assert(s < f->state_count);
@ -686,11 +588,7 @@ int pmatch(fa *f, const char *p0) /* longest match, for sub */
else
goto nextin; /* no match */
}
if (*q == 0)
break;
q += n;
} while (1);
q++; /* was *q++ */
} while (*q++ != 0);
if (f->out[s])
patlen = q-p-1; /* don't count $ */
if (patlen >= 0) {
@ -699,19 +597,13 @@ int pmatch(fa *f, const char *p0) /* longest match, for sub */
}
nextin:
s = 2;
if (*p == 0)
break;
n = u8_rune(&rune, p);
p += n;
} while (1); /* was *p++ */
} while (*p++);
return (0);
}
int nematch(fa *f, const char *p0) /* non-empty match, for sub */
{
int s, ns;
int n;
int rune;
const uschar *p = (const uschar *) p0;
const uschar *q;
@ -726,11 +618,10 @@ int nematch(fa *f, const char *p0) /* non-empty match, for sub */
if (f->out[s]) /* final state */
patlen = q-p;
/* assert(*q < NCHARS); */
n = u8_rune(&rune, q);
if ((ns = get_gototab(f, s, rune)) != 0)
if ((ns = f->gototab[s][*q]) != 0)
s = ns;
else
s = cgoto(f, s, rune);
s = cgoto(f, s, *q);
if (s == 1) { /* no transition */
if (patlen > 0) {
patbeg = (const char *) p;
@ -738,11 +629,7 @@ int nematch(fa *f, const char *p0) /* non-empty match, for sub */
} else
goto nnextin; /* no nonempty match */
}
if (*q == 0)
break;
q += n;
} while (1);
q++;
} while (*q++ != 0);
if (f->out[s])
patlen = q-p-1; /* don't count $ */
if (patlen > 0 ) {
@ -757,61 +644,6 @@ int nematch(fa *f, const char *p0) /* non-empty match, for sub */
}
#define MAX_UTF_BYTES 4 // UTF-8 is up to 4 bytes long
// Read one rune at a time from the given FILE*. Return both
// the bytes and the actual rune.
struct runedata {
int rune;
size_t len;
char bytes[6];
};
struct runedata getrune(FILE *fp)
{
struct runedata result;
int c, next;
memset(&result, 0, sizeof(result));
c = getc(fp);
if (c == EOF)
return result; // result.rune == 0 --> EOF
else if (c < 128 || awk_mb_cur_max == 1) {
result.bytes[0] = c;
result.len = 1;
result.rune = c;
return result;
}
// need to get bytes and fill things in
result.bytes[0] = c;
result.len = 1;
next = 1;
for (int i = 1; i < MAX_UTF_BYTES; i++) {
c = getc(fp);
if (c == EOF)
break;
result.bytes[next++] = c;
result.len++;
}
// put back any extra input bytes
int actual_len = u8_nextlen(result.bytes);
while (result.len > actual_len) {
ungetc(result.bytes[--result.len], fp);
}
result.bytes[result.len] = '\0';
(void) u8_rune(& result.rune, (uschar *) result.bytes);
return result;
}
/*
* NAME
* fnematch
@ -831,8 +663,7 @@ bool fnematch(fa *pfa, FILE *f, char **pbuf, int *pbufsize, int quantum)
{
char *buf = *pbuf;
int bufsize = *pbufsize;
int i, j, k, ns, s;
struct runedata r;
int c, i, j, k, ns, s;
s = pfa->initstat;
patlen = 0;
@ -841,38 +672,35 @@ bool fnematch(fa *pfa, FILE *f, char **pbuf, int *pbufsize, int quantum)
* All indices relative to buf.
* i <= j <= k <= bufsize
*
* i: origin of active substring (first byte of first character)
* j: current character (last byte of current character)
* i: origin of active substring
* j: current character
* k: destination of next getc()
*/
i = -1, k = 0;
do {
j = i++;
do {
r = getrune(f);
if ((++j + r.len) >= k) {
if (k >= bufsize)
if (++j == k) {
if (k == bufsize)
if (!adjbuf((char **) &buf, &bufsize, bufsize+1, quantum, 0, "fnematch"))
FATAL("stream '%.30s...' too long", buf);
buf[k++] = (c = getc(f)) != EOF ? c : 0;
}
memcpy(buf + k, r.bytes, r.len);
j += r.len - 1; // incremented next time around the loop
k += r.len;
c = (uschar)buf[j];
/* assert(c < NCHARS); */
if ((ns = get_gototab(pfa, s, r.rune)) != 0)
if ((ns = pfa->gototab[s][c]) != 0)
s = ns;
else
s = cgoto(pfa, s, r.rune);
s = cgoto(pfa, s, c);
if (pfa->out[s]) { /* final state */
patlen = j - i + 1;
if (r.rune == 0) /* don't count $ */
if (c == 0) /* don't count $ */
patlen--;
}
} while (buf[j] && s != 1);
s = 2;
if (r.len > 1)
i += r.len - 1; // i incremented around the loop
} while (buf[i] && !patlen);
/* adjbuf() may have relocated a resized buffer. Inform the world. */
@ -893,9 +721,8 @@ bool fnematch(fa *pfa, FILE *f, char **pbuf, int *pbufsize, int quantum)
* terminate the buffer.
*/
do
for (int ii = r.len; ii > 0; ii--)
if (buf[--k] && ungetc(buf[k], f) == EOF)
FATAL("unable to ungetc '%c'", buf[k]);
if (buf[--k] && ungetc(buf[k], f) == EOF)
FATAL("unable to ungetc '%c'", buf[k]);
while (k > i + patlen);
buf[k] = '\0';
return true;
@ -970,7 +797,7 @@ Node *primary(void)
rtok = relex();
if (rtok == ')') { /* special pleading for () */
rtok = relex();
return unary(op2(CCL, NIL, (Node *) cclenter("")));
return unary(op2(CCL, NIL, (Node *) tostring("")));
}
np = regexp();
if (rtok == ')') {
@ -993,7 +820,7 @@ Node *concat(Node *np)
return (concat(op2(CAT, np, primary())));
case EMPTYRE:
rtok = relex();
return (concat(op2(CAT, op2(CCL, NIL, (Node *) cclenter("")),
return (concat(op2(CAT, op2(CCL, NIL, (Node *) tostring("")),
primary())));
}
return (np);
@ -1192,8 +1019,6 @@ static int repeat(const uschar *reptok, int reptoklen, const uschar *atom,
return 0;
}
extern int u8_rune(int *, const uschar *); /* run.c; should be in header file */
int relex(void) /* lexical analyzer for reparse */
{
int c, n;
@ -1211,12 +1036,6 @@ int relex(void) /* lexical analyzer for reparse */
rescan:
starttok = prestr;
if ((n = u8_rune(&rlxval, prestr)) > 1) {
prestr += n;
starttok = prestr;
return CHAR;
}
switch (c = *prestr++) {
case '|': return OR;
case '*': return STAR;
@ -1254,15 +1073,10 @@ rescan:
}
else
cflag = 0;
n = 5 * strlen((const char *) prestr)+1; /* BUG: was 2. what value? */
n = 2 * strlen((const char *) prestr)+1;
if (!adjbuf((char **) &buf, &bufsz, n, n, (char **) &bp, "relex1"))
FATAL("out of space for reg expr %.10s...", lastre);
for (; ; ) {
if ((n = u8_rune(&rlxval, prestr)) > 1) {
for (i = 0; i < n; i++)
*bp++ = *prestr++;
continue;
}
if ((c = *prestr++) == '\\') {
*bp++ = '\\';
if ((c = *prestr++) == '\0')
@ -1287,7 +1101,7 @@ rescan:
* program to track each string's length.
*/
for (i = 1; i <= UCHAR_MAX; i++) {
if (!adjbuf((char **) &buf, &bufsz, bp-buf+2, 100, (char **) &bp, "relex2"))
if (!adjbuf((char **) &buf, &bufsz, bp-buf+1, 100, (char **) &bp, "relex2"))
FATAL("out of space for reg expr %.10s...", lastre);
if (cc->cc_func(i)) {
/* escape backslash */
@ -1429,7 +1243,7 @@ int cgoto(fa *f, int s, int c)
int *p, *q;
int i, j, k;
/* assert(c == HAT || c < NCHARS); BUG: seg fault if disable test */
assert(c == HAT || c < NCHARS);
while (f->accept >= maxsetvec) { /* guessing here! */
resizesetvec(__func__);
}
@ -1445,8 +1259,8 @@ int cgoto(fa *f, int s, int c)
|| (k == DOT && c != 0 && c != HAT)
|| (k == ALL && c != 0)
|| (k == EMPTYRE && c != 0)
|| (k == CCL && member(c, (int *) f->re[p[i]].lval.rp))
|| (k == NCCL && !member(c, (int *) f->re[p[i]].lval.rp) && c != 0 && c != HAT)) {
|| (k == CCL && member(c, (char *) f->re[p[i]].lval.up))
|| (k == NCCL && !member(c, (char *) f->re[p[i]].lval.up) && c != 0 && c != HAT)) {
q = f->re[p[i]].lfollow;
for (j = 1; j <= *q; j++) {
if (q[j] >= maxsetvec) {
@ -1478,7 +1292,7 @@ int cgoto(fa *f, int s, int c)
goto different;
/* setvec is state i */
if (c != HAT)
set_gototab(f, s, c, i);
f->gototab[s][c] = i;
return i;
different:;
}
@ -1487,13 +1301,13 @@ int cgoto(fa *f, int s, int c)
++(f->curstat);
resize_state(f, f->curstat);
for (i = 0; i < NCHARS; i++)
set_gototab(f, f->curstat, 0, 0);
f->gototab[f->curstat][i] = 0;
xfree(f->posns[f->curstat]);
p = intalloc(setcnt + 1, __func__);
f->posns[f->curstat] = p;
if (c != HAT)
set_gototab(f, s, c, f->curstat);
f->gototab[s][c] = f->curstat;
for (i = 0; i <= setcnt; i++)
p[i] = tmpset[i];
if (setvec[f->accept])

@ -1,5 +0,0 @@
BEGIN {
getline l
getline l
print (s=substr(l,1,10)) " len=" length(s)
}

@ -1,10 +0,0 @@
BEGIN {
str="\342\200\257"
print length(str)
match(str,/^/)
print RSTART, RLENGTH
match(str,/.+/)
print RSTART, RLENGTH
match(str,/$/)
print RSTART, RLENGTH
}

@ -1,6 +0,0 @@
BEGIN {
FS="␟"
RS="␞"
OFS=","
}
{ print $1, $2, $3 }

@ -1,2 +0,0 @@
id␟name␟age␞1␟Bob "Billy" Smith␟42␞2␟Jane
Brown␟37

@ -1,5 +0,0 @@
id,name,age
1,Bob "Billy" Smith,42
2,Jane
Brown,37

@ -1,7 +0,0 @@
BEGIN {
FS = "א"
RS = "בב"
OFS = ","
}
{ print $1, $2, $3 }

@ -1,2 +0,0 @@
idאnameאageא1אBob "Billy" Smithא42א2בבJane
Brownא37

@ -1,6 +0,0 @@
BEGIN {
# str = "\342\200\257"
str = "あ"
n = gsub(//, "X", str)
print n, str
}

@ -377,8 +377,6 @@ int yylex(void)
}
}
extern int runetochar(char *str, int c);
int string(void)
{
int c, n;
@ -426,50 +424,20 @@ int string(void)
*bp++ = n;
break;
case 'x': /* hex \x0-9a-fA-F (exactly two) */
{
int i;
n = 0;
for (i = 1; i <= 2; i++) {
c = input();
if (c == 0)
break;
if (isxdigit(c)) {
c = tolower(c);
n *= 16;
if (isdigit(c))
n += (c - '0');
else
n += 10 + (c - 'a');
} else
break;
}
if (n)
*bp++ = n;
else
unput(c);
break;
}
case 'u': /* utf \u0-9a-fA-F (1..8) */
{
int i;
n = 0;
for (i = 0; i < 8; i++) {
c = input();
if (!isxdigit(c) || c == 0)
break;
c = tolower(c);
n *= 16;
if (isdigit(c))
n += (c - '0');
case 'x': /* hex \x0-9a-fA-F + */
{ char xbuf[100], *px;
for (px = xbuf; (c = input()) != 0 && px-xbuf < 100-2; ) {
if (isdigit(c)
|| (c >= 'a' && c <= 'f')
|| (c >= 'A' && c <= 'F'))
*px++ = c;
else
n += 10 + (c - 'a');
break;
}
*px = 0;
unput(c);
bp += runetochar(bp, n);
sscanf(xbuf, "%x", (unsigned int *) &n);
*bp++ = n;
break;
}
@ -566,7 +534,7 @@ int regexpr(void)
char *bp;
if (buf == NULL && (buf = (char *) malloc(bufsz)) == NULL)
FATAL("out of space for reg expr");
FATAL("out of space for rex expr");
bp = buf;
for ( ; (c = input()) != '/' && c != 0; ) {
if (!adjbuf(&buf, &bufsz, bp-buf+3, 500, &bp, "regexpr"))

@ -34,8 +34,6 @@ THIS SOFTWARE.
#include <math.h>
#include "awk.h"
extern int u8_nextlen(const char *s);
char EMPTY[] = { '\0' };
FILE *infile = NULL;
bool innew; /* true = infile has not been read by readrec */
@ -152,6 +150,11 @@ int getrec(char **pbuf, int *pbufsize, bool isrecord) /* get next input record *
}
DPRINTF("RS=<%s>, FS=<%s>, ARGC=%g, FILENAME=%s\n",
*RS, *FS, *ARGC, *FILENAME);
if (isrecord) {
donefld = false;
donerec = true;
savefs();
}
saveb0 = buf[0];
buf[0] = 0;
while (argno < *ARGC || infile == stdin) {
@ -191,9 +194,6 @@ int getrec(char **pbuf, int *pbufsize, bool isrecord) /* get next input record *
fldtab[0]->fval = result;
fldtab[0]->tval |= NUM;
}
donefld = false;
donerec = true;
savefs();
}
setfval(nrloc, nrloc->fval+1);
setfval(fnrloc, fnrloc->fval+1);
@ -221,22 +221,16 @@ void nextfile(void)
argno++;
}
extern int readcsvrec(char **pbuf, int *pbufsize, FILE *inf, bool newflag);
int readrec(char **pbuf, int *pbufsize, FILE *inf, bool newflag) /* read one record into buf */
{
int sep, c, isrec; // POTENTIAL BUG? isrec is a macro in awk.h
char *rr = *pbuf, *buf = *pbuf;
int sep, c, isrec;
char *rr, *buf = *pbuf;
int bufsize = *pbufsize;
char *rs = getsval(rsloc);
if (CSV) {
c = readcsvrec(pbuf, pbufsize, inf, newflag);
isrec = (c == EOF && rr == buf) ? false : true;
} else if (*rs && rs[1]) {
if (*rs && rs[1]) {
bool found;
memset(buf, 0, bufsize);
fa *pfa = makedfa(rs, 1);
if (newflag)
found = fnematch(pfa, inf, &buf, &bufsize, recsize);
@ -249,7 +243,6 @@ int readrec(char **pbuf, int *pbufsize, FILE *inf, bool newflag) /* read one rec