Node: BP character constants, Next: Compiler directives internally, Previous: Lexer problems, Up: Lexical analyzer
Borland-style character constants of the form ^M
need special
care. For example look at the following type declaration:
type X = Integer; Y = ^X; { pointer type } Z = ^X .. ^Y; { subrange type }
One way to resolve this is to try to let the parser tell the lexer
(via a global flag) whether a character constant or the symbol
^
(to create pointer types or to dereference pointer
expressions) is suitable in the current context. This was done in
previous versions, but it had a number of disadvantages: First, any
dependency of the lexer on the parser (see Lexical Tie-Ins) is problematic by itself since it must be taken care of
manually in each relevant parser rule. Furthermore, the parser
read-ahead must be taken into account, so the flag must usually be
changed apparently one token too early. Using a more powerful
parsing algorithm such as GLR (see GLR Parsers) adds
to this problem since it may read many tokens while the parser is
split before it can perform any semantic action (which is where the
flag could be modified). Secondly, as the example above shows, there
are contexts in which both meanings are acceptable. So further
look-ahead (within the lexer) was needed to resolve the problem.
Therefore, the current version of GPC uses another approach. When
seeing ^X
, the lexer returns two tokens, a regular ^
and a special token LEX_CARET_LETTER
with semantic value
X
. The parser then accepts LEX_CARET_LETTER
wherever
an identifier is accepted (and turns it into the identifier X
via the nonterminal caret_letter
). Furthermore, it accepts
the sequence ^
, LEX_CARET_LETTER
as a string constant
(whose value is a one-character string). Since
LEX_CARET_LETTER
is only produced by the lexer immediately
after ^
(no white-space in between), this works (whereas
otherwise, pasting tokens in the parser is not reliable due to
white-space, e.g. the token sequence :
and =
could
stand for :=
(if :=
weren't a token by itself), but
also for : =
with a space in between). With this trick, we
can handle ^
followed by a single letter or underscore. The
fact that this doesn't cause any conflicts in the bison parser tell
us that this method works.
However, BP even allows any other character after ^
as a char
constant. E.g., ^)
could be a pointer dereference after an
expression and followed by a closing parenthesis, or the character
i
(sic!).
Some characters are unproblematic because they can never occur after
a ^
in its regular meaning, so the sequence can be lexed as a
char constant directly. These are all characters that are not part
of any Pascal tokens at all (which includes all control characters
except white-space, all non-ASCII characters and the characters
!
, &
, %
, ?
, \
, `
,
|
, ~
and }
– the last one occurs at the end
of comments, but within a comment this issue doesn't occur, anyway)
and those characters that can only start constants because a
constant can never follow a ^
in Pascal (which are #
,
$
, '
, "
and the digits).
For ^
followed by whitespace, we return the token
LEX_CARET_WHITE
which the parser accepts as either a string
constant or equivalent to ^
(because in the regular meaning,
the white-space is meaningless).
If ^
is followed by one of the tokens ,
, .
,
:
, ;
, (
, )
, [
, ]
,
+
, -
, *
, /
, <
, =
,
>
, @
, ^
, the lexer just returns the tokens
regularly, and the parser accepts these sequences as a char constant
(besides the normal meaning of the tokens). (Again, since
white-space after ^
is already dealt with, this token pasting
works here.)
But ^
can also be followed by a multi-character alphanumeric
sequence such as ^cto
which might be read as ^ cto
or
^c to
(since BP also allows omitting white-space after
constants), or by a multi-character token such as ^<=
which
could be ^ <=
or ^< =
. Both could be solved with extra
tokens, e.g. lexing ^<=
as ^
, LEX_CARET_LESS
,
=
and accepting ^
, LEX_CARET_LESS
in the parser
as a string constant and LEX_CARET_LESS
, =
as
equivalent to <=
(relying on the fact that the lexer doesn't
produce LEX_CARET_LESS
if there's white-space after the
<
because then the simple ^
, <
will work, so
justifying the token-pasting once again). This has not been done yet
(in the alphanumeric case, this might add a lot of special tokens
because of keywords etc., and it's doubtful whether that's worth
it).
Finally, we have ^{
and ^(*
. This is so incredibly
stupid (e.g., think of the construct type c = Integer; foo =
^{ .. ^|; bar = {} c;
which would become ambiguous then), that
perhaps we should not attempt to handle this ...
(As a side-note, BP itself doesn't handle ^
character
constants in many situations, including many that GPC does handle
with the mechanisms described above, probably the clearest sign for
a design bug. But if we support them at all, we might just as well
do it better than BP ... :–)