Extended Pascal allows arbitrary expressions as the lower bounds of subrange types. This leads to some following parsing difficulties:
type a = (expr1) .. expr2;
(if expr1
starts with an identifier) vs.
type a = (enum1, enum2);
If the enum type contains at least two items, we get no real conflicts, but what the bison manual calls “mystery conflicts”, i.e. our grammer is LR(1), but not LALR(1) which bison requires, Mystery Conflicts.
Our solution is the one suggested in the bison manual, to add a
bogus rule. For that we add a new terminal BOGUS
which is
never used and a new nonterminal conflict_id
which contains
the identifiers that are responsible for the six conflicts.
It gets more difficult if the enum type has only one item, i.e.:
type a = (enum1);
If further expr1
consists of a single identifier, the
conflict cannot be resolved without reading the token following the
right parenthesis. (This is inherent in the EP language.)
But already after reading the identifier (expr1
or
enum1
), our parser has to decide whether to reduce it to an
expression or to keep it as an identifier. (The alternative would be
to define an expression which is anything but a single identifier,
and parse (
identifier)
as a distinct thing, but this
would get quite hairy.)
We resolve it with a lexer hack. The lexer turns a right parenthesis
which is followed by ..
into the new token LEX_RPAR
.
Most places in the parser treat LEX_RPAR
and )
as
equivalent (nonterminal rpar
). However, enum types allow only
)
(they can never be followed by ..
), and the lower
bound of a subrange allows only LEX_RPAR
(it is always
followed by ..
). This resolves the conflict.
But there are more conflicts if the lower bound is not enclosed in parentheses:
type a = Foo (42) .. expr2;
(where Foo
can be one of certain built-in functions such as
Sqr
, or a type name for a type-cast) vs.
type a = Bar (42);
(where Bar
is an undiscriminated schema type).
To resolve this, we let the lexer return a special token
LEX_SCHEMA
for identifiers which correspond to
undiscriminated schema types. The parser accepts this token in
new_identifier
(so schema identifiers can be redefined) and
typename
(e.g. for parameters), but not in id
(which
appears in expressions) where undiscriminated schema types are
invalid.
The last conflict:
type a = @Foo = (expr) .. expr2;
(where @
is the BP address operator – the
= (
expr)
is there to create an ordinal (namely,
Boolean) expression that starts with the address operator) vs.
type a = @Bar = (expr);
(where @
is a lexical alternative for ^
, according to
the standards).
The conflict arises already with the @
token. The =
(as a comparison operator in the first case, and for a type
initializer – EP extension, combined with a BP extension of using
=
instead of value
) just adds to the problem. Since
expr can be arbitrary long, the conflict is in fact not
solvable with any fixed number of lookup tokens.
This conflict is decided using parser precedence rules, in favour of the latter interpretation. (BP itself can't parse the supposedly BP compatible former syntax.)
Another parsing problem is the innocent-looking (BP compatible)
New
call for objects with constructors:
New (ObjectPointer, ConstructorCall (Arguments))
This conflicts with:
New (SchemaPointer, FunctionCall (Arguments))
The same goes for Dispose
with destructors. Since the
constructor name is used without naming the object type, it needs
special handling. Trying to solve it using
pascal_shadow_record_fields
(which is used to handle
with
statements and the implicit Self
parameter in
object methods), fails for two reasons: First, only constructors (or
destructors, respectively), no other methods, must be recognized.
This could be solved with a special flag to
pascal_shadow_record_fields
.
But secondly, the constructor must be recognized only in the first
position of the second argument (see fjf915[ab].pas
for some
evil examples). This means the same identifier (here: Init
)
can have a different meaning in different positions of the same
statement. This is completely contrary to the usual Pascal scoping
rules, but actually BP's behaviour.
Since the two forms of New
shown above look completely the
same syntactically (the difference is given by the type, i.e. the
semantic value, of the first argument of New
), we can't
distinguish them during parsing without further tricks.
Using a GLR parser, the %merge
feature looked promising, but
it also does not help here since it first evaluates both
alternatives and then merges them (in a user-defined way, which
would be fine since it could look at the type). But trying to
evaluate the respective wrong alternative would already produce
error messages which can't be undone later.
So we resort to a lexical tie-in. After parsing the first argument
(start_of_new
), we store it in the global variable
current_structor_object_type
which the lexer recognizes and
returns the special token LEX_STRUCTOR
for the first
following occurrence of an identifier that denotes a
constructor/destructor of this object type. This token is then
required in the syntax rules for an object New
call with a
constructor (whereas an object New
call without constructor
is parsed the same way as a non-object New
call).
To avoid mis-parsing, we accept LEX_STRUCTOR
nowhere else in
the grammar, and in the rule for New
call with further
arguments which are not a constructor, we verify that the first
argument is no object (otherwise something like New
(ObjectPointer, SomeFunction)
could be accepted syntactially and
cause havoc in the further stages).
This method has all the usual drawbacks of lexer tie-ins, in
particular it would break if the parser had already read ahead too
far when the first argument of New
is handled (which is in
principle not impossible with GLR). On the other hand, that's the
same when the lexer looks at declarations, which include identifiers
made available by with
, so this new kludge is at least no
worse than a with
based alternative (if the latter would work
at all).