Node: Parsing subranges, Previous: Parsing keywords, Up: Language definition



Expressions as lower bounds of subranges

Extended Pascal allows arbitrary expressions as the lower bounds of subrange types. This leads to some following parsing difficulties:

     type
       a = (expr1) .. expr2;

(if expr1 starts with an identifier) vs.

     type
       a = (enum1, enum2);

If the enum type contains at least two items, we get no real conflicts, but what the bison manual calls “mystery conflicts”, i.e. our grammer is LR(1), but not LALR(1) which bison requires, Mystery Conflicts.

Our solution is the one suggested in the bison manual, to add a bogus rule. For that we add a new terminal BOGUS which is never used and a new nonterminal conflict_id which contains the identifiers that are responsible for the six conflicts.

It gets more difficult if the enum type has only one item, i.e.:

     type
       a = (enum1);

If further expr1 consists of a single identifier, the conflict cannot be resolved without reading the token following the right parenthesis. (This is inherent in the EP language.)

But already after reading the identifier (expr1 or enum1), our parser has to decide whether to reduce it to an expression or to keep it as an identifier. (The alternative would be to define an expression which is anything but a single identifier, and parse (identifier) as a distinct thing, but this would get quite hairy.)

We resolve it with a lexer hack. The lexer turns a right parenthesis which is followed by .. into the new token LEX_RPAR. Most places in the parser treat LEX_RPAR and ) as equivalent (nonterminal rpar). However, enum types allow only ) (they can never be followed by ..), and the lower bound of a subrange allows only LEX_RPAR (it is always followed by ..). This resolves the conflict.

But there are more conflicts if the lower bound is not enclosed in parentheses:

     type
       a = Foo (42) .. expr2;

(where Foo can be one of certain built-in functions such as Sqr, or a type name for a type-cast) vs.

     type
       a = Bar (42);

(where Bar is an undiscriminated schema type).

To resolve this, we let the lexer return a special token LEX_SCHEMA for identifiers which correspond to undiscriminated schema types. The parser accepts this token in new_identifier (so schema identifiers can be redefined) and typename (e.g. for parameters), but not in id (which appears in expressions) where undiscriminated schema types are invalid.

The last conflict:

     type
       a = @Foo = (expr) .. expr2;

(where @ is the BP address operator – the = (expr) is there to create an ordinal (namely, Boolean) expression that starts with the address operator) vs.

     type
       a = @Bar = (expr);

(where @ is a lexical alternative for ^, according to the standards).

The conflict arises already with the @ token. The = (as a comparison operator in the first case, and for a type initializer – EP extension, combined with a BP extension of using = instead of value) just adds to the problem. Since expr can be arbitrary long, the conflict is in fact not solvable with any fixed number of lookup tokens.

This conflict is decided using parser precedence rules, in favour of the latter interpretation. (BP itself can't parse the supposedly BP compatible former syntax.)

Another parsing problem is the innocent-looking (BP compatible) New call for objects with constructors:

     New (ObjectPointer, ConstructorCall (Arguments))

This conflicts with:

     New (SchemaPointer, FunctionCall (Arguments))

The same goes for Dispose with destructors. Since the constructor name is used without naming the object type, it needs special handling. Trying to solve it using pascal_shadow_record_fields (which is used to handle with statements and the implicit Self parameter in object methods), fails for two reasons: First, only constructors (or destructors, respectively), no other methods, must be recognized. This could be solved with a special flag to pascal_shadow_record_fields.

But secondly, the constructor must be recognized only in the first position of the second argument (see fjf915[ab].pas for some evil examples). This means the same identifier (here: Init) can have a different meaning in different positions of the same statement. This is completely contrary to the usual Pascal scoping rules, but actually BP's behaviour.

Since the two forms of New shown above look completely the same syntactically (the difference is given by the type, i.e. the semantic value, of the first argument of New), we can't distinguish them during parsing without further tricks.

Using a GLR parser, the %merge feature looked promising, but it also does not help here since it first evaluates both alternatives and then merges them (in a user-defined way, which would be fine since it could look at the type). But trying to evaluate the respective wrong alternative would already produce error messages which can't be undone later.

So we resort to a lexical tie-in. After parsing the first argument (start_of_new), we store it in the global variable current_structor_object_type which the lexer recognizes and returns the special token LEX_STRUCTOR for the first following occurrence of an identifier that denotes a constructor/destructor of this object type. This token is then required in the syntax rules for an object New call with a constructor (whereas an object New call without constructor is parsed the same way as a non-object New call).

To avoid mis-parsing, we accept LEX_STRUCTOR nowhere else in the grammar, and in the rule for New call with further arguments which are not a constructor, we verify that the first argument is no object (otherwise something like New (ObjectPointer, SomeFunction) could be accepted syntactially and cause havoc in the further stages).

This method has all the usual drawbacks of lexer tie-ins, in particular it would break if the parser had already read ahead too far when the first argument of New is handled (which is in principle not impossible with GLR). On the other hand, that's the same when the lexer looks at declarations, which include identifiers made available by with, so this new kludge is at least no worse than a with based alternative (if the latter would work at all).