Parsing Prolog with ANTLR

Parsing

Introduction

Parsing Prolog terms and theories is the purpose of the parser-core and parser-theory modules, respectively. Both modules leverage on ANTLR for parsing strings, behind the scenes, but they only expose an ANTLR-agnostic public API. Users of 2P-Kt will only rely on types and methods from these modules to parse their Prolog strings. In principle, if the ANTLR technology is replaced in the future, the public API of the parser-core and parser-theory modules should not change at all.

ANTLR is a parser generator capable of producing Java or JavaScript (JS) code, i.e., platform-specific sources. Any parser generated by ANTLR requires a runtime library to actually function in production. There are many platform-specific runtime libraries (or simply "runtimes", henceforth). Of course, Java-based parser require the Java runtime (that is, a Java library), whereas JS-based parsers require the JS runtime, (that is, a JS library).

To handle such complexity, we introduced two more modules, namely, parser-jvm and parser-js. They are two Kotlin platform-specific projects. In particular, parser-jvm is a Kotlin/JVM project, whereas parser-js is a Kotlin/JS project. They are to be considered implementation modules and they must only depend on their respective ANTLR runtime. Accordingly, they must NOT depend on any other module in 2P-Kt.

TODO documentation below must be rewritten

ANTLR Grammar for Prolog

The main ANTLR grammar feature is the expression rule. The purpose is to parse an expression, like 3 + 4 * 5, with right operators' associativity and priorities: how to do that? Prolog has an internal representation of priority, that is a number in the range between 0 and 1200 (0 identifies the highest priority).

When we have an expression, we have to understand which of the following situations we are facing: XFY, YFX, YF, XF, XFX, FX, FY. The action we have to do is different depending on these situations. If we have the YFX, we need to memorize the left term, then parse the operator, making sure it's not part of disabled operators (for example, if we have a List with elements separated by ',', than ',' must be a disabled operator, because in this context identifies the separation between elements), and memorize right expressions (can be more than one). In this case, right expressions have higher priority, so we have to parse the expression with a lower priority level, as aforementioned. Below a parser expression extract of grammar.

expression[int priority, String[] disabled] -> local utilities variables
    : (left=term 
           ( operators+=op[YFX] right+=expression[$op.priority - 1, $disabled]
               (operators+=op[YFX] right+=expression[$op.priority - 1, $disabled])*
           | operators+=op[XFY] right+=expression[$op.priority, $disabled]
               (operators+=op[XFY] right+=expression[$op.priority, $disabled])*
           | operators+=op[XFX] right+=expression[$op.priority - 1, $disabled]
           | operators+=op[YF] (operators+=op[YF])*
           | operators+=op[XF]
           )?
   | operators+=op[FX] right+=expression[$op.priority - 1, $disabled]
   | operators+=op[FY] (operators+=op[FY])* right+=expression[$op.priority, $disabled]
   ) (outers+=outer[$priority, $bottom, disabled])*
   ;

Looking at the last instruction, we can see an "outers" accumulator and an "outer" rule. To understand why we introduce this rule, the better way is to do an example. All above cases consider the right expression as a high priority term (excluded XFY and FY that consider it as the same priority). So in all these cases, the priority level decreases (we invoke the expression rule with priority-1). But, what happen when we have an expression like 3 + 4 * 5 - 1? We have left expression with high priority level (so low priority) 3 + expression, then we have a lower priority level expression (so higher priority) left * expression. Then we have a higher priority level (so lower priority) left - expression and we have to handle this priority increase. To do that, we introduce the "outer" term, in which we memorize expression that go back in priority level. So the parse tree of this expression becomes:

.
+-- EOF
+-- singletonExpr
    |   +-- "3"
    |   +-- "+"
    +-- expr
        | +-- "4"
        | +-- "*"
        | +-- "5"
        +-- outer
            | +-- "-"
            | +-- "1"

We can see how the go-back expression is transformed into an "outer" branch, which contains all the original information. Nevertheless this tree is not the right parse tree, in fact the ? operation must be the first appearing. The right tree corresponding is '-'('+'(3, '*'(4, 5)), 1). To do this, all outer branches are rotated in the order in which they appear. The steps are:

  • We build the tree until outer branch ('+'(3, '*'(4, 5))
  • We parse the outer branch '-'(X,1)
  • we substitute the previous expression to X and we obtain the tree '-'('+'(3, '*'(4, 5)), 1). Same steps for more than one outer.

parser-jvm

JVM specific module that aim to generate a jvm specific parser with ANTLR. It depends only on antlr4 features and jdk libraries.

When ANTLR generates the grammar in jvm specific target, it also generates:

  • PrologLexer.java: it contains all lexer support methods, token recognition and other minor features
  • PrologParser.java: it contains the parser main features. There are also a lot of Context classes, corresponding to every node/rule found in the grammar.
  • PrologParserBaseVisitor: it extends an interface (PrologParserVisitor) and it contains semantic actions to do in every node and every context. It must be extended to add right semantic actions.
  • PrologParserBaseListener: it extends an interface (PrologParserListener) and it contains all semantic actions for specific events that happen during the parsing (enter/exit Rule ecc). It's like an event handler.

To implement semantic in the right way, it was necessary to define custom methods and add them to parser features. To do that, as introduced into ANTLR paragraph, we created a class (DynamicLexer/Parser for Lexer/Parser) that implements dynamic features Lexer/Parser need to have, for example into DynamicLexer we memorize found operators and we define support methods like isOperator and addOperators. In DynamicParser we must track also operator's associativity and priority, so we need a lot of support methods to understand how to handle every cases. For example, with option { superClass = DynamicParser} into PrologParser.g4 we specify that generated PrologParser.java extends DynamicParser.

parser-js

JS specific module that aim to generate a js specific parser with ANTLR. It depends only on antlr4 and js features.

To create dynamic behavior to PrologLexer, in parser-jvm, as mentioned above, it was created a DynamicLexer.java, extended by PrologLexer.java (thanks to superClass ANTLR option). Unfortunately, in js class concept does not exist. JavaScript is a prototype-based language, meaning object properties and methods can be shared through generalized objects that have the ability to be cloned and extended. This is known as prototype inheritance and differs from class inheritance. So, to affront this problem, the solution is:

  • Create a DynamicLexer.js, that contains a constructor (DynamicLexer) and implements all behavior that must be added to PrologLexer
  • Use @headers (PrologLexer.g4) annotation to "import" this constructor into the generated PrologLexer (@headers{var DynamicLexer = require("./DynamicLexer").DynamicLexer})
  • Use @members (PrologLexer.g4) annotation to create the prototype inheritance between PrologLexer and DynamicLexer using call method: @members {DynamicLexer.call(this);}. This add involve that this code will be injected into PrologLexer constructor. Now, all functionality defined into DynamicLexer can be used from PrologLexer (the same of a class extension) The same procedure was use to develop PrologParser (with DynamicParser). After that, js specific tests have been implemented to ensure proper functionality (necessity to copy/paste js test files into right build directory) Despite in jvm module this was enough, in js module is missing an important feature: how to map from specific js to common? Kotlin helps us with external class mapping in several simple steps:
  • Exports necessarily js modules (for example, exports.Associativity = Associativity;)
  • Create Kotlin class with header @file:JsModule("./Associativity") (followed by @file:JsNonModule)
  • Create class/val/fun with same js signature and with "external" keyword (for example, external enum class Associativity) Now this class can be used in Kotlin common modules. It was necessarily to create PrologParser.kt, PrologLexer.kt, Associativity.kt, StringType.kt, PrologParserVisitor.kt and most of antlr4 sources file.

parser-core

Bridge-like module that aim to link platform specific world to common world. See "Rationale and Architecture/parsing.md"

It was necessary to define common methods in commonMain: implemented interface TermParser (with parse method extension in main project classes, and also in String/Integer Kotlin classes, to simplify parsing). This interface had to be implemented in both platform specific submodules (jvm and js). So, expect methods have been defined (that needed an actual match in both jvm and js submodules) to simplify TermParser creation. After that, common tests have been prepared, with aim to run in both jvm and js submodules. To prepare those, as parser-core/common does not depend on any platform, there have benn complications: JUnit cannot be used. To resolve this inconvenience, two support classes have been developed, one to simplify the assertion between stringToBeParser and expected term, the other to introduce an exhausting test set (written in Prolog dsl, developed in Kotlin in dslX modules).

parser-core/jvm

Platform specific submodule for jvm. Considering parser-jvm module, it was necessarily to extend PrologParserBaseVisitor, PrologParserBaseListener and simplify Parser creation:

  • Implemented PrologExpressionVisitor that extends PrologParserBaseVisitor: new structures of the core project, like List or Scope (that includes memory management)
  • Added PrologParserFactory interface in jvmMain with parse methods and Singleton Pattern
  • Implemented PrologParserFactory with a Singleton Impl and with Kotlin's methods and structures: this class is very useful to simplify the initialization of parsing
  • Implemented DynamicOpListener that extends PrologParserBaseListener: it adds exitClause functionality
  • In fact, now, to trigger the parsing (in TermParserImpl) we just need to do:
PrologParserFactory().parseExpression(input, operators).accept(PrologExpressionVisitor())

parser-core/js

Platform specific submodule for js. Considering parser-js module, it was necessarily to extend PrologParserVisitor.kt to develop expect semantic. Then, like parser-core/jvm, a PrologParserFactory has been developed to simplify Parser creation and parsing. As in parser-js module the antlr grammar does not create listener, DynamicOpListener has not been developed.

  • PrologVisitor: extends PrologParserVisitor<Term> and implements visit logic for each node.
  • PrologParserFactory with a Singleton Impl and with Kotlin's methods and structures: to simplify the initialization of parsing Now to trigger the parsing (in TermParserImpl) we just need to do:
PrologParserFactory().parseExpression(input,operators).accept(PrologVisitor())}

Main issues:

  • JS specific code into file.g4 does not consider ';' character. Probably an ANTLR4 issue. The solution is to use double ';;' instead. JS specific test from commonTest gave problems:
  • No way to add bare js sources in build folder, actually a Gradle issue. So to launch tests it is necessarily to copy/paste every time js sources into build folder. [fixed]