1 Modelling and Parsing Free Word Order Languages: A Brief State of Affairs

1.1 The Challenge of Free Word Order

Since Kashket (1986)’s seminal contribution, developing models and parsing techniques for free word order languages has been an ongoing challenge for computational linguists. Whilst free word order phenomena are largely absent from modern Western languages such as English, they are frequent in ancient Indo-European languages such as Sanskrit (Schaufele 1991), Greek and Latin (Conrad 1965; Devine and Stephens 2006; Spevak 2010), in Finno-Ugric languages such as Hungarian (Kiss 1981) or Finnish (Kay and Karttunen 1984), but also in Australian (Kashket 1986; Austin 2001), Turkic (Hoffman 1995), and, to a certain extent, Slavic (Siewierska and Uhlirova 1998) and Germanic (Reape 1994) idioms. In morphologically rich languages, a certain level of word order freedom is generally present, which can range from simple relaxation of linear ordering contraints to genuine non-configurationality. Additionnally, loosening of common word order constraints is a frequent feature of literary, especially metrical, texts, in which prosodic, stylistic and expressive factors favor alternative and unusual word orderings.

At this point, it is worth mentioning that even the notion of free word order is, in itself, rather imprecise. Three different phenomena are generally qualified as such: (i) freedom in linear reordering or grammatical constituents as in Today I walkI walk today (ii) discontinuous constituents that may span over a whole sentence; this common feature of e.g. German can also be demonstrated with English phrasal verbs in sentences such as I checked this out (iii) hyperbaton, i.e. interleaving of grammatical constituents as they frequently occur for instance in Classical Latin: cetera labuntur celeri caelestia motuFootnote 1 (‘the other heavenly [bodies] move quickly’, litt. ‘the-other move quick heavenly movement’). This last and, by most aspects, most complex phenomenon produces crossing dependencies between constituents. Classical Latin, which provides innumerable examples of this, will serve as a reference for further investigation, but similar patterns can also be exhibited in Ancient Greek, Sanskrit, Old Norse, Slavic and Finno-Ugric languages, among others.

Context-free grammars (CFGs), introduced by Noam Chomsky in the 1950s, can be considered the de facto baseline of most generative grammar formalisms in both computer science and linguistics. Nevertheless, CFGs, unlike many dependency grammar formalisms, turned out to be unable to describe certain syntactic phenomena occuring in the grammar of natural languages, especially those involving free constituent order or discontinuous constituents. These limitations fostered the development of new, non context-free formalisms better suited to describe natural language: indexed grammars (Aho 1968), immediate dominance/linear precedence grammars (ID/LP) (Pullum 1982; Shieber 1984), tree-adjoining grammars (TAG) (Vijayashanker and Joshi 1988), parallel multiple context-free grammars (PMCFG) (Seki et al. 1991), affix grammars over a fixed lattice (AGFL) (Koster 1991), positive range concatenation grammars (PRCG) (Boullier 1998), among others. Following Chomsky (1956), these formalisms can all be classified as Type-1 grammars, and the languages they generate are generally referred to as context-sensitive. Most of the effort focussed on the development of so-called midly context-sensitive formalisms. A complete survey of the most common non context-free formalisms and their use in computational linguistics can be found in Kallmeyer (2010).

With rather strict word order languages accounting for a significant part of the available digital corpora and potential application fields, computational linguists, many of whom are native speakers of one of these idioms, may have been tempted to address the grammatical modelling of free word order languages with tools chiefly designed to describe English or similar languages. These tools rarely integrate an operator allowing for arbitrary constituent order, let alone for interleaving, since such operators come with a high computational cost that can and must be avoided when parsing free word order languages. This is especially true as regards multilingual parsing, translation or text generation systems that would have added support for some of the above languages at a later stage of their development. From all grammatical formalisms described above, only ID/LP can easily encode hyperbaton, but does not provide support for discontinuous constituents.

Note that this paper does not make a theoretical claim that none of the existing mildly context-sensitive formalisms is expressive enough, from a theoretical viewpoint, to encode free word order phenomena observed in natural languages. There are in fact good reasons to think that some of them are. In practice, if we assume that interleaving phenomena always have a finite depth, we can encode hyperbatic phenomena through a finite, yet exponential, number of context-free rules; recent theoretical results (Ho 2018) have shown that even without a finite-depth assumption, hyperbaton without copy is still mildly context-sensitive. What this paper does observe, however, is that we lack a general grammar description framework with built-in support for free word order phenomena, in which describing e.g. Classical Latin syntax requires neither an exponential inflation in the number of rules compared to the fixed word order case nor a complex conversion process. We lack a framework that would allow us to describe free word order syntax as linguists or grammarians would do, e.g. by defining single attachment rules that do not necessarily impose ordering constraints.

Early attempts to design grammatical formalisms for free word order languages have not led to the development of general-purpose tools; nor were they designed to provide cross-lingual interoperability with fixed word order languages. Covington (1990)’s approach, whose applications to parsing a “tiny subset of Latin” were explored by Koch (1993), relies on dependency rather than phrase structure grammar, which both authors consider less suited to addressing free word order phenomena. Dependency- and constraints-based methods have also been implemented by Bharati and Sangal (1993) for Indian languages, building on notions from Pāṇinian grammar. Though underlying dependency relations between words are indeed the real issue while describing the syntax of free word order languages, we do not believe that this point of view should be deemed irreconciliable with the traditional structured approaches to grammar writing, that involve clear-cut constituents.

We are indeed looking for a formalism that would allow us to conveniently describe the syntax of free word order languages, and that could be used to produce wide-coverage, modular grammars in the style of the Ressource Grammar Library (Ranta et al. 2009). In addition to providing native support for free word order language, the new framework would still be able to encode standard fixed order rules; ideally, it would be built as a “free word order extension” of on existing framework, in order to capitalize on past efforts and guarantee compatibility with existing fixed word order grammars. A new formalism fulfilling these requirements, which we will introduce and study in Sect. 2, is called Interleave-Disjunction-Lock parallel multiple context-free grammars or IDL-PMCFG.

Another essential factor to take into account when designing a grammatical formalism is its suitability for practical implementation of wide-coverage grammars. One way to ensure that users can easily define and use their own grammar models is to provide a complete front-end syntax for grammatical description in the form of a special-purpose programming language. In these regards, we built on Ranta (2011)’s Grammatical Framework (GF) and Nederhof and Satta (2004)’s IDL expressions to elaborate our own grammar description system, COMPĀ, whose syntax extends so-called context-free GF (Ljunglöf 2004) with some new operators to encode interleaving, disjunction and locking of constituents. High-level COMPĀ code is compiled into a low-level IDL-PMCF grammar that can be used directly for parsing. COMPĀ and its compiler are introduced in Sect. 3; the parsing algorithm itself is presented and studied in Sect. 4.

Before we proceed with the description of our formalism, a short look at precise linguistic facts behind extensive free word order can help us identify the exact features we are looking for.

1.2 Towards a Natural Account of Free Word Order Syntax: The Case of Classical Latin

A language with considerable freedom of word order, Classical Latin presents many syntactic phenomena alien to most modern Western European languages. By looking at the few typical aspects of Latin syntax, we shall see in this section which kind of features our desired framework should have in order to be able to concisely encode the syntactic phenomena at play in free word order languages in general, and in Classical Latin in particular.

1.2.1 Hyperbaton and Interleaved Constituents

As Devine and Stephens (2006) puts it, “[p]hrasal discontinuity, traditionally called hyperbaton in Classical studies, is perhaps the most distinctively alien feature of Latin word order”. Hyperbaton is a very general, transcategorial phenomenon that can occur whenever a syntactic constituent is non-contiguous. Danckaert (2017) emphasizes that modern research has shifted away from the opposition of regular vs. exceptional word orders as it is found for example in Marouzeau (1922); still, recent transformational approaches have relied on some kind of default word order to distinguish non-emphatic from non-emphatic word orders. This might be totally justified when pragmatic information is available, provided that, in the words of Devine and Stephens (2006), “[t]he syntax is massaged to provide for a simple and direct translation into a pragmatically structured meaning”. Unfortunately, such information is generally not available in usual parsing contexts.

In particular, the ante- or postposition of adjectives and genitive modifiers in Classical Latin does not obey general syntactic rules (Devine and Stephens 2006). Statistical patterns may vary from word to word, with patterns rarely uniformly spreading over whole semantic lexical categories. Not surprisingly, discontinuous adjective and genitive attachement represents an overwhelming majority of all instances of hyperbata. In verse, where discontinuity is the standard rather than the exception, Conrad (1965) has shown it to be a characteristic feature of a long Greco-Roman poetic tradition dating back to the oral tradition of the Homeric times, influenced by the Roman taste for phenomena such as the clash of ictus and accent on the fourth foot of the hexameter. Latin poets made such an extensive use of the device that in Horace, we find stanzas with three crossing attribute dependencies.Footnote 2

In this context, there is no reason to deny hyperbaton its status as a standard, independent feature of Classical Latin; as parsing systems do not have access to pragmatic information, and since hyperbaton is extremely common even in simple prose, we need to be able to formulate general adjective attachment rules that, within the clause, relaxes all constraints on both linear order and intervention of other constituents.

1.2.2 Locking of Clauses and Prepositional Phrases

One seemingly absolute constraint on word reordering in Classical Latin concerns the impossibility of so-called ‘long hyperbata’ between finite clauses. ‘Long hyperbata’ are defined in Devine and Stephens (2006) as hyperbata that involve the extraction of a word from one clause to another; ‘short hyperbata’, on the other hand, are hyperbata that allow for interleaving words only within the bounds of a given clause. We must be able to express that finite clauses generally need to be ‘locked’, i.e. protected against interleaving with other clauses.

We only say ‘generally’, since verse texts provide well-known counter-examples to this rule,Footnote 3 showing that mixing of material from different finite clauses was not altogether impossible in poetic contexts. Moreover, it must be noted that this general exclusion of long hyperbata in finite clauses does not generalize to non-finite (infinitive and participial) clauses, which can be freely interleaved.

Another important issue, especially in verse, is that of the position of the subordinator not at the beginning, but within the clause, that has been extensively studied by Marouzeau (1949). Bortolussi (2006) has emphasized the high occurence frequency and expressive value of this so-called traiectio, which leads to the subordinator appearing (at least) second in the clause. Yet, an almost absolute rule that opposes rightward movement of subordinators is that a subordinator cannot stand last within the clause it introduces. Therefore, we still need to be able to restrict (linear) freedom of word order in certain cases.

Finally, another instance of locking with an additional constraint on word order occurs in the context of prepositional phrases: while all but one element of the prepositional phrase might be arbitrary interleaved within the clause, at least one element (not necessarily the head) must be placed directly after the preposition. To account for this type of syntactic limitation, a combination of mostly free word order with targeted locking and linear constraints is again required.

1.2.3 Multiple Fields and Features

General-purpose grammar description systems such as Grammatical Framework (Ranta 2004) make an extensive use of records and fields in order to store the various forms of a word, parts of discontinuous grammatical constituents or handle reduplication phenomena. As our goal is to be able to describe Classical Latin syntax as generally as possible and we may want to keep some interoperability with existing frameworks, records and fields are required in practice.

There is, however, no obvious reason why we should require copying to be available in order to describe Classical Latin. Allowing copy in our framework can be desirable in order to account for specific syntactic phenomena in other natural languages (see below), because copying is a general phenomenon in language (Kobele 2006), or to preserve compatability with existing tools such as Grammatical Framework. But this is a design decision independent from the specific characteristics of Latin syntax, whose goal is not to stick closely to the formal requirements of Latin syntax, but rather to preverse some general linguistic expressiveness. As it makes sense to think of a new framework as having to match the needs of free word order languages in general and not of Classical Latin exclusively, we will want to allow copy operations in our formalism.

1.2.4 Summary

The above discussion suggests five characteristics that a grammatical formalism designed to enable a straightforward description of the syntax of free word order languages such as Classical Latin should have: operators to interleave grammatical constituents, lock phrases and restrict reorderings of constituents; a record and fields system; and, finally, and maybe less importantly, a support for copy operations.

Notations

We will use the following conventions:

  • \({\mathbb {N}}^+\) denotes \({\mathbb {N}}{\setminus }\left\{ 0 \right\} = \left\{ 1, 2, \dots \right\} \);

  • Symbol \(\varepsilon \) denotes the empty word, while \(\underline{\varepsilon }\) and \(\diamond \) (‘diamond’) are special symbols;

  • All alphabets \(\varSigma \) used in this paper are assumed not to contain the symbols \(\underline{\varepsilon }\) and \(\diamond \);

  • For \(\left( a,b\right) \in {\mathbb {N}}^2\), \([\![a,b]\!]\) denotes the set \(\left\{ a, a + 1, \dots , b - 1, b \right\} \) and \([\![a,b[\![\) the set \(\left\{ a, a + 1, \dots , b - 1 \right\} \);

  • For any set S, \({\mathscr {P}}\left( S\right) \) denotes the power set (set of subsets) of S and \({\mathscr {P}}_f\left( S\right) \) the set of finite subsets of S;

  • For any set S, \(\left|S \right|\) denotes the cardinal of S;

  • For any sets S, T, \(S \rightharpoonup T\) denotes a partial function from S to T;

  • For all \(f: S \rightharpoonup T\), \({\mathscr {D}}\left( f\right) \) denotes the domain of f;

  • Let \(\varSigma \) be an alphabet and \(t \in \varSigma ^*\) a word on this alphabet. Then \(\left|t \right|\) denotes the length of t. Furthermore, for all \(p \in {\mathscr {P}}\left( [\![1,\left|t \right|]\!]\right) \), \(t_p\) denotes the subword of t formed by extracting the symbols at positions p in t. For example, on \(\varSigma = \left\{ a, \dots , z \right\} \):

    $$\begin{aligned} alphabet _{\left\{ 1, 8 \right\} }&= at\\ alphabet _{\left\{ 2, 5, 6 \right\} }&= lab\\ alphabet _{\left\{ 6, 7, 8 \right\} }&= bet; \end{aligned}$$
  • Rules in grammars are written using the following functional notation

    $$\begin{aligned} A_1 \rightarrow \dots \rightarrow A_n \rightarrow B: a_1, \dots , a_n \mapsto e \end{aligned}$$

    which reads “given an item \(a_1\) of category \(A_1\), ..., an item \(a_n\) of category \(A_n\), an item of category B can be produced which is equal to expression e”; expression e depends on the current formalism but will usually contain instances of \(a_1\), ..., \(a_n\);

2 Introducing IDL Parallel Multiple Context-Free Grammars (IDL-PMCFG)

2.1 IDL Expressions

IDL (Interleave-Disjunction-Lock) expressions, introduced by Nederhof and Satta (2004), are a family of regular expressions tailored to describe and parse natural language sentences. Since they do not allow for the use of nonterminal symbols, Nederhof and Satta’s original IDL expressions are no grammars and can therefore only be used to describe specific (finite) families of utterances; a single IDL expression cannot encode a complex language model. However, they already include everything needed to account for free constituent order, hyperbata and their respective limitations. The definitions below closely follow those of the original paper.

Definition 1

(IDL expression) Let \(\varSigma \) be a finite alphabet. An IDL expression e over \(\varSigma \) is defined inductively as follows:

$$\begin{aligned} e&:= a \quad \forall a \in \varSigma \cup \left\{ \underline{\varepsilon }\right\} \\&\mid e' \cdot e'' \\&\mid \times \left( e'\right) \\&\mid \vee \left( e_1, \dots , e_n\right) \quad \forall n \in {\mathbb {N}}^+ \\&\mid ||\left( e_1, \dots , e_n\right) \quad \forall n \in {\mathbb {N}}^+. \end{aligned}$$

Note that, unlike usual regular expressions dealing with character strings, IDL expressions used in typical computational linguistics applications use an alphabet \(\varSigma \) composed of full words (tokens), that are to be combined into grammatical constituents and sentences. Therefore, throughout this document, the word string should be understood as a shortcut for ‘token list’, an the word set of strings as a shortcut for ‘set of token lists’. The informal semantics of the constructors, that all act on sets of strings, is as follows:

  • The dot represents standard concatenation;

  • Disjunction has its usual semantics as set union;

  • The interleave operator \(||\) allows for arbitrarily mixing tokens contained in n strings, as long as the relative ordering within each initial string is preserved in the final string.

  • The lock operator \(\times \) prevents a string from being divided into several substrings by an instance of the interleave operator.

This can be illustrated by the following variations on the Latin sentence ciuis Romanus sum (‘I am a Roman citizen’, litt. ‘citizen Roman am’):

$$\begin{aligned} \texttt {ciuis} \cdot \texttt {Romanus} \cdot \texttt {sum} \longrightarrow \left\{ \right.&\left. \texttt {"ciuis Romanus sum"}~\right\} \\ \vee \left( \texttt {ciuis}, \texttt {Romanus}, \texttt {sum}\right) \longrightarrow \left\{ \right.&\left. \texttt {"ciuis"},\texttt {"Romanus"},\texttt {"sum"}~\right\} \\ ||\left( \texttt {ciuis}, \texttt {Romanus}, \texttt {sum}\right) \longrightarrow \left\{ \right.&\left. \texttt {"ciuis Romanus sum"}, \texttt {"ciuis sum Romanus"},\right. \\&\left. \texttt {"Romanus ciuis sum"}, \texttt {"Romanus sum ciuis"} \right. \\&\left. \texttt {"sum ciuis Romanus"}, \texttt {"sum Romanus ciuis"}~\right\} \\ ||\left( \texttt {ciuis} \cdot \texttt {Romanus}, \texttt {sum}\right) \longrightarrow \left\{ \right.&\left. \texttt {"ciuis Romanus sum"}, \texttt {"ciuis sum Romanus"},\right. \\&\left. \texttt {"sum ciuis Romanus"}~\right\} \\ ||\left( \times \left( \texttt {ciuis} \cdot \texttt {Romanus}\right) , \texttt {sum}\right) \longrightarrow \left\{ \right.&\left. \texttt {"ciuis Romanus sum"},\texttt {"sum ciuis Romanus"}~\right\} \end{aligned}$$

To formally define the language of an IDL expression, we first need to introduce the primitives lock and comb as done in Nederhof and Satta (2004):

Definition 2

(Primitives lock and comb)

  1. 1.

    Let \(\mathsf {lock}\) be the only monoid homomorphism over \(\left( \left( \varSigma \cup \left\{ \diamond \right\} \right) )^*, \cdot \right) \) such that \(\mathsf {lock}_{\mid \varSigma } = \mathrm {id}_{\mid \varSigma }\) and \(\mathsf {lock}\left( \diamond \right) = \varepsilon \).

  2. 2.

    Let \({\mathsf {comb}}\) and \({\mathsf {comb}}'\) be functions from \(\left( \left( \varSigma \cup \left\{ \diamond \right\} \right) ^*\right) ^2\) to \({\mathscr {P}}\left( \left( \varSigma \cup \left\{ \diamond \right\} \right) ^*\right) \) defined inductively by:

    $$\begin{aligned} {\mathsf {comb}}\left( x,y\right)&= {\mathsf {comb}}'\left( x,y\right) \cup {\mathsf {comb}}'\left( y,x\right) \\ {\mathsf {comb}}'\left( x,y\right)&= \left\{ \begin{array}{ll} \left\{ x \diamond y \right\} &{} \text {if there is no }\diamond \text { in } x \\ \left\{ x' \diamond y' \mid y' \in {\mathsf {comb}}\left( x'',y\right) \right\} &{} \text {if }x\text { is of the form } x' \diamond x'' \text { with no }\diamond \text { in }x' \end{array} \right. . \end{aligned}$$

Informally, the symbol \(\diamond \) represents positions at which words can be interleaved into the current substring. Such \(\diamond \) symbols are inserted into the current string by each concatenation or interleave operation: by default, every word boundary is a place where a new word can be inserted. The \(\mathsf {lock}\) primitive erases such symbols in each string of the input set, thus preventing any interleaving within the enclosed substrings. The \({\mathsf {comb}}\) primitive produces the set of all strings that can be obtained by interleaving contiguous substrings of the two input string at positions marked by a \(\diamond \).

Since \({\mathsf {comb}}\) produces all possible interleavings of two input strings, it is clearly associative and commutative. We can see \({\mathsf {comb}}\) as an nary operator for all n, and write \({\mathsf {comb}}_{i=1}^n a_i := {\mathsf {comb}}\left( a_1, {\mathsf {comb}}\left( a_2, \dots {\mathsf {comb}}\left( a_{n-1},a_n\right) \dots \right) \right) \).

We can now define the language of an IDL expression, which exactly matches the mechanics exposed and demonstrated above:

Definition 3

(Language of an IDL expression) Let \(\varSigma \) be an alphabet. For any IDL expression e over \(\varSigma \), language \(\mathrm {L}\left( e\right) \) is given by

$$\begin{aligned} \mathrm {L}\left( e\right) = \sigma \left( \times \left( e\right) \right) \end{aligned}$$

where for any IDL expression e over \(\varSigma \), \(\sigma \left( e\right) \) is a subset of \({\mathscr {P}}\left( \left( \varSigma \cup \left\{ \diamond \right\} \right) ^*\right) \) that is defined inductively as follows:

$$\begin{aligned} \sigma \left( \underline{\varepsilon }\right)&= \left\{ \varepsilon \right\} \\ \sigma \left( a\right)&= \left\{ a \right\}&\forall a \in \varSigma \\ \sigma \left( e' \cdot e''\right)&= \left\{ w' \diamond w'' \mid \left( w', w''\right) \in \sigma \left( e'\right) \times \sigma \left( e''\right) \right\} \\ \sigma \left( \times \left( e'\right) \right)&= \mathsf {lock}\left( \sigma \left( e'\right) \right) \\ \sigma \left( \vee \left( e_1, \dots , e_n\right) \right)&= \bigcup _{i=1}^n \sigma \left( e_i\right) \\ \sigma \left( ||\left( e_1, \dots , e_n\right) \right)&= {\mathsf {comb}}_{i=1}^n \sigma \left( e_i\right) . \end{aligned}$$

To see why IDL expressions are well-suited to describe grammatically valid reorderings of utterances in free word order languages, consider the following example from Latin: Marcus cum amico caro ambulat (‘Marcus walks with his dear friend’, litt. ‘Marcus with friend dear walks’). For a permutation of the five above words to be considered valid in Classical Latin verse, the only condition to be met is that cum (‘with’) must stand immediately before either amico (‘friend’, ablative singular) or caro (‘dear’, ablative singular masculine). Besides this single constraint, the order of constituents is free, and the verb modifier cum amico suo might even be disjoint. This means that even heavily reordered utterances such as amico Marcus ambulat cum caro should be considered grammatical, as similar structures are, indeed, well documented. Now, this seemingly unusual syntactic constraint is surprisingly easy to express in terms of an IDL expression:

$$\begin{aligned} ||\left( \texttt {Marcus}, \vee \left( ||\left( \times \left( \texttt {cum}\cdot \texttt {amico}\right) , \texttt {caro}\right) , ||\left( \times \left( \texttt {cum}\cdot \texttt {caro}\right) , \texttt {amico}\right) \right) , \texttt {ambulat}\right) . \end{aligned}$$

Of course, IDL expressions alone cannot provide much more than ad-hoc solutions for a set of specific utterances. In order for their expressive power to be used for general language description, they must be integrated into a complete grammatical formalism.

2.2 Parallel Multiple Context-Free Grammars

It was already to deal with discontinuous constituents in natural language that Seki et al. (1991) defined parallel multiple context-free grammars (PMCFG), whose definition is given below. Parallel multiple context-free grammars extend context-free grammars by manipulating tuples of strings instead of strings. Each category of a PMCFG is assigned a dimension (a tuple size). Every production consumes a number of named argument tuples of fixed categories and produces a new tuple. Each element of this tuple is the concatenation of an arbitrary number of terminals and (nonterminal, index) pairs, which uniquely identify a field of one the arguments. The start category, usually denoted by S, defines the grammar’s language, and has therefore dimension 1.

We have the following typical example:

Lemma 1

Language \(\mathrm {L}_{3n} = \left\{ a^n b^n c^n \mid n \in {\mathbb {N}} \right\} \) on \(\varSigma _3 = \left\{ a, b, c \right\} \) is in \({\mathrm {PMCFL}}\).

Proof

The following PMCFG grammar matches \(\mathrm {L}_{3n}\)

$$\begin{aligned} T \rightarrow S&: t \mapsto \left\langle t\left[ 0\right] \cdot t\left[ 1\right] \cdot t\left[ 2\right] \right\rangle \\ T \rightarrow T&: t \mapsto \left\langle a\cdot t\left[ 0\right] , b\cdot t\left[ 1\right] , c\cdot t\left[ 2\right] \right\rangle \\ T&: \left\langle \underline{\varepsilon }, \underline{\varepsilon }, \underline{\varepsilon }\right\rangle \end{aligned}$$

where, following usual programming conventions, we start indexing at 0 and write (nonterminal, index) pairs as nonterminal[index]. \(\square \)

How does this grammar define the above language? First, it states that a (one-dimensional) tuple of type S can be produced from a (three-dimensional) tuple t of type T by concatenating the three fields of t; then, that a tuple of type T can be produced in either of two ways: either it is generated from another tuple t of type T by appending a, b and c respectively at the beginning of each of the fields, or is equal to \(\left\langle \varepsilon , \varepsilon , \varepsilon \right\rangle \). It is straightforward to see that T matches \(\mathrm {L}_{3n}\), thus yielding the expected behavior for S.

The formal definition of PMCFG, slightly adapted from Seki et al. (1991), is as follows:

Definition 4

(PMCF grammar) A PMCF grammar (or PMCFG) is a sextuple

$$\begin{aligned} G = (N,\delta , \varSigma , F, P, S) \end{aligned}$$

where

  1. 1.

    N is a finite set of nonterminal symbols (also called categories);

  2. 2.

    \(\delta : N \rightarrow {\mathbb {N}}\) maps each nonterminal symbol A to its dimension \(\delta \left( A\right) \);

  3. 3.

    \(\varSigma \) is a finite set of terminal symbols disjoint with N;

  4. 4.

    F is a finite set of functions such that for all \(f \in F\), there exists \(a\left( f\right) \in {\mathbb {N}}\), called arity of f, as well as a(f) integers \(d_1(f), \dots , d_{a(f)}(f)\) encoding the dimensions of the a(f) arguments of f, and an integer r(f) encoding the dimension of the image of f, such that the signature of f is

    $$\begin{aligned} \left( \varSigma ^*\right) ^{d_1\left( f\right) } \times \dots \times \left( \varSigma ^*\right) ^{d_{a(f)}\left( f\right) } \rightarrow \left( \varSigma ^*\right) ^{r(f)}; \end{aligned}$$
  5. 5.

    For any \(f \in F\), letting \(\rho := r\left( f\right) \), f is of the form

    $$\begin{aligned} s_1,\dots ,s_{a\left( f\right) } \mapsto \left\langle \alpha _{11} s_{\beta _{11}\gamma _{11}}\alpha _{12}s_{\beta _{12}\gamma _{12}}\dots \alpha _{1\delta _1}, \dots , \alpha _{\rho 1}s_{\beta _{\rho 1}\gamma _{\rho 1}}\alpha _{\rho 2} s_{\beta _{\rho 2}\gamma _{\rho 2}}\dots \alpha _{\rho \gamma _{\rho }}\right\rangle \end{aligned}$$

    where all \(\delta _i\), \(\beta _{ij}\) and \(\gamma _{ij}\) are integers with \(\beta _{ij} \le a\left( f\right) \) and \(\gamma _{ij} \le d_{\beta _{ij}}\left( f\right) \), and \(\alpha _{ij} \in \varSigma ^*\) for all ij. In other terms, every component of the tuple produced by f is obtained via arbitrary concatenation of symbols from \(\varSigma \) and components of f’s arguments;

  6. 6.

    For \(q \in {\mathbb {N}}\), let \(F_q\) denote the subset of all functions of arity q in F;

  7. 7.

    P, called the set of productions or rules, is a finite subset of \(\bigcup _{q \in {\mathbb {N}}} \left( F_q \times N^{q+1}\right) \) such that for all \(q \in {\mathbb {N}}\) and \(\left( f, A_1, \dots , A_{q+1}\right) \in P\), we have \(d_k\left( f\right) = \delta \left( A_k\right) \) for every \(k \in \left\{ 1, \dots , q \right\} \) as well as \(r(f) = \delta \left( A_{q+1}\right) \), i.e. the dimensions of the arguments (resp. of the image) of f match the dimensions of the categories on the left (resp. right) side of the production;

  8. 8.

    \(S \in N\) is the start symbol, of dimension \(\delta \left( S\right) =1\).

Note that according to the previous definition, a PMCFG \(G = \left( N,\delta ,\varSigma ,F,P,S\right) \) such that \(\delta \left( N\right) = \left\{ 1 \right\} \) (i.e. a PMCFG whose categories have dimension as most 1) is exactly a CFG.

Finally, we define the language of a parallel multiple context-free grammar as follows:

Definition 5

(Language of a PMCFG) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be a PMCF grammar. Let \(m = \max _{A \in N} \delta \left( A\right) \). We define a big-step derivation relation \({\hat{\rightarrow }}\) on \(N \times \bigcup _{i=0}^m \left( \varSigma ^*\right) ^i\) inductively as follows:

For all \(\left( A, \left\langle t_1, \dots , t_{\delta \left( A\right) }\right\rangle \right) \in N \times \left( \varSigma ^*\right) ^{\delta \left( A\right) }\), we have \(A {\hat{\rightarrow }} \left\langle t_1, \dots , t_{\delta \left( A\right) } \right\rangle \) if, and only if, there exists a production \(\left( f,A_1, \dots , A_{a\left( f\right) },A\right) \in P\) and strings \(\left( s_{ij}\right) _{i \le a\left( f\right) , j \le d_i\left( f\right) }\) such that the two following conditions are met:

  1. 1.

    For all integers \(i \le a\left( f\right) \), \(j \le d_i\left( f\right) \), \(A_{ij} {\hat{\rightarrow }} s_{ij}\);

  2. 2.

    \(f\left( s_1,\dots ,s_{a\left( f\right) }\right) = \left\langle t_1, \dots , t_{\delta \left( A\right) }\right\rangle .\)

The language recognized by G is defined as

$$\begin{aligned} \mathrm {L}(G) = \left\{ s \in \varSigma ^* \mid S {\hat{\rightarrow }} s \right\} \end{aligned}$$

and call \(\mathrm {PMCFL}\) (resp. \(\mathrm {CFL}\)) the set of all languages that are recognized by at least one PMCFG (resp. CFG).

Let us give an example of how PMCFG extends the expressivity of CFG. We will use the following well-known example:

Lemma 2

Language \(\mathrm {L}_{3n}\) defined in Lemma 5 is not context-free.

Proof

This is a classical result whose proof (usually using the pumping lemma) can be found e.g. in Hopcroft et al. (2013). \(\square \)

This lemma, combined with Lemma 5, results in the following strict inclusion:

Proposition 1

\(\mathrm {CFL} \subsetneq \mathrm {PMCFL}\).

Through its use of tuples, PMCFG provides a handy way to handle discontinuous constituents. Various parts of the linearization of a constituent can be stored in different fields, and later on integrated into a larger phrase. Since the same argument can appear an arbitrary number of times in the right hand side of any production, general PMCFGs can also define reduplication phenomena as encountered e.g. in short Swiss German verbs (Lötscher 1993), Indonesian plurals (Dalrymple and Mofu 2011) or Telugu distributives (Balusu 2006). The formalism has proved efficient as a parsing front-end for context-free GF (Ljunglöf 2004; Angelov et al. 2009; Ljunglöf 2012). Nevertheless, expressing free order of constituents or interleaving of constituents is not easy in PMCFG. Until 2018, it was not even known whether this was possible. Although the answer is now known to be positive (Ho 2018), there is still no convenient way to concisely express the interleaving of groups, since PMCFG lacks a specific operator for this type of reordering. One way to overcome this difficulty is to define all legal orderings manually and pass them as arguments to the corresponding rule; this technique has been demonstrated by Lange (2017) in the case of Latin.

2.3 Bringing Together IDL Expressions and PMCFG: IDL-PMCFG

The conclusions of the last two subsections suggest that we combine IDL expressions and parallel multiple context-free grammars into a single formalism that can handle discontinuous constituents and copy (as did PMCFG) as well as free constituent order and hyperbaton (as did IDL expressions). This formalism is IDL-PMCF grammars (IDL-PMCFG), whose definition is given below. Although the complexity of the membership problem of both IDL expressions and PMCF grammars is polynomial, this is not the case of their combination: we will see in this subsection that parsing IDL-PMCF grammars is NP-hard, which as an important corrolary (Theorem 1) implies that IDL-PMCF provides a strict extension of PMCFG unless \(\mathrm {P} = \mathrm {NP}\).

In IDL-PMCF grammars, productions are defined not as concatenations, but as IDL expressions of terminals and (nonterminal, index) pairs. Instead of tuples of strings, tuples of sets of strings are now the basic data type manipulated by the various rules. The \(\diamond \) symbol, which marks positions at which new words can be interleaved into the current string, is added to the alphabet.

Definition 6

(IDL-PMCF grammar) An IDL-MCF grammar (or IDL-PMCFG) is a sextuple

$$\begin{aligned} G = (N,\delta , \varSigma , F, P, S) \end{aligned}$$

where

  1. 1.

    N is a finite set of nonterminal symbols (also called categories);

  2. 2.

    \(\delta : N \rightarrow {\mathbb {N}}\) maps each nonterminal symbol A to its dimension \(\delta \left( A\right) \);

  3. 3.

    \(\varSigma \) is a finite set of terminal symbols disjoint with N;

  4. 4.

    F is a finite set of functions such that for all \(f \in F\), there exists \(a\left( f\right) \in {\mathbb {N}}\), called arity of f, as well as a(f) integers \(d_1(f), \dots , d_{a(f)}(f)\) encoding the dimensions of the a(f) arguments of f, and an integer r(f) encoding the dimension of the image of f, such that the signature of f is

    $$\begin{aligned} \left( {\mathscr {P}}\left( {\overline{\varSigma }}\right) \right) ^{d_1\left( f\right) } \times \dots \times \left( {\mathscr {P}}\left( {\overline{\varSigma }}\right) \right) ^{d_{a(f)}\left( f\right) } \rightarrow \left( {\mathscr {P}}\left( {\overline{\varSigma }}\right) \right) ^{r(f)} \end{aligned}$$

    where \({\overline{\varSigma }} = \left( \varSigma \cup \left\{ \diamond \right\} \right) ^*\);

  5. 5.

    For any \(f \in F\), there exist IDL expressions \(e_1, \dots , e_{r\left( f\right) }\) over \(\varSigma \cup \left\{ X_{ij} \mid i \le a\left( f\right) , j \le d_i\left( f\right) \right\} \), where the \(X_{ij}\) are fresh variable symbols, such that f is of the form

    $$\begin{aligned}&s_1,\dots ,s_{a\left( f\right) } \mapsto \bigcup _{\left( w_1,\dots ,w_{a\left( f\right) }\right) \in s_1 \times \dots \times s_{a\left( f\right) }}\\&\quad \left\{ \left\langle x_1\left[ X_{ij}:=w_{ij}~\forall i,j\right] , \dots , x_{r\left( f\right) }\left[ X_{ij}:=w_{ij}~\forall i,j\right] \right\rangle \right. \\&\quad \left. \mid (x_1, \dots , x_{r\left( f\right) }) \in \sigma \left( e_1\right) \times \dots \times \sigma \left( e_{r\left( f\right) }\right) \right\} . \end{aligned}$$

    Function f now produces a set of tuples of length \(r\left( f\right) \), which are derived in three steps: (1) for each \(i \le a\left( f\right) \), choose one tuple \(w_i\) in each set \(s_i\) (2) for \(k \le r\left( f\right) \), choose an \(x_k\) in each \(\sigma \left( e_k\right) \) (3) for all ij, substitute \(w_{ij}\) for \(X_{ij}\) in each \(x_k\).

  6. 6.

    For \(q \in {\mathbb {N}}\), let \(F_q\) denote the subset of all functions of arity q in F;

  7. 7.

    P, called the set of productions or rules, if a finite subset of \(\bigcup _{q \in {\mathbb {N}}} \left( F_q \times N^{q+1}\right) \) such that for all \(q \in {\mathbb {N}}\), for all function \(f \in F\) and categories \(A_1, \dots , A_{q+1} \in N\) such that \(\left( f, A_1, \dots , A_{q+1}\right) \in P\), we have \(d_k\left( f\right) = \delta \left( A_k\right) \) for every \(k \in \left\{ 1, \dots , q \right\} \) as well as \(r(f) = \delta \left( A_{q+1}\right) \), i.e. the dimensions of the arguments (resp. of the image) of f match those of the categories on the left (resp. right) side of the production;

  8. 8.

    \(S \in N\) is the start symbol, of dimension \(\delta \left( S\right) =1\).

Just as a parallel multiple context free grammar with \(\delta \left( N\right) = \left\{ 1 \right\} \) is a context-free grammar, we define IDL-CFGs as follows:

Definition 7

(IDL-CF grammar) An IDL-MCF grammar \(G = \left( N,\delta ,\varSigma ,F,P,S\right) \) such that \(\delta \left( N\right) = \left\{ 1 \right\} \) is called an IDL-CF grammar (or IDL-CFG).

IDL context-free grammars are of essential theoretical interest. As we will come to evaluate the expressivity gain obtained by replacing simple concatenation by IDL expressions, it will be important to single out the contributions of both the PMCFG formalism and IDL expressions to the extension of the class of languages can that can be described by IDL-PMCFGs. Hence, comparing the expressivity of PMCFG to that of IDL-(PM)CFG, as we will do it in Sect. 2.4, shall inform us more thoroughly about the complementarity of the two approaches we combined.

Finally, the language matched by a given IDL-PMCFG can now be defined:

Definition 8

(Language of an IDL-PMCFG) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an IDL-PMCF grammar. Let \(m = \max _{A \in N} \delta \left( A\right) \). We define a big-step derivation relation \(\rightarrow \) on \(N \times \bigcup _{i=0}^m \left( {\mathscr {P}}\left( {\overline{\varSigma }}\right) \right) ^i\) inductively as follows:

For all \(\left( A, \left\langle t_1, \dots , t_{\delta \left( A\right) }\right\rangle \right) \in N \times \left( {\mathscr {P}}\left( {\overline{\varSigma }}\right) \right) ^{\delta \left( A\right) }\), we have \(A \rightarrow \left\langle t_1, \dots , t_{\delta \left( A\right) } \right\rangle \) if, and only if, there exists a production \(\left( f,A_1, \dots , A_{a\left( f\right) },A\right) \in P\) and sets of strings \(\left( s_{ij}\right) _{i \le a\left( f\right) , j \le d_i\left( f\right) }\) such that the two following conditions are met:

  1. 1.

    For all integers \(i \le a\left( f\right) \), \(j \le d_i\left( f\right) \), \(A_{ij} \rightarrow s_{ij}\);

  2. 2.

    \(f\left( s_1,\dots ,s_{a\left( f\right) }\right) = \left\langle t_1, \dots , t_{\delta \left( A\right) }\right\rangle .\)

Finally, we use the \(\mathsf {lock}\) function introduced in Definition 3 to define the language recognized by G as

$$\begin{aligned} \mathrm {L}(G) = \bigcup _{\underset{S \rightarrow t}{t \in {\mathscr {P}}\left( {\overline{\varSigma }}\right) }}\left\{ \mathsf {lock}\left( s\right) \mid s\in t\right\} \end{aligned}$$

and call \(\mathrm {IDLPMCFL}\) (resp. \(\mathrm {IDLCFL}\)) the set of all languages that are recognized by at least one IDL-PMCFG (resp. IDL-CFG).

Note that the \(\diamond \) symbols are only erased from the output in the last step of the definition of \(\mathrm {L}\left( G\right) \), after the set all strings that can be derived from the start symbol S has been retrieved. If they had been erased each time a production was used, interleaving constituents that are not arguments of the same rule would have been impossible, and constituents obtained from any production would have been locked.

Let us give an example of this. Consider the following IDL-CFG \(G_{abcd}\) on \(\varSigma = \left\{ a, b, c, d \right\} \) (when describing IDL-CFG grammars, we shall omit the [0] indexes identifying the first field of every argument):

$$\begin{aligned} Q \rightarrow R \rightarrow S&\,{:}\, q, r \mapsto ||\left( q, r\right) \\ Q&\,{:}\,a \cdot b \\ R&\,{:}\,c \cdot d. \end{aligned}$$

Clearly, \(\mathrm {L}\left( G_{abcd}\right) = \left\{ abcd, acbd, acdb, cabd, cadb, cdab \right\} \). Consider the derivation tree for acbd:

figure a

For the first derivation to be possible, the diamonds in \(a \diamond b\) and \(c \diamond d\) are still required; otherwise, we would have \({\mathsf {comb}}\left( \left\{ ab \right\} , \left\{ cd \right\} \right) = \left\{ abcd, cdab \right\} \), which does not contain acbd. Keeping the diamonds in place until the end of the derivation process is therefore essential.

The following result comes “for free”:

Proposition 2

\(\mathrm {PMCFL} \subset \mathrm {IDLPMCFL}\).

We finally give a simple example of how this can be used to implement a very simple grammar. Suppose that we want to encode a small subset of Latin that contains sentences composed of a final verb and an optional subject. This subject is a noun phrase, i.e. a noun to which an arbitrary number of optional adjectives may be attached. The following IDL-CFG describes exactly this:

$$\begin{aligned} NP \rightarrow V \rightarrow S&:np, v \mapsto np \cdot v \\ V \rightarrow S&: v \mapsto v \\ N \rightarrow NP&: n \mapsto n \\ NP \rightarrow A \rightarrow NP&: np, a \mapsto ||\left( np,a\right) . \end{aligned}$$

Remark that while we used the \(||\) operator for building a new NP from an NP and an adjective (meaning that the adjective is allowed to appear either before, within or after the NP it is appended to), we resorted to simple concatenation for building a sentence from an NP and a verb, as we want the verb to appear at the end of the sentence, after the subject NP.

2.4 Expressivity

We shall now investigate the expressivity of IDL-(PM)CFGs and try to locate the corresponding class of languages within the hierarchy of polynomial languages. Recall the following series of inclusions

$$\begin{aligned} \mathrm {CFL} \subsetneq \mathrm {TAL} \subsetneq \mathrm {PMCFL} \subsetneq \mathrm {PRCL} = \mathrm {P} \end{aligned}$$

where

  • \(\mathrm {CFL}\) is the class of context-free languages;

  • \(\mathrm {TAL}\) is the class of tree-adjoining languages (Vijayashanker and Joshi 1988);

  • \(\mathrm {PMCFL}\) is the class of parallel multiple context-free languages;

  • \(\mathrm {PRCL}\) is the class of positive range concatenation languages (Boullier 1998);

  • \(\mathrm {P}\) is the class of languages recognizable in polynomial time.

The important equality \(\mathrm {PRCL} = \mathrm {P}\) is proved in Boullier (1998).

The main contribution of this subsection is a proof that IDL-PMCFGs can be located strictly above PRCGs in the hierarchy (Theorem 1). We also show that IDL-CFGs are strictly more expressive than TAGs and not more expressive than PMCFGs and suggest the question of whether \(\mathrm {IDLCFL} \subset \mathrm {PMCFL}\) as a natural generalization of a recently solved classification problem.

We first observe that IDL-CFG allows us to define in a very compact manner the nMIX language family:

Proposition 3

For all \(n \in {\mathbb {N}}^+\), the \(n\mathrm {MIX}\) language defined as

$$\begin{aligned} n\mathrm {MIX} = \left\{ x \in \left\{ a_1, \dots , a_n \right\} ^* \mid \left|x \right|_{a_1} = \dots = \left|x \right|_{a_n} \right\} \end{aligned}$$

is in \(\mathrm {IDLCFL}\).

Proof

Let \(n \in {\mathbb {N}}^+\). The following IDL-CFG

$$\begin{aligned} S&: \varepsilon \\ S \rightarrow S&: s \mapsto ||\left( s, a_1, \dots , a_n\right) \end{aligned}$$

defines \(n\mathrm {MIX}\). \(\square \)

The position of the \(n\mathrm {MIX}\) languages within the hierarchy has been intensively studied within the last decade. Language \(2\mathrm {MIX}\) is context-free. For \(n \ge 3\), the problem turns out to be much more difficult. The original \(\mathrm {MIX}\) language (or Bach language), i.e. \(3\mathrm {MIX}\), has been proven a \(\mathrm {PMCFL}\) by Salvati (2015). Makoto and Salvati (2012) also proved that \(\mathrm {MIX}\) is not a tree-adjoining language. Together with the fact that IDL-CFGs generate all \(n\mathrm {MIX}\) languages, this results provides us with the following corollary:

Proposition 4

\(\mathrm {IDLCFL} \not \subset \mathrm {TAL}\).

For many years, no general classification results were available for \(n > 3\). Only very recently did Ho (2018) prove that for all n, the word problem of \({\mathbb {Z}}^n\) is in \(\mathrm {PMCFL}\). Since the word problem of \({\mathbb {Z}}^n\) and nMIX are rationally equivalent (Salvati 2015), this yields the inclusion of the whole \(n\mathrm {MIX}\) family within \(\mathrm {PMCFL}\).

Proving that \(\mathrm {IDLCFL} \subset \mathrm {PMCFL}\) would be an even stronger result, given that \(n\mathrm {MIX} \in \mathrm {IDLCFL}\) for all n; the inclusion might even appear likely in the light of Ho’s proof. Nevertheless, the amount of work needed to address the “specific” case of \(n\mathrm {MIX}\) languages suggests that this will be anything but easy, and we will leave this for future work.

On the other hand, it is clear that IDL-CFG does not contain PMCFG.

Proposition 5

Language \(\mathrm {L}_{3n}\) above is not in \(\mathrm {IDLCFL}\).

Proof

By contradiction, let G be an IDL-CFG matching \(\mathrm {L}\). Consider the context-free grammar \(G'\) that is obtained from G by replacing every IDL expression e over some alphabet \(\varSigma '\) in the right-hand side of a rule by a string \(s\left( e\right) \) defined inductively as follows:

$$\begin{aligned} s\left( a\right)&= a&\forall a \in \varSigma ' \cup \left\{ \underline{\varepsilon }\right\} \\ s\left( e\cdot e'\right)&= s\left( e\right) \cdot s\left( e'\right) \\ s\left( \times \left( e\right) \right)&= s\left( e\right) \\ s\left( \vee \left( e_1,\dots ,e_n\right) \right)&= \vee \left( s\left( e_1\right) \cdot \dots \cdot s\left( e_n\right) \right)&\forall n \in {\mathbb {N}} \\ s\left( ||\left( e_1,\dots ,e_n\right) \right)&= s\left( e_1\right) \cdot \dots \cdot s\left( e_n\right)&\forall n \in {\mathbb {N}}. \end{aligned}$$

Note that an equivalent CFG grammar in a more canonical form can be easily obtained by removing the disjunction nodes in exchange of an increase in the number of rules. Now, it is straightforward that the language \(\mathrm {L}'\) generated by \(G'\) is a subset of \(\mathrm {L}\). Moreover, for any string w generated by G, there exists a string \(w' \in \mathrm {L}' \subset \mathrm {L}\) such that \(\left|w \right|= \left|w' \right|\), and such that \(w'\) is obtained by applying the same rules in \(G'\) that were used to produce w in G. By construction, \(w'\) is a permutation of w. Let \(x \in \mathrm {L}\), and \(n = \left|x \right|\). By definition of \(\mathrm {L}\), x is the only word of length n in \(\mathrm {L}\). As a consequence, \(x \in \mathrm {L}'\). This means that \(\mathrm {L}' = \mathrm {L}\) and that the CFG grammar \(G'\) recognizes \(\mathrm {L}\), which is impossible since \(\mathrm {L}\) is not context-free. \(\square \)

Again, this results in a classification result:

Proposition 6

\(\mathrm {PMCFL} \not \subset \mathrm {IDLCFL}\).

We have this corollary:

Proposition 7

\(\mathrm {IDLCFL} \subsetneq \mathrm {IDLPMCFL}\).

Proof

By Propositions 2 and 6. \(\square \)

An essential result is that unless \(\mathrm {P} = \mathrm {NP}\), IDL-PMCFGs can define some non-polynomial languages. This is in line with Kirman and Salvati (2013)’s findings that even classes of grammars that are “close to [...] mildly context sensitive” may have NP-hard membership problems as soon as commutation is allowed. In the case of IDL-PMCFG, we will prove this in three steps. First, we will recall the definition of the \(\mathrm {NP}\)-complete problem \(\mathrm {3SAT}\) and suggest a polynomial encoding of it on a simple finite alphabet. Second, we will construct an IDL-PMCFG grammar that recognizes the language of satisfiable 3-CNF formulae in the previous encoding. A final step will then lead us to the result.

The 3SAT problem is one of Karp (1972)’s 21 \(\mathrm {NP}\)-complete problems. It asks for determining whether a finite boolean formula on a potentially infinite set of variables \(\left\{ x_n\right\} _{n \in {\mathbb {N}}}\), input in conjunctive normal form (CNF) with at most three literals per clause, is satisfiable. Consider for example the 3-CNF formulae

$$\begin{aligned} f_1&=\left( x_1 \vee x_2 \vee \lnot x_3\right) \wedge \left( \lnot x_2 \vee \lnot x_3\right) \wedge \left( \lnot x_1 \vee x_2 \vee x_3\right) \\ f_2&=\left( x_1 \vee \lnot x_2\right) \wedge \left( x_2 \vee \lnot x_3\right) \wedge \lnot x_1 \wedge \lnot x_3. \end{aligned}$$

Formula \(f_1\) is satisfiable because the valuation \(x_1 \mapsto \top , x_2 \mapsto \bot , x_3 \mapsto \top \) results in the formula reducing to \(\top \). Formula \(f_2\) is not satisfiable: the last two unary clauses impose \(x_1 \mapsto \bot , x_3 \mapsto \bot \), but then the first clause requires \(x_2 \mapsto \bot \) to be satisfied whereas the second one needs \(x_2 \mapsto \top \), a contradiction.

The size of 3-CNFs is measured by the number of their clauses, without regard to the number of variables. In the two instances above, this gives \(\left|f_1 \right|= 3\), \(\left|f_2 \right|= 4\).

So far, our description of 3-CNF formulae, unlike the grammars we study in this paper, used an infinite alphabet to encode variables. We now introduce an encoding of 3-CNF logical formulae on an finite alphabet.

Definition 9

Let \(\varSigma = \left\{ \texttt {[}, \texttt {]}, \texttt {(}, \texttt {)}, \texttt {1}, \texttt {!} \right\} \). Let \(V = \left( x_n\right) _{n \in {\mathbb {N}}^+}\) the set of variables and \(SV = V \cup \lnot V\) the set of optionally negated variables. We define

  • A mapping \(\nu : V \rightarrow \varSigma ^*\) encoding variables as the unary representation of their index:

    $$\begin{aligned} \forall n \in {\mathbb {N}},\; \nu \left( x_n\right) = \texttt {1}^n; \end{aligned}$$
  • A mapping \(\mu : SV \rightarrow \varSigma ^*\) encoding optionally negated variables by appending the character \(\mathtt {!}\) in front of negated variables:

    $$\begin{aligned} \forall v \in V,\left\{ \begin{array}{l} \mu \left( v\right) = \nu \left( v\right) \\ \mu \left( \lnot v\right) = \mathtt {!}\cdot \nu \left( v\right) \end{array}; \right. \end{aligned}$$
  • A mapping \(\pi : SV^3 \rightarrow \varSigma ^*\) encoding ternary clauses in the following way:

    $$\begin{aligned} \forall \left( u,v,w\right) \in SV^3,\;\pi \left( u \vee v \vee w\right) = \texttt {(} \cdot \mu \left( u\right) \cdot \texttt {)} \cdot \texttt {(} \cdot \mu \left( v\right) \cdot \texttt {)} \cdot \texttt {(} \cdot \mu \left( w\right) \cdot \texttt {)}; \end{aligned}$$
  • A mapping \(\tau : \bigcup _{n \in {\mathbb {N}}} \left( SV^3\right) ^n \rightarrow \varSigma ^*\) encoding 3-CNF formulae as follows:

    $$\begin{aligned}&\forall n \in {\mathbb {N}}, \forall \left( C_1, \dots , C_n\right) \in \left( SV^3\right) ^n,\\&\tau \left( C_1 \wedge \dots \wedge C_n\right) = \texttt {[} \cdot \pi \left( C_1\right) \cdot \texttt {]} \cdot \dots \cdot \texttt {[} \cdot \pi \left( C_n\right) \cdot \texttt {]}. \end{aligned}$$

Mappings \(\nu \), \(\mu \), \(\pi \) and \(\tau \) are clearly bijective.

The above encoding is only applicable to formulae in 3-CNF where every clause contains exactly three literals. A straightforward observation makes this restriction largely irrelevant and will simplify the discussion later on:

Proposition 8

Let f be a logical formula in 3-CNF. There exists another logical formula \({\hat{f}}\) in 3-CNF such that:

  1. 1.

    Formulae f and \({\hat{f}}\) are equivalent and \(\left|{\hat{f}} \right|= \left|f \right|\);

  2. 2.

    All clauses in \({\hat{f}}\) have exactly three literals;

  3. 3.

    The set \({\hat{W}}\) of variables used in \({\hat{f}}\) is equal to \(\left( x_n\right) _{n \in [\![1,N]\!]}\) for some \(N \in {\mathbb {N}}\) such that \(N \le 3\left|f \right|\).

Moreover, for all such f, a formula \({\hat{f}}\) matching the three above conditions can computed from f in time \({\mathscr {O}}\left( \left|f \right|\right) \).

Proof

Let f be a logical formula in 3-CNF and W the set of its variables. We derive \({\hat{f}}\) from f as follows:

  1. 1.

    Rename the variables in f to produce a new formula g with variable set X such that \(X = \left( x_n\right) _{n \in [\![1,\left|W \right|]\!]}\). One convenient way to achieve this is to process the formula from left to right, keeping in mind the index of the smallest currently unused variable in the new (partial) formula, as well as the correspondance between variables in f that have already been renamed in g. This is done in time linear in \(\left|f \right|\) and would e.g. convert \(f_3 = \left( x_3 \vee x_1 \vee x_{17}\right) \wedge \left( x_4 \vee \lnot x_3 \vee \lnot x_{16} \right) \) into \(g_3 = \left( x_1 \vee x_2 \vee x_3\right) \wedge \left( x_4 \vee \lnot x_1 \vee \lnot x_5\right) \).

  2. 2.

    Now, for every clause that has only one (resp. two) literals, duplicate (resp. triplicate) the first literal of the clause. This can be done in linear time scanning the input from left to right. For instance, formula \(f_2\) would be converted into \(g_2 = \left( x_1 \vee \lnot x_2 \vee x_1\right) \wedge \left( x_2 \vee \lnot x_3 \vee x_2 \right) \wedge \left( \lnot x_1 \vee \lnot x_1 \vee \lnot x_1\right) \wedge \left( \lnot x_3 \vee \lnot x_3 \vee \lnot x_3\right) \).

Since the total number of variables in a 3-CNF formula cannot exceed 3 times the number of clauses, the resulting formula \({\hat{f}}\) clearly satisfies the above conditions. \(\square \)

Applying the operations described in Proposition 8 followed by the mapping \(\tau \) described in Definition 9 leads to the following encoding of formulae \(f_1\) and \(f_2\):

$$\begin{aligned} \tau \left( \hat{f_1}\right)&= \texttt {[(1)(11)(!111)][(!11)(!111)(!11)][(!1)(11)(111)]} \\ \tau \left( \hat{f_2}\right)&= \texttt {[(1)(!11)(1)][(11)(!111)(11)][(!1)(!1)(!1)]}\\&\quad \times \texttt {[(!111)(!111)(!111)]}. \end{aligned}$$

The following lemma provides the key argument:

Lemma 3

Let \(\mathrm {3SATL}\) be the language of satisfiable 3-CNF formulae. There exists some IDL-PMCF grammar G such that, for all 3-CNF formula f, \(f \in \mathrm {3SATL}\) iff \(\tau \left( {\hat{f}}\right) \in \mathrm {L}\left( G\right) \).

Proof

We build an IDL-PMCF grammar G that recognizes satisfiable 3-CNF formulae encoded as in Definition 9. First, we define variables (of category V and arity 1) as sequences of \(\texttt {1}\)s:

$$\begin{aligned} V&: \left\langle \texttt {1} \right\rangle \\ V \rightarrow V&: v \mapsto \left\langle v[0] \cdot \texttt {1} \right\rangle . \end{aligned}$$

We proceed by defining literals (of category L, arity 1) as variables preceded by the optional negation symbol \(\texttt {!}\):

$$\begin{aligned} V \rightarrow L : v \mapsto \left\langle \vee \left( \underline{\varepsilon }, \texttt {!}\right) \cdot v[0] \right\rangle . \end{aligned}$$

A satisfiable formula and the valuation satisfying it are produced in parallel through a number of double-steps, each of them consisting of:

  1. 1.

    A selection step where a new variable \(x_{i'} := x_{i+1}\) is selected and its boolean value \(v_i\) in the valuation is chosen;

  2. 2.

    An insertion step where an arbitrary number of ternary clauses containing \(x_i\) (if \(v = \top \)) or \(\lnot x_i\) (if \(v = \bot \)) is added at arbitrary positions in the already produced formula.

Each double-step uses a category F of arity 3 that stores the current formula \(f_i\) as well as the current litteral \(\sigma _i x_i\) (where \(\sigma _i \in \left\{ \varepsilon , \texttt {!}\right\} \) and \(\sigma _i = \varepsilon \Leftrightarrow v_i = \top \)) as \(\left\langle f_i, x_i, \sigma _i \right\rangle \).Footnote 4

The selection step retrieves the next variable and chooses its value by

$$\begin{aligned} F \rightarrow F&: f \mapsto \left\langle f[0], f[1] \cdot \texttt {1}, \underline{\varepsilon }\right\rangle \\ F \rightarrow F&: f \mapsto \left\langle f[0], f[1] \cdot \texttt {1}, \texttt {!} \right\rangle . \end{aligned}$$

The insertion step adds arbitrary clauses containing the current litteral to the current formula:

$$\begin{aligned} L&\rightarrow L \rightarrow F \rightarrow F: v,w,f \mapsto \langle ||(f[0], \times (\texttt {[}\cdot ||( \times (\texttt {(} \cdot v[0] \cdot \texttt {)} ),\\&\quad \times (\texttt {(} \cdot w[0] \cdot \texttt {)}), \times (\texttt {(} \cdot f[2] \cdot f[1] \cdot \texttt {)}) ) \cdot \texttt {]}) ), f[1], f[2] \rangle . \end{aligned}$$

This rule can be described informally as follows: interleave into the current formula a (locked) clause consisting of three interleaved (locked) literals, the first two of which are arbitrary while the third one is equal to the current litteral; literals are enclosed in parentheses while clauses are enclosed in square brackets.

Finally, we have to indicate that the start category can be produced from any F and that the empty formula is satisfiable (with \(\texttt {1}\) as first variable to consider):

$$\begin{aligned} F&: \left\langle \underline{\varepsilon }, \texttt {1}, \underline{\varepsilon }\right\rangle \\ F&: \left\langle \underline{\varepsilon }, \texttt {1}, \texttt {!}\right\rangle \\ F \rightarrow S&: f \mapsto \left\langle f[0] \right\rangle . \end{aligned}$$

Let f be a 3-CNF formula such that \(f \in \mathrm {3SATL}\). As f and \({\hat{f}}\) are equivalent, \({\hat{f}} \in \mathrm {3SATL}\). Let N such that the set \({\hat{W}}\) of variables of \({\hat{f}}\) is equal to \(\left( x_n\right) _{n \in [\![1,N]\!]}\). Let v be a valuation of \({\hat{W}}\) satisfying \({\hat{f}} =: \left( C_1, \dots , C_{\left|f \right|} \right) \). Let \(\left( E_i\right) _{i \in [\![1,N]\!]} \in {\mathscr {P}}\left( [\![1,\left|f \right|]\!]\right) ^N\) such that \(\sqcup _{i=1}^N E_i = [\![1,\left|f \right|]\!]\) and for all \(i \in [\![1,N]\!]\), \(n \in [\![1,\left|f \right|]\!]\),

$$\begin{aligned} n \in E_i \Rightarrow \left( v\left( x_i\right) = \top \wedge x_i \in C_n \right) \vee \left( v\left( x_i\right) = \bot \wedge \lnot x_i \in C_n \right) . \end{aligned}$$

In other words, \(E_i\) is a set of clauses such that the current value of \(x_i\) in v makes all clauses in the set reduce to true. Let \(\kappa \left( \top \right) = \underline{\varepsilon }\) and \(\kappa \left( \bot \right) = \texttt {!}\). This decomposition necessarily exists since \({\hat{f}}\) is satisfiable, but it is in general not unique. The string \(\tau \left( {\hat{f}}\right) \) is recognized by G using the following derivations for \(i = 1\) up to N, starting with \(\left\langle \underline{\varepsilon }, \texttt {1}, \kappa \left( v\left( x_1\right) \right) \right\rangle \) of category F:

  • Consider the available item \(\phi = \left\langle f, \texttt {1}^i, \kappa \left( v\left( x_i\right) \right) \right\rangle \) of category F;

  • For all \(n \in E_i\):

    • Without loss of generality, suppose \(v\left( x_i\right) = \top \) and therefore \(\kappa \left( v\left( x_i\right) \right) = \underline{\varepsilon }\),

    • Up to reordering of i, j and k, \(C_n = x_i \vee \sigma _j x_j \vee \sigma _k x_k\),

    • Produce two literals of category L containing \(\sigma _jx_j\) and \(\sigma _kx_k\) respectively,

    • Use them along with \(\phi \) to produce \(\phi ' = \left\langle f', \texttt {1}^i, \underline{\varepsilon }\right\rangle \) where \(f'\) is equal to f up to a clause

      $$\begin{aligned} \texttt {[(}{} \texttt {1}^i\texttt {)(}\kappa \left( \sigma _j\right) \texttt {1}^j\texttt {)(}\kappa \left( \sigma _k\right) \texttt {1}^k\texttt {)]} \end{aligned}$$

      that has been added at a position compatible with the final reordering in \(\tau \left( {\hat{f}}\right) \),

    • Do \(\phi := \phi '\), \(f := f'\);

  • If \(i < N\), produce an item \(\phi := \left\langle f, \texttt {1}^{i+1}, v\left( x_{i+1}\right) \right\rangle \) of category F;

  • Else, \(i = N\): produce \(\left\langle f \right\rangle \) of category S; terminate.

Conversely, let f be a 3-CNF formula such that \(\tau \left( {\hat{f}}\right) \in \mathrm {L}\left( G\right) \). As f and \({\hat{f}}\) are equivalent, it suffices to prove that \({\hat{f}} \in \mathrm {3SATL}\). Now, it is straightforward to see that subsets \(E_i\) of \([\![1,\left|f \right|]\!]\) verifying the same properties as in the first half of the proof can be constructed by considering the clauses added in f at iteration i. The existence of these subsets guarantees that \({\hat{f}} \in \mathrm {3SATL}\), which concludes the proof. \(\square \)

Our main classification result follows:

Theorem 1

Unless \(\mathrm {P} = \mathrm {NP}\), \(\mathrm {IDLPMCFL} \not \subset \mathrm {PRCL} = \mathrm {P}\).

Proof

By contradiction, suppose that \(\mathrm {IDLPMCFL} \subset \mathrm {P}\). We will now prove that \(\mathrm {3SATL} \in \mathrm {P}\).

Let G be the grammar defined in Lemma 3. Thanks to our hypothesis that \(\mathrm {IDLPMCFL} \subset \mathrm {P}\), the language \(\mathrm {L}\left( G\right) \) recognized by G is in \(\mathrm {P}\). Let T be a (deterministic) Turing machine that recognizes \(\mathrm {L}\left( G\right) \) in polynomial time. Consider the following procedure Poly3SAT:

figure b

First, this procedure runs in polynomial time in the size \(\left|f \right|\) of the input:

  1. 1.

    Computing \({\hat{f}}\) takes time \({\mathscr {O}}\left( \left|f \right|\right) \) according to Proposition 8, and \(\left|{\hat{f}} \right|= \left|f \right|\).

  2. 2.

    Computing \(\tau \left( g\right) \) is also clearly linear in the size \(\left|g \right|= \left|f \right|\) of its input. The size of \(\tau \left( g\right) \in \varSigma ^*\) is given by its length. The set W of variables appearing in g is included in \(\left( x_i\right) _{i \in [\![1,3\left|g \right|]\!]}\) according to Proposition 8. Therefore, \(\left|\nu \left( w\right) \right|\le 3\left|g \right|= 3 \left|f \right|\) for all \(w \in W\). Following Definition 9, we get

    $$\begin{aligned} \left|\tau \left( g\right) \right|\le \left|g \right|\left( 8 + 3 \times \left( 1 + 3\left|f \right|\right) \right) = \left|f \right|\left( 8 + 3 \times \left( 1 + 3 \left|f \right|\right) \right) = {\mathscr {O}}\left( \left|f \right|^2\right) ; \end{aligned}$$
  3. 3.

    Finally, computing \(T\left( h\right) \) is by assumption a \({\mathscr {O}}\left( \left|h \right|^{\alpha }\right) \) for some \(\alpha \in {\mathbb {N}}^+\). As \(\left|h \right|= \left|\tau \left( g\right) \right|= {\mathscr {O}}\left( \left|f \right|^2\right) \), computing \(T\left( h\right) \) is a \({\mathscr {O}}\left( \left|f \right|^{2\alpha }\right) \).

Second, it recognizes \(\mathrm {3SATL}\). This is a direct consequence of Lemma 3: \(f \in \mathrm {3SATL}\) iff \(h = \tau \left( {\hat{f}}\right) \in \mathrm {L}\left( G\right) = \mathrm {L}\left( T\right) \), iff Poly3SAT returns true when applied to f.

We have proved that Poly3SAT recognizes \(\mathrm {3SATL}\) in polynomial time. Hence, \(\mathrm {3SATL} \in \mathrm {P}\). As \(\mathrm {3SAT}\) is \(\mathrm {NP}\)-complete, this yields \(\mathrm {P} = \mathrm {NP}\). \(\square \)

This theorem, combined with results obtained by Ljunglöf (2005), admits a useful corollary:

Theorem 2

Unless \(\mathrm {P} = \mathrm {NP}\), \(\mathrm {PMCFL} \subsetneq \mathrm {IDLPMCFL}\).

Proof

Ljunglöf (2005) proves that \(\mathrm {PMCFL} \subsetneq \mathrm {PRCL} = \mathrm {P}\). According to Theorem 1, unless \(\mathrm {P} = \mathrm {NP}\), we have \(\mathrm {IDLPCMFL} \not \subset \mathrm {P}\). Clearly, \(\mathrm {PMCFL} \subset \mathrm {IDLPMCFL}\), so unless \(\mathrm {P} = \mathrm {NP}\), \(\mathrm {PMCFL} \subsetneq \mathrm {IDLPMCFL}\). \(\square \)

Our results are summarized on Fig. 1. The following questions remain open:

  1. 1.

    Is \(\mathrm {IDLCFL} \subset \mathrm {PMCFL}\)? We have suggested above that the answer is likely to be positive. This is displayed on Fig. 1 by the interrogation mark and the fact that language \(\tau \left( \widehat{\mathrm {3SATL}}\right) \in \mathrm {IDLPMCFL} \subset \mathrm {PMCFL}\) is placed at the border between \(\mathrm {IDLCFL}{\setminus } \left( \mathrm {PMCFL} \cup \mathrm {IDLCFL}\right) \) and \(\mathrm {IDLCFL} \cap \left( \mathrm {IDLPMCFL}{\setminus }\mathrm {PMCFL}\right) \).

  2. 2.

    Is \(\mathrm {IDLPMCFL} \subset \mathrm {CSL}\), i.e., are all \(\mathrm {IDLPMCFL}\)s recognized by some linear bounded automaton?

Independently from the answers to the previous questions, it is already clear that the presence in a \(\mathrm {CFL}\)-based formalism of all three \(||\), \(\cdot \) and \(\times \) operators as well as of tuples of size at least 2 and copying, is sufficient to leave the realm of \(\mathrm {P}\). As noted in Sect. 1.2, interleaving, linear constraints, locking, records and copying are reasonable requirements for a grammatical formalism designed to describe the syntax of free word order languages in general, and of Classical Latin in particular. This, of course, does not mean that Classical Latin itself would be non-polynomial, since the reduction presented is not linguistically relevant, and involves copying which Latin does not require. It simply means that a grammatical formalism for free word order languages containing the features above leads to worst-case non-polynomial scenarios which might not necessarily be linguistically relevant.

Fig. 1
figure 1

\(\mathrm {PMCFL}\), \(\mathrm {IDLCFL}\) and \(\mathrm {IDLPMCFL}\), unless \(\mathrm {P}=\mathrm {NP}\)

3 COMPĀ: A Programming Language for Describing Free Word Order Syntax

3.1 Grammatical Framework and COMPĀ

Grammatical Framework (GF) (Ranta 2004), developed by Ranta et al. since 1998, is a special-purpose programming language aimed at writing grammars of natural languages. Practically, GF serves as the natural-language counterpart of tools such as YACC (Johnson et al. 1975) or Menhir (Pottier and Régis-Gianas 2016) for programming languages. From a logical point of view, Grammatical Framework is a logical framework relying on Martin-Löf type theory (Martin-Löf and Sambin 1984). A functional programming language, GF also has a large support for modularity and enforces conventions and standards simplifying the development of multilingual applications. Its community actively contributes to the Resource Grammar Library (Ranta et al. 2009), that unites ‘concrete’ wide-coverage grammars for over 30 individual languages around a common ‘abstract’ grammar. Over the course of the last 20 years, GF, which remains fully open-source, has been used in several experimental as well as industrial contexts, for applications ranging from morphological generation to natural language transcription of formal (mathematical, proof, technical) language, from multilingual translation of ‘controlled language’ to language learning tools.

This chapter describes COMPĀ, a GF-like programming language tailored to encode the grammatical syntax of free word order languages. Though it has been primarily conceived to model and study the syntax of the Latin language, its design as well as the description we will give are both language-agnostic. The name COMPĀ stands for COMPĀgēs Grammaticālis Latīna, which means ‘Latin Grammatical Framework’ in Latin.

The syntax and semantics of COMPĀ are largely borrowed from standard GF: it is a functional programming language in Haskell-style manipulating sets of words/terminals, and providing records and tables over finite types, as well as finite lambda functions (Ranta 2011). As an experimental language focussing on the syntactic description of individual languages, COMPĀ does not implement structures and operators mainly directed at handling morphology, semantics and multilingualism or providing additional modularity, such as abstract grammars, dependent types, token-level gluing or general lambda functions. More precisely, COMPĀ’s extends a subset of GF so-called context-free GF (Ljunglöf 2004). In turn, it provides 3 operators absent from standard GF, viz. the interleave (\(||\)), disjunction (\(\vee \)) and lock (\(\times \)) operators. While standard GF compiles (mostly for parsing purposes) into Angelov et al. (2009)’s PMCFG-equivalent PGF, COMPĀ can be transcompiled into IDL-PMCFG.

In this section, we will focus on aspects of COMPĀ’s design that differ from standard GF, and show how it can be used as an efficient front-end for writing practical grammars of free word order languages. For a detailed presentation of the syntax of standard GF, the reader can refer to the GF reference manual (Ranta 2011).

3.2 Operations and Types

3.2.1 Data Types

As a language designed to model free word order languages, COMPĀ relies one fundamental data type Set, the type of sets of token lists (short: ‘sets’), that replaces the standard GF Str. The basic operators described below take as input, and return, only data of type Set.

Besides the fundamental type Set, each grammar may define an arbitrary number of parameter types. These finite types are often used to encode specific grammatical features (e.g. case, number or gender).

Record types can be built from a list of (distinct) identifier names and a list of types, each of which might be either the type Set or any finite type. Records store structured information and allow for an accurate representation of grammatical constituents (storing some of their features as well of one or several sets that represent their linearization).

Tables are finite functions that map every value of a finite type to a value of some (unique) other type.

Given a set of finite (i.e. enumerated) types \(\varPi \) and the set of admissible string identifiers S, the syntax of admissible types is formally defined as follows (Fig. 2):

Fig. 2
figure 2

COMPĀ’s types

3.2.2 Operations on Sets

Sets are introduced by means of the standard syntax for strings. Thus, in COMPĀ, the expression consul does not represent the singleton token list \(\left[ \texttt {consul} \right] \) as in GF, but it instead stands for the singleton set \(\left\{ \left[ \texttt {consul} \right] \right\} \). Similarly, COMPĀ’s \(\texttt {[]}\) does not stand as in GF for the empty list of tokens (the empty string), but for the singleton containing the empty list of tokens. Another more practical way to put it is to see this set as the set of possible phrases that can be derived from the expression consul: there is only one, containing one word, the word consul, hence the singleton set above. Note that the empty set of strings has its own syntax, \(\texttt {variants \{\}}\), that is also borrowed from standard GF.

COMPĀ define four basic operations on sets, that are the exact counterparts of those defined in Nederhof and al.’s IDL expressions formalism (Fig. 3):

Fig. 3
figure 3

Operations on sets in COMPĀ

3.3 Syntax

3.3.1 Structure of Programs

Each COMPĀ program, called a grammar, is enclosed in a file, with each COMPĀ file defining exactly one such grammar. Standard extension for COMPĀ files is .cp. The syntax of programs is as follows (note that all whitespace and line breaks are ignored) (Fig. 4):

The identifier following the concrete keyword is an (arbitrary) name. We will now go through each of the four sections of the grammar and consider their individual syntaxes.

Fig. 4
figure 4

Syntax of COMPĀ identifiers and grammars

3.3.2 Including GF Lexica

The first section is used to import already extant standard GF lexica into COMPĀ. Its syntax is extremely simple (Fig. 5):

where filename is the name of a concrete GF file functioning as a lexicon. When an include is read, the corresponding GF file is retrieved and all words it defines are automatically extracted. The include section thus provides some compatibility with standard GF as well as a support (through the use of GF itself) for efficient morphological analysis.

Fig. 5
figure 5

Syntax of include

3.3.3 Parameters

Parameters are declared as follows (Fig. 6):

In the above description, \({{\textit{ident}}_0}\) is the name of the new finite type, and \(\left( {\textit{ident}}_k\right) _{k\ge 1}\) its values. Both type and value parameter identifiers must be unique throughout the whole grammar, and are usually (but not necessarily) capitalized.

Fig. 6
figure 6

Syntax of param

3.3.4 Categories

Categories are introduced in the lincat section according to the syntax below, where paramType is any parameter type defined in the param section (Fig. 7):

Fig. 7
figure 7

Syntax of lincat

3.3.5 Linearization Functions

The lin section collects the functional rules that are the heart of any GF or COMPĀ grammar. Each linearization rule describes a way to combine several arguments (of given input categories) into a new item (of a given output category). Unlike GF, which separates the type-annotated declaration of linearization functions in abstract syntax files from the non-type-annotated definition of linearization functions in concrete syntax files, COMPĀ uses only a single (concrete) syntax, which is directly annotated with types. COMPĀ includes a complete type-checker.

Let us first formally describe the syntax of linearization functions. In this figure, the non-terminals \( paramType \) and \( paramValue \) match parameter types and values introduced in the param section, whereas the non-terminal \( licatName \) matches the name of any category defined in the lincat section. In the definition of \( lin \), \( ident_0 \) and \( lincatName_0 \) are respectively the name of the linearization function and its output category, while

$$\begin{aligned} \left( {{\textit{ident}}_k},{{\textit{lincatName}}_k}\right) _{k \ge 1} \end{aligned}$$

are the names and categories of the function’s arguments (Fig. 8).

Fig. 8
figure 8

Syntax of lin

3.3.6 Iterating over Finite Types with \(\mathtt {for}\)

To handle those cases where similar rules must be constructed for all possible values of a given parameter, a loop structure absent from standard GF is proposed. This structure is available in COMPĀ through a \(\texttt {for-do}\) construction.

Suppose that verb category V has type \(\left\{ s : \mathsf {Tense} \Rightarrow \mathsf {Set} \right\} \) where \(\mathsf {Tense}\) is a finite type enumerating the available tenses in the language, and that we want to write a rule that takes a verb of category V and produces a conjugated verb of category \({\textit{ConjugV}}\) and type \(\left\{ s : \mathsf {Set}; {\textit{tense}} : \mathsf {Tense} \right\} \) that stores a conjugated verb and keeps trace of its tense. This can be achieved in COMPĀ like this:

$$\begin{aligned}&\texttt {conjugateVerb (v : V) : {\textit{ConjugV}}}\\&\quad \texttt {= {\textit{for t}} : {\textit{Tense do}} \{ s = v.s ! t; {\textit{tense}} = t \};} \end{aligned}$$

When translated into low-level IDL-PMCFG, this results into several parallel rules being constructed, one for each available value of the bound variable. This is especially useful when a parameter (e.g. a verb tense or mode) provides different linearizations without playing any part in the syntactic structure itself, or when another parameter (e.g. number) can be arbitrarily chosen at some syntactic level before being propagated downwards into the tree.

3.4 The COMPĀ (Trans)Compiler

Just as standard GF must be compiled into low-level PMCFG for parsing purposes, the COMPĀ language is used as a grammar description front-end that has to be translated into IDL-PMCFG before parsing.

Using OCaml, we implemented a lightweight transcompiler that type-checks and converts a COMPĀ grammar into an equivalent IDL-PMCFG grammar. The essential conversion step employs finite function resolution techniques similar to those presented by Ljunglöf (2004): tables and parameter fields are replaced by new fields and categories, and new rules are finally created between new categories.

The compiler’s source code can be found in the corresponding GitHub repository.Footnote 5

4 A Parsing Algorithm for COMPĀ

In this section, we present a parsing algorithm for IDL-PMCF grammars and provide an analysis of its complexity. This algorithm, for which we also provide a complete OCaml implementation, is inspired by the works of Ljunglöf (2012) and Angelov (2009) on GF parsing, while building on techniques introduced by Nederhof and Satta (2004) to parse IDL expressions. We extend Nederhof and Satta’s graph-based finite state approach, enriching it by decorating active nodes by sets of word positions.

4.1 Parsing COMPĀ’s IDL Expressions

Nederhof and Satta (2004) present a parsing algorithm for IDL expressions relying on left-to-right scanning of the input and a representation of the current parsing state as a cut (a set of nodes verifying certain properties, that does not necessarily match the traditional graph-theoretical definition of a cut—see below) within a so-called IDL graph. Each IDL expression is compiled to a single IDL graph. Transitions from one state/cut to another state/cut are encoded in the IDL graph; edges are labelled with words that must be read to transition from one cut to another. The input is parsed successfully if and only if the final state is reached after all characters have been read.

Unlike in the original publication, where IDL expressions were used as autonomous regular expressions rather than within a grammar, the edges of COMPĀ’s IDL expressions may be annotated both by terminals, i.e. words, and by (nonterminal, index) pairs. The latter labelling corresponds to the case where we want to match a field of one of the arguments of the current rule.

Let us now define the IDL graph associated to a given IDL expression. Note that this definition, though closely following along the lines of Nederhof and Satta’s contribution, does not encode the lock operator in the same way. This different encoding has been found more practical for parsing of full IDL-PMCF grammars, as will be overt when we will discuss our algorithm.

Definition 10

(IDL graph) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an IDL-PMCFG.

Let \(\left( f, A_1, \dots , A_q, A\right) \in P\). Let e be an IDL expression over \(\varSigma ' = \varSigma \cup \left\{ X_{ij} \mid i \le a\left( f\right) , j \le d_i\left( f\right) \right\} \).

The IDL graph \(\gamma _e\) associated with e is defined by induction as follows:

  • If

    figure c

    ;

  • If

    figure d

    ;

  • If

    figure e

    ;

  • If

    figure f

    .

  • If

    figure g

    .

As an exemple, the IDL graph associated with the IDL expression describing the valid permutations of Marcus cum amico caro ambulat is:Footnote 6

Fig. 9
figure 9

IDL graph for Marcus cum amico caro ambulat

The parsing process of sentence Caro cum amico Marcus ambulat with the IDL graph presented in Fig. 9 is given in Fig. 10.

Given an IDL expression and its IDL graph, we also define the set of its cuts, that will serve as states in the parsing algorithm.

Definition 11

(Cuts of an IDL expression) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an IDL-PMCFG.

Let \(\left( f, A_1, \dots , A_q, A\right) \in P\). Let e be an IDL expression over \(\varSigma ' = \varSigma \cup \left\{ X_{ij} \mid i \le a\left( f\right) , j \le d_i\left( f\right) \right\} \), and \(\gamma _e\) its IDL graph.

The set of cuts of e, \(C_e \subset {\mathscr {P}}\left( V(\gamma _e)\right) \), where \(V(\gamma _e)\) denotes the set of vertices of \(\gamma _e\), is defined by induction as follows:

  • If \(e = a \in \varSigma ' \cup \left\{ \underline{\varepsilon }\right\} \), \(C_e = \left\{ \left\{ s_e \right\} , \left\{ f_e \right\} \right\} \);

  • If \(e = e'\cdot e''\), \(C_e = \left\{ \left\{ s_e \right\} , \left\{ f_e \right\} \right\} \cup C_{e'} \cup C_{e''}\);

  • If \(e = \times \left( e'\right) \), \(C_e = \left\{ \left\{ s_e \right\} , \left\{ f_e \right\} \right\} \cup C_{e'}\);

  • If \(e = \vee \left( e_1,\dots ,e_n\right) \), \(C_e = \left\{ \left\{ s_e \right\} , \left\{ f_e \right\} \right\} \cup \bigcup _{k=1}^n C_{e_k}\);

  • If \(e = ||\left( e_1,\dots ,e_n\right) \), \(C_e= \left\{ \left\{ s_e \right\} , \left\{ f_e \right\} \right\} \cup \prod _{k=1}^n C_{e_k}\), where \(\prod \) denotes nary Cartesian product.

Fig. 10
figure 10

Example of parsing Caro cum amico Marcus ambulat with the IDL graph from Fig. 9

IDL graphs can be regarded as automata recognizing a given regular expression by reading it left-to-right, allowing for parallel reading of several interleaved substrings. The initial and final cuts are composed of a single node. Split edges marked by \(\vdash _n\) cause several branches to be explored in parallel (thus increasing the cardinality of the current cut by \(n-1\) elements) while merge edges marked by \(\dashv _n\) allow n nodes in the old cut to be replaced by one single node in the new cut. Labelled edges can be used to replace the node on the left-hand-side of the edge by the node on the right-hand-side of the edge in the current cut, provided that the terminal or nonterminal labelling the edge can be read at the current position. Epsilon-labelled edges (aka \(\varepsilon \)-transitions) can be taken under no specific assumption, provided that the left-hand-side node of the edge is in the current cut. They are especially used to encode disjunction nodes, which do not result in several branches being taken at the same time, but in only one of them to be chosen. The special lock edges, which were absent from Nederhof and Satta’s original publication, will be discussed later.

An additional degree of complexity has to be dealt with in the context of IDL-PMCFG: we have to check that the substrings matched by the various nonterminals previously read are compatible with the constraints imposed on word order or interleaving. Therefore, throughout the execution of the algorithm, the current state of the parsing process within each IDL graph must be cautiously saved. Any field of any input category can match an arbitrary (and non necessarily contiguous) substring of the input. Moreover, given the ability of the formalism to encode nested lock constructions, an arbitrary number of such position sets must be remembered. This suggests the state space presented in Definition 13.

We first formally define a notion of stacks above an arbitrary set.

Definition 12

(Stack over a set) For any set S, \(\mathsf {Stack}\left( S\right) \) is the set of stacks over S endowed with the two canonical primitives

$$\begin{aligned} \mathsf {pop}: \mathsf {Stack}\left( S\right)&\rightarrow \left( \mathsf {Stack}\left( S\right) \times S\right) , \\ h::t&\mapsto \left( t, h\right) \\ \mathsf {push}: \left( \mathsf {Stack}\left( S\right) \times S\right)&\rightarrow \mathsf {Stack}\left( S\right) \\ \left( s, e\right)&\mapsto e::s \end{aligned}$$

and an additional primitive defined only on non-empty stacks (see also Fig. 11).

$$\begin{aligned} \mathsf {applyHead}: \left( \mathsf {Stack}\left( S\right) \times \left( S \rightarrow S \right) \right)&\rightarrow \mathsf {Stack}\left( S\right) \\ \left( h::t, f\right)&\mapsto f(h)::t \end{aligned}$$
Fig. 11
figure 11

Effect of primitive applyHead

The state space of an IDL expression can now be defined.

Definition 13

(State space of an IDL expression) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an IDL-PMCFG.

Let \(\left( f, A_1, \dots , A_q, A\right) \in P\). Let e be an IDL expression over \(\varSigma ' = \varSigma \cup \left\{ A_{ij} \mid i \le a\left( f\right) , j \le d_i\left( f\right) \right\} \) and \(C_e\) its cuts.

The general state space of e is defined as

and, for \(t \in \varSigma ^*\), the t-specific state space of e as

$$\begin{aligned} {\mathscr {S}}_{e,t} := \bigcup _{c \in C_e} \left\{ \left( c, \vec {s}\right) \mid \vec {s} \in \left( \mathsf {Stack}\left( {\mathscr {P}}_f\left( [\![1,\left|t \right|]\!]\right) \right) \right) ^c \right\} \end{aligned}$$

where \({\mathscr {P}}_f\left( {\mathbb {N}}\right) \) denotes the set of finite subsets of \({\mathbb {N}}\).

Informally, each state of an IDL parsing item is a pair \(\left( c,\vec {\sigma }\right) \) where c is a cut and \(\sigma \) is a map from the nodes of this cut to stacks of position sets. These stacks store the position of the terminals (words) that have already been read by the automaton when the current state is reached. Using a stack allows us to distinguish word positions that were matched in the current branch or at the current level of nested locks, as opposed to words matched before the last split or outside of the current level of nested locks. When a set of split transitions are taken, each of the new nodes added to the cut will store an independent copy of the previous stack, extended with an \(\emptyset \) head. During the processing of the current branch, positions matched in the same branch will be added to the head of the stack, while non-head elements will store information from previous branches. When a set of merge transitions are taken, the heads of the various stacks will be first merged together (ensuring that no contradiction occurs) and then with the second element of all stacks (to take into account the closing of the current parallel processing and check, again, that no impossibility arises). With this technique, we can also give a simple semantics to the \(\uparrow \) and \(\downarrow \) edges: when an \(\uparrow \) transition is taken, an \(\emptyset \) is added to the current stack; when an \(\downarrow \) transition is taken, we check whether the head element of the current stack is an interval and, if it is the case (and no incompatibility arises), we merge it with the second element.

The incompatibilities we evoked can be of two sorts: either the same positions have been read in two different branches, which can therefore not be merged; or what has been parsed does not respect the principle that an IDL graph, within the same branch, parses its input from left to right.

To formalize this, we introduce a partial order on sets of positions as well as some useful predicate:

Definition 14

The relation is defined by

$$\begin{aligned} \forall \left( A, B\right) \in {\mathscr {P}}_f\left( {\mathbb {N}}\right) ^2, A \prec B \Leftrightarrow \forall \left( a,b\right) \in A \times B, a < b. \end{aligned}$$

Note that for any \(A \in {\mathscr {P}}_f\left( {\mathbb {N}}\right) \) and \(\emptyset \prec A\), \(A \prec \emptyset \), and that moreover \(\emptyset \prec \emptyset \).

Definition 15

The predicate is defined by

We can finally define a transition relation between states:

Definition 16

(Transition relation of an IDL expression) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an IDL-PMCFG.

Let \(\left( f, A_1, \dots , A_q, A\right) \in P\). Let e be an IDL expression over \(\varSigma ' = \varSigma \cup \left\{ X_{ij} \mid i \le a\left( f\right) , j \le d_i\left( f\right) \right\} \), \(\gamma _e = \left( V_e, E_e\right) \) its IDL graph, \(C_e\) its cuts and \({\mathscr {S}}_e\) its states. The relation is the smallest relation verifying the following axioms:

  1. 1.

    For all \(\left( s = \left( c,\vec {s}\right) , a, \pi \right) \in \varOmega \), if there exists \(\left( v_1, v_2, r\right) \in V_e^2 \times {\mathscr {P}}\left( V_e\right) \) such that \(c = r \sqcup \left\{ v_1 \right\} \), \(v_1 \overset{a}{\longrightarrow } v_2 \in E_e\) and \(\mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _1 \prec \pi \), then \(\varDelta _e\left( s, a, \pi , \left( r \cup \left\{ v_2 \right\} , \vec {s}'\right) \right) \) where , \(\vec {s}'\mid _r = \vec {s}\mid _r\) and \(\vec {s}\left( v_2\right) = \mathsf {applyHead}\left( \vec {s}\left( v_1\right) , \pi ' \mapsto \pi \cup \pi '\right) \); graphically:

    figure h

    this axiom encodes the fact that, when reading the (non)terminal a at position set \(\pi \succ \pi '\) (meaning that \(\pi \) is located right of the previously read position set \(\pi '\)), we can update the current cut by replacing the node on the LHS of any edge labelled with a by the node on its RHS and appending the position set \(\pi \) to the positions stored on the top of the stack.

  2. 2.

    For all \(\left( s = \left( c,\vec {s}\right) , \underline{\varepsilon }, \emptyset \right) \in \varOmega \), if there exists \(\left( v_1, v_2, r\right) \in V_e^2 \times {\mathscr {P}}\left( V_e\right) \) such that \(c = r \sqcup \left\{ v_1 \right\} \) and \(v_1 \overset{\uparrow }{\longrightarrow } v_2 \in E_e\), then \(\varDelta _e\left( s, a, \pi , \left( r \cup \left\{ v_2 \right\} , \vec {s}'\right) \right) \) where , \(\vec {s}'\mid _r = \vec {s}\mid _r\) and \(\vec {s}\left( v_2\right) = \mathsf {push}\left( \vec {s}\left( v_1\right) ,\emptyset \right) \); graphically:

    figure i

    an \(\uparrow \)-edge can always be used to replace the node on the LHS of the edge by the node on its RHS, pushing an empty position set on the top of the corresponding stack —this is used to isolate the parsing of locked subexpressions, which are finally tested for connexity through a \(\downarrow \)-edge;

  3. 3.

    For all \(\left( s = \left( c,\vec {s}\right) , \underline{\varepsilon }, \emptyset \right) \in \varOmega \), if there exists \(\left( v_1, v_2, r\right) \in V_e^2 \times {\mathscr {P}}\left( V_e\right) \) such that \(c = r \sqcup \left\{ v_1 \right\} \), \(v_1 \overset{\downarrow }{\longrightarrow } v_2 \in E_e\), \(\mathsf {interval}\left( \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _1\right) \) and \(\mathsf {pop}\left( \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _0\right) _1 \prec \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _1\), then \(\varDelta _e\left( s, a, \pi , \left( r \cup \left\{ v_2 \right\} , \vec {s}'\right) \right) \) where , \(\vec {s}'\mid _r = \vec {s}\mid _r\) and

    $$\begin{aligned} \vec {s}\left( v_2\right) = \mathsf {push}\left( \mathsf {pop}\left( \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _0\right) _0, \mathsf {pop}\left( \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _0\right) _1 \cup \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _1\right) ; \end{aligned}$$

    graphically:

    figure j

    the \(\downarrow \)-edges are used at the end of locked subexpressions: the position set on the top of the stack, which stores the positions used in the current locked branch, is tested for connexity (with the \(\mathsf {interval}\) primitive) and linear precedence (the newly closed locked branch at positions \(\pi \) must be located right of the previously read positions \(\pi ''\)), which, if both tests succeed, leads to the node on the LHS of the edge to be replaced on its RHS and to both position sets to be merged;

  4. 4.

    For all \(\left( s = \left( c,\vec {s}\right) , \underline{\varepsilon }, \emptyset \right) \in \varOmega \), if there exists \(n \in {\mathbb {N}}\), \(\left( v_0, v_1, \dots , v_n, r\right) \in V_e^{n+1} \times {\mathscr {P}}\left( V_e\right) \) such that the \(v_i\) are distinct, \(c = r \sqcup \left\{ v_0 \right\} \) and \(\forall i \in \left\{ 1, \dots , n \right\} , v_0 \overset{\vdash _n}{\longrightarrow } v_i \in E_e\), then \(\varDelta _e\left( s, a, \pi , \left( r \cup \left\{ v_1, \dots , v_n \right\} , \vec {s}'\right) \right) \) where , \(\vec {s}'\mid _r = \vec {s}\mid _r\) and \(\forall i \in \left\{ 1, \dots , n \right\} , \vec {s}\left( v_2\right) = \mathsf {push}\left( \vec {s}\left( v_0\right) ,\emptyset \right) \); graphically:

    figure k

    as soon as the current cut contains the LHS of a set of split (i.e. \(\vdash _n\)) edges, this axiom opens n parallel (interleaved) branches, replacing the LHS node by n RHS nodes, all of which come with the same stack as previously, except for an additional empty position set on top, which will later isolate the positions read in the various parallel branches;

  5. 5.

    For all \(\left( s = \left( c,\vec {s}\right) , \underline{\varepsilon }, \emptyset \right) \in \varOmega \), if there exists \(n \in {\mathbb {N}}\), \(\left( v_0, v_1, \dots , v_n, r\right) \in V_e^{n+1} \times {\mathscr {P}}\left( V_e\right) \) such that:

    • The \(v_i\) are distinct,

    • We have \(c = r \sqcup \left\{ v_1, \dots , v_n \right\} \),

    • For all \(i \in \left\{ 1, \dots , n \right\} \), \(v_i \overset{\dashv _n}{\longrightarrow } v_0 \in E_e\),

    • For all \(\left( i,j\right) \in \left\{ 1, \dots , n \right\} ^2\), \(i \ne j \Rightarrow \mathsf {pop}\left( \vec {s}\left( v_i\right) \right) _1 \cap \mathsf {pop}\left( \vec {s}\left( v_j\right) \right) _1 = \emptyset \),

    • We have \(\mathsf {pop}\left( \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _0\right) _1 \prec \bigsqcup _{i=1}^n \mathsf {pop}\left( \vec {s}\left( v_i\right) \right) _1 =: m\);

    then \(\varDelta _e\left( s, a, \pi , \left( r \cup \left\{ v_0 \right\} , \vec {s}'\right) \right) \) where \(\vec {s}' \in \mathsf {Stack}\left( {\mathscr {P}}_f\left( {\mathbb {N}}\right) \right) ^{r \cup \left\{ v_0 \right\} }\), \(\vec {s}'\mid _r = \vec {s}\mid _r\) and

    $$\begin{aligned} \vec {s}\left( v_0\right) = \mathsf {applyHead}\left( \mathsf {pop}\left( \vec {s}\left( v_1\right) \right) _0, \pi \mapsto \pi \cup m \right) ; \end{aligned}$$

    graphically:

    figure l

    at the end of a series of parallel branches marked with merge (i.e. \(\dashv _n\)) edges (closing the parsing of an \(||\) node), this axiom checks that the position sets matched by the various parallel branches are compatible (disjunct) and that these positions, when merged, are compatible with previously matched positions (i.e. located right of them), and, in this case, it replaces the set of nodes on the LHS by the single RHS of merge edges.

Although these rules would essentially suffice to describe the parsing algorithm if all non-terminals appearing in a rule appeared exactly once, the fact that the same non-terminal may appear several times (copy) or not appear at all (erasement) requires us to keep trace of partial parsing contexts in which each argument may or may not have been already identified. We do this by introducing so-called context tables. A context table is a partial function that associates to some arguments of the rule a partial mapping between some fields of these arguments and positions sets. It helps us remember which arguments have already been fixed and which ones can still be chosen freely. For each argument that has already been fixed, it retains which of its fields are available and what positions in the input string are matched by each available field.

Definition 17

(Context table) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an IDL-PMCFG.

Let \(\left( f, A_1, \dots , A_q, A\right) \in P\). Let e be an IDL expression over \(\varSigma ' = \varSigma \cup \left\{ X_{ij} \mid i \le a\left( f\right) , j \le d_i\left( f\right) \right\} \), \(\gamma _e = \left( V_e, E_e\right) \) its IDL graph, \(C_e\) its cuts and \({\mathscr {S}}_e\) its states.

We call context table for e any \(\varGamma \in [\![1,q]\!] \rightharpoonup \left( {\mathbb {N}} \rightharpoonup {\mathscr {P}}_f\left( {\mathbb {N}}\right) \right) \) such that for all \(i \in [\![1,q]\!]\), \({\mathscr {D}}\left( \varGamma \left( i\right) \right) \subset [\![1,\delta \left( A_q\right) ]\!]\). The set of context tables for e is denoted by \({\mathscr {G}}_e\).

The set of context tables is equipped with three primitives defined as follows:

  • For any input string s, is such that

    $$\begin{aligned}&\forall \left( \varGamma , k, \ell , \vec {r}\right) \in \varXi ,~{\mathsf {compat}}_s\left( \varGamma ,k,\ell ,\vec {r}\right) \\&\quad \Leftrightarrow \left\{ \begin{array}{ll} \forall u \in {\mathscr {D}}\left( \vec {r}\right) , \forall v \in {\mathscr {D}}\left( \varGamma \right) , \forall w \in {\mathscr {D}}\left( \varGamma \left( v\right) \right) , \vec {r}_u \cap \varGamma \left( v\right) _w = \emptyset &{} \text {if}~k \not \in {\mathscr {D}}\left( \varGamma \right) \\ \ell \in {\mathscr {D}}\left( \varGamma \left( k\right) \right) \cap {\mathscr {D}}\left( \vec {r}\right) ~\text {and}~s_{\varGamma \left( k\right) _{\ell }} = s_{\vec {r}_{\ell }} &{} \text {otherwise} \end{array} \right. ; \end{aligned}$$

    in other terms, \({\mathsf {compat}}\) checks that the current context table \(\varGamma \) is compatible with mapping field \(\ell \) of argument k to position set \(\vec {r}_{\ell } =: \pi \), which is the case iff either (i) [new nonterminal \(\left( k,\ell \right) \) matched] k is not in the context and \(\pi \) does not intersect any of the position sets stored in \(\varGamma \) or (ii) [copy of already matched nonterminal] k is already in the context, mapped to a partial function \(\varGamma \left( k\right) \), which itself maps field index \(\ell \) to a position set \(\varGamma \left( k\right) _{\ell } =: \pi '\) such that substring \(s_{\pi }\) is the same as \(s_{\pi '}\);

  • The map is defined as

    $$\begin{aligned}&\forall \left( \varGamma , k, \vec {r}\right) \in \varPsi , {\mathsf {reserve}}\left( \varGamma ,k,\vec {r}\right) \\&\quad = \left\{ \begin{array}{ll} \left[ \bigcup _{u \in {\mathscr {D}}\left( \varGamma \right) } \left\{ u \mapsto \varGamma \left( u\right) \right\} \right] \cup \left\{ k \mapsto \vec {r}\right\} &{} \text {if}\;k \not \in {\mathscr {D}}\left( \varGamma \right) \\ \varGamma &{} \text {otherwise} \end{array} \right. ; \end{aligned}$$

    this map registers k in \(\varGamma \), mapping it to \(\vec {r}\), iff k is not yet in the context;

  • For any input string s, \({\mathsf {unify}}_s\subset {\mathscr {G}}_e^3\) is such that

    $$\begin{aligned}&\forall \left( \varGamma , \varGamma ', \varGamma ''\right) \in {\mathscr {G}}_3,{\mathsf {unify}}_s\left( \varGamma ,\varGamma ',\varGamma ''\right) \\&\quad \Leftrightarrow \forall k \in {\mathscr {D}}\left( \varGamma \right) , \forall \ell \in {\mathscr {D}}\left( \varGamma \left( k\right) \right) , {\mathsf {compat}}_s\left( \varGamma ', k, \ell , \varGamma \left( k\right) \right) \\&\quad \wedge \left[ \varGamma '' = \bigcup _{k \in {\mathscr {D}}\left( \varGamma '\right) } {\mathsf {reserve}}\left( \varGamma ,k,\varGamma '\left( k\right) \right) \right] ; \end{aligned}$$

    primitive \({\mathsf {unify}}\) identifies triplets of contexts \(\varGamma \), \(\varGamma '\) and \(\varGamma ''\) such that \(\varGamma ''\) can be obtained from \(\varGamma \) and \(\varGamma '\) by first (i) adding to \(\varGamma \) all matchings \(k \mapsto \vec {r}\) from \(\varGamma '\) for which k is outside of the domain of \(\varGamma \) and then (ii) checking that for all k such that there exists \(k \mapsto \vec {r} \in \varGamma \) and \(k \mapsto \vec {r}' \in \varGamma '\), \(\vec {r}\) and \(\vec {r}'\) define the same fields and maps them to identical substrings.

The semantics of the three primitives are rather natural. First, \({\mathsf {compat}}\) indicates whether the assertion “the \(\ell \)th field of argument k can be identified in the input string at position set \(\vec {r}_{\ell }\)” is compatible with all prior decision stored in the current context. This is possible iff either the kth argument has never been matched before, or the \(\ell \)th field of the already detected kth argument in the current context is identical to the one matched by position set \(\vec {r}_{\ell }\). Once a compatibility is detected, \({\mathsf {reserve}}\) is used to update the context table according to the newly matched item. Finally, \({\mathsf {unify}}\) allows us to compute the union of two contexts that do not interfere with each other.

4.2 Parsing COMPĀ Grammars

Our algorithm is inspired by Earley-style parsers designed to parse context-free GF or PMCFG grammars (Angelov et al. 2009; Ljunglöf 2012). The input is read from left to right and three different kinds of items are build bottom-up. The structure of the above context tables, that contain all essential information about the arguments of each rule and their position, makes it easier to recursively reconstruct all valid parse trees of a given input.

The three types of items we use are:

Definition 18

(Parsing items) Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an IDL-PMCFG.

  • Active items are items of the form \(\left[ \phi ;e;s;\varGamma \right] \) where \(\phi = \left( f, A_1, \dots , A_q, A\right) \in P\), e is an IDL expression over \(\varSigma ' = \varSigma \cup \bigcup _{i=1}^q \left\{ A_{ij} \mid j \in \left\{ 1, \dots , d_i\left( f\right) \right\} \right\} \), \(s \in {\mathscr {S}}_e\) and \(\varGamma \in {\mathscr {G}}_e\);

  • Passive items are items of the form \(\left[ \phi ;A_i;r;\varGamma \right] _{\mathrm {P}}\) where \(\phi = \left( f, A_1, \dots , A_q, A\right) \in P\) and \(\varGamma \in {\mathscr {G}}_e\), with e any IDL expression over \(\varSigma ' = \varSigma \cup \bigcup _{i=1}^q \left\{ A_{ij} \mid j \in \left\{ 1, \dots , d_i\left( f\right) \right\} \right\} \);

  • Completed items are items of the form \(\left[ \phi ;A;\vec {r};\varGamma \right] _{\mathrm {C}}\) where \(\phi = \left( f, A_1, \dots , A_q, A\right) \in P\), \(\vec {r} \in [\![1,q]\!] \rightharpoonup {\mathscr {P}}_f\left( {\mathbb {N}}\right) \) and \(\varGamma \in {\mathscr {G}}_e\), with e any IDL expression over \(\varSigma ' = \varSigma \cup \bigcup _{i=1}^q \left\{ A_{ij} \mid j \in \left\{ 1, \dots , d_i\left( f\right) \right\} \right\} \).

While active items store the current parsing status of a given IDL expression and passive items memorize successful parsing of a given IDL expression, completed items unify parsing results for different fields of the same category, checking that the various contexts are compatible with each other. Passive items are not absolutely essential; they are essentially syntactic sugar for active items where the current cut is reduced to the final node.

Fig. 12
figure 12

Parsing rules

Fig. 13
figure 13

Item types and rules

The deduction-style rules that make up the core of our algorithm are presented in Fig. 12. Predict, Scan and Combine have their usual semantics from bottom-up parsing algorithms, and make extensive use of the context tables and transition relations defined in Sect. 4.1. Step explores \(\varepsilon \)-transitions. Save produces a passive item from an active item that has reached it final state; this passive item is immediately converted into a completed item with only one activated field by Singleton. Finally, when a passive item can be used to extend the domain of a preexisting completed item, Unify performs this operation and returns a new completed item.

Practical implementation of the parsing algorithm requires that an (efficient) ordering be defined on the rules to apply. This ordering must guarantee correctness (i.e. that all possible syntax trees can be output) as well as an acceptable running time.

The graph from Fig. 13 displays the seven deduction rules from Fig. 12 as functions from and to the sets of active (A), passive (P) and completed items (C), as well as products thereof. Dashed arrows are added between two sets X and Y whenever there exists Z such that \(Y = X \times Z\) (red) or \(Y = Z \times X\) (green).

This graph can be viewed as a kind of “recursive control flow diagram” for our algorithm. Each edge labelled with a rule name corresponds to a recursive call to a corresponding function, that will try to apply the rule using the item output by the previous successful function call; each dashed edge corresponds to a fold operation through the item set matching the right-hand-side of the destination type (for red arrows) or the left-hand side of the destination type (for green arrows). The rule Predict is applied only at the first iteration. At each iteration, the parameters j and \(a = t_j\) used by Scan are updated, reading the input from left to right, and a sequence of recursive calls takes place, building new rules that are appended to the existing parsing environment.

In fact, due to the structure of our parsing system, one of the arrows above is redundant: the red arrow r from C to \(C \times P\) can be removed without altering the correctness of the algorithm, as long as, when handling a passive item, the fold operation suggested by the green arrow g from P to \(C \times P\) is executed before the recursive call encoded in the Singleton arrow. Let us consider the first iteration where the red arrow r is taken. This can only occur when a new completed item c has just been created; this completed item has been itself generated by either of the Singleton or Unify rules. If it has been generated by Singleton, say from a passive item p, a recursive fold through the set C has already taken place via the green arrow g. That fold has added to the environment all completed elements that can be computed from p and any other available completed item. Now, for any available passive item \(p'\), a completed item \(c'\) has been derived from \(p'\) at some point of time in the past. The fold operation triggered by g has already, if possible, derived a new complete item from p and \(c'\) that would contain exactly the same information as the item to be created from c and \(p'\). If c has been created by Unify, a fold has already been triggered through g (the only possible path to reach \(C\times P\) has been through g, because of our hypothesis that we have not taken r before) and the same reasoning applies by considering the items \(c'\) and p used to produce c.

The resulting pseudocode is presented in Algorithm 2.

figure m

4.3 Complexity

The goal of the final part of this paper is to provide an upper bound of the complexity of our parsing algorithm under some practically reasonable assumptions, that will be obtained as Theorem 4. The detailed proof of this theorem can be found in Appendix A.1.

Before addressing the actual complexity problem, three remarks must be made.

First, \(\vee \) nodes are not absolutely needed in the IDL-PMCFG formalism. By creating some new rules and introducing intermediate categories, it is easy to transform any grammar into an equivalent one without any \(\vee \) node. In the following discussion, we will often exclude the case of \(\vee \) nodes, and give upper bounds only for the case where those \(\vee \) nodes are not used. Practical experience showed that disjunction, being redundant with the creation of two separate rules, are a useful, but less frequently used feature of the formalism. Nevertheless, we shall give some insights in Appendix A.1.3 about how to take into account disjunction in the final estimate.

Second, we introduce a notion of G-density of a language:

Definition 19

(G -density of a language) Let \(m \in {\mathbb {N}}\). Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an m-parallel IDL-MCF grammar. Let \(T \subset \varSigma ^*\) such that \(\varepsilon \not \in T\). The G-density of T is defined as

where \(t_p\) denotes the substring of t composed of the tokens at positions p in t (see notations).

The G-density of a language serves as a proxy for the amount of ambiguity that this language contains from the point of view of grammar G. It answers the question: ‘How many different substrings of any string in the language can be matched by the same field of the same category?’. This ‘how many’ is quantified as a quotient of the number of different matches over the length of any input string. Introducing G-density will allow us to discuss the worst-case complexity of parsing on reasonable sets of inputs, i.e. those for which \(\rho _T\) is finite, or, equivalently, for which the number of matched substrings grows at most linearly in the size of the string.

Consider the case where we want to describe adjective-noun attachment in a natural language where adjectives can be placed arbitrarily before or after the noun they modify. We are given an (arbitrary large) lexicon with a number of terminal rules producing adjectives (of category A) and nouns (of category N). These terminals are stored in an alphabet we denote by \(\varSigma \). The part of the grammar building noun phrases (of category S) in our toy IDL-CF grammar looks like this:

$$\begin{aligned} N \rightarrow S&: n \mapsto n \\ S \rightarrow A \rightarrow S&: np,a \mapsto ||\left( np,a\right) . \end{aligned}$$

Now, how ambiguous can noun phrases be? If we take \(T = \varOmega := {\mathscr {P}}\left( \varSigma ^*\right) \), then considering a string with a noun and n arbitrary adjectives in any order results in all substrings containing the noun to be valid noun phrases; in this case, \(\rho _{T,G} \ge \frac{2^n}{n+1}\) for all \(n \in {\mathbb {N}}^+\), and therefore \(\rho _{T,G} = +\infty \). But if we now consider the (more practical) case where the number of adjectives to be attached to the same noun is less than some reasonable constant M, and call U the corresponding sublanguage of \(\varOmega \), then we get no more than \(2^{\min \left( k-1,M\right) }\) different matching substrings for any input of length \(k \ge 1\); as a consequence, \(\rho _{U,G} = \sup _{k \in {\mathbb {N}}^+} \frac{2^{\min \left( k-1,M\right) }}{k} = \frac{2^M}{M+1} < +\infty \).

Third, we need to keep in mind the following fact, that is an immediate consequence of Theorem 1:

Theorem 3

IDL-PMCFG parsing is NP-complete. Therefore, unless \(\mathrm {P}=\mathrm {NP}\), general IDL-PMCFG parsing is not polynomial in the size of the input string.

Proof

Theorem 1 provides a reduction from the \(\mathrm {NP}\)-complete problem \(\mathrm {3SAT}\) to parsing IDL-PMCF the grammar G from Lemma 3\(\square \)

4.3.1 Measuring IDL Graphs

We now introduce two measures of the complexity of IDL graphs, encoded in two primitives height and width. While height was already defined in Nederhof and Satta’s paper (though it was there called width, and defined in a slightly different manner), width plays a new and complementary role that we shall emphasize later. The informal interpretation of these metrics is simple: height measures the maximal number of branches that can be traversed in parallel, while width quantifies the maximal number of edges labelled with a terminal or \(\underline{\varepsilon }\) on any left-to-right path from the start to the end node.

Definition 20

(Height and width of an IDL expression graph) Let \(\varSigma \) be a set of symbols that does not contain \(\underline{\varepsilon }\) and \(\diamond \) and \({\mathfrak {E}}\) the set of IDL expressions over \(\varSigma \). The height and width of an IDL expression \(e \in {\mathfrak {E}}\) are defined inductively as follows:

$$\begin{aligned} {\mathsf {height}}\left( a\right)&= 1&\forall a \in \varSigma \cup \left\{ \underline{\varepsilon }\right\} \\ {\mathsf {height}}\left( e' \cdot e''\right)&= \max \left( {\mathsf {height}}\left( e'\right) ,{\mathsf {height}}\left( e''\right) \right) \\ {\mathsf {height}}\left( \times \left( e\right) \right)&= {\mathsf {height}}\left( e\right) \\ {\mathsf {height}}\left( \vee \left( e_1, \dots , e_n\right) \right)&= \sum _{i=1}^n{\mathsf {height}}\left( e_i\right) \\ {\mathsf {height}}\left( ||\left( e_1, \dots , e_n\right) \right)&= \sum _{i=1}^n{\mathsf {height}}\left( e_i\right) \\ {\mathsf {height}}\left( ||\left( e'\right) \right)&= {\mathsf {height}}\left( e'\right) ;\\ {\mathsf {width}}\left( a\right)&= 1&\forall a \in \varSigma \cup \left\{ \underline{\varepsilon }\right\} \\ {\mathsf {width}}\left( e' \cdot e''\right)&= {\mathsf {width}}\left( e'\right) + {\mathsf {width}}\left( e''\right) \\ {\mathsf {width}}\left( \times \left( e\right) \right)&= {\mathsf {width}}\left( e\right) \\ {\mathsf {width}}\left( \vee \left( e_1, \dots , e_n\right) \right)&= \max _{i=0}^n {\mathsf {width}}\left( e_i\right) \\ {\mathsf {width}}\left( ||\left( e_1, \dots , e_n\right) \right)&= \max _{i=0}^n {\mathsf {width}}\left( e_i\right) . \end{aligned}$$

In the graph \(\gamma \) of Fig. 9, we have \({\mathsf {width}}\left( \gamma \right) = 3\) (a path from left to right in the graph contains at most two edges labelled by terminals) and \({\mathsf {height}}\left( \gamma \right) = 6\) (there are at most six nodes in a cut, or equivalently six branches traversed in parallel).

4.3.2 Final Complexity Estimate

Based on the previous definitions of height and width, we can prove

Theorem 4

Let \(m \in {\mathbb {N}}\). Let \(G = \left( N, \delta , \varSigma , F, P, S\right) \) be an m-parallel IDL-MCF grammar. Let E be the set of IDL expressions used in G. Assume that for all \(e \in E\), e does not contain any \(\vee \) node. Let \(T \subset \varSigma ^*\) and put \(\rho := \rho _T\). Let \(w := \max _{e \in E} {\mathsf {width}}\left( e\right) \), \(h := \max _{e \in E} {\mathsf {height}}\left( e\right) \), \(R = \left|P \right|\), \(\alpha = \max _{f \in F} a\left( f\right) \), \(M = \max _{e \in E} \left|e \right|\). Finally, let \(t \in T\) and \(n := \left|t \right|\). Assume that \(w \ge 6\) and \(n \ge w\). An upper bound of the complexity of the parsing algorithm described in Algorithm 2 is given by

$$\begin{aligned} {\mathscr {O}}\left( \alpha m^2Rh\rho ^m\left( \left( \frac{\rho w^2}{h^2}\right) ^w\frac{1}{\left( w-1\right) !}\right) ^h h^n n^{2hw+m+2} \right) . \end{aligned}$$

Proof

See A.1. \(\square \)

5 Conclusion

In this paper, we have presented and studied IDL-PMCFG, a new grammatical formalism that extends PMCFG with Nederhof and Satta’s IDL expression. This formalism, along with its GF-like experimental front-end COMPĀ, was designed as a tool to formally encode the syntax of free word order languages. COMPĀ, its IDL-PMCFG backbone and the associated parsing algorithm have been implemented in an experimental setup, focussing on the parsing of Classical Latin. The corresponding code can be found in our GitHub repository. To our knowledge, this formalism is the only one to this day to allow for a straightforward, wide-coverage syntactic description of Classical Latin and similar languages for rule-based parsing purposes. The fact that IDL-PMCFG extends PMCFG with only two new operators should make extending existing tools for support of hyperbatic constructions comparatively smoother than if an ad hoc approach to Latin syntax had been chosen.

In order to be able to easily encode the kind of extensive free word order encountered in the case of Classical Latin, an operator allowing for grammatical constituents to be swapped and intertwined, the \(||\) operator, is required; no less required for conciseness is the ability, in particular instances, to impose fixed constituent order (through the \(\cdot \) operator) or non-interleaving, or locking, of constituents (through the \(\times \) operator). Since these operators are virtually anavoidable when it comes to providing a linguistically intuitive description of the actual syntactic constraints in the language, it is reasonable to think of IDL-PMCFG as the “smallest extention of PMCFG with built-in support for free word order as observed in Classical Latin”. Note that this does not mean that IDL-PMCFG would be the smallest extension of CFG with this same property, since copying is not required to encode hyperbatic constructions. Besides the design and analysis of the formalism itself, one of the main contributions of this paper is the classification result of Theorem 1. This theorem has two main consequences.

The first one is mainly theoretical: whenever a CFG-derived grammatical formalism is coupled with IDL expressions and includes a record system that does not restrict copying, parsing in this formalism must, in the worst case, be non-polynomial in the size of the input. As an immediate corollary, such formalisms cannot be mildly context-sensitive. In fact, even if we disallowed copying, there is no way a formalism able to generate Latin hyperbatic structures in a linguistically meaningful way could be mildly context sensitive: Becker et al. (1992) showed that scrambling as it occurs in German—a kind of free word order that is strictly less general than Latin hyperbata—is not mildly context-sensitive.

The second, more practical consequence is that the corresponding parsing algorithms will not be polynomial, which does of course not mean that parsing will be intractable altogether, since pratical linguistic settings rarely present the level of ambiguity that leads to theoretical worst cases. Our first experiments would rather suggest the opposite. Studying the complexity of the IDL-PMCFG parsing algorithm in the particular case of IDL-MCF grammars without copy would be an interesting path for further research.

While a majority of works in formal NLP draw most of their examples from fixed word order languages (most notably English, but also to certain extent German, French, or Chinese) in which hyperbata are almost always ungrammatical, it might be tempting to think that mild context sensitivity, and in particular polynomial-time parsing, is sufficiently expressive to account for syntactic phenomena in a vast majority of instances and for almost all natural languages. This view indeed seems to widely accepted,Footnote 7 and it has proved practical in many cases, often preventing unnecessary computational explosion.

Becker et al. (1992) showed that its theoretical accurateness should be questioned. Indeed, in the light of this study, the alleged minimality of mildly context-sensitive languages, while not contradicted by the grammar of English and similar languages, appears to have somewhat underestimated the complexity of (very) free word order: Classical Latin and Greek, Sanskrit etc. present hyperbatic constructions that are considerably more complex that German scrambling. These may well be specific languages, and, in one sense, they are: Classical Latin and Greek, as well as Sanskrit, belong to a rather extremal subset in the (somewhat imprecise) galaxy of so-called free word order languages. In these languages, audacious interleaves and permutations have become part of a canon of refined rhetorical and prosodic effects, thus enhancing even more the natural syntactic flexibility of a morphologically rich linguistic system. Many other idioms, such as English, are not concerned by this kind of phenomena, and it would be equally unsatisfying to impose free word order formalisms upon them. What is at stake is not so much the pertinence of mild context sentivity for the vast majority of formal NLP applications, but rather its universality throughout natural languages.

Two import questions remain open. On the formal side, the position of IDL-CFGs and IDL-MCFGs (without copy) in the hierarchy are still unclear, and so is the complexity of their respective parsing algorithms. On the linguistic side, the level of expressivity needed to account for hyperbaton and locking of clauses is not precisely known. Answering these two questions would provide a more complete insight into the level of syntactic complexity of free word order in natural language, while paving the way for the development of more efficient description and parsing systems.