🔠 NLP: PGF Specification 2.1

September 5, 2025 by CryptoPatrick Grammar NLP

Note: This is not an official specification. Version 2.1 was yanked by its author and it’s unclear what the current status of the project is. In the meantime, PGF version 1.0 is (what I assume) the only official version of PGF at this time of writing.

Basic Types

The Portable Grammar Format is a binary format where the structures of the grammar are serialized as a sequence of bytes. Every structure is a list of sequentially serialized fields, where every field is either another structure or has a basic type. The allowed basic types are:

Int8 - 8 bits integer, with sign, represented as a single byte.
Int16 - 16 bits integer, with sign, represented as a sequence of two bytes where the most significant byte is stored first.
Int32 - a 32bits integer with sign encoded as a sequence of bytes with variable length. The last bit of every byte is an indication for whether there are more bytes left. If 127 the bit is 1, then there is at least one more byte to be read, otherwise this is the last byte in the sequence. The other 7 bits are parts of the stored integer. We store the bits from the least significant to the most significant.
String - a string in UTF-8 encoding. We first store as Int (a variable length integer) the length of the string in number of Unicode characters and after that we add the UTF-8 encoding of the string itself.
Float - A double precision floating point number serialized in a big-endian format following the IEEE754 standard.
List - Many of the object fields are lists of other objects. We say that the field is of type [Object] if it contains a list of objects of type Object. The list is serialized as a variable length integer indicating the length of the list in number of objects, followed by the serialization of the elements of the list.

PGF

The whole PGF file contains only one structure which is based upon the abstract structure as documented in G from Definition 1 in Section 2.1 of Krasimir Angelov’s thesis “The Mechanics of the Grammatical Framework” and further detailed in Appendix A.

The PDF describes version 1.0 of this format. This document intendeds to describe the most recent changes as seen in the gf-core repository.

Name	Type	Description
majorVersion	Int16	major PGF version
minorVersion	Int16	minor PGF version
flags	[Flag]	global flags
abstract	Abstract	abstract syntax
concretes	[Concrete]	list of concrete syntaxes

Flag

The flags are pairs of a name and a literal and store different configuration parameters. They are generated by the compiler from the flags statements in the grammar and can be accessed with the runtime API. By using flags it is possible to add new settings without changing the format.

Name	Type	Description
name	String	flag name
literal	Literal	flag value

Literal

The Literal object represents the built-in kinds of literal constants. It starts with a tag which encodes the type of the constant:

LiteralFloat

float

Name	Type	Description
tag	2	a tag
value	Float	float value

LiteralStr

string

Name	Type	Description
tag	0	has a value of 0 for this type of literal
value	String	string value

LiteralInt

integer

Name	Type	Description
tag	1	a tag
value	Int32	int value

Abstract

This is the object that represents the abstract syntax A (Definition 2, Section 2.1) of a grammar. The name of the abstract syntax is the name of the top-level abstract module in the grammar. The start category is specified with the flag startcat.

Name	Type	Description
name	String	the name of the abstract syntax
flags	[Flag]	a list of flags
absFunctions	[AbsFun]	a list of abstract functions
absCats	[AbsCat]	a list of abstract categories

AbsFun

Every abstract function is represented with one AbsFun object.

The constructor tag distinguishes between constructors and computable functions, i.e. we can distinguish between this two judgements: 0 = constructor: data f : T 1 = function: fun f : T

If this is a function, then we also include a list of definitional equations. The list can be empty which means that the function is an axiom. In the cases, when we have at least one equation then the arity is the number of arguments that have to be known in order to do pattern matching. For constructors or axioms the arity is zero.

Name	Type	Description
name	String	the name of the function
type	Type	function’s type signature
arity	Int32	function’s arity
constructorTag	ConstructorTag	a constructor tag: 0- constructor; 1- function
equations	[Equation]	definitional equations for this function if it is not a constructor
probability	Float	the probability of the function

Type

This is the description of an abstract syntax type.
Since the types are monomorphic and in normal form, they have the general form: $$(x_1 : T_1) → (x_2 : T_2) → … → (x_n : T_n) → C e_1…e_n$$

The list of hypotheses $(x_i : T_i)$ is stored as a list of Hypo objects and the indices $e_1 …e_n$ are stored as a list of expressions.

Name	Type	Description
hypotheses	[Hypo]	a list of hypotheses
categoryName	String	the name of the category in the return type
expressions	[Expression]	indices in the return type

Hypo

Every Hypo object represents an argument in some function type. Since we support implicit and explicit arguments, the first field tells us whether we have explicit argument i.e. (x : T) or implicit i.e. ({x} : T). The next two fields are the name of the bound variable and its type. If no variable is bound then the name is ’_’.

Name	Type	Description
bindType	BindType	the binding type i.e. implicit/explicit argument
name	String	a variable name or ’_’ if no variable is bound
type	Type	the type of the variable

BindType

The bind type is a tag which encodes whether we have an explicit or an implicit argument. Tag 0 is for explicit, and tag 1 is for implicit.

Expression

This is the encoding of an abstract syntax expression (tree).

ExpressionAbstraction

a lambda abstraction (i.e. \x → …)

Name	Type	Description
tag	0	has a value of 0 for this type of expression
bindType	BindType	a tag for implicit/explicit argument
name	String	the variable name
expression	Expression	the body of the lambda abstraction

ExpressionApplication

application (i.e. f x)

Name	Type	Description
tag	1	has a value of 1 for this type of expression
left	Expression	the left-hand expression
right	Expression	the right-hand expression

ExpressionLiteral

a literal value i.e. string, integer or float

Name	Type	Description
tag	2	a tag
literal	Literal	the value of the literal

ExpressionMetaVar

a metavariable (i.e. ?0,?1,…)

Name	Type	Description
tag	3	has a value of 3 for this type of expression
id	Int32	the id of the metavariable

ExpressionFunction

an abstract syntax function

Name	Type	Description
tag	4	has a value of 4 for this type of expression
name	String	the function name

ExpressionVar

a variable, De Bruijn index reference

Name	Type	Description
tag	5	has a value of 5 for this type of expression
deBruijnIndex	Int32	the de Bruijn index of the variable

ExpressionTypeAnnotation

an expression with a type annotation (i.e. e : t )

Name	Type	Description
tag	6	has a value of 6 for this type of expression
expression	Expression	the annotated expression
type	Type	the type of the expression

ExpressionImplicitArg

an implicit argument (i.e. {e})

Name	Type	Description
tag	7	has a value of 7 for this type of expression
expression	Expression	the expression for the argument

Equation

Every computable function is represented with a list of equations where the equation is a pair of list of patterns and an expression. All equations must have the same number of patterns which is equal to the arity of the function.

Name	Type	Description
patterns	[Pattern]	a sequence of patterns
expression	Expression	an expression

Pattern

This is the representation of a single pattern in a definitional equation for computable function. The first field is a tag which encodes the kind of pattern.

PatternApplication

pattern matching on constructor application (i.e. c p1 p2 … pn)

Name	Type	Description
tag	0	has a value of 0 for this type of pattern
name	String	the name of the constructor
patterns	[Pattern]	a list of nested patterns for the arguments

PatternVar

A variable

Name	Type	Description
tag	1	has a value of 1 for this type of pattern
name	String	the variable name

PatternVarAndNested

a pattern which binds a variable but also does nested pattern matching (i.e. x@p)

Name	Type	Description
tag	2	a tag
name	String	the variable name
pattern	Pattern	a nested pattern

PatternWildcard

A wildcard (i.e. ).

Name	Type	Description
tag	3	has a value of 3 for this type of pattern

PatternLiteral

matching a literal i.e. string, integer or float

Name	Type	Description
tag	4	a tag
literal	Literal	the value of the literal

PatternImplicitArg

pattern matching on an implicit argument (i.e. {p})

Name	Type	Description
tag	5	has a value of 5 for this type of pattern
pattern	Pattern	a nested pattern

PatternInaccessible

an inaccessible pattern (∼ p)

Name	Type	Description
tag	6	has a value of 6 for this type of pattern
expression	Expression	the nested pattern

AbsCat

very abstract category is represented with one AbsCat object. The object includes the name and the type information for the category plus a list of all functions whose return type is this category. The functions are listed in the decreasing probability order. Equivalently, since the probability is represented as a negated probability they are actually sorted in the increasing probability order.

Name	Type	Description
name	String	the name of the category
hypotheses	[Hypo]	a list of hypotheses
weightIdent	[WeightedIdent]	guessing on this from Java and C code
probability	Float	the probability of the category

WeightedIdent

Name	Type	Description
weight	Float	The negated logarithm of the function probability.
name	String	The name of the function.

Concrete

Every concrete syntax C (Definition 3, Section 2.1), in the grammar, is represented with an object. The name of the concrete syntax is the name of the top-level concrete module in the grammar.

Note: The lists Flag, PrintName and CncCat are sorted by name which makes it easy to do binary search. Note: The total number of concrete categories is used by the parser to determine whether a given category is part of the grammar, i.e. member of NC, or it was created during the parsing. This is the way to decide when to put metavariables during the tree extraction (Section 2.3.7).

Note: Sequences are stored first. Thanks to that, when cncFuns, linDefs and linRefs are deserialized the sequence ids can be turned into direct pointers.

Name	Type	Description
name	String	the name of the concrete syntax
flags	[Flag]	a list of flags
printNames	[PrintName]	a list of print names
sequences	[Sequence]	a table with sequences (Section 2.8.1)
cncFuns	[CncFun]	a list of concrete functions
linDefs	[LinDef]	a list of functions for default linearization
linRefs	[LinRef]	a list of lin refs
cCats	[CCat]	A list of CCats
cncCats	[CncCat]	a list of concrete categories
concreteCategoryCount	Int32	the total number of concrete categories allocated for the grammar

PrintName

Every function or category can have a print name which is a user friendly name that can be displayed in the user interface instead of the real one. The print names are defined in the concrete syntax which makes it easier to localize the user interface to different languages.

This is still supported both in the GF source language and in the compiler but none of the recent UIs use the feature.

Name	Type	Description
name	String	the name of the function or the category
printName	String	the printable name

Sequence

This is the representation of a single sequence in PMCFG, produced during the common subexpression optimization (Section 2.8.1).

Name	Type	Description
symbols	[PgfSymbol]	a list of symbols

PgfSymbol

The Symbol(Definition 4, Section 2.1) [Renamed here as GFSymbol] represents either a terminal or a function argument in some sequence. The representation starts with a tag encoding the type of the symbol:

PgfSymbolCat

his is the representation of an argument, i.e. a pair <k;l> where k is the argument index and l is the constituent index.

Name	Type	Description
tag	0	has a value of 0 for this type of symbol
argumentIndex	Int32	argument index
constituentIndex	Int32	constituent index

PgfSymbolLiteral

This is again an argument but we use different tag to indicate that the target can be a literal category (see Section 2.6). If the target category is not a new fresh category, generated by the parser, then it is treated as a literal category. In the pgf_pretty format, we print this kind of symbol as {d;r} instead of <d;r>.

Name	Type	Description
tag	1	has a value of 1 for this type of symbol
argumentIndex	Int32	argument index
constituentIndex	Int32	constituent index

PgfSymbolVar

A high-order argument i.e. <d;$r> (Section 2.7).

Name	Type	Description
tag	2	has a value of 2 for this type of symbol
argumentIndex	Int32	argument index
variableIndex	Int32	variable index

PgfSymbolTerminalString

This is a terminal symbol and represents a list of tokens.

Name	Type	Description
tag	3	has a value of 3 for this type of symbol
tokens	String	a single token

PgfSymbolTerminalPhrase

An alternative terminal symbol representing phrase, whose form depends on the prefix of the next token. It corresponds to the pre construction in GF and encodes variations like a/an in English.

Name	Type	Description
tag	4	has a value of 4 for this type of symbol
defaultForm	[PgfSymbol]	the default form
alternative	[Alternative]	a sequence of alternatives

Alternative

Every Alternative represents one possible form of a phrase which is dependent on the prefix of the next token. For example when the construction: pre {”beau”;”bel”/”ami”} is compiled then the alternative “bel” / “ami” will be represented by the pair ([”bel”],[”ami”]).

Name	Type	Description
form	[PgfSymbol]	The tokens to use if the prefix matches (should be renamed when we understand it better)
prefixes	[string]	The prefix matched with the following tokens

PgfSymbolBind

Name	Type	Description
tag	5	has a value of 5 for this type of symbol

PgfSymbolSoftBind

Name	Type	Description
tag	6	has a value of 6 for this type of symbol

PgfSymbolNE

Name	Type	Description
tag	7	has a value of 7 for this type of symbol

PgfSymbolSoftSpace

Name	Type	Description
tag	8	has a value of 8 for this type of symbol

PgfSymbolCapital

Name	Type	Description
tag	9	has a value of 9 for this type of symbol

PgfSymbolAllCapital

Name	Type	Description
tag	10	has a value of 10 for this type of symbol

CncFun

This is the definition of a single concrete function (Definition 4, Section 2.1). The first field is the name of the corresponding abstract function which gives us the direct definition of the ψF mapping. The second field is the function definition given as a list of indices pointing to the sequences table (see the Concrete object).

Name	Type	Description
abstractFunctionName	String	the name of the corresponding abstract function
sequenceIndicies	[Int32]	list of indices into the sequences array

LinDef

The LinDef object stores the list of all concrete functions that can be used for the default linearization of some concrete category (Section 2.5).

Name	Type	Description
concreteCategory	Int32	the concrete category
concreteFunctions	[Int32]	a list of concrete functions

LinRef

Name	Type	Description
concreteCategory	Int32	the concrete category
concreteFunctions	[Int32]	a list of concrete functions

CCat

Name	Type	Description
id	Int32	the name of the corresponding (by ψN) abstract category
productions	[Production]

Production

The production can be either an application of some function or a coercion.

ProductionApply

The production is an application (Definition 4, Section 2.1):

Name	Type	Description
tag	0	has a value of 0 for this type of production
concreteFunction	Int32	the concrete function
args	[PArg]	a list of arguments

PArg

An argument in a production.

Name	Type	Description
categories	[Int32]	the categories of the high-order arguments (Section 2.7)
concreteCategory	Int32	a concrete category

ProductionCoerce

The production is a coercion (Section 2.8.1):

Name	Type	Description
tag	1	has a value of 1 for this type of production
concreteCategory	Int32	the concrete category

CncCat

This is the representation of a set of concrete categories which map to the same abstract category. Since all concrete categories generated from the same abstract category are always represented as consequtive integers, here we store only the first and the last category. The compiler also generates a name for every constituent so here we have the list of names. The length of the list is equal to the dimension of the category.

Name	Type	Description
name	String	the name of the corresponding (by ψN) abstract category
firstConcreteCategory	Int32	the first concrete category
lastConcreteCategory	Int32	the last concrete category
constituentNames	[string]	a list of constituent names

PGF File Format Schematic

The PGF file is a binary serialization of a PGF structure. It contains an abstract syntax (PgfAbstr) and one or more concrete syntaxes (PgfConcr). Below is a schematic breakdown of the format, with byte-level representations based on pgf/reader.c and the debug output (e.g., [1, 3, 3, 77, 117, 109, 1, 4, 5, 87, …] and offsets like 912, 913, 1008, 1184). Variable-length integers are used extensively in PGF as varint, and are encoded as a sequence of bytes, where each byte has the high bit set (0x80) except the last, with the lower 7 bits contributing to the value.

Offset	Bytes	Description
0	2	major_version (uint16, big-endian)
2	2	minor_version (uint16, big-endian)
4	varint	gflags length (number of global flags)
4+x	…	gflags (map of PgfCId to GuString, see below)
…	abstr (PgfAbstr, see below)
varint	n_concretes (number of concrete syntaxes)
…	concretes (array of PgfConcr, see below)

'I write to understand as much as to be understood.' —Elie Wiesel
(c) 2024 CryptoPatrick