Syntax

This section describes all syntactic constructs in Zwerg language. You may want to check out Tutorial, which introduces many (but not all) of these concepts on an example, and Core vocabulary or Dwarf vocabulary to learn about the actual function words that you can use.

Comments

Form:

// Ignored until the end of the line.
# This one as well.
/* C-like comments work as well.  */

Comments are ignored. Nesting comments is not allowed.

Empty expression

Empty expression yields exactly the stacks that get to its input.

For example, the following is an empty expression within negative sub-expression assertion. This expression never yields anything:

!()     # An expression that never yields.

As another example, it may be useful to bind the initial Dwarf value to a name, typically in cases when one needs to refer to it several times. One way to do this is this:

let Dw := ;     # There's an empty expression between := and ;.

Words

Form:

foo
?foo
!foo
@foo    # and more.

Most functional parts of Zwerg expressions are expressed by words. Words consume input stacks and may yield possibly several stacks of arbitrary composition. A typical word would inspect or pop one or more words on each incoming stack, carry out some computation, and produce zero or more stacks with the results of computation pushed to them.

For example, the word entry expects a stack with a Dwarf value on top. It then examines the Dwarf data, and for each debug info entry (DIE), it produces a stack with that DIE pushed on top (instead of the consumed Dwarf value). E.g.:

$ dwgrep ./tests/nontrivial-types.o -e ''
<Dwarf "./tests/nontrivial-types.o">

$ dwgrep ./tests/nontrivial-types.o -e 'entry'
[b]     compile_unit
        producer (strp) GNU C 4.6.3 20120306 (Red Hat 4.6.3-2);
        [... more attributes ...]
[2d]    structure_type
        name (string)   foo;
        [... more attributes ...]
[35]    subprogram
        external (flag) true;
        name (strp)     blah;
        [... more attributes ...]
[... more output ...]

Some words (actually quite a few of them) will change their behavior depending on what values are on the stack. E.g. entry is besides Dwarfs applicable to, as of this writing, compilation units and abbreviation units.

The words that start with ? and ! are assertions. While what was written about words in general applies to assertions as well, their mode of operation is more restricted. Assertions either yield the incoming stack unchanged, or yield nothing at all.

The assertions come in pairs, where the ?-flavored one tests the positive condition (e.g. ?root tests whether a DIE on top of stack is a root DIE), and the !-flavored one tests the opposite condition (e.g. !root tests whether a DIE on top of stack is not a root DIE).

Concatenation

Form:

EXPR₁ EXPR₂ …

The resulting expression passes all input stacks to EXPR₁, takes the output of that, passes it to EXPR₂, and so on. The result of concatenation is then whatever the last constituent expression yields.

The constituent expressions are evaluated in plain context.

Parenthesizing

Form:

(EXPR₁)

Concatenation has a high precedence, so parentheses can be used liberally to adjust how an expression should be chopped into sub-expression. The parenthesized expression is evaluated in plain context.

For example:

foo bar, baz    # these two ...
(foo bar), baz  # ... mean the same thing
foo (bar, baz)  # this one is different

Integer literals

Form:

“-”?(“0x”|“0o”|“0b”|“0”|“”){digits}

Integer literals are words that start with a decimal digit, or by a minus sign followed by a decimal digit. For integer literal to be valid, the word needs to be composed of a prefix that determines radix of the literal, followed by one or more digits from a set specified as follows:

  • 0x and 0X are hexadecimal prefixes. Valid digits are [0-9a-fA-F].
  • 0o, 0O and 0 are octal prefixes. Valid digits are [0-7].
  • 0b and 0B are binary prefixes. Valid digits are [0-1].
  • Without prefix, decimal base is assumed. Valid digits are [0-9].
  • An initial - means the number is negative.

Zwerg integers can hold any 64-bit signed or unsigned number:

$ dwgrep '0xffffffffffffffff -0x7fffffffffffffff add'
0x8000000000000000

Invalid integer literals are diagnosed at query compilation time:

$ dwgrep '123foo'
dwgrep: Invalid integer literal: `123foo'

Named constants

Form:

DW_AT_name
DW_TAG_base_type
DW_FORM_strp
DW_LANG_C

Zwerg has support for named constants. They aren't merely aliases for numbers--Zwerg remembers their domain and uses it when the value needs to be displayed or converted to a string. It is still possible to extract the underlying numerical value:

$ dwgrep 'DW_TAG_base_type value'
36
$ dwgrep 'DW_TAG_base_type hex'
0x24

Numbers in non-decimal bases use the same trick, so if you use a hexadecimal number, it will keep its hexadecimal-ness throughout the script:

$ dwgrep '10 0x10 0o10 0b10'
---
0b10
010
0x10
10

Similarly values decoded from attribute will have the appropriate "skin"--they will be named constants, hexadecimal or decimal numbers, depending on what is deemed the best fit:

$ dwgrep ... -e 'entry @AT_language'
DW_LANG_C89

$ dwgrep ... -e 'entry @AT_decl_line'
4
6
11

$ dwgrep ... -e 'entry @AT_low_pc'
0x80483f0
0x80482f0

Lexical scopes ((|A|…))

Form:

(|ID₁ ID₂ …| EXPR₁)

A presence of binding block in a parenthesized construct converts the whole form into a function that pops as many arguments as there are identifiers in the binding block, and binds them to these identifiers such that the rightmost one gets the value on TOS and then progressively lefter ID's get progressively deeper stack values. The expression is then evaluated as usual, except one can use the bound names.

Sometimes it's useful to enclose the whole expression into a scope and pop and bound the initial Dwarf value:

(|Dw| Dw entry (name == Dw symtab name))

Another example shows how to implement set union even without first-class Zwerg support:

[foo] [bar] (|A B| A [B elem !(== A elem)] add)

ALT-lists (,)

Form:

EXPR₁ “,” EXPR₂ …

The resulting expression passes all input stacks to all of EXPR₁, EXPR₂, etc. It then yields any and all stacks that any of the constituent expressions yields.

All constituent expressions shall have the same overall stack effect (the number of slots pushed minus number of slots popped will be the same for each branch).

The constituent expressions are evaluated in plain context.

E.g. to compare a TOS value to one of the fixed list of values, you would do:

(== (1, 2, 3))

You would also use ALT-lists to create a sequence value:

[1, 2, 3]

OR-lists (||)

Form:

EXPR₁ “||” EXPR₂ …

Each input stack is passed first to EXPR₁. If that yields anything, that's what the overall expression yields. Otherwise the same original input stack is passed to EXPR₂, and so in this fashion. If neither constituent expression yields anything, the overall expression doesn't yield anything either. The mode of operation is very similar to shell-like OR lists, hence the name.

All constituent expressions shall have the same overall stack effect (the number of slots pushed minus number of slots popped will be the same for each branch).

The constituent expressions are evaluated in plain context.

An important use of an OR list is to implement case-like conditions and fallback cases:

let name := (@DW_AT_linkage_name || @DW_AT_MIPS_linkage_name
             || @DW_AT_name || "???");

let has_loc := (?(@DW_AT_location) true || false);

Infix assertions (==)

Form:

EXPR₁ OP₁ EXPR₂

Zwerg has support for infix binary assertions. These forms are evaluated as follows:

?(let .tmp1 := EXPR₁; let .tmp2 := EXPR₂; .tmp1 .tmp2 OP₂)

In plain English, EXPR₁ and EXPR₂ are each evaluated in sub-expression context. The TOS's of the stacks that this produces are then put to stack and a certain operator OP₂ is evaluated. This shows that the two expressions are independent of each other.

The word OP₂ is here used as a stand-in to explain the syntax. The actual word that is looked up is again OP₁--the operators are just words with a bit of syntactical support on Zwerg side (and with funny names). So which operators are available depends on which vocabularies any particular wrapper brings in.

Zwerg core defines the following OP₁'s, and for each of them there is currently an associated OP₂ as follows:

EXPR₁ “==” EXPR₂        # ?eq
EXPR₁ “!=” EXPR₂        # ?ne
EXPR₁ “<” EXPR₂         # ?lt
EXPR₁ “<=” EXPR₂        # ?le
EXPR₁ “>” EXPR₂         # ?gt
EXPR₁ “>=” EXPR₂        # ?ge
EXPR₁ “=~” EXPR₂        # ?match
EXPR₁ “!~” EXPR₂        # !match

(Again, the association to OP₂ is for explanation only. Overriding that symbol has no effect on actual execution.)

Another thing worth noticing is that the whole form behaves as an assertion. No side effects leak from the constituent EXPR's or from the operator itself.

See ?eq to learn about comparison operators. See ?match to learn about regular expression matching.

For example:

$ dwgrep ./tests/typedef.o -e 'entry (offset == 0x1d)'
[1d]    typedef
        name (strp)     int_t;
        decl_file (data1)       /home/petr/proj/dwgrep/typedef.c;
        decl_line (data1)       1;
        type (ref4)     [28];

Iteration (*)

Form:

EXPR₁ “*”
EXPR₁ “+”
EXPR₁ “?”

The effect of expressions EXPR₁* is that the input stack is yielded unchanged, and also passed to EXPR₁. The output is collected and yielded, and then passed back to EXPR₁. This process is repeated until EXPR₁ stops yielding more stacks.

This is typically used for forming closures--peeling uninteresting types, forming sets of nodes, etc. E.g.:

child*  # nodes of tree rooted in node that's on TOS

One can also (ab)use this to create an explicit for loop:

0 (1 add ?(10 ?lt))*

EXPR₁ shall push exactly the same number of stack slots as it pops.

The effect of EXPR₁+ is the same as that of EXPR₁ EXPR₁*.

The effect of EXPR₁? is the same as that of (, EXPR₁).

For example:

A B swap? ?gt drop    # "max" -- keep the greater number on stack
A B swap? ?lt drop    # "min"

Formatting strings

Form:

“r”? “"” (formatting string) “"” “\”?

Zwerg string literals are functions that push themselves to an incoming stack, which they then yield. The literal is enclosed in quotes and if a quote itself should be part of the string, it should be escaped with a backslash (\). Should a backslash be part of the string and not considered an escape character, it should be escaped by another backslash:

"a simple string"
"a string \"with\" quotes"
"a string with backslash: \\"

Other recognized escapes include:

escape meaning
\\

Escaped backslash stands for a backslash:

$ dwgrep '"\\a\\"'
\a\
\123

123 stands for octal number. The escape is replaced with an ASCII character the code of which is the octal number:

$ dwgrep '"\110\145\154\154\157"'
Hello
\xab

ab stands for hexadecimal number. The meaning is analogous to that of \123:

$ dwgrep '"\x77\x6f\x72\x6c\x64\x21"'
world!
\t Stands for the tab character.

\n

<EOL>

Both a literal end-of-line character, as well as the \n escape stand for new line:

$ dwgrep '"foo\nbar"'
foo
bar

$ dwgrep '"foo
bar"'
foo
bar
\<EOL>

Escaped newline is ignored:

$ dwgrep '"foo\
bar"'
foobar

There are more escape sequences (essentially those mentioned in man ascii are supported as well).

If a string literal is prefixed by r, it's actually a raw string literal. These work the same as normal formatting strings, but escape sequences are left intact in the string:

$ dwgrep '"foo \"bar\\\""'
foo "bar\"

$ dwgrep 'r"foo \"bar\\\""'
foo \"bar\\\"

String literals can be split, provided that all but the last segment end not with a mere quote, but "\. The following two examples produce equivalent programs:

"a long string "\ "that continues here"
"a long string that continues here"

Any whitespace (but only whitespace) is allowed between "\ and the following ".

String literals can contain formatting directives. Formatting directive are two-character sequences whose first character is a %. Such strings are not merely literals anymore, they behave more as template strings. They are functions that take values from the stack and splice their string representation together with the non-formatting-directive parts of the string.

A pseudo-directive %% is not really a directive at all, it's an escape that stands for a single percent character.

The simplest formatting directive is %s:

$ dwgrep ... -e 'entry ?AT_decl_column !AT_decl_line
                 "%s has DW_AT_decl_column, but NOT DW_AT_decl_line"'
[58] formal_parameter has DW_AT_decl_column, but NOT DW_AT_decl_line

$ dwgrep ... -e 'entry attribute (form "%s" =~ "DW_FORM_ref.*")'
type (ref4)     [65];
sibling (ref4)  [65];
type (ref4)     [2d];
[... etc ...]

The most general formatting directive is a pair of %( and %), which encloses an expression. The formatting string passes input stacks to that expression, which is evaluated in plain context. TOS of yielded stacks is then popped, converted to a string, spliced, and the resulting string is pushed to stack. For example:

$ dwgrep ... -f /dev/stdin <<"EOF"
(|Dw|
 let T := Dw entry ?TAG_typedef ;
 let U := T @AT_type (?TAG_typedef @AT_type)* !TAG_typedef ;

 "[%( T offset %)] typedef %( T @AT_name %) %( U @AT_name || "???" %) "\
 "(%( U @AT_encoding || "???" %), %( U @AT_byte_size || "???" %) bytes)"
)
EOF
[0x1d] typedef int_t int (DW_ATE_signed, 4 bytes)
[0x2f] typedef int_t_t int (DW_ATE_signed, 4 bytes)
[0x3a] typedef int_t_t_t int (DW_ATE_signed, 4 bytes)

If there's more than one formatting directive in a given string, they are resolved in the order from right to left (sic!): rightmost formatting directive gets TOS value:

$ dwgrep 'dwgrep '1 2 3 "{%s [%s (%s)]}"'
{1 [2 (3)]}

This principle is applied throughout Zwerg: when a binding block contains several names the rightmost name is bound to value on TOS. The reason should be apparent from the above example: because the strings are parsed from left to right, rightmost word produces the topmost stack value.

Now that we have introduced %( %), it is easy to use it to define the others:

  • %s stands for %( %)
  • %d stands for %( value %)
  • %x stands for %( value hex %)
  • %o stands for %( value oct %)
  • %b stands for %( value bin %)

Sequence capture ([])

Form:

“[” EXPR₁ “]”

Sequences offer a way to gather values yielded by another expression. If you think of ALT-list as a fork, then capture is a join.

The resulting expression passes any input stacks to EXPR₁. For each input stack, it gathers the stacks produced by EXPR₁, takes the top value off each of them, and collects these values in a sequence. This sequence it then pushes to TOS.

EXPR₁ is evaluated in sub-expression context. That means that the sequence is pushed to TOS in addition to whatever was already there--nothing is removed.

Of particular interest is interplay between ALT-list and capture, which allows easy and syntactically familiar construction of sequence literals:

$ dwgrep '[1, 2, 3]'
[1, 2, 3]

Or you can use this construct to e.g. collect children of a node on TOS:

$ dwgrep ./tests/typedef.o -e 'entry ?root [child]'
---
[[1d] typedef, [28] base_type, [2f] typedef, [3a] typedef, [45] variable]
[b]     compile_unit
        producer (strp) GNU C 4.6.3 20120306 (Red Hat 4.6.3-2);
        language (data1)        DW_LANG_C89;
        name (strp)     typedef.c;
        comp_dir (strp) /home/petr/proj/dwgrep;
        stmt_list (data4)       0;

Or the whole tree rooted at node on TOS:

dwgrep ./tests/typedef.o -e 'entry ?root [child*]'
---
[[b] compile_unit, [45] variable, [3a] typedef, [2f] typedef, [28] base_type, [1d] typedef]
[b]     compile_unit
        producer (strp) GNU C 4.6.3 20120306 (Red Hat 4.6.3-2);
        language (data1)        DW_LANG_C89;
        name (strp)     typedef.c;
        comp_dir (strp) /home/petr/proj/dwgrep;
        stmt_list (data4)       0;

Form:

“[” “]”

This produces an empty list. Alternatively, one could also do:

[!()]

But that's somewhat awkward.

To capture an empty expression, one would need to explicitly parenthesize it:

$ dwgrep '1 [()]'
---
[1]
1

Capture with binding ([|A|…])

Form:

“[” “|” ID₁ ID₂ … “|” EXPR₁ “]”

Sub-expression capture allows a binding block. A presence of such block converts the whole form into a function that pops as many arguments as there are identifiers in the binding block, and binds them to these identifiers such that the rightmost one get the value on TOS and then progressively lefter ID's get progressively deeper stack values. The expression is then evaluated as usual, except one can use the bound names:

$ dwgrep ./tests/typedef.o -e 'entry ?root [|A| A child]'
[[1d] typedef, [28] base_type, [2f] typedef, [3a] typedef, [45] variable]

$ dwgrep '1 [|A| A]'    # or you could just write '[1]' ;)
[1]

Name binding (let)

Form:

“let” ID₁ ID₂ … “:=” EXPR₁ “;”

This form introduces a new name into the current scope. It passes the input stack to EXPR₁, which is evaluated in sub-expression context. Then values near TOS are bound to given identifiers and exported into surrounding context. Then the original stack is yielded as many times as EXPR₁ yields, each time with a possibly different set of bindings. Bound names may be mentioned later, and they push the bound value to the stack.

Consider the following examples:

$ dwgrep 'let X := 1; X'
1

$ dwgrep 'let X := 1, 2; X'
1
2

Ordering of ID's is such that the rightmost is bound to TOS, the next one to the left to one below TOS, etc. The mnemonic for this is that the list of variables describes stack layout, with TOS being on the right. E.g.:

$ dwgrep 'let A B := 1 2, 3 4; A B'
---
2
1
---
4
3

The new bindings are introduced into the most enclosing scope. The following constructs comprise a scope:

  • Any sub-expression context has a scope of its own:

    $ dwgrep '?(let A := 1;) A'
    dwgrep: Attempt to read an unbound name `A'
    
    $ dwgrep '(let A := 1; 1 == 1) A'
    dwgrep: Attempt to read an unbound name `A'
    
  • Each branch of an OR-list or ALT-list has a scope of its own:

    $ dwgrep '(let A := 1;, let A := 2;) A'
    dwgrep: Attempt to read an unbound name `A'
    
  • Parenthesized expressions with name binding blocks are a scope:

    $ dwgrep '"" (|X| let A := 1;) A'
    dwgrep: Attempt to read an unbound name `A'
    
  • then and else branches of an if-then-else form each introduce a scope:

    $ dwgrep 'if ?() then let A := 1; else let A := 2; A'
    dwgrep: Attempt to read an unbound name `A'
    

    Typically these constructs can be rewritten as follows:

    $ dwgrep 'let A := if ?() then 1 else 2; A'
    1
    

On the other hand, simple parenthesizing does not introduce a scope:

$ dwgrep '(let A := 1;) A'
1

It is not allowed to rebind an once-bound name within the same scope:

$ dwgrep 'let A := 1; let A := 1;'
Error: Name `A' rebound.

It is also not possible to access names from outer scopes if they are shadowed by the same name in an inner scope. In the following, the inner reference to A will always resolve to 2, and there is no way to access the outer A of 1:

$ dwgrep 'let A := 1; 2 [|A| A]'
[2]

Sub-expression assertions (?(), !())

Form:

“?(” EXPR₁ “)”
“!(” EXPR₁ “)”

The form ?(...) sends an input stack to EXPR₁, which is then evaluated in sub-expression context. If it yields anything, the overall assertion succeeds and yields the original stack. The form !(...) has the inverse semantics: it succeeds when EXPR₁ yields nothing.

The behavior of these two forms is equivalent to the following:

([ EXPR₁ ] != [])       # ?( EXPR₁ )
([ EXPR₁ ] == [])       # !( EXPR₁ )

For example, to select leaf DIE's of a Dwarf graph:

!(child)

To test whether one of the DIE's contains an empty location expression:

entry ?(@AT_location !(elem))

Like other parenthesized forms, sub-expression assertions allow a binding block as well:

1 2 ?(|A B| A B ?lt)

Conditionals (if-then-else)

Form:

“if” EXPR₀ “then” EXPR₁ “else” EXPR₂

Input stack is passed to EXPR₀, which is evaluated in sub-expression context. Next a body expression is evaluated, which is either EXPR₁ if EXPR₀ yielded anything. Or, if EXPR₀ yielded nothing, EXPR₂ is the body expression.

The input stack is then passed to the body expression, which is evaluated in plain context. The overall form then yields anything that the body expression yields.

Both EXPR₁ and EXPR₂ shall have the same overall stack effect (the number of slots pushed minus number of slots popped will be the same for each branch).

This form could be circumscribed by the following snippet:

(?(EXPR₀) (EXPR₁), !(EXPR₀) (EXPR₂))

As an example, consider the following snippet from a script:

if ?DW_TAG_formal_parameter then (
  // Of formal parameters we ignore those that are children of
  // subprograms that are themselves declarations.
  ?(parent !DW_TAG_subroutine_type !(@DW_AT_declaration == true))
) else (
  ?DW_TAG_variable
)

In particular, note the following surprising example:

$ dwgrep 'if false then "yes" else "no"'        # WARNING!
yes

The reason is that false is a named constant, which means the tested expression always yields a stack: namely a stack with false on top. This is how Booleans should be tested in Zwerg:

if (@AT_declaration == true) then … else …