.. _syntax: Syntax ====== This section describes all syntactic constructs in Zwerg language. You may want to check out :ref:`tutorial`, which introduces many (but not all) of these concepts on an example, and :ref:`zw_vocabulary_core` or :ref:`zw_vocabulary_dwarf` to learn about the actual function words that you can use. Comments -------- Form:: // Ignored until the end of the line. # This one as well. /* C-like comments work as well. */ Comments are ignored. Nesting comments is not allowed. Empty expression ---------------- Empty expression yields exactly the stacks that get to its input. For example, the following is an empty expression within negative sub-expression assertion. This expression never yields anything:: !() # An expression that never yields. As another example, it may be useful to bind the initial Dwarf value to a name, typically in cases when one needs to refer to it several times. One way to do this is this:: let Dw := ; # There's an empty expression between := and ;. Words ----- Form:: foo ?foo !foo @foo # and more. Most functional parts of Zwerg expressions are expressed by words. Words consume input stacks and may yield possibly several stacks of arbitrary composition. A typical word would inspect or pop one or more words on each incoming stack, carry out some computation, and produce zero or more stacks with the results of computation pushed to them. For example, the word ``entry`` expects a stack with a Dwarf value on top. It then examines the Dwarf data, and for each debug info entry (DIE), it produces a stack with that DIE pushed on top (instead of the consumed Dwarf value). E.g.:: $ dwgrep ./tests/nontrivial-types.o -e '' $ dwgrep ./tests/nontrivial-types.o -e 'entry' [b] compile_unit producer (strp) GNU C 4.6.3 20120306 (Red Hat 4.6.3-2); [... more attributes ...] [2d] structure_type name (string) foo; [... more attributes ...] [35] subprogram external (flag) true; name (strp) blah; [... more attributes ...] [... more output ...] Some words (actually quite a few of them) will change their behavior depending on what values are on the stack. E.g. ``entry`` is besides Dwarfs applicable to, as of this writing, compilation units and abbreviation units. The words that start with ``?`` and ``!`` are assertions. While what was written about words in general applies to assertions as well, their mode of operation is more restricted. Assertions either yield the incoming stack unchanged, or yield nothing at all. The assertions come in pairs, where the ``?``-flavored one tests the positive condition (e.g. ``?root`` tests whether a DIE on top of stack is a root DIE), and the ``!``-flavored one tests the opposite condition (e.g. ``!root`` tests whether a DIE on top of stack is not a root DIE). Concatenation ------------- Form:: EXPR₁ EXPR₂ … The resulting expression passes all input stacks to *EXPR₁*, takes the output of that, passes it to *EXPR₂*, and so on. The result of concatenation is then whatever the last constituent expression yields. The constituent expressions are evaluated in plain context. Parenthesizing -------------- Form:: (EXPR₁) Concatenation has a high precedence, so parentheses can be used liberally to adjust how an expression should be chopped into sub-expression. The parenthesized expression is evaluated in plain context. For example:: foo bar, baz # these two ... (foo bar), baz # ... mean the same thing foo (bar, baz) # this one is different Integer literals ---------------- Form:: “-”?(“0x”|“0o”|“0b”|“0”|“”){digits} Integer literals are words that start with a decimal digit, or by a minus sign followed by a decimal digit. For integer literal to be valid, the word needs to be composed of a prefix that determines radix of the literal, followed by one or more digits from a set specified as follows: - ``0x`` and ``0X`` are hexadecimal prefixes. Valid digits are ``[0-9a-fA-F]``. - ``0o``, ``0O`` and ``0`` are octal prefixes. Valid digits are ``[0-7]``. - ``0b`` and ``0B`` are binary prefixes. Valid digits are ``[0-1]``. - Without prefix, decimal base is assumed. Valid digits are ``[0-9]``. - An initial ``-`` means the number is negative. Zwerg integers can hold any 64-bit signed or unsigned number:: $ dwgrep '0xffffffffffffffff -0x7fffffffffffffff add' 0x8000000000000000 Invalid integer literals are diagnosed at query compilation time:: $ dwgrep '123foo' dwgrep: Invalid integer literal: `123foo' Named constants --------------- Form:: DW_AT_name DW_TAG_base_type DW_FORM_strp DW_LANG_C Zwerg has support for named constants. They aren't merely aliases for numbers--Zwerg remembers their domain and uses it when the value needs to be displayed or converted to a string. It is still possible to extract the underlying numerical value:: $ dwgrep 'DW_TAG_base_type value' 36 $ dwgrep 'DW_TAG_base_type hex' 0x24 Numbers in non-decimal bases use the same trick, so if you use a hexadecimal number, it will keep its hexadecimal-ness throughout the script:: $ dwgrep '10 0x10 0o10 0b10' --- 0b10 010 0x10 10 Similarly values decoded from attribute will have the appropriate "skin"--they will be named constants, hexadecimal or decimal numbers, depending on what is deemed the best fit:: $ dwgrep ... -e 'entry @AT_language' DW_LANG_C89 $ dwgrep ... -e 'entry @AT_decl_line' 4 6 11 $ dwgrep ... -e 'entry @AT_low_pc' 0x80483f0 0x80482f0 Lexical scopes (``(|A|…)``) ------------------------------ Form:: (|ID₁ ID₂ …| EXPR₁) A presence of binding block in a parenthesized construct converts the whole form into a function that pops as many arguments as there are identifiers in the binding block, and binds them to these identifiers such that the rightmost one gets the value on TOS and then progressively lefter *ID*'s get progressively deeper stack values. The expression is then evaluated as usual, except one can use the bound names. Sometimes it's useful to enclose the whole expression into a scope and pop and bound the initial Dwarf value:: (|Dw| Dw entry (name == Dw symtab name)) Another example shows how to implement set union even without first-class Zwerg support:: [foo] [bar] (|A B| A [B elem !(== A elem)] add) ALT-lists (``,``) ----------------- Form:: EXPR₁ “,” EXPR₂ … The resulting expression passes all input stacks to all of *EXPR₁*, *EXPR₂*, etc. It then yields any and all stacks that any of the constituent expressions yields. All constituent expressions shall have the same overall stack effect (the number of slots pushed minus number of slots popped will be the same for each branch). The constituent expressions are evaluated in plain context. E.g. to compare a TOS value to one of the fixed list of values, you would do:: (== (1, 2, 3)) You would also use ALT-lists to create a sequence value:: [1, 2, 3] OR-lists (``||``) ----------------- Form:: EXPR₁ “||” EXPR₂ … Each input stack is passed first to *EXPR₁*. If that yields anything, that's what the overall expression yields. Otherwise the same original input stack is passed to *EXPR₂*, and so in this fashion. If neither constituent expression yields anything, the overall expression doesn't yield anything either. The mode of operation is very similar to shell-like OR lists, hence the name. All constituent expressions shall have the same overall stack effect (the number of slots pushed minus number of slots popped will be the same for each branch). The constituent expressions are evaluated in plain context. An important use of an OR list is to implement case-like conditions and fallback cases:: let name := (@DW_AT_linkage_name || @DW_AT_MIPS_linkage_name || @DW_AT_name || "???"); let has_loc := (?(@DW_AT_location) true || false); Infix assertions (``==``) ------------------------- Form:: EXPR₁ OP₁ EXPR₂ Zwerg has support for infix binary assertions. These forms are evaluated as follows:: ?(let .tmp1 := EXPR₁; let .tmp2 := EXPR₂; .tmp1 .tmp2 OP₂) In plain English, *EXPR₁* and *EXPR₂* are each evaluated in sub-expression context. The TOS's of the stacks that this produces are then put to stack and a certain operator *OP₂* is evaluated. This shows that the two expressions are independent of each other. The word *OP₂* is here used as a stand-in to explain the syntax. The actual word that is looked up is again *OP₁*--the operators are just words with a bit of syntactical support on Zwerg side (and with funny names). So which operators are available depends on which vocabularies any particular wrapper brings in. Zwerg core defines the following *OP₁*'s, and for each of them there is currently an associated *OP₂* as follows:: EXPR₁ “==” EXPR₂ # ?eq EXPR₁ “!=” EXPR₂ # ?ne EXPR₁ “<” EXPR₂ # ?lt EXPR₁ “<=” EXPR₂ # ?le EXPR₁ “>” EXPR₂ # ?gt EXPR₁ “>=” EXPR₂ # ?ge EXPR₁ “=~” EXPR₂ # ?match EXPR₁ “!~” EXPR₂ # !match (Again, the association to *OP₂* is for explanation only. Overriding that symbol has no effect on actual execution.) Another thing worth noticing is that the whole form behaves as an assertion. No side effects leak from the constituent *EXPR*'s or from the operator itself. See :ref:`?eq ` to learn about comparison operators. See :ref:`?match ` to learn about regular expression matching. For example:: $ dwgrep ./tests/typedef.o -e 'entry (offset == 0x1d)' [1d] typedef name (strp) int_t; decl_file (data1) /home/petr/proj/dwgrep/typedef.c; decl_line (data1) 1; type (ref4) [28]; Iteration (``*``) ----------------- Form:: EXPR₁ “*” EXPR₁ “+” EXPR₁ “?” The effect of expressions ``EXPR₁*`` is that the input stack is yielded unchanged, and also passed to *EXPR₁*. The output is collected and yielded, and then passed back to *EXPR₁*. This process is repeated until *EXPR₁* stops yielding more stacks. This is typically used for forming closures--peeling uninteresting types, forming sets of nodes, etc. E.g.:: child* # nodes of tree rooted in node that's on TOS One can also (ab)use this to create an explicit for loop:: 0 (1 add ?(10 ?lt))* *EXPR₁* shall push exactly the same number of stack slots as it pops. The effect of ``EXPR₁+`` is the same as that of ``EXPR₁ EXPR₁*``. The effect of ``EXPR₁?`` is the same as that of ``(, EXPR₁)``. For example:: A B swap? ?gt drop # "max" -- keep the greater number on stack A B swap? ?lt drop # "min" Formatting strings ------------------ Form:: “r”? “"” (formatting string) “"” “\”? Zwerg string literals are functions that push themselves to an incoming stack, which they then yield. The literal is enclosed in quotes and if a quote itself should be part of the string, it should be escaped with a backslash (``\``). Should a backslash be part of the string and not considered an escape character, it should be escaped by another backslash:: "a simple string" "a string \"with\" quotes" "a string with backslash: \\" Other recognized escapes include: +-----------+---------------------------------------------------------+ | escape | meaning | +===========+=========================================================+ | ``\\`` | Escaped backslash stands for a backslash:: | | | | | | $ dwgrep '"\\a\\"' | | | \a\ | +-----------+---------------------------------------------------------+ | ``\123`` | *123* stands for octal number. The escape is replaced | | | with an ASCII character the code of which is the octal | | | number:: | | | | | | $ dwgrep '"\110\145\154\154\157"' | | | Hello | +-----------+---------------------------------------------------------+ | ``\xab`` | *ab* stands for hexadecimal number. The meaning is | | | analogous to that of \\123:: | | | | | | $ dwgrep '"\x77\x6f\x72\x6c\x64\x21"' | | | world! | +-----------+---------------------------------------------------------+ | ``\t`` | Stands for the tab character. | +-----------+---------------------------------------------------------+ | ``\n`` | Both a literal end-of-line character, as well as the | | | ``\n`` escape stand for new line:: | | | | | ```` | $ dwgrep '"foo\nbar"' | | | foo | | | bar | | | | | | $ dwgrep '"foo | | | bar"' | | | foo | | | bar | +-----------+---------------------------------------------------------+ | ``\``| Escaped newline is ignored:: | | | | | | $ dwgrep '"foo\ | | | bar"' | | | foobar | +-----------+---------------------------------------------------------+ There are more escape sequences (essentially those mentioned in ``man ascii`` are supported as well). If a string literal is prefixed by ``r``, it's actually a raw string literal. These work the same as normal formatting strings, but escape sequences are left intact in the string:: $ dwgrep '"foo \"bar\\\""' foo "bar\" $ dwgrep 'r"foo \"bar\\\""' foo \"bar\\\" String literals can be split, provided that all but the last segment end not with a mere quote, but ``"\``. The following two examples produce equivalent programs:: "a long string "\ "that continues here" "a long string that continues here" Any whitespace (but only whitespace) is allowed between ``"\`` and the following ``"``. String literals can contain formatting directives. Formatting directive are two-character sequences whose first character is a ``%``. Such strings are not merely literals anymore, they behave more as template strings. They are functions that take values from the stack and splice their string representation together with the non-formatting-directive parts of the string. A pseudo-directive ``%%`` is not really a directive at all, it's an escape that stands for a single percent character. The simplest formatting directive is ``%s``:: $ dwgrep ... -e 'entry ?AT_decl_column !AT_decl_line "%s has DW_AT_decl_column, but NOT DW_AT_decl_line"' [58] formal_parameter has DW_AT_decl_column, but NOT DW_AT_decl_line $ dwgrep ... -e 'entry attribute (form "%s" =~ "DW_FORM_ref.*")' type (ref4) [65]; sibling (ref4) [65]; type (ref4) [2d]; [... etc ...] The most general formatting directive is a pair of ``%(`` and ``%)``, which encloses an expression. The formatting string passes input stacks to that expression, which is evaluated in plain context. TOS of yielded stacks is then popped, converted to a string, spliced, and the resulting string is pushed to stack. For example:: $ dwgrep ... -f /dev/stdin <<"EOF" (|Dw| let T := Dw entry ?TAG_typedef ; let U := T @AT_type (?TAG_typedef @AT_type)* !TAG_typedef ; "[%( T offset %)] typedef %( T @AT_name %) %( U @AT_name || "???" %) "\ "(%( U @AT_encoding || "???" %), %( U @AT_byte_size || "???" %) bytes)" ) EOF [0x1d] typedef int_t int (DW_ATE_signed, 4 bytes) [0x2f] typedef int_t_t int (DW_ATE_signed, 4 bytes) [0x3a] typedef int_t_t_t int (DW_ATE_signed, 4 bytes) If there's more than one formatting directive in a given string, they are resolved in the order from right to left (sic!): rightmost formatting directive gets TOS value:: $ dwgrep 'dwgrep '1 2 3 "{%s [%s (%s)]}"' {1 [2 (3)]} This principle is applied throughout Zwerg: when a binding block contains several names the rightmost name is bound to value on TOS. The reason should be apparent from the above example: because the strings are parsed from left to right, rightmost word produces the topmost stack value. Now that we have introduced ``%( %)``, it is easy to use it to define the others: - ``%s`` stands for ``%( %)`` - ``%d`` stands for ``%( value %)`` - ``%x`` stands for ``%( value hex %)`` - ``%o`` stands for ``%( value oct %)`` - ``%b`` stands for ``%( value bin %)`` Sequence capture (``[]``) ------------------------- Form:: “[” EXPR₁ “]” Sequences offer a way to gather values yielded by another expression. If you think of ALT-list as a fork, then capture is a join. The resulting expression passes any input stacks to *EXPR₁*. For each input stack, it gathers the stacks produced by *EXPR₁*, takes the top value off each of them, and collects these values in a sequence. This sequence it then pushes to TOS. *EXPR₁* is evaluated in sub-expression context. That means that the sequence is pushed to TOS *in addition* to whatever was already there--nothing is removed. Of particular interest is interplay between ALT-list and capture, which allows easy and syntactically familiar construction of sequence literals:: $ dwgrep '[1, 2, 3]' [1, 2, 3] Or you can use this construct to e.g. collect children of a node on TOS:: $ dwgrep ./tests/typedef.o -e 'entry ?root [child]' --- [[1d] typedef, [28] base_type, [2f] typedef, [3a] typedef, [45] variable] [b] compile_unit producer (strp) GNU C 4.6.3 20120306 (Red Hat 4.6.3-2); language (data1) DW_LANG_C89; name (strp) typedef.c; comp_dir (strp) /home/petr/proj/dwgrep; stmt_list (data4) 0; Or the whole tree rooted at node on TOS:: dwgrep ./tests/typedef.o -e 'entry ?root [child*]' --- [[b] compile_unit, [45] variable, [3a] typedef, [2f] typedef, [28] base_type, [1d] typedef] [b] compile_unit producer (strp) GNU C 4.6.3 20120306 (Red Hat 4.6.3-2); language (data1) DW_LANG_C89; name (strp) typedef.c; comp_dir (strp) /home/petr/proj/dwgrep; stmt_list (data4) 0; Form:: “[” “]” This produces an empty list. Alternatively, one could also do:: [!()] But that's somewhat awkward. To capture an empty expression, one would need to explicitly parenthesize it:: $ dwgrep '1 [()]' --- [1] 1 Capture with binding (``[|A|…]``) --------------------------------- Form:: “[” “|” ID₁ ID₂ … “|” EXPR₁ “]” Sub-expression capture allows a binding block. A presence of such block converts the whole form into a function that pops as many arguments as there are identifiers in the binding block, and binds them to these identifiers such that the rightmost one get the value on TOS and then progressively lefter *ID*'s get progressively deeper stack values. The expression is then evaluated as usual, except one can use the bound names:: $ dwgrep ./tests/typedef.o -e 'entry ?root [|A| A child]' [[1d] typedef, [28] base_type, [2f] typedef, [3a] typedef, [45] variable] $ dwgrep '1 [|A| A]' # or you could just write '[1]' ;) [1] Name binding (``let``) ---------------------- Form:: “let” ID₁ ID₂ … “:=” EXPR₁ “;” This form introduces a new name into the current scope. It passes the input stack to *EXPR₁*, which is evaluated in sub-expression context. Then values near TOS are bound to given identifiers and exported into surrounding context. Then the original stack is yielded as many times as *EXPR₁* yields, each time with a possibly different set of bindings. Bound names may be mentioned later, and they push the bound value to the stack. Consider the following examples:: $ dwgrep 'let X := 1; X' 1 $ dwgrep 'let X := 1, 2; X' 1 2 Ordering of ID's is such that the rightmost is bound to TOS, the next one to the left to one below TOS, etc. The mnemonic for this is that the list of variables describes stack layout, with TOS being on the right. E.g.:: $ dwgrep 'let A B := 1 2, 3 4; A B' --- 2 1 --- 4 3 The new bindings are introduced into the most enclosing scope. The following constructs comprise a scope: - Any sub-expression context has a scope of its own:: $ dwgrep '?(let A := 1;) A' dwgrep: Attempt to read an unbound name `A' $ dwgrep '(let A := 1; 1 == 1) A' dwgrep: Attempt to read an unbound name `A' - Each branch of an OR-list or ALT-list has a scope of its own:: $ dwgrep '(let A := 1;, let A := 2;) A' dwgrep: Attempt to read an unbound name `A' - Parenthesized expressions with name binding blocks are a scope:: $ dwgrep '"" (|X| let A := 1;) A' dwgrep: Attempt to read an unbound name `A' - ``then`` and ``else`` branches of an ``if-then-else`` form each introduce a scope:: $ dwgrep 'if ?() then let A := 1; else let A := 2; A' dwgrep: Attempt to read an unbound name `A' Typically these constructs can be rewritten as follows:: $ dwgrep 'let A := if ?() then 1 else 2; A' 1 On the other hand, simple parenthesizing does not introduce a scope:: $ dwgrep '(let A := 1;) A' 1 It is not allowed to rebind an once-bound name within the same scope:: $ dwgrep 'let A := 1; let A := 1;' Error: Name `A' rebound. It is also not possible to access names from outer scopes if they are shadowed by the same name in an inner scope. In the following, the inner reference to ``A`` will always resolve to ``2``, and there is no way to access the outer ``A`` of ``1``:: $ dwgrep 'let A := 1; 2 [|A| A]' [2] Sub-expression assertions (``?()``, ``!()``) -------------------------------------------- Form:: “?(” EXPR₁ “)” “!(” EXPR₁ “)” The form ``?(...)`` sends an input stack to *EXPR₁*, which is then evaluated in sub-expression context. If it yields anything, the overall assertion succeeds and yields the original stack. The form ``!(...)`` has the inverse semantics: it succeeds when *EXPR₁* yields nothing. The behavior of these two forms is equivalent to the following:: ([ EXPR₁ ] != []) # ?( EXPR₁ ) ([ EXPR₁ ] == []) # !( EXPR₁ ) For example, to select leaf DIE's of a Dwarf graph:: !(child) To test whether one of the DIE's contains an empty location expression:: entry ?(@AT_location !(elem)) Like other parenthesized forms, sub-expression assertions allow a binding block as well:: 1 2 ?(|A B| A B ?lt) Conditionals (``if-then-else``) ------------------------------- Form:: “if” EXPR₀ “then” EXPR₁ “else” EXPR₂ Input stack is passed to *EXPR₀*, which is evaluated in sub-expression context. Next a body expression is evaluated, which is either *EXPR₁* if *EXPR₀* yielded anything. Or, if *EXPR₀* yielded nothing, *EXPR₂* is the body expression. The input stack is then passed to the body expression, which is evaluated in plain context. The overall form then yields anything that the body expression yields. Both *EXPR₁* and *EXPR₂* shall have the same overall stack effect (the number of slots pushed minus number of slots popped will be the same for each branch). This form could be circumscribed by the following snippet:: (?(EXPR₀) (EXPR₁), !(EXPR₀) (EXPR₂)) As an example, consider the following snippet from a script:: if ?DW_TAG_formal_parameter then ( // Of formal parameters we ignore those that are children of // subprograms that are themselves declarations. ?(parent !DW_TAG_subroutine_type !(@DW_AT_declaration == true)) ) else ( ?DW_TAG_variable ) In particular, note the following surprising example:: $ dwgrep 'if false then "yes" else "no"' # WARNING! yes The reason is that ``false`` is a named constant, which means the tested expression always yields a stack: namely a stack with ``false`` on top. This is how Booleans should be tested in Zwerg:: if (@AT_declaration == true) then … else …