174 lines
5.7 KiB
Plaintext
174 lines
5.7 KiB
Plaintext
# The lexer has three modes: ARITHMETIC, COMMAND, and STRING
|
||
# ARITHMETIC mode is operand-based, all symbols, keywords, and constant parsing
|
||
# is enabled.
|
||
# COMMAND mode is word-based, only a subset of symbols are enabled, no keyword
|
||
# or constant parsing is performed, and more liberal word formations and
|
||
# substitutions are allowed
|
||
# STRING mode is used to read string literals (i.e. those strings that DON'T
|
||
# support variable substitutions). All chars read are appended to the resulting
|
||
# string, with no further parsing performed.
|
||
|
||
# Initially, the lexer mode is unspecified, until:
|
||
# a) The lexer reads a character, from which the correct mode is deduced.
|
||
# b) The parser manually switches the lexer's mode
|
||
# Lexer state supports nesting.
|
||
|
||
# ARITHMETIC
|
||
# both of these are equivalant
|
||
$a = 2
|
||
# VAR(a)
|
||
# SYMBOL(=)
|
||
# INT(2)
|
||
|
||
$b=4
|
||
# VAR(b)
|
||
# SYMBOL(=)
|
||
# INT(4)
|
||
|
||
# ARITHMETIC
|
||
# this is a syntax error (there should be an operator between the two vars)
|
||
$a$b
|
||
# VAR(a)
|
||
# VAR(b)
|
||
|
||
# When the parser encounters SYMBOL(%) it should switch the lexer to COMMAND
|
||
# mode, which will allow the following word construction to be used.
|
||
# this executes the command whose name is equal to concatenating the values
|
||
# of $a and $b (in this case, '24')
|
||
% $a$b
|
||
# SYMBOL(%)
|
||
# WORD_START
|
||
# VAR(a)
|
||
# VAR(b)
|
||
# WORD_END
|
||
|
||
# executes the command with the name 'a+2b'. because the first char encountered
|
||
# by the lexer is alphabetic, it reads a regular word in COMMAND mode.
|
||
a+2b
|
||
# WORD(a+2b)
|
||
|
||
# executes the command with the name '-no$a' ($a is not substituted).
|
||
# the first char encountered is a symbol, which is read as a word in COMMAND
|
||
# mode
|
||
-no$a
|
||
# WORD(-no)
|
||
|
||
# returns the result of applying the NOT operator to the value of $a.
|
||
# the first char encountered is a symbol, which is read as a word in COMMAND
|
||
# mode. as characters are read, they are compared against registered operators.
|
||
# if a match is found, the operator is emitted, and the parser will switch
|
||
# the lexer to ARITHMETIC mode
|
||
-not$a
|
||
# OP(not)
|
||
# VAR(a)
|
||
|
||
# executes the command with the name '-not$a' ($a is NOT substituted)
|
||
# because of the preceding hyphen, variable substitution is not performed.
|
||
% -not$a
|
||
# SYMBOL(%)
|
||
# WORD(-not$a)
|
||
|
||
# executes the command with the name '-not2' ($a IS substituted)
|
||
# variable substitution IS performed in dquote strings regardless of the hyphen.
|
||
% "-not$a"
|
||
# SYMBOL(%)
|
||
# STR_START
|
||
# STRING(-not)
|
||
# VAR(a)
|
||
# STR_END
|
||
|
||
# interpreted as a command with args ['a', '+b', '/c']
|
||
# the first char encountered is alpbabetic, so the expression is parsed in
|
||
# COMMAND mode
|
||
a +b /c
|
||
# WORD(a)
|
||
# WORD(+b)
|
||
# WORD(/c)
|
||
|
||
# interpreted as an arithmetic expression (but not a well-formed one)
|
||
+b /c
|
||
# SYM(+)
|
||
# WORD(b)
|
||
# SYM(/)
|
||
# WORD(c)
|
||
|
||
# interpreted as a command with name '%+'
|
||
%+
|
||
# WORD(%+)
|
||
|
||
# interpreted as a command with args ['%', '+']
|
||
% +
|
||
# WORD(%)
|
||
# WORD(+)
|
||
|
||
# interpreted as a command with name '%'
|
||
%;
|
||
# WORD(%)
|
||
# SYMBOL(;)
|
||
|
||
# interpreted as a command with name '+'
|
||
&+
|
||
# SYMBOL(&)
|
||
# WORD(+)
|
||
|
||
# interpreted as a string, which triggers the parser to enter ARITHMETIC mode
|
||
'hello world'
|
||
# STRING(hello world)
|
||
|
||
# interpreted as a command with args ['echo', 'hello world']
|
||
echo 'hello world'
|
||
# WORD(echo)
|
||
# STRING(hello world)
|
||
|
||
# interpreted as an interpolated string
|
||
"Hello $(if ($x -lt 5) { echo 'yes' } else {echo 'no'})"
|
||
|
||
|
||
###############################################################################
|
||
# The lexer operates as a state machine, moving between different states as
|
||
# different characters are encountered
|
||
# The states are stored in a stack, to allow recursive parsing.
|
||
# The lexer has the following states:
|
||
# STATEMENT: A generic statement, could be a command, keyword, arithmetic
|
||
# expression, etc. The next char or symbol encountered will cause the
|
||
# lexer to switch to the appropriate state type:
|
||
# letters, word-symbols -> COMMAND
|
||
# squote -> ARITHMETIC
|
||
# dquote -> ARITHMETIC, FSTRING
|
||
# Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC
|
||
# EXPRESSION: Similar to STATEMENT, but only allows a single command or
|
||
# arithmetic expression. CANNOT use keywords or statement terminators.
|
||
# Letters, word-symbols -> COMMAND
|
||
# squote -> ARITHMETIC
|
||
# dquote -> ARITHMETIC, FSTRING
|
||
# Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC
|
||
# COMMAND: Only words, (f)strings, vars, var-splats, and a subset of symbols are
|
||
# parsed.
|
||
# ARITHMETIC: Words, strings, vars, var-splats, all symbols, keywords are parsed.
|
||
# STRING: Only a subset of symbols are parsed, all other characters are appended
|
||
# to the resulting string.
|
||
#
|
||
# Once a state has changed from EXPRESSION to one of the other three state
|
||
# types, certain characters will result in the current state either changing
|
||
# type or being popped from the stack:
|
||
# STATEMENT: semicolon -> STATEMENT
|
||
# left-paren, left-brace -> POP
|
||
# EXPRESSION: semicolon -> POP
|
||
# left-paren, left-brace -> POP
|
||
# COMMAND: semicolon -> STATEMENT
|
||
# left-paren, left-brace -> POP
|
||
# ARITHMETIC: semicolon -> STATEMENT
|
||
# left-paren, left-brace -> POP
|
||
#
|
||
# Certain symbols require recursive parsing:
|
||
# - dquote strings allow string interpolation, so expressions withing the string
|
||
# may be parsed in a different state. Once the expression is complete, the
|
||
# lexer returns to the previous state.
|
||
# - in most cases, $(...) can be used to delimit sub-expressions (including in
|
||
# strings. When '$(' is encountered, a new state entry of type EXPRESSION is
|
||
# pushed onto the stack. When the corresponding ')' is encountered, that state
|
||
# entry is popped from the stack.
|
||
# - similarly to $(...), (...) can be used to group expressions, just like in
|
||
# mathematical expressions.
|
||
|