174 lines
5.7 KiB
Plaintext
174 lines
5.7 KiB
Plaintext
|
|
# The lexer has three modes: ARITHMETIC, COMMAND, and STRING
|
|||
|
|
# ARITHMETIC mode is operand-based, all symbols, keywords, and constant parsing
|
|||
|
|
# is enabled.
|
|||
|
|
# COMMAND mode is word-based, only a subset of symbols are enabled, no keyword
|
|||
|
|
# or constant parsing is performed, and more liberal word formations and
|
|||
|
|
# substitutions are allowed
|
|||
|
|
# STRING mode is used to read string literals (i.e. those strings that DON'T
|
|||
|
|
# support variable substitutions). All chars read are appended to the resulting
|
|||
|
|
# string, with no further parsing performed.
|
|||
|
|
|
|||
|
|
# Initially, the lexer mode is unspecified, until:
|
|||
|
|
# a) The lexer reads a character, from which the correct mode is deduced.
|
|||
|
|
# b) The parser manually switches the lexer's mode
|
|||
|
|
# Lexer state supports nesting.
|
|||
|
|
|
|||
|
|
# ARITHMETIC
|
|||
|
|
# both of these are equivalant
|
|||
|
|
$a = 2
|
|||
|
|
# VAR(a)
|
|||
|
|
# SYMBOL(=)
|
|||
|
|
# INT(2)
|
|||
|
|
|
|||
|
|
$b=4
|
|||
|
|
# VAR(b)
|
|||
|
|
# SYMBOL(=)
|
|||
|
|
# INT(4)
|
|||
|
|
|
|||
|
|
# ARITHMETIC
|
|||
|
|
# this is a syntax error (there should be an operator between the two vars)
|
|||
|
|
$a$b
|
|||
|
|
# VAR(a)
|
|||
|
|
# VAR(b)
|
|||
|
|
|
|||
|
|
# When the parser encounters SYMBOL(%) it should switch the lexer to COMMAND
|
|||
|
|
# mode, which will allow the following word construction to be used.
|
|||
|
|
# this executes the command whose name is equal to concatenating the values
|
|||
|
|
# of $a and $b (in this case, '24')
|
|||
|
|
% $a$b
|
|||
|
|
# SYMBOL(%)
|
|||
|
|
# WORD_START
|
|||
|
|
# VAR(a)
|
|||
|
|
# VAR(b)
|
|||
|
|
# WORD_END
|
|||
|
|
|
|||
|
|
# executes the command with the name 'a+2b'. because the first char encountered
|
|||
|
|
# by the lexer is alphabetic, it reads a regular word in COMMAND mode.
|
|||
|
|
a+2b
|
|||
|
|
# WORD(a+2b)
|
|||
|
|
|
|||
|
|
# executes the command with the name '-no$a' ($a is not substituted).
|
|||
|
|
# the first char encountered is a symbol, which is read as a word in COMMAND
|
|||
|
|
# mode
|
|||
|
|
-no$a
|
|||
|
|
# WORD(-no)
|
|||
|
|
|
|||
|
|
# returns the result of applying the NOT operator to the value of $a.
|
|||
|
|
# the first char encountered is a symbol, which is read as a word in COMMAND
|
|||
|
|
# mode. as characters are read, they are compared against registered operators.
|
|||
|
|
# if a match is found, the operator is emitted, and the parser will switch
|
|||
|
|
# the lexer to ARITHMETIC mode
|
|||
|
|
-not$a
|
|||
|
|
# OP(not)
|
|||
|
|
# VAR(a)
|
|||
|
|
|
|||
|
|
# executes the command with the name '-not$a' ($a is NOT substituted)
|
|||
|
|
# because of the preceding hyphen, variable substitution is not performed.
|
|||
|
|
% -not$a
|
|||
|
|
# SYMBOL(%)
|
|||
|
|
# WORD(-not$a)
|
|||
|
|
|
|||
|
|
# executes the command with the name '-not2' ($a IS substituted)
|
|||
|
|
# variable substitution IS performed in dquote strings regardless of the hyphen.
|
|||
|
|
% "-not$a"
|
|||
|
|
# SYMBOL(%)
|
|||
|
|
# STR_START
|
|||
|
|
# STRING(-not)
|
|||
|
|
# VAR(a)
|
|||
|
|
# STR_END
|
|||
|
|
|
|||
|
|
# interpreted as a command with args ['a', '+b', '/c']
|
|||
|
|
# the first char encountered is alpbabetic, so the expression is parsed in
|
|||
|
|
# COMMAND mode
|
|||
|
|
a +b /c
|
|||
|
|
# WORD(a)
|
|||
|
|
# WORD(+b)
|
|||
|
|
# WORD(/c)
|
|||
|
|
|
|||
|
|
# interpreted as an arithmetic expression (but not a well-formed one)
|
|||
|
|
+b /c
|
|||
|
|
# SYM(+)
|
|||
|
|
# WORD(b)
|
|||
|
|
# SYM(/)
|
|||
|
|
# WORD(c)
|
|||
|
|
|
|||
|
|
# interpreted as a command with name '%+'
|
|||
|
|
%+
|
|||
|
|
# WORD(%+)
|
|||
|
|
|
|||
|
|
# interpreted as a command with args ['%', '+']
|
|||
|
|
% +
|
|||
|
|
# WORD(%)
|
|||
|
|
# WORD(+)
|
|||
|
|
|
|||
|
|
# interpreted as a command with name '%'
|
|||
|
|
%;
|
|||
|
|
# WORD(%)
|
|||
|
|
# SYMBOL(;)
|
|||
|
|
|
|||
|
|
# interpreted as a command with name '+'
|
|||
|
|
&+
|
|||
|
|
# SYMBOL(&)
|
|||
|
|
# WORD(+)
|
|||
|
|
|
|||
|
|
# interpreted as a string, which triggers the parser to enter ARITHMETIC mode
|
|||
|
|
'hello world'
|
|||
|
|
# STRING(hello world)
|
|||
|
|
|
|||
|
|
# interpreted as a command with args ['echo', 'hello world']
|
|||
|
|
echo 'hello world'
|
|||
|
|
# WORD(echo)
|
|||
|
|
# STRING(hello world)
|
|||
|
|
|
|||
|
|
# interpreted as an interpolated string
|
|||
|
|
"Hello $(if ($x -lt 5) { echo 'yes' } else {echo 'no'})"
|
|||
|
|
|
|||
|
|
|
|||
|
|
###############################################################################
|
|||
|
|
# The lexer operates as a state machine, moving between different states as
|
|||
|
|
# different characters are encountered
|
|||
|
|
# The states are stored in a stack, to allow recursive parsing.
|
|||
|
|
# The lexer has the following states:
|
|||
|
|
# STATEMENT: A generic statement, could be a command, keyword, arithmetic
|
|||
|
|
# expression, etc. The next char or symbol encountered will cause the
|
|||
|
|
# lexer to switch to the appropriate state type:
|
|||
|
|
# letters, word-symbols -> COMMAND
|
|||
|
|
# squote -> ARITHMETIC
|
|||
|
|
# dquote -> ARITHMETIC, FSTRING
|
|||
|
|
# Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC
|
|||
|
|
# EXPRESSION: Similar to STATEMENT, but only allows a single command or
|
|||
|
|
# arithmetic expression. CANNOT use keywords or statement terminators.
|
|||
|
|
# Letters, word-symbols -> COMMAND
|
|||
|
|
# squote -> ARITHMETIC
|
|||
|
|
# dquote -> ARITHMETIC, FSTRING
|
|||
|
|
# Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC
|
|||
|
|
# COMMAND: Only words, (f)strings, vars, var-splats, and a subset of symbols are
|
|||
|
|
# parsed.
|
|||
|
|
# ARITHMETIC: Words, strings, vars, var-splats, all symbols, keywords are parsed.
|
|||
|
|
# STRING: Only a subset of symbols are parsed, all other characters are appended
|
|||
|
|
# to the resulting string.
|
|||
|
|
#
|
|||
|
|
# Once a state has changed from EXPRESSION to one of the other three state
|
|||
|
|
# types, certain characters will result in the current state either changing
|
|||
|
|
# type or being popped from the stack:
|
|||
|
|
# STATEMENT: semicolon -> STATEMENT
|
|||
|
|
# left-paren, left-brace -> POP
|
|||
|
|
# EXPRESSION: semicolon -> POP
|
|||
|
|
# left-paren, left-brace -> POP
|
|||
|
|
# COMMAND: semicolon -> STATEMENT
|
|||
|
|
# left-paren, left-brace -> POP
|
|||
|
|
# ARITHMETIC: semicolon -> STATEMENT
|
|||
|
|
# left-paren, left-brace -> POP
|
|||
|
|
#
|
|||
|
|
# Certain symbols require recursive parsing:
|
|||
|
|
# - dquote strings allow string interpolation, so expressions withing the string
|
|||
|
|
# may be parsed in a different state. Once the expression is complete, the
|
|||
|
|
# lexer returns to the previous state.
|
|||
|
|
# - in most cases, $(...) can be used to delimit sub-expressions (including in
|
|||
|
|
# strings. When '$(' is encountered, a new state entry of type EXPRESSION is
|
|||
|
|
# pushed onto the stack. When the corresponding ')' is encountered, that state
|
|||
|
|
# entry is popped from the stack.
|
|||
|
|
# - similarly to $(...), (...) can be used to group expressions, just like in
|
|||
|
|
# mathematical expressions.
|
|||
|
|
|