# The lexer has three modes: ARITHMETIC, COMMAND, and STRING # ARITHMETIC mode is operand-based, all symbols, keywords, and constant parsing # is enabled. # COMMAND mode is word-based, only a subset of symbols are enabled, no keyword # or constant parsing is performed, and more liberal word formations and # substitutions are allowed # STRING mode is used to read string literals (i.e. those strings that DON'T # support variable substitutions). All chars read are appended to the resulting # string, with no further parsing performed. # Initially, the lexer mode is unspecified, until: # a) The lexer reads a character, from which the correct mode is deduced. # b) The parser manually switches the lexer's mode # Lexer state supports nesting. # ARITHMETIC # both of these are equivalant $a = 2 # VAR(a) # SYMBOL(=) # INT(2) $b=4 # VAR(b) # SYMBOL(=) # INT(4) # ARITHMETIC # this is a syntax error (there should be an operator between the two vars) $a$b # VAR(a) # VAR(b) # When the parser encounters SYMBOL(%) it should switch the lexer to COMMAND # mode, which will allow the following word construction to be used. # this executes the command whose name is equal to concatenating the values # of $a and $b (in this case, '24') % $a$b # SYMBOL(%) # WORD_START # VAR(a) # VAR(b) # WORD_END # executes the command with the name 'a+2b'. because the first char encountered # by the lexer is alphabetic, it reads a regular word in COMMAND mode. a+2b # WORD(a+2b) # executes the command with the name '-no$a' ($a is not substituted). # the first char encountered is a symbol, which is read as a word in COMMAND # mode -no$a # WORD(-no) # returns the result of applying the NOT operator to the value of $a. # the first char encountered is a symbol, which is read as a word in COMMAND # mode. as characters are read, they are compared against registered operators. # if a match is found, the operator is emitted, and the parser will switch # the lexer to ARITHMETIC mode -not$a # OP(not) # VAR(a) # executes the command with the name '-not$a' ($a is NOT substituted) # because of the preceding hyphen, variable substitution is not performed. % -not$a # SYMBOL(%) # WORD(-not$a) # executes the command with the name '-not2' ($a IS substituted) # variable substitution IS performed in dquote strings regardless of the hyphen. % "-not$a" # SYMBOL(%) # STR_START # STRING(-not) # VAR(a) # STR_END # interpreted as a command with args ['a', '+b', '/c'] # the first char encountered is alpbabetic, so the expression is parsed in # COMMAND mode a +b /c # WORD(a) # WORD(+b) # WORD(/c) # interpreted as an arithmetic expression (but not a well-formed one) +b /c # SYM(+) # WORD(b) # SYM(/) # WORD(c) # interpreted as a command with name '%+' %+ # WORD(%+) # interpreted as a command with args ['%', '+'] % + # WORD(%) # WORD(+) # interpreted as a command with name '%' %; # WORD(%) # SYMBOL(;) # interpreted as a command with name '+' &+ # SYMBOL(&) # WORD(+) # interpreted as a string, which triggers the parser to enter ARITHMETIC mode 'hello world' # STRING(hello world) # interpreted as a command with args ['echo', 'hello world'] echo 'hello world' # WORD(echo) # STRING(hello world) # interpreted as an interpolated string "Hello $(if ($x -lt 5) { echo 'yes' } else {echo 'no'})" ############################################################################### # The lexer operates as a state machine, moving between different states as # different characters are encountered # The states are stored in a stack, to allow recursive parsing. # The lexer has the following states: # STATEMENT: A generic statement, could be a command, keyword, arithmetic # expression, etc. The next char or symbol encountered will cause the # lexer to switch to the appropriate state type: # letters, word-symbols -> COMMAND # squote -> ARITHMETIC # dquote -> ARITHMETIC, FSTRING # Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC # EXPRESSION: Similar to STATEMENT, but only allows a single command or # arithmetic expression. CANNOT use keywords or statement terminators. # Letters, word-symbols -> COMMAND # squote -> ARITHMETIC # dquote -> ARITHMETIC, FSTRING # Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC # COMMAND: Only words, (f)strings, vars, var-splats, and a subset of symbols are # parsed. # ARITHMETIC: Words, strings, vars, var-splats, all symbols, keywords are parsed. # STRING: Only a subset of symbols are parsed, all other characters are appended # to the resulting string. # # Once a state has changed from EXPRESSION to one of the other three state # types, certain characters will result in the current state either changing # type or being popped from the stack: # STATEMENT: semicolon -> STATEMENT # left-paren, left-brace -> POP # EXPRESSION: semicolon -> POP # left-paren, left-brace -> POP # COMMAND: semicolon -> STATEMENT # left-paren, left-brace -> POP # ARITHMETIC: semicolon -> STATEMENT # left-paren, left-brace -> POP # # Certain symbols require recursive parsing: # - dquote strings allow string interpolation, so expressions withing the string # may be parsed in a different state. Once the expression is complete, the # lexer returns to the previous state. # - in most cases, $(...) can be used to delimit sub-expressions (including in # strings. When '$(' is encountered, a new state entry of type EXPRESSION is # pushed onto the stack. When the corresponding ')' is encountered, that state # entry is popped from the stack. # - similarly to $(...), (...) can be used to group expressions, just like in # mathematical expressions.