Files

174 lines
5.7 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# The lexer has three modes: ARITHMETIC, COMMAND, and STRING
# ARITHMETIC mode is operand-based, all symbols, keywords, and constant parsing
# is enabled.
# COMMAND mode is word-based, only a subset of symbols are enabled, no keyword
# or constant parsing is performed, and more liberal word formations and
# substitutions are allowed
# STRING mode is used to read string literals (i.e. those strings that DON'T
# support variable substitutions). All chars read are appended to the resulting
# string, with no further parsing performed.
# Initially, the lexer mode is unspecified, until:
# a) The lexer reads a character, from which the correct mode is deduced.
# b) The parser manually switches the lexer's mode
# Lexer state supports nesting.
# ARITHMETIC
# both of these are equivalant
$a = 2
# VAR(a)
# SYMBOL(=)
# INT(2)
$b=4
# VAR(b)
# SYMBOL(=)
# INT(4)
# ARITHMETIC
# this is a syntax error (there should be an operator between the two vars)
$a$b
# VAR(a)
# VAR(b)
# When the parser encounters SYMBOL(%) it should switch the lexer to COMMAND
# mode, which will allow the following word construction to be used.
# this executes the command whose name is equal to concatenating the values
# of $a and $b (in this case, '24')
% $a$b
# SYMBOL(%)
# WORD_START
# VAR(a)
# VAR(b)
# WORD_END
# executes the command with the name 'a+2b'. because the first char encountered
# by the lexer is alphabetic, it reads a regular word in COMMAND mode.
a+2b
# WORD(a+2b)
# executes the command with the name '-no$a' ($a is not substituted).
# the first char encountered is a symbol, which is read as a word in COMMAND
# mode
-no$a
# WORD(-no)
# returns the result of applying the NOT operator to the value of $a.
# the first char encountered is a symbol, which is read as a word in COMMAND
# mode. as characters are read, they are compared against registered operators.
# if a match is found, the operator is emitted, and the parser will switch
# the lexer to ARITHMETIC mode
-not$a
# OP(not)
# VAR(a)
# executes the command with the name '-not$a' ($a is NOT substituted)
# because of the preceding hyphen, variable substitution is not performed.
% -not$a
# SYMBOL(%)
# WORD(-not$a)
# executes the command with the name '-not2' ($a IS substituted)
# variable substitution IS performed in dquote strings regardless of the hyphen.
% "-not$a"
# SYMBOL(%)
# STR_START
# STRING(-not)
# VAR(a)
# STR_END
# interpreted as a command with args ['a', '+b', '/c']
# the first char encountered is alpbabetic, so the expression is parsed in
# COMMAND mode
a +b /c
# WORD(a)
# WORD(+b)
# WORD(/c)
# interpreted as an arithmetic expression (but not a well-formed one)
+b /c
# SYM(+)
# WORD(b)
# SYM(/)
# WORD(c)
# interpreted as a command with name '%+'
%+
# WORD(%+)
# interpreted as a command with args ['%', '+']
% +
# WORD(%)
# WORD(+)
# interpreted as a command with name '%'
%;
# WORD(%)
# SYMBOL(;)
# interpreted as a command with name '+'
&+
# SYMBOL(&)
# WORD(+)
# interpreted as a string, which triggers the parser to enter ARITHMETIC mode
'hello world'
# STRING(hello world)
# interpreted as a command with args ['echo', 'hello world']
echo 'hello world'
# WORD(echo)
# STRING(hello world)
# interpreted as an interpolated string
"Hello $(if ($x -lt 5) { echo 'yes' } else {echo 'no'})"
###############################################################################
# The lexer operates as a state machine, moving between different states as
# different characters are encountered
# The states are stored in a stack, to allow recursive parsing.
# The lexer has the following states:
# STATEMENT: A generic statement, could be a command, keyword, arithmetic
# expression, etc. The next char or symbol encountered will cause the
# lexer to switch to the appropriate state type:
# letters, word-symbols -> COMMAND
# squote -> ARITHMETIC
# dquote -> ARITHMETIC, FSTRING
# Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC
# EXPRESSION: Similar to STATEMENT, but only allows a single command or
# arithmetic expression. CANNOT use keywords or statement terminators.
# Letters, word-symbols -> COMMAND
# squote -> ARITHMETIC
# dquote -> ARITHMETIC, FSTRING
# Digits, vars, var-splats, keywords, all other symbols -> ARITHMETIC
# COMMAND: Only words, (f)strings, vars, var-splats, and a subset of symbols are
# parsed.
# ARITHMETIC: Words, strings, vars, var-splats, all symbols, keywords are parsed.
# STRING: Only a subset of symbols are parsed, all other characters are appended
# to the resulting string.
#
# Once a state has changed from EXPRESSION to one of the other three state
# types, certain characters will result in the current state either changing
# type or being popped from the stack:
# STATEMENT: semicolon -> STATEMENT
# left-paren, left-brace -> POP
# EXPRESSION: semicolon -> POP
# left-paren, left-brace -> POP
# COMMAND: semicolon -> STATEMENT
# left-paren, left-brace -> POP
# ARITHMETIC: semicolon -> STATEMENT
# left-paren, left-brace -> POP
#
# Certain symbols require recursive parsing:
# - dquote strings allow string interpolation, so expressions withing the string
# may be parsed in a different state. Once the expression is complete, the
# lexer returns to the previous state.
# - in most cases, $(...) can be used to delimit sub-expressions (including in
# strings. When '$(' is encountered, a new state entry of type EXPRESSION is
# pushed onto the stack. When the corresponding ')' is encountered, that state
# entry is popped from the stack.
# - similarly to $(...), (...) can be used to group expressions, just like in
# mathematical expressions.