Thursday 18 April 2019

Expression Parser

Modus Operandi

Previously:
Differentiating f(x)^g(x)
The Differentiator
Graphing Derivatives
The Complexities Of The Simplifier
Formula Translation: The Error Function
The Differentiator's fluent expression-building syntax is great when you have source code access, but it's easy to build a standard, recursive-descent parser for everyone else to enjoy. As you'll see below, the best thing about that is the sheer pleasure of creating your own precedence and associativity rules.

Our little language of arithmetic expressions in a single real variable x has quite the simple grammar, which in mostly-EBNF, looks like this:

    Parameter       = "x" ;

    Number          = ? \d*\.?\d*([eE][+-]?\d+)? ? ;

    NamedConstant   = "e" | "π" | "pi" | "ϕ" | "phi" ;

    UnaryOperator   = "+" | "-" | "!" | "~" | "√" ;

    BinaryOperator  = "||" | "&&" | "|" | "&" | "=" | "==" | "≠" | "<>" | "!=" | "<" | ">" | "≮ " | "≯ " | "<=" | ">="
                    | "+" | "-" | "*" | "/" | "^" ;

    Function        = "Abs" | "Acos" | "Acosh" | "Acot" | "Acoth" | "Acsc" | "Acsch" | "Asec" | "Asech" | "Asin"
                    | "Asinh" | "Atan" | "Atanh" | "Ceiling" | "Cos" | "Cosh" | "Cot" | "Coth" | "Csc" | "Csch"
                    | "Erf" | "Exp" | "Floor" | "Ln" | "Log10" | "Round" | "Sec" | "Sech" | "Sign" | "Sin"
                    | "Sinh" | "Sqrt" | "Step" | "Tan" | "Tanh" ;

    Operand         = Parameter
                    | Number
                    | NamedConstant
                    | Operand, "'" (* Form the derivative *)
                    | "(", Expression, ")"
                    | Function, Operand
                    | UnaryOperator, Operand ;

    Expression      = Operand, { BinaryOperator, Operand }
                    | Expression, "?", Expression, ":", Expression ;

The mutual recursion visible between the above Expression and Operand nonterminals is the hallmark of a classic infix-notation, parentheses-laden syntax.

Notes on the Grammar
  • The Parameter 'x' is case insensitive, but will be converted to lower case during the parsing process.
  • The Number element is defined above using a regular expression pattern. This pattern overreaches, since for simplicity's sake, none of its character groups are mandatory. So the empty string is regarded as a valid number; as is, a single dot; or for example, the sequence '.E-'. But that's OK, as long as the tokeniser is just greedy enough, but doesn't steal characters from subsequent tokens. The actual parsing of constants will be done later by the double.Parse() method, so if the tokeniser passes through any such malformed number, the error will be detected.
  • A Number starts with either a digit or a decimal point; any preceding '+' or '-' signs will be treated as unary operators (although the parser will later collapse all such operators into a single Constant node).
  • The NamedConstant class does allow you to type the names of (some of) the few supported values as Greek letters (π, ϕ)... but you're wasting your time! This code collapses (combines) constants eagerly, and everything soon ends up as a double-precision, floating point number. NamedConstant elements are case insensitive.
  • The UnaryOperator elements '!' (nothing to do with factorials) and '~' both represent a logical not operation. The number zero is regarded as false, anything else as true. So if x=0, then !x (or equivalently ~x) will return 1; otherwise 0.
  • UnaryOperator '√' performs the same function as 'Sqrt'.
  • The BinaryOperator elements '&' and '&&' both perform a logical And operation, likewise '|' and '||' a logical Or. They differ only in their levels of precedence: as in C#, the order of evaluation is '&', '|', '&&', '||' (obviously you can just stick to one preferred style, as the same effect can be achieved using parentheses). Similarly, '=' and '==' perform the same equality test; '≠', '<>' and '!=' the same inequality test; and so on for the comparison operators.
  • The available Function terminal elements are just the names of the 35 functions made available in the previous article.
  • Function names are case insensitive, but will be converted to Title Case (initial capital) during the parsing process. Notice that the Function syntax doesn't require parentheses; abs x works just as well as Abs(x).
  • The apostrophe "'" is the only postfix operator in the grammar. It has the effect of differentiating the preceding Operand, so for example, (Sin(x))' evaluates to Cos(x).
Turning Japanese

It's always at this point, turning from the syntax diagram to the semantics of coding, that I inevitably get seduced by this cool feature, first encountered in my first couple of pocket computers: the Sharp PC-1211 and PC-1500. The cool feature is implied products. TL;DR: every memory category was in such short supply back then (see: millennium bug), even BASIC variables were meted out individually, and their names were A..Z. The unexpected benefit of this rationing was that products of variables (variables which individually could only have single-character names like A or B) could now be expressed unambiguously in the form AB, saving a whole byte by omitting the multiplication symbol.

At first glimpse this implied multiplication seems no big deal, but wait. Does it necessarily have the same associativity and precedence as normal multiplication? If so, then
A/BC would be parsed as (A/B)*C = (A*C)/B.
But this seems wrong! The expression looks so much more like a quantity A being divided by a quantity BC, in other words,
A/BC = A/(B*C).
Now the C has landed below the line; before, it was above. Which is correct?

Utility is monarch, and the solution turns out to be the introduction of a new kind of implied multiplication with a higher precedence than the usual one. By analogy: the associativity of '^' was once settled to be right, unlike its additive and multiplicative colleagues which all tended left.
So, while subtraction goes like this: A-B-C = (A-B)-C = A-(B+C),

by contrast exponentiation does this: A^B^C = A^(B^C)

rather than the much less useful (A^B)^C,
because the real action's in the exponent. In the case of the implicit multiplier, the new operator was given a higher precedence than its peers, so for example
A^BC = A^(B*C), rather than (A^B)*C.
In this way multiplication, if it just remain hidden, can leapfrog other more powerful operators, perhaps even unary operators, perhaps even function calls:
sin AB = sin (A * B), compared to sin A * B = (sin A) * B.

White Feather

So why did I hesitate to implement such an operator in Parser?

A recurring subject in the C# language designers' blogs and comment responses is cost. Nothing comes for free! There's always at least a test budget to consider, and almost always, myriad further hidden design and development costs. Implied products are almost too attractive a feature. Sure, (x+2)(x-2) and its ilk are compelling, but to get the full story, you must go back to the language's grammar definition and look at all newly emerging cases.

What about
2 sin x cos x
for example. The familiar sine of a double angle formula, obviously and unambiguously intended to be read as
2*sin(x)*cos(x).
But should the precedence of "implicit multiply" exceed even unary and function calls, the interpretation becomes
2*sin(x*cos(x)),
which, simply, ouch.

Then again, why all this detail, if I never intended to implement such an operator? Because I knew I would. I knew that about myself, and so wanted all this background material to be available for review, when I finally revisited the code to smoosh it up a bit. Also, the very inevitability of that revisit is why I made this code as clean and readily understandable as I could. Part of this process is of course eliminating all but the most vital of comments from the source; hence their home in this article.

That Precedence enumeration was a proper godsend when smooshing time arrived! Anyway, here comes the parser code:

namespace FormulaBuilder
{
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Linq.Expressions;
    using System.Text.RegularExpressions;

    public class Parser
    {
        private enum Precedence
        {
            RightParenthesis,
            Additive,
            Multiplicative,
            Exponential,
            Functional,
            ImpliedProduct,
            SuperscriptPower
        }

        const char
            SquareRoot = '√';

        const string
            ImpliedProduct = "i*",
            SuperscriptPower = "s^",
            UnaryMinus = "u-",
            UnaryPlus = "u+";

        private string Formula;
        private int Index;
        private Stack<Expression> Operands;
        private Stack<string> Operators;

        public Expression Parse(string formula)
        {
            Formula = formula;
            Index = 0;
            Operands = new Stack<Expression>();
            Operators = new Stack<string>(new[] { "(" });
            ParseExpression();
            if (Operators.Any())
                throw new FormatException(
                    $"Unexpected end of expression, input='{Formula}'");
            return Operands.Peek();
        }

        public bool TryParse(string formula, out object result)
        {
            try
            {
                result = Parse(formula);
                return true;
            }
            catch (Exception e)
            {
                result = e.Message;
                return false;
            }
        }

        private static double GetNamedConstantValue(string constant)
        {
            switch (constant.ToLower())
            {
                case "e":
                    return Math.E;
                case "π":
                case "pi":
                    return Math.PI;
                case "ϕ":
                case "phi":
                    return (1 + Math.Sqrt(5)) / 2;
            }
            return 0;
        }

        private static ExpressionType GetExpressionType(string op)
        {
            switch (op)
            {
                case "+":
                    return ExpressionType.Add;
                case "-":
                    return ExpressionType.Subtract;
                case "*":
                case "i*":
                    return ExpressionType.Multiply;
                case "/":
                    return ExpressionType.Divide;
                case "^":
                case "s^":
                    return ExpressionType.Power;
                case UnaryPlus:
                    return ExpressionType.UnaryPlus;
                case UnaryMinus:
                    return ExpressionType.Negate;
            }
            throw new FormatException();
        }

        private static Precedence GetPrecedence(string op)
        {
            switch (op)
            {
                case ")":
                    return Precedence.RightParenthesis;
                case "+":
                case "-":
                    return Precedence.Additive;
                case "*":
                case "/":
                    return Precedence.Multiplicative;
                case "^":
                    return Precedence.Exponential;
                case ImpliedProduct:
                    return Precedence.ImpliedProduct;
                case SuperscriptPower:
                    return Precedence.SuperscriptPower;
            }
            return Precedence.Functional;
        }

        private Expression MakeBinary(string op, Expression lhs, Expression rhs) =>
            Expression.MakeBinary(GetExpressionType(op), lhs, rhs);

        private Expression MakeFunction(string f, Expression operand)
        {
            f = $"{char.ToUpper(f[0])}{f.ToLower().Substring(1)}";
            var result = Expressions.Function(f, operand);
            if (operand is ConstantExpression c)
                return result.AsDouble((double)c.Value).Constant();
            return result;
        }

        private Expression MakeUnary(string op, Expression operand)
        {
            if (operand is ConstantExpression c)
                switch (op)
                {
                    case UnaryPlus: return operand;
                    case UnaryMinus: return (-(double)c.Value).Constant();
                }
            return Expression.MakeUnary(GetExpressionType(op), operand, null);
        }

        private string MatchFunction() => MatchRegex(@"^[\p{Lu}\p{Ll}\d]+").ToLower();

        private string MatchNumber() => MatchRegex(@"^\d*\.?\d*([eE][+-]?\d+)?");

        private string MatchRegex(string pattern)
        {
            var match = Regex.Match(Formula.Substring(Index), pattern);
            return Formula.Substring(Index + match.Index, match.Length);
        }

        private string MatchSubscript() => MatchRegex($"^[{StringUtilities.Subscripts}]+");
        private string MatchSuperscript() => MatchRegex($"^[{StringUtilities.Superscripts}]+");

        private char NextChar()
        {
            var count = Formula.Length;
            while (Index < count && Formula[Index] == ' ')
                Index++;
            return Index < count ? Formula[Index] : Index == count ? ')' : '$';
        }

        private string NextToken()
        {
            var nextChar = NextChar();
            switch (nextChar)
            {
                case '+':
                case '-':
                case '*':
                case '/':
                case '^':
                case '(':
                case ')':
                case SquareRoot:
                    return nextChar.ToString();
                case char c when char.IsDigit(c):
                case '.':
                    return MatchNumber();
                case char c when c.IsSuperscript():
                    return MatchSuperscript();
                case char c when char.IsLetter(c):
                    return MatchFunction();
            }
            throw new FormatException(
                $"Unexpected character '{nextChar}', input='{Formula}', index={Index}");
        }

        private void ParseExpression()
        {
            do
            {
                ParseOperand();
                var op = NextChar();
                switch (op)
                {
                    case '+':
                    case '-':
                    case '*':
                    case '/':
                    case '^':
                    case ')':
                        ParseOperator(op.ToString());
                        if (op == ')')
                            return;
                        break;
                    case '$' when Index == Formula.Length + 2: // End of input (normal)
                        return;
                    case '$' when Index < Formula.Length + 2: // End of input (unexpected)
                        throw new FormatException(
                            $"Unexpected end of text, input='{Formula}', index={Index}");
                    case char c when c.IsSuperscript():
                        ParseOperator(SuperscriptPower); //
                        break;
                    default:
                        if (Operands.Peek() is ConstantExpression)
                        {
                            ParseOperator(ImpliedProduct); // Implied multiplication
                            break;
                        }
                        throw new FormatException(
                            $"Unexpected character '{op}', input='{Formula}', index={Index}");
                }
            }
            while (true);
        }

        private void ParseFunction(string function)
        {
            Operators.Push(function);
            ReadPast(function);
        }

        private bool ParseNamedConstant(string constant)
        {
            var value = GetNamedConstantValue(constant);
            if (value == 0)
                return false;
            Operands.Push(value.Constant());
            ReadPast(constant);
            return true;
        }

        private void ParseNumber(string number)
        {
            try
            {
                Operands.Push(double.Parse(number).Constant());
            }
            catch (FormatException)
            {
                throw new FormatException(
                    $"Invalid number format '{number}', input='{Formula}', index={Index}");
            }
            catch (OverflowException)
            {
                throw new FormatException(
                    $"Numerical overflow '{number}', input='{Formula}', index={Index}");
            }
            ReadPast(number);
        }

        private void ParseOperand()
        {
            var token = NextToken();
            switch (char.ToLower(token[0]))
            {
                case 'x' when token.Length == 1:
                    ParseParameter(token);
                    break;
                case char c when char.IsDigit(c):
                case '.':
                    ParseNumber(token);
                    break;
                case char c when c.IsSuperscript():
                    ParseSuperscript(token);
                    break;
                case '(':
                    Operators.Push(token);
                    ReadPast(token);
                    ParseExpression();
                    break;
                case '+':
                case '-':
                    ParseUnary(token);
                    ParseOperand();
                    break;
                case char c when char.IsLetter(c):
                    if (!ParseNamedConstant(token))
                    {
                        ParseFunction(token);
                        ParseOperand();
                    }
                    break;
                case SquareRoot:
                    ParseSquareRoot();
                    ParseOperand();
                    break;
                default:
                    throw new FormatException(
                        $"Missing operand, input='{Formula}', index={Index}");
            }
        }

        private void ParseOperator(string op)
        {
            do
            {
                var ours = GetPrecedence(op);
                var pending = Operators.Peek();
                if (pending == "(")
                    break;
                var theirs = GetPrecedence(pending);
                // Operator '^' is right associative: a^b^c = a^(b^c).
                if (theirs > ours || theirs == ours && op != "^")
                {
                    Operators.Pop();
                    var operand = Operands.Pop();
                    if (theirs == Precedence.Functional)
                        switch (pending)
                        {
                            case UnaryPlus:
                                break;
                            case UnaryMinus:
                                operand = MakeUnary(pending, operand);
                                break;
                            default:
                                try
                                {
                                    operand = MakeFunction(pending, operand);
                                }
                                catch (ArgumentNullException)
                                {
                                    throw new FormatException(
                                        $"Unrecognised function '{pending}', input='{Formula}'");
                                };
                                break;
                        }
                    else
                    {
                        if (pending == ImpliedProduct)
                            pending = "*";
                        operand = MakeBinary(pending, Operands.Pop(), operand);
                    }
                    Operands.Push(operand);
                }
                else
                    break;
            }
            while (true);
            if (op == ")")
                Operators.Pop();
            else
                Operators.Push(op);
            if (op != ImpliedProduct && op != SuperscriptPower)
                ReadPast(op);
        }

        private void ParseParameter(string token)
        {
            Operands.Push(Expressions.x);
            ReadPast(token);
        }

        private void ParseSquareRoot()
        {
            Operators.Push("Sqrt");
            ReadPast(SquareRoot.ToString());
        }

        private void ParseSuperscript(string superscript)
        {
            Operands.Push(new Parser().Parse(superscript.SuperscriptToNormal()));
            ReadPast(superscript);
        }

        private void ParseUnary(string unary)
        {
            Operators.Push($"u{unary}");
            ReadPast(unary);
        }

        private void ReadPast(string token) => Index += token.Length;
    }
}

Going Wild!

It's a fair cop. If you've read through the source code above, you'll have spotted vestiges of not only implied multiplication, but also (implied) superscript exponentiation. Like a lot of subsequent developments, they're not visible in the grammar spec at the head of this article. But to start with, you can omit the multiplication sign after any constant number, and before the next operand, which may be "x", or a function, or "(" introducing a nested expression, or even another number (separated from the first by at least one space).

And what of the over-eager precedence problem? This new implicit multiplication operator needs its own, new precedence level, Implied, above Unary, so stuff like sin 2x works properly. But that's also higher than Exponential precedence, so 2x^3 becomes (2*x)^3, whereas in the arena of polynomials we'd rather see 2*(x^3). This can be fixed, luckily, quickly, and quirkily: introduce Unicode superscripts, and give them their own implicit exponentiation operator! This has the highest precedence of all, and causes 2x³ to get dressed as the wanted 2*(x^3).

Just how far can we go with superscripts? Unicode characters include numbers and common mathematical symbols ⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾, a full superscript Latin lowercase alphabet ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ (well, full except for 'q'), a limited uppercase Latin alphabet ᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂ, and some Greek letters ᵅᵝᵞᵟᵋᶿᶥᶲᵠᵡ. These limitations already suggest we should use some other type of markup language - remember, this series started with a look at LaTeX - and the fact that even these available glyphs come from different ranges, so depending on your typeface can vary in size and position, surely cements that opinion. But until we make that switch, we can enjoy the novelty of chucking expressions like x⁴-4x³+6x²-4x+1, or eᶜᵒˢ⁽ˣ⁾, into our mathematical stockpot.

You don't have superscripts on your keyboard? Sorry, I'm just a code monkey. That's definitely a hardware problem.

namespace FormulaBuilder
{
    using System;
    using System.Text;

    public static class StringUtilities
    {
        public const string
            Subscripts = "₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎ₐₑₕᵢⱼₖₗₘₙₒₚᵣₛₜᵤᵥₓᵦᵧᵨᵩᵪ",
            Transcripts = "0123456789+-=()aehijklmnoprstuvxβγρψχbcdfgwyzABDEGHIJKLMNOPRTUVWαδεθιφ",
            Superscripts = "⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ᵃᵉʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛˣᵝᵞρᵠᵡᵇᶜᵈᶠᵍʷʸᶻᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂᵅᵟᵋᶿᶥᶲ";

        public static bool IsSubscript(this char c) => Subscripts.IndexOf(c) >= 0;
        public static bool IsSuperscript(this char c) => Superscripts.IndexOf(c) >= 0;

        public static string NormalToSubscript(this string number) => Transcribe(number, Transcripts, Subscripts);
        public static string NormalToSuperscript(this string number) => Transcribe(number, Transcripts, Superscripts);
        public static string SubscriptToNormal(this string number) => Transcribe(number, Subscripts, Transcripts);
        public static string SubscriptToSuperscript(this string number) => Transcribe(number, Subscripts, Superscripts);
        public static string SuperscriptToNormal(this string number) => Transcribe(number, Superscripts, Transcripts);
        public static string SuperscriptToSubscript(this string number) => Transcribe(number, Superscripts, Subscripts);

        public static string Transcribe(this string number, string source, string target)
        {
            var stringBuilder = new StringBuilder(number);
            for (var index = 0; index < Math.Min(source.Length, target.Length); index++)
                stringBuilder.Replace(source[index], target[index]);
            return stringBuilder.ToString();
        }
    }
}

But Does It Parse The Test?

Obviously we need some tests for all this, and in fact the following lines have just appeared in Expressions.Tests.cs. Despite giving the visual appearance of comparing one string with another one rather like it, these tests actually embody quite the round trip. The left string is pumped into the parser, generating an expression tree as its output. This is then fed through a stage 2 amplifier, namely the AsString() extension method, where we hope to see a similarly structured string emerge, but sporting at the most binary operations, and all the other minor effects and transformations expected.

        public static void TestParse(string input, string output) =>
            Check(output, new Parser().Parse(input).AsString());

        public static void TestParseFail(string input, string error)
        {
            try
            {
                new Parser().Parse(input);
                Check("Exception thrown", "no Exception thrown");
            }
            catch (Exception ex)
            {
                Check(error, ex.Message);
            }
        }

        public static void TestParser()
        {
            TestParserFailure();
            TestParserSuccess();
        }

        public static void TestParserFailure()
        {
            TestParseFail("x~2", "Unexpected character '~', input='x~2', index=1");
            TestParseFail("x+123,456", "Unexpected character ',', input='x+123,456', index=5");
            TestParseFail("x+1$2", "Unexpected end of text, input='x+1$2', index=3");
            TestParseFail("x+1e999", "Numerical overflow '1e999', input='x+1e999', index=2");
            TestParseFail("x+.", "Invalid number format '.', input='x+.', index=2");
            TestParseFail("x+.E+1", "Invalid number format '.E+1', input='x+.E+1', index=2");
            TestParseFail("x+", "Missing operand, input='x+', index=2");
            TestParseFail("(x+(2*(x+(3)))", "Unexpected end of text, input='(x+(2*(x+(3)))', index=15");
        }

        public static void TestParserSuccess()
        {
            TestParse("0", "0");
            TestParse("e", "2.71828182845905");
            TestParse("π", "3.14159265358979");
            TestParse("pi", "3.14159265358979");
            TestParse("ϕ", "1.61803398874989");
            TestParse("phi", "1.61803398874989");
            TestParse("X+1", "(x+1)");
            TestParse("((x+1))", "(x+1)");
            TestParse("x+x*x^x/x-x", "((x+((x*(x^x))/x))-x)");
            TestParse("3*x+x/5", "((3*x)+(x/5))");
            TestParse("x-2-x", "((x-2)-x)");                             // Subtraction is left associative
            TestParse("x^2^x", "(x^(2^x))");                             // Exponentiation is right associative
            TestParse("(x-3)*(5-x)/10", "(((x-3)*(5-x))/10)");
            TestParse("sin x * cos x", "(Sin(x)*Cos(x))");
            TestParse("Ln(sin x - tanh(x)) - 1", "(Ln((Sin(x)-Tanh(x)))-1)");
            TestParse("Abs Cos Sin Tan 1.5", "0.540839774154307");
            TestParse("Abs Cos Sin Tan (x/2)", "Abs(Cos(Sin(Tan((x/2)))))");
            TestParse("2*(sin x + cos x ^ 3 - tan(x^3))/3", "((2*((Sin(x)+(Cos(x)^3))-Tan((x^3))))/3)");
            TestParse("2*(x+3*(x-4^x)-5)/6", "((2*((x+(3*(x-(4^x))))-5))/6)");
            TestParse("1/5x", "(1/(5*x))");                              // Implied products have higher precedence
            TestParse("1/2sqrt(x)", "(1/(2*Sqrt(x)))");
            TestParse("2 sin 3x", "(2*Sin((3*x)))");
            TestParse("2 sin 3x * 5 cos 7x", "((2*Sin((3*x)))*(5*Cos((7*x))))");
            TestParse("2 3", "(2*3)");
            TestParse("2(x+3)", "(2*(x+3))");
            TestParse("2x^3)", "((2*x)^3)");
            TestParse("2(x^3)", "(2*(x^3))");
            TestParse("x⁴-4x³+6x²-4x+1", "(((((x^4)-(4*(x^3)))+(6*(x^2)))-(4*x))+1)");
            TestParse("1/2√(1-x²)", "(1/(2*Sqrt((1-(x^2)))))");
            TestParse("eˣ", "(2.71828182845905^x)");
            TestParse("eᶜᵒˢ⁽ˣ⁾", "(2.71828182845905^Cos(x))");
        }

No comments:

Post a Comment