Parsing Math Expressions in C#. Tokenization

The science is constantly evolving, the technology is rapidly breaking into our lives, new tools and opportunities are appearing, the only one thing that is permanent is the gloomy expression of the face of your interviewer. And since we’ve figured out how to read the emotions of the interviewer, then we still have problems with reading mathematical expressions. Therefore, in this post, we will analyze the simplest topic in the context of this task – splitting a source string into separate tokens.

As has been mentioned in the previous posts about the postfix notation and shunting-yard algorithm. We have the following types of tokens:

  1. Arithmetic signs: +, -, *, /;
  2. Round brackets (, );
  3. Real numbers; for simplicity let’s use only integer numbers and the numbers with a dot delimiter;
  4. Space is the only delimiter for tokens.

Let’s create a body for the main method in C#. We’re going to read an expression character by character and process them in-place. For our case, it’s enough while other lexical analyzers can require more complex approaches. In the very book, this topic is described quite well.

public class Tokenizer
{
    public IEnumerable<IToken> Parse(string expression)
    {
        foreach (char next in expression)
        {
            FeedCharacter(next);
        }
        return GetResult();
    }
}

First, we’re implementing the simplest case: an input line contains only one number, there are neither spaces no operators:

  • While there are characters read a character and add it to special storage;
  • If the storage is not empty then create a number based on the characters and add the number to the resulting sequence.
public class Tokenizer
{
    public Tokenizer()
    {
        _valueTokenBuilder = new StringBuilder();
        _infixNotationTokens = new List<IToken>();
    }

    public IEnumerable<IToken> Parse(string expression)
    {
        foreach (char next in expression)
        {
            FeedCharacter(next);
        }
        return GetResult();
    }

    private void FeedCharacter(char next)
    {
        _valueTokenBuilder.Append(next);
    }

    private IToken CreateOperandToken(string raw)
    {
        if (double.TryParse(raw, out double result))
        {
            return new OperandToken(result);
        }

        throw new SyntaxException($"The operand {raw} has an invalid format.");
    }

    private IEnumerable<IToken> GetResult()
    {
        if (_valueTokenBuilder.Length > 0)
        {
            var token = CreateOperandToken(_valueTokenBuilder.ToString());
            _valueTokenBuilder.Clear();
            _infixNotationTokens.Add(token);
        }

        return _infixNotationTokens.ToList();
    }

    private readonly StringBuilder _valueTokenBuilder;
    private readonly List<IToken> _infixNotationTokens;
}

Let’s complicate the task a bit by allowing the use of several numbers separated by spaces. Then the conversion into tokens can be represented as follows:

  • While there are characters take a character:
    • If the character is a separator (a space in our case), then convert the previously saved characters in the storage to a number and add the number to the resulting sequence;
    • Otherwise, add the character to the numbers storage.
  • If the storage contains characters then convert the characters to a number and add the number to the resulting sequence.
public class Tokenizer
{
    public Tokenizer()
    {
        _valueTokenBuilder = new StringBuilder();
        _infixNotationTokens = new List<IToken>();
    }

    public IEnumerable<IToken> Parse(string expression)
    {
        foreach (char next in expression)
        {
            FeedCharacter(next);
        }
        return GetResult();
    }

    private void FeedCharacter(char next)
    {
        if (IsSpacingCharacter(next))
        {
            if (_valueTokenBuilder.Length > 0)
            {
                var token = CreateOperandToken(_valueTokenBuilder.ToString());
                _valueTokenBuilder.Clear();
                _infixNotationTokens.Add(token);
            }
        }
        else
        {
            _valueTokenBuilder.Append(next);
        }
    }

    private bool IsSpacingCharacter(char c)
    {
        switch (c)
        {
            case ' ':
                return true;
            default:
                return false;
        }
    }

    private IToken CreateOperandToken(string raw)
    {
        if (double.TryParse(raw, out double result))
        {
            return new OperandToken(result);
        }

        throw new SyntaxException($"The operand {raw} has an invalid format.");
    }

    private IEnumerable<IToken> GetResult()
    {
        if (_valueTokenBuilder.Length > 0)
        {
            var token = CreateOperandToken(_valueTokenBuilder.ToString());
            _valueTokenBuilder.Clear();
            _infixNotationTokens.Add(token);
        }

        return _infixNotationTokens.ToList();
    }

    private readonly StringBuilder _valueTokenBuilder;
    private readonly List<IToken> _infixNotationTokens;
}

The last thing we need to add is a processor for operators. As in our problem statement, every operator can be represented by only a single character then when we meet an operator character during the processing we can convert it and put it to the resulting sequence at once. Then the final algorithm looks like this:

  • While there are characters take a character:
    • If the character is a separator (a space in our case), then convert the previously saved characters in the numbers storage to a number and add the number to the resulting sequence;
    • If the character is an operator then convert the previously saved characters in the numbers storage to a number and add the number to the resulting sequence, after which add the operator token to the resulting sequence;
    • Otherwise, add the character to the numbers storage.
  • If the storage contains characters then convert the characters to a number and add the number to the resulting sequence.
public class Tokenizer
{
    public Tokenizer()
    {
        _valueTokenBuilder = new StringBuilder();
        _infixNotationTokens = new List<IToken>();
    }

    public IEnumerable<IToken> Parse(string expression)
    {
        foreach (char next in expression)
        {
            FeedCharacter(next);
        }
        return GetResult();
    }

    private void FeedCharacter(char next)
    {
        if (IsSpacingCharacter(next))
        {
            if (_valueTokenBuilder.Length > 0)
            {
                var token = CreateOperandToken(_valueTokenBuilder.ToString());
                _valueTokenBuilder.Clear();
                _infixNotationTokens.Add(token);
            }
        }
        else if (IsOperatorCharacter(next))
        {
            if (_valueTokenBuilder.Length > 0)
            {
                var token = CreateOperandToken(_valueTokenBuilder.ToString());
                _valueTokenBuilder.Clear();
                _infixNotationTokens.Add(token);
            }

            var operatorToken = CreateOperatorToken(next);
            _infixNotationTokens.Add(operatorToken);
        }
        else
        {
            _valueTokenBuilder.Append(next);
        }
    }

    private bool IsOperatorCharacter(char c)
    {
        switch (c)
        {
            case '(':
            case ')':
            case '+':
            case '-':
            case '*':
            case '/':
                return true;
            default:
                return false;
        }
    }

    private bool IsSpacingCharacter(char c)
    {
        switch (c)
        {
            case ' ':
                return true;
            default:
                return false;
        }
    }

    private IToken CreateOperandToken(string raw)
    {
        if (double.TryParse(raw, out double result))
        {
            return new OperandToken(result);
        }

        throw new SyntaxException($"The operand {raw} has an invalid format.");
    }

    private OperatorToken CreateOperatorToken(char c)
    {
        switch (c)
        {
            case '(': return new OperatorToken(OperatorType.OpeningBracket);
            case ')': return new OperatorToken(OperatorType.ClosingBracket);
            case '+': return new OperatorToken(OperatorType.Addition);
            case '-': return new OperatorToken(OperatorType.Subtraction);
            case '*': return new OperatorToken(OperatorType.Multiplication);
            case '/': return new OperatorToken(OperatorType.Division);
            default:
                throw new SyntaxException($"There's no a suitable operator for the char {c}");
        }
    }

    private IEnumerable<IToken> GetResult()
    {
        if (_valueTokenBuilder.Length > 0)
        {
            var token = CreateOperandToken(_valueTokenBuilder.ToString());
            _valueTokenBuilder.Clear();
            _infixNotationTokens.Add(token);
        }

        return _infixNotationTokens.ToList();
    }

    private readonly StringBuilder _valueTokenBuilder;
    private readonly List<IToken> _infixNotationTokens;
}

As always, the full implementation can be found on the GitHub of the coding-interview project, while with the post we are finishing the topic of parsing mathematical expressions. I hope that everything has been informative and interesting. See you!

Leave a Reply