5
\$\begingroup\$

I am trying to write a simple tokenizer for a basic arithmetic calculator. Here's the code:

// Example program
#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <map>
#include <utility>

std::string operandRegex = "^(-?)(0|([1-9][0-9]*))(\\.[0-9]+)?";
enum Operator { ADD = 0, SUBTRACT = 1, MULTIPLY = 2, DIVIDE = 3 };
static const std::map<std::string, enum Operator> operatorRegexMap = {
    std::make_pair("^\\+", Operator::ADD),
    std::make_pair("^\\-", Operator::SUBTRACT),
    std::make_pair("^\\*", Operator::MULTIPLY),
    std::make_pair("^/", Operator::DIVIDE),
};

std::vector<std::string> tokenize(const std::string &expression) {
  std::vector<std::string> result;

  std::string::const_iterator searchStart(expression.cbegin());

  while (searchStart != expression.cend()) {
    std::regex re = std::regex{operandRegex};
    std::smatch match;
    // Operand
    if (regex_search(searchStart, expression.cend(), match, re)) {
      searchStart = match.suffix().first;
      result.push_back(match.str(0));
    } else {
      // Operator
      bool noMatch = true;
      for (const auto &re_op : operatorRegexMap) {
        std::regex re = std::regex{re_op.first};
        if (regex_search(searchStart, expression.cend(), match, re)) {
          searchStart = match.suffix().first;
          result.push_back(match.str(0));
          noMatch = false;
          break;
        }
      }
      if (noMatch) {
        break;
      }
    }
  }

  return result;
}

void print(const std::vector<std::string> &tokens) {
    for (const auto &token : tokens) {
        std::cout << token << std::endl;
    }
}

int main() {
    std::string expression;
    std::cout << "Input Expression: ";
    std::getline(std::cin, expression);
    std::vector<std::string> tokens = tokenize(expression);
    std::cout << "Parsed: " << std::endl;
    print(tokens);
    return 0;
}

I am getting issue with associativity of operator vs operand:

Input Expression: 2-3+5
Parsed: 
2
-3
+
5

These are unary operators, and I could fix it. But what if I wanted to experiment with more complex operators, say, pow() function or integral(function, low, high). How do I change the code to make it into a better design, with emphasis on SOLID principles?

New contributor
user2338150 is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
\$\endgroup\$

2 Answers 2

4
\$\begingroup\$

There's a reason many folks have found lex / yacc (flex / bison) to be helpful. But let's assume we wish to reinvent some aspects of such compiler technology.

specification

The most important part of any program's design, especially for a parser, is to write down exactly what it does. Here, it's necessary to describe a particular language (a grammar) that it accepts.

Many designers will choose to give a concise and precise description using EBNF. With that accomplished, you'd be in a good position to mechanically translate by hand to an RD parser, and could easily trace from functions in the C++ implementation back to particular EBNF production rules, and vice versa.

automated test suite

Writing up a formal specification certainly has instructional value to end users of your language, at an abstract level. But test cases, which are known to run Green, also have enormous instructional value, at a more concrete level. The OP would benefit from the addition of such a test suite.

unusual inputs

This is a valid input to the OP code, though it's not clear that all developers who read the Review Context prose would similarly agree it should be valid.

  • -0 (different from the quite usual -0.0)

Absent a Specification document, it's not clear these should be marked "invalid":

  • 07
  • 007
  • 7.

We also prohibit +7, which is perhaps nice enough; certainly it makes x+y simpler to parse. Arguably the + is redundant; but then, so is the - in -0, given that we just get 0.

Summary: if there are input language aspects you wish to enforce, you might prefer to do that at a higher level than a regex or EBNF production.

regex vs. strcmp

    std::make_pair("^\\+", Operator::ADD),
    std::make_pair("^\\-", Operator::SUBTRACT),
    std::make_pair("^\\*", Operator::MULTIPLY),
    std::make_pair("^/", Operator::DIVIDE),

All of the OP operators are represented with a single character, which is easily compared. If you feel you'll need to support longer operators, you might as well introduce ** exponentiation now, to at least justify the character versus string decision. Even then the use of regex seems a bit heavy handed.

Presumably the integral(fn, low, high) use case motivates the use of regexes here. But if we listen to YAGNI, it would be better to get some unit tests showing Green right now with a simple technique, and refactor when the time comes to support such function call syntax. By then, you may find that RD or another approach is the better fit for your project's evolving needs.

\$\endgroup\$
3
\$\begingroup\$

Set operandRegex as constexpr. It does not benefit from being a std::string and can simply be a const char *. Refer to the constructor for regex.

operatorRegexMap may also be constexpr as of C++26.

re should not be constructed on the inside of the loop. It should be constructed outside of the loop, either in function scope or as a const global. This applies to both token pattern kinds.

For Operator use enum class instead of enum.

The logic of tokenize does not seem correct to me. It tries, in a stateless manner, to match either an operand or an operator on any token. Instead, you should alternate between matching an operand and matching an operator, assuming that all operators are binary. If you also need to support unary operators, you need exactly one binary operator between operands, and zero or more unary operators before each operand.

Since your module doesn't export anything, all global symbols other than main() should be in an anonymous namespace { }. It's a long story, but I find it less repetitive than declaring static on everything.

\$\endgroup\$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.