How I Built My Own Interpreter : A Journey to Creating a Programming Language

How I Built My Own Interpreter : A Journey to Creating a Programming Language

A Deep Dive into the World of Interpreters and How They Transform Code into Action

Have you ever wondered what happens behind the scenes when you write code in your favorite programming language? How does the computer understand the instructions you've written, and what's the process of executing them?

Introduction

In this article, we're going to take a fascinating journey into the world of interpreters, the unsung heroes of the programming world. But we won't just stop at theoretical concepts - we'll put our knowledge into practice by building our very own programming language from scratch! Using our custom language, we'll be able to write code that can print messages, read user input, perform basic arithmetic operations like addition and subtraction, and return the results.

The twist? We'll be building this interpreter using an interpreted language itself - PHP. Get comfortable, grab a snack, and let's dive into the world of interpreters and language creation.

“If you don’t know how compilers work, then you don’t know how computers work. If you’re not 100% sure whether you know how compilers work, then you don’t know how they work.” — Steve Yegge

what are interpreters and compilers?

The goal of an interpreter or a compiler is to translate a source program in some high-level language into some other form.

Compiler

A compiler translates the entire source code into machine code before executing it.

Imagine you're a chef, and you have a recipe written in a language that the kitchen staff can't understand. You give the recipe to a translator, who translates the entire recipe into a language that the kitchen staff can understand. The translator writes down the translated recipe on a piece of paper and then hands it to the kitchen staff.

The kitchen staff can then follow the translated recipe to cook the meal without needing to understand the original language. The translator's job is done, and the kitchen staff can work independently.

In this example:

  • The recipe is like the source code written by the programmer.

  • The translator is like the compiler, which translates the source code into machine code that the computer's processor can understand.

  • The translated recipe is like the machine code, which is executed directly by the computer's processor.

Example: C & C++ or compiled programming language.

Interpreter

An interpreter reads the source code line by line and executes it directly, without translating the entire code at once.

Now, imagine the same scenario, but this time, the translator doesn't translate the entire recipe at once. Instead, the translator stands in the kitchen and reads the original recipe, line by line, and tells the kitchen staff what to do next.

As the translator reads each line, the kitchen staff performs the corresponding action. If the translator comes across a line that says "add salt", the translator tells the kitchen staff to add salt, and they do it immediately.

In this example:

  • The recipe is still the source code written by the programmer.

  • The translator is like the interpreter, which reads the source code line by line and executes it directly.

  • The kitchen staff is like the computer's processor, which performs the actions as instructed by the interpreter.

Example: Python & PHP are Interpreted languages.


📓
In this article, we are going to learn only about Interpreters. How does our code have been interpreted & executed?

Introducing Our Programming Language: SimpleScript

let's define the scope of our programming language. We'll call it SimpleScript, a language that will allow users to perform basic operations like reading input, performing arithmetic, and printing output.

SimpleScript will have the following features:

  • Reading input from users: Our language will allow users to input data, which will be stored in memory for later use.

  • Performing addition and subtraction: SimpleScript will support basic arithmetic operations, enabling users to perform calculations and store the results.

  • Printing text and results: Our language will provide a way to print output to the console, making it easy to display results and messages.

To manage memory, we'll rely on two fundamental data structures:

  • Stack: A Last-In-First-Out (LIFO) data structure that will help us keep track of the order in which variables and values are accessed.

  • Hashmap: A data structure that will enable us to store and retrieve values efficiently, using keys to identify and access specific data.

Here's a sneak peek at what our SimpleScript interpreter will look like in action:

As you can see, our interpreter will take in user input, execute the instructions, and print the output. In the next sections, we'll dive deeper into the implementation details. With SimpleScript, we'll create a solid foundation for understanding the inner workings of programming languages and interpreters.

Understanding User Instructions with Parsers

The idea of interpreters may seem straightforward - identify user inputs and execute the corresponding instructions. However, turning user inputs into actionable instructions is where the magic happens. In this section, we'll delve into the first crucial step of building an interpreter. To understand user instructions, we need to learn about parsers.

Parser

A parser is a program that analyzes the user input, breaks it down into its constituent parts, and identifies the structure and meaning of the input.

Think of a parser as a linguistic expert who takes a sentence and identifies the individual words, their meanings, and how they relate to each other.

In the context of our interpreter, the parser will take the user's input, such as a line of code, and break it down into its components, such as keywords, variables, and operators. This process is called lexical analysis or tokenization.

In the provided code, the parser is implemented in the Parser class. The parser reads the source code from a file, line by line, and breaks each line into individual tokens using the explode function. The tokens are then stored in an array, which is returned by the parse method.

For example, if the source code is:

PRINT Type-number-one
READ
PRINT Type-number-two
READ
SUM #total
PRINT #total
EXIT

The parser will break it down into the following tokens:

class Parser
{
    private  $file;
    private array $tokens;

    public function __construct($file)
    {
        $this->file = $file;
        $this->tokens = [];
    }

    public function parse(): array
    {
        $handle = fopen($this->file, 'r');
        if (!$handle) {
            throw new Exception("Could not open file");
        }

        while (($line = fgets($handle)) !== false) {
            $this->parseLine($line);
        }

        fclose($handle);
        return $this->tokens;
    }

    private function parseLine($line): void
    {
        $line = trim($line);
        $tokens = explode(' ', $line);
        foreach ($tokens as $token) {
            $this->tokens[] = $token;
        }
    }
}
// output of tokens
[
  'PRINT',
  'Type-number-one',
  'READ',
  'PRINT',
  'Type-number-two',
  'READ',
  'SUM',
  '#total',
  'PRINT',
  '#total',
  'EXIT'
]

In this simple parser, tokenization is achieved by splitting the input line into individual words using spaces as delimiters. This is a basic form of tokenization, and in a real-world parser, you would likely want to handle more complex cases, such as:

  • Handling multiple spaces or tabs between tokens

  • Identifying keywords and operators

  • Handling strings and literals

  • Ignoring comments

In real world, the process is called lexical analysis. where parsing the instruction and making it as raw tokens.

Tokenizing

The next step is tokenizing, which is the process of converting the tokens into a format that the interpreter can execute. This is done by a tokenizer, which is a program that takes the tokens generated by the parser and converts them into a sequence of tokens that can be executed.

In the provided code, the tokenizer is implemented in the Tokenizer class. The tokenizer takes the tokens generated by the parser and converts them into a sequence of TokenStreams objects, which contain the token type, token value, and other metadata.

The tokenizer will convert them into the following TokenStreams objects:


class Tokenizer
{
    public array $tokens;
    public array $map;
    public int $tokenCounter;

    public function __construct()
    {
        $this->tokens = [];
        $this->tokenCounter = 0;
        $this->map = [];
    }

    public function tokenize(string $token): void
    {
        if ($this->isValidToken($token)) {
            $this->tokens[] = TokenStreams::makeToken($token);
            $this->tokenCounter++;
            return;
        }
        if ($this->isVariable($token)) {
            $this->map[$token] = 0;
            $this->tokens[] = TokenStreams::makeVariableToken($token);
        }
        else if(is_numeric($token)){
            $this->tokens[] = TokenStreams::makeNumericToken($token);
        }
        else {
            $this->tokens[] = TokenStreams::makeStringToken($token);
        }
    }

    public function isVariable(string $value)
    {
        return str_starts_with($value, '#');
    }
    public static function isValidToken(string $value): bool
    {
        $tokens = [
            TOKEN::PUSH->value,
            TOKEN::POP->value,
            TOKEN::SUM->value,
            TOKEN::SUB->value,
            TOKEN::EXIT->value,
            TOKEN::PRINT->value,
            TOKEN::READ->value,
            TOKEN::SUB->value
        ];
        return in_array($value, $tokens);
    }
}
// output 
[
  TokenStreams { token: PRINT, tokenType: PRINTABLE, tokenValue: null },
  TokenStreams { token: STRING, tokenType: PRINTABLE, tokenValue: 'Type-number-one' },
  TokenStreams { token: READ, tokenType: PRINTABLE, tokenValue: null },
  TokenStreams { token: PRINT, tokenType: PRINTABLE, tokenValue: null },
  TokenStreams { token: STRING, tokenType: PRINTABLE, tokenValue: 'Type-number-two' },
  TokenStreams { token: READ, tokenType: PRINTABLE, tokenValue: null },
  TokenStreams { token: SUM, tokenType: MEASURABLE, tokenValue: null },
  TokenStreams { token: STRING, tokenType: VARIABLE, tokenValue: '#total' },
  TokenStreams { token: PRINT, tokenType: PRINTABLE, tokenValue: null },
  TokenStreams { token: STRING, tokenType: VARIABLE, tokenValue: '#total' },
  TokenStreams { token: EXIT, tokenType: PRINTABLE, tokenValue: null }
]

The following tokens can be considered abstract syntax trees. We going to execute the instruction based on these token stream objects.

Real-World Interpreter: Python

Let's compare this with how a real-world interpreter like Python handles parsing and tokenization.

  1. Lexical Analysis: Python uses a lexer to convert the source code into tokens. The lexer reads the source code character by character and groups them into meaningful sequences called tokens. These tokens include keywords, identifiers, literals, operators, and delimiters.

  2. Syntax Analysis: After lexical analysis, Python's parser takes the tokens and arranges them into a syntax tree. This tree represents the grammatical structure of the code. The parser checks for syntax errors and ensures that the code follows the language's grammar rules.

  3. Semantic Analysis: Python performs semantic analysis to ensure that the code makes sense. It checks for things like variable declarations, type checking, and scope resolution.

  4. Execution: Once the code passes all the analysis phases, Python's interpreter executes the code line by line. It translates each line into machine code and executes it immediately.

    you could explore the following topics:

  5. Lexical Analysis: Delve deeper into the process of tokenization, discussing different techniques and tools used in lexical analysis.

  6. Token Types: Introduce the different types of tokens that can be encountered in a programming language, such as keywords, identifiers, literals, and operators.

  7. Parser Generators: Discuss tools like ANTLR, yacc, or lex, which can generate parsers from grammar definitions.

  8. Top-Down vs. Bottom-Up Parsing: Explain the differences between these two parsing approaches and provide examples of each.

  9. Error Handling: Discuss how parsers handle errors, such as syntax errors or unexpected input.

  10. Parser Optimization: Explore techniques for optimizing parser performance, such as caching or memoization.

Some recommended resources for further reading:

  • "The Dragon Book" by Alfred Aho, Monica Lam, Ravi Sethi, and Jeffrey Ullman (a classic textbook on compiler design)

  • "Parser Generators" by Charles N. Fischer, Richard J. LeBlanc, and Richard C. Dybvig (a book on parser generators)

  • "ANTLR" by Terence Parr (a book on the ANTLR parser generator)

  • "Parsing" by Grune and Jacobs (a book on parsing techniques)

Executing

The final step is executing the tokenized code. This is done by an interpreter, which is a program that takes the tokenized code and executes it.

class Interpreter
{

    public int $programCounter = 0;
    public Tokenizer $tokenizer;
    public  SplStack $stack;

    public function __construct(Tokenizer $tokenizer)
    {
        $this->tokenizer = $tokenizer;
        $this->stack = new SplStack();
    }

    public function execute(): void
    {

        while (TOKEN::EXIT != $this->tokenizer->tokens[$this->programCounter]->token) {

            $opCode = $this->tokenizer->tokens[$this->programCounter]->token;
            $this->programCounter++;
            if ($opCode == TOKEN::PRINT) {
                $instruction = $this->tokenizer->tokens[$this->programCounter];
                if ($instruction->tokenType == TOKEN_TYPE::PRINTABLE) {
                    echo $instruction->tokenValue . "\n";
                }
                if ($instruction->tokenType == TOKEN_TYPE::VARIABLE) {
                    $val = $this->tokenizer->map[$instruction->tokenValue];
                    echo $val . "\n";
                }
                $this->programCounter++;
            }

            if($opCode == TOKEN::PUSH){
                $this->stack->push($this->tokenizer->tokens[$this->programCounter]->tokenValue);
                $this->programCounter++;
            }

            if($opCode == TOKEN::SUM){
                $token = $this->tokenizer->tokens[$this->programCounter];
                $operandOne = $this->stack->pop();
                $operandTwo = $this->stack->pop();
                $sum = $operandOne + $operandTwo;
                $this->tokenizer->map[$token->tokenValue] = $sum;
                $this->programCounter++;
            }
            if($opCode == TOKEN::SUB){
                $token = $this->tokenizer->tokens[$this->programCounter];
                $operandOne = $this->stack->pop();
                $operandTwo = $this->stack->pop();
                $sum = $operandTwo - $operandOne;
                $this->tokenizer->map[$token->tokenValue] = $sum;
                $this->programCounter++;
            }
            if($opCode == TOKEN::READ){
                $value = (int)trim(fgets(STDIN));
                $this->stack->push($value);
            }
        }
    }
}

In the provided code, the interpreter is implemented in the Interpreter class. The interpreter takes the tokenized code and executes it by iterating through the TokenStreams objects and performing the corresponding actions.

For example, when the interpreter encounters a PRINT token, it will print the value of the token to the console. When it encounters a READ token, it will read input from the user and push it onto the stack. When it encounters a SUM token, it will pop two values from the stack, add them together, and push the result back onto the stack.

The interpreter will continue executing the tokenized code until it encounters a EXIT token, at which point it will terminate.

The Interpreter class is responsible for executing the instructions in the program. It takes an Tokenizer object as input, which has already broken down the program into individual tokens.

Properties

The class has three properties:

  • $programCounter: an integer that keeps track of the current position in the program.

  • $tokenizer: the Tokenizer object that provides the tokens for the program.

  • $stack: an SplStack an object that is used to store values during execution [It’s a simple stack data structure ].

Execute Method

The execute method is the main entry point for the interpreter. It loops through the tokens in the program and executes each instruction. The loop continues until it reaches the TOKEN::EXIT token, which marks the end of the program.

The interpreter handles the following instructions:

  • TOKEN::PRINT: prints a value to the console. If the value is a string, it prints the string directly. If the value is a variable, it looks up the variable in the Tokenizer's map and prints its value.

  • TOKEN::PUSH: pushes a value onto the stack.

  • TOKEN::SUM and TOKEN::SUB: pops two values from the stack, performs the specified arithmetic operation, and stores the result in a variable.

  • TOKEN::READ: reads an integer value from the standard input and pushes it onto the stack.

Conclusion

In summary, By creating SimpleScript, we have delved into the core concepts of how programming languages work, from parsing user input to executing instructions.This journey has provided us with a deeper understanding of interpreters, tokenization, and the execution process.

Sources

Build your interpreter complete series - complete real-world series.
Source code - Github link to take a look over a code.

Did you find this article valuable?

Support Saravana sai blog by becoming a sponsor. Any amount is appreciated!