Tuesday | 29 OCT 2024
[ previous ]
[ next ]

Crafting Interpreters - VM

Title:
Date: 2024-10-19
Tags:  

I'm going to work through the crafting interpreters book in the hopes that at the end of it is I'll be able to write a BASIC compiler that is much faster than the one I already worked on.

The goal is to write the BASIC side and then get started on building my own version of Pick backed by sqlite. For what reason you might ask, because I want to. It is sitting in my head and the idea demands to be brought out. I'm sure it's not a good idea but I can't help it. I really want to have this thing in me run my editor and my shell and from there maybe it'll be able to run something else.

At least I have a good amount programs to test against it but we'll see if this project really holds up. There are quite a few pieces to it.

I'm going to once again stream of consiousness write while working through the book.

Chapter 1 - Chunks of Bytecode

While working through the book, I had a sense of deja vu and then realized that I had actually already done the beginning of the compiler section. I checked my shell history and indeed found the compilation commands I had used. I was shocked to see how far I had actually gotten as I could not remember doing it. I knew I had started it before but it looks like I got a pretty good way in. None of it stuck unfortunately.

This time I decided that I'll try to do everything in a single file. Usually that helps me keep the entire thing in my head. I'll also try to build out the BASIC language while working through the book so it's a bit more relevant to me. These two things I'm hoping let me learn better.

A thing to note here is that there is a large chunk of code due to display debugging information and another large part is the logic to grow arrays. This is due to C and I can see the language code being much smaller when you can ignore these things when using a language that has dynamic arrays.

The first chapter is now done and I'm not exactly thrilled with it. It's still too far removed from anything useful but I can see why its an important chapter. It lays out the plan and sets the foundation for things to come. I just don't see the connection between what a chunk is and what the source code will look like.

I can understand that a chunk is a set of bytes that map to op codes and the opcode dictate how big the operands are. I can see how the OP_CONSTANT op code works and how it will be useful but how does a source line become a chunk and a chunk implies that there are many chunks to be processed. Each chunk has a constant pool that holds constants of variable size. How does this relate to the constants inside an entire program?

This chapter seems to be give me more questions than answers but I'm sure it'll become more clear as I go on. The big thing is that I remember the interpreter part of the book being much more satisfcatory at the end of each chapter. Every chapter ended with success and a clear path forward.

Let's see how chapter 2 goes.

Chapter 2 - A Virtual Machine

This chapter was a bit better but still not very clear about where we're headed. This chapter builds out the virtual machine and that code that will ultimately execute the language. This chapter introduces the stack and pushing and popping values. It also adds the arithmetic operations and shows how the stack will hold them an execute them.

I think I would have started much higher up before getting to this chapter. Converting a line of code into byte code and then executing that might be the better way for me to learn. I think that top down approach would make it feel much more real.

I've built a stack machine before and the core logic of it is so simple. The harder part is parsing and generating the bytecode. The logic of function calls and actually running code I think falls out of the logic of a stack very nicely.

This chapter doesn't add too many lines to the program. Currently the single file is sitting at 300 lines of code. I'm very curious to see how big this program gets by the end.

I would also love to be able to make this language very simple and small but I don't think thats possible with something like BASIC and my own lack of skill.

Chapter 3 - Scanning on Demand

The scanning logic is pretty straightforward. It rips through and breaks on the various characters like parenthesis, keywords and numbers. It builds the tokens for strings and numbers. It also

The logic of using a trie to identify if a keyword is a variable name or a builtin is the complex part. I wonder how much faster it really is versus using something like a hashmap. As the chapter explains, even a hashmap still requires every character to be processed whereas with a trie, it will only check as many character as it needs to and many words would be skipped over.

This chapter is one that I should implement from scratch as I think it would help quite a bit to really learn parsing. It is straightforward but I've probably made it more complex in previous versions of writing small languages.

This chapter also makes it clear why programmers end up writing language generators as this code for parsing can be made automatically as it depends on just what your keywords are going to be and what special character your are using.

This chapter is better than the previous ones but we still aren't where I want to be. This step is the parsing where the language gets split into tokens. The next chapter will probably be the key where the tokens get converted into bytecode.

I hope.

Chapter 4 - Compiling Expressions

This chapter is most definitely the most important so far. The chapter goes over how to do handle the precdence of operators and how tokens are used to generate the op codes.

I'll need to walk through the logic and map out examples because it uses a table of functions instead of converting an infix string to postfix. I used a function to do the conversion as it made things easier and I understood it more but this might be the better way to do things once I can intuit why it works.

This is one of those times where I may need to split out the compilation logic to a seperate file so I can see just how those functions interconnect.

The program is now sitting at 800 lines.

Chapter 5 - Types of Values

This chapter adds more operators and also adds typing information to the values. For the most part this was a breather chapter and I ended up fixing a bug that I had in my parsing logic.

I'm slowly starting to work on understanding the parsing logic and how the chain of functions is being called. The indirection is a bit frustrating and I wonder if there is a better way to do it. It also doesn't help that C doesn't display enough information easily and so I need to write logic to print enough information to trace things out.

I think this aspect especially might be helped with another language. For example printing a struct in rust will output all the fields. I can also see javascript being much easier for me to understand as it's the language I'm most familiar with. Here I'm trying to figure out if the bugs are in my code or my misunderstanding of C.

I usually learn things deeper when I explain them so I may write out my process for creating the BASIC compiler. Originally the plan was to modify the resulting compiler from the book but starting from scratch might be better. I haven't decided yet but something to think about.

The next chapter is going to be interesting as that will handle strings but really I'm waiting to get to variables, control flow and functions. Those are what make a programming language a programming language.

Chapter 6 - Strings

Strings was a fun thing to implement. So far everything has been packed into the Value struct which was then added to the chunks constant array.

Strings however need variable sizing and so the Value struct contains a pointer to the string instead of holding it inside itself. This opens up the gates to having things be on the heap that we need to still access.

I can see strings being much simpler in another language besides C, alot of the code is due to the fact that you need to manually allocate and copy over strings from the source code into memory. I can also see how this will be handy when other things like functions need to live on the heap or in some dynamically access memory.

I also liked the implementation of the clean up as now we have things that need to get traced to get deallocated. Before the ValueArray was getting freed and all the values were by default freed because everything lived inside the Value struct. Now the strings would need to be followed to their locations in the heap to be freed.

The current solution this chapter uses is to store a list of objects that get created throughout the life of the program and it then cleans them up at the end. This is a very cool solution as it uses my favourite data structure, the linked list.

The code at this point is sitting at around 1150 lines. I think the macros isn't really helping and I haven't seen much use for it yet. I think making things more explicit might not be a bad thing. I am curious how a redo from scratch in another language would go. Some things might be simpler but I think I need to be more comfortable with the other languages to make things optimized.

Chapter 7 - Hash Tables

This is a bit of a boring chapter, I've implemented hash maps multiple times at this point and there is much new content here. The tombstones was nice and seeing the full delete function was useful but ultimately I'd rather reach for an alreayd working hashmap implementation than write my own.

The string interning part was very cool. I can see that being very useful and the speedup sounds like it is worth doing. The idea is to save every string to a set of strings even if its never really used more than once. This way you can save all the strings and then re-use them as the program needs. The goal is to have only one place where you do a full character by character string check. I can see this being massively useful. This kind of optimization I think could be done after the fact.

Both the previous chapter on strings and this chapter on hashmaps I beleive disappears in another language where you get these things automatically.

Next up will be global variables and now that we have hashmaps, I can see how globals will come into play. There will probably be a symbol table in the VM. I imagine this where assignment will also finally come into play and wiring up the identifier logic. This will be the first chapter that will do something really programming related rather than calculator related. So far everything has been very calculator centric.

Chapter 8 - Global Variables

This was the most enjoyable chapter thus far. The variables logic nicely fits in with the hashmaps. When I wrote my own programming language, I directly looked up variables and set variables as I hit them rather than emitting bytecode. I can see now that I had the basic idea of using a stack machine and op codes but I didn't commit fully to it. Some things were done by the meta language while others were done by the byte code runner.

I'm guessing having everything translate to bytecode would be better as it gives you a single place to process things in.

Good to see and I've definitely learned something here.

Another thing I would change is the advance function that gets the next token. I see why it is split out into its own function. It's because you need to advance the token based on many things and so removing it from a loop makes sense. However it also makes things harder for me to understand. I wonder if there's a way to write the scanner and parser such that you can just use loops. One solution might be to fully scan the programmer and then parse it and start executing. That does seem wasteful as the whole point is that you only need a couple tokens in reality to do anything.

As of this chapter, the program sits at about 1500 lines of code.

Chapter 9 - Local Variables

Local variables is a great chapter and it uses the stack data structure to load up variables and then keeps track of where it is on the stack to get things. This is a beautifully simply way of doing scoping and I like this more than the environment hashmap.

I wonder why you can't extend the logic to hold globals as well, as currently globals get looked up in a hashmap. Local variables seem to be a search through all the scopes to find the variable which would be the downside.

But you have to do this search before getting the global anyway so I'm not sure if it actually matters if the globals are on the stack versus the hashmap.

The block logic is also insanely simple as all it does is loop around reading everything and compilng everything in between the braces.

Now that that logic exists, I'm interesting in seeing how conditionals get handled. My own block logic felt fragile and cumbersome but this code is elegant. I'll need to see where I went wrong but likely its in the fact that the scanning and parsing logic are so much more robust here.

Next chapter is the long awaited control flow, this is where this toy calculator will become a real programming language, cannot wait.

Chapter 10 - Jumping Back and Forth

This was a fun chapter and once that makes it clear that I was on the wrong path when writing my own languages. I was trying to use both a stack machine and an AST style design when I should have gone with just one. I can see now that my conditionals and looping logic would have been much easier to handle if I had added jumps as opcodes.

This was a very cool chapter and I think this will need to sit with me for a bit before I feel comfortable using it. The core idea is starting to hit home though, everything should be done through the stack machine and I should be putting things on and manipulating the stack rather than using the meta language's features. This would simplify and make things easier I think.

This chapter is also what transforms the calculator into a programming language. At this point you can do quite a bit with the language. I think adding some operating system stuff like file system access and network access is the next most important thing. Currently the program can still only do calculations and doesn't have any side effects in the world.

The current clox implementation sits at 1.8k lines of code.

The next chapter is on calls and functions which should be useful. I'll also need to start seeing how to save the compiled form of a program so that I don't have to go through the entire parsing and scanning step.

Chapter 11 - Calls and Functions

This was the longest chapter by far and also the most complex. It also was the one where you can't really see the fruits of your labor until the end. The fact that everything actually worked is a testament to how good the book is as I wasn't super careful with copying the code. I didn't have much trouble but there were a few things that I had forgotten or had changed earlier that I had to take into account. Ultimately I think it was a good thing to try and write everything in a single file because opening up different files is a pain.

The chapter goes over adding function calls and the payoff is great. You conver t the compiler to actually compile a chunk of code and then return the compiled form for execution. By doing this, you can then compile a function as well as a mainline program. This was set up in chapter 1 but you don't actually need it until now. I wonder if there's a better way to highlight that fact. I've realized alot of the things are done so future chapters are easier to handle.

The major thing this chapter does is that the bytecode for the function is generation and saved with the function name. When the function is called, that bytecode that is generated executed and then the value is returned and the calling program will pick up where it was. I'm starting to see how you can write an entire language using a stack machine and it is quite cool. I think implementing functions from scratch would be an illumintating project.

I do wish it was more top down though. I would love to have started with an example of a function call in Lox and then slowly work through the front end to get the scanner and parser generating the right code. Then it spits out the bytecode for the function and the function name. Then finally have the bytecode get wired into the backend part of the stack machine. Then it would just execute.

This chapter isn't super useful as I won't be implementing functions in BASIC or at least not planning to. Though it would be supremely useful. The key idea though is that I could save the bytecode that gets returned by the compile function to a file in BP.O. This would then get read in and added if my mainline program calls another program. I can see how these pieces come together now so you can create subroutine calls.

This is definitely going to be a huge project as I'm starting to see.

This chapter also goes over adding native functions which is probably the most important thing so far. I need file access and network access so that BASIC can actually do things.

This chapter and variable assignment so far have been my favourite chapters. This one is also my most difficult and I'm sure I don't fully understand it yet.

I was also suprised to see that the fibonacci sequence program still took awhile to run. I was expecting it to run way faster but it didn't. I'm now worried that there is going to be some issues with BASIC. But we'll burn that bridge when we get to it.

The program is currently at 2000 lines of code and we still have quite a few chapters to go.

Chapter 12 - Closures

Another long chapter and this one is less relevant to what I'm planning to do but this still crystallized some ideas about programming. The core idea is that a function should not only be a function that contains bytecode to execute, but that it should also have access to variables declared outside of itself and that it should hold references to these variables that may have left the scope.

This means that a function should really always be a closure. A closure could then contain variables that live on the heap for longer than wherever they came from.

From this idea we build out the logic and it is a long process to add it in.

I understand why using the bytecode ops to handle everything makes sense but having the C language just do things while also parsing makes sense to me. I think that's where I went with my own language implemenations. I would do both executing and parsing at the same time. This did result in a mess of code though so that might be enough reason to keep things seperate.

The next chapter is going to be an important one as it's on garbage collection. That is probably going to be useful in general and I'm looking forward to it.

The code is currently sitting at about 2300 lines.

Chapter 13 - Garbage Collection

A fun chapter and relatively short in comparison to the previous chapters but it packs quite a punch. The core idea of garbage collection is simple. The VM already keeps track of all the objects allocated to the heap. We need to then figure out what the root variables are which are the variables that are currently on the stack and we also need to find our if these variables point to things on the heap. We go through these references and mark them.

At the end of this process, we should have 2 lists, one of all the objects and another that we have marked as still reachable by the program. The objects that aren't marked are therefore eligable to be reclaimed as the program won't be using them anymore.

There was some wiring up but luckily and by design it was concentrated in one area of the language. The focus was on the reallocate function. This meant that anytime memory was requested or increased, the garbage collection could be run.

It was also cool to see the criteria where the garbage collection is triggered and I wonder how widespread that the logic in the book is. It is used in Lua but probably a few languages use the same idea. The garbage collection is triggered when there a limit is hit and then that limit is used to decide the next run of the garbage collector. This results in more memory usage triggering the garbage collection after a longer amount of time while less memory usage results in the garbage collector being called more often.

This chapter also explains throughput and latency which was handy. Throughput is how much of the run time was spent in the garbage collector. Latency is how long the garbage collector actually runs for in one run. This is something you want to balance as you don't want to call the garabge collector too often but if you go too long without calling it, it will spend more time doing the garbage collection.

The other thing I learned was that most of the time is spent marking objects as being reachable. This makes sense now but I had never thought of it that way. The garbage collector spends most of its time picking out the stuff that isn't garbage.

This chapter is definitely going to be helpful in the long term. The next chapter though looks to be less as BASIC doesn't have classes. Though like functions it might be something I want to implement anyway.

Chapter 14 - Classes and Instances

This is the chapter that I am giving up on. I lost motivation and I don't really care for the object oriented logic. I trudged through this chapter but I can tell that it's the set up the language needs to implement the real core which is the class methods.

I think I can safely stop here as I've gotten the main ideas at this point and really now I have to actually try creating my own language. That will really cement the learning and hopefully I'll do better this time around than the previous times.