Stories by Maxim Zaks on Medium

When “Magic” Becomes Explicit: Mojo’s Take on Metaprogramming

Maxim Zaks — Fri, 06 Feb 2026 11:33:54 GMT

What Swift Hides, Mojo Exposes

Mojo is a new programming language by Chris Lattner, the same Chris Lattner who developed Swift, LLVM, MLIR, and tons of other cool stuff.

Mojo means “a magical charm” or “magical powers.”
https://docs.modular.com/mojo/faq#why-is-it-called-mojo

Ironically, while Mojo’s name means “magical power,” one of its central design goals is to minimize compiler magic. Recent additions such as compile-time reflection and default trait method implementations reinforce this direction.

If you’ve worked with Swift, you know that when a type conforms to protocols like Stringable, Sendable, Hashable, or Equatable, the compiler can synthesize the conformance for you. In practice, this means the compiler generates derived implementations based on the conformances of the type’s stored properties. This is powerful, but also entirely compiler-driven and opaque to developers. The behavior is hard-coded for a limited set of standard protocols, and there is no general mechanism for user-defined protocols to opt into the same synthesis model. As a result, authors of custom protocols must require manual conformance implementations from their users.

In Mojo, this is different. Thanks to compile-time reflection and default trait method implementation, any trait can supply this magical behavior. Example:

trait Hashable:
    fn __hash__[H: Hasher](self, mut hasher: H):
        comptime names = struct_field_names[Self]()
        comptime types = struct_field_types[Self]()
        @parameter
        for i in range(names.size):
            comptime T = types[i]
            _constrained_field_conforms_to[
                conforms_to(T, Hashable),
                Parent=Self,
                FieldIndex=i,
                ParentConformsTo="Hashable",
            ]()
            hasher.update(trait_downcast[Hashable](__struct_field_ref(i, self)))

This feature is extremely powerful because it generalizes behavior that was previously restricted to compiler intrinsics and makes it available to all developers. However, to fully realize this model, another upcoming feature is needed: struct extensions. Struct extensions will allow developers to retroactively implement new traits for existing types, including those defined in the standard library. In practice, this enables library authors to introduce new abstractions and have pre-existing types participate in them without requiring modifications to the original type definitions.

Was 2025 the year we all became 10x engineers?

Maxim Zaks — Wed, 17 Dec 2025 15:53:22 GMT

Or did we just become sloppy?

This blog post was triggered by a discussion on LinkedIn, a YouTube video, and an audiobook. Well, okay, it is also that time of the year…

For me personally, 2025 kicked off with a new job in an exciting field: I started working on an AI coding assistant. The field is wild — tons of competition, even more uncertainty, and at the end of the day, as the German saying goes:

Alle kochen nur mit Wasser

Where the “Wasser” in this case is a handful of LLMs (services) from a few big companies. I will not go into detail about my journey in this space, I just want to emphasize that I am no stranger to AI-assisted coding. I even have insight into how the sausage is made, and I have used multiple tools throughout 2025.

As mentioned above, one of the triggers for this post was a YouTube video that resonated with me quite a bit, maybe because the author had a similar experience to mine: doing more with less has its limits.

https://medium.com/media/0e8999b3a585221e88596fbbacf7407c/href

Doing more with less is the philosophy behind 10x engineering. And as the author of the video mentioned, after the honeymoon period, reality will hit you like a brick wall. Being sloppy does not pay off in the long run.

Don’t get me wrong: I do use LLMs for my work. They have become an important tool, but you need to understand that what an LLM does is very similar to a cargo cult. An LLM is trained on tons of code, and when you ask it to build something for you, it simply tries to replicate what it was trained on — without actual understanding the inner workings.

Here is an example from my current project. I asked ChatGPT about an intricacy of a publicly available specification. Here is an excerpt from its answer:

So far, so good. The model gave me an in-depth analysis and grounded it with a citation from the RFC, including a paragraph number. The problem is that the citation is hallucinated. The paragraph number is correct, the paragraph discusses the SCIM PATCH path specification, but there is no such sentence in it. The model generated a citation using appropriate language, looking absolutely legitimate.

After I confronted my assistant, it responded as follows:

Here we can see how masterful ChatGPT is at consulting, but is it reasonable to generate false citations just because most people interpret the spec differently? If a person did something like this, we would consider it fraudulent. Or is it the equivalent of everybody lying on their dating profile?

Now let’s go back to the claims of unlocking x times the productivity with AI tools. I see people writing that it takes them one or two days to implement a feature that previously took two to three weeks. And yes, if we think in person-days, they moved from 10–15 days to 1–2 days, which is a 10x speedup. So in a year, they should deliver over 100 features instead of 10.

How much trust has to be placed in our devoted consultant in order to keep such a pace? What happens if 10% of these features have a bug or undesired behaviour? How fast will you be able to fix a bug in the future that you only spent a couple of hours working on?

It is a known fact that in order to get good at something, we need to put a certain amount of hours into it. It is also a known fact that doing boring work is where our mind comes up with creative solutions. Going 10x faster than your normal speed means you have to use a 100% reliable tool for a 100% deterministic process, like using a calculator instead of doing math in your head, or digging a grave with an excavator instead of a shovel.

Speaking of digging a grave, you know what else is very dangerous in the x-times productivity debate? If you have a 10x engineer, how many of them do you need on a team? Do you need a team at all?

From a profit-oriented perspective, you do not. So it is okay to have only one person hammering out over 100 features a year, replacing a full team. That said, what is your bus factor in this scenario? Right… it is also 100%.

But fear not: the next person will be able to replace them using the same tool, right? Well, although the context window of an LLM is growing and it is legitimate to explore legacy codebases with the help of an LLM, it is still not enough to understand the why behind the decisions in a codebase. Although, if the previous person just vibe-coded their way to this point, they probably do not understand the why either.

In conclusion, I think the x-times productivity boost narrative should be considered harmful. Most of it is grounded in anecdotal evidence, novelty, and wishful thinking — sometimes even greed. AI tools are helpful and they are getting better, but we need to understand what sustainable throughput and team size should be based on long-term observations.

As the navy seal saying goes:

Slow is smooth and smooth is fast

Wish you all great holidays and a happy new year!

What do AI Agents and Retained GUI Frameworks Have in Common?

Maxim Zaks — Sat, 24 May 2025 14:23:01 GMT

No, this is not the start of a bad joke. Though it could be:

“An AI agent and a GUI framework walk into a bar. The bartender says, ‘Why the long stack trace?’”

But I digress.

If you’re not familiar with the term retained GUI framework, don’t worry — you’re in good company. Let’s break it down: in a retained-mode GUI, the UI state is stored and managed by the framework. When the data changes, the framework has to go, “Ah! The button needs to be blue now!” and update the UI accordingly. Sounds simple? Sure. Until you realize you’re suddenly debugging why your toggle button has trust issues and won’t commit to being toggled.

One of the main pain points of retained GUIs is state management. Keeping the UI and your actual data in sync is like trying to get toddlers to nap at the same time — possible in theory, but rarely achieved without chaos.

You’ve got to:

Know when your data changes.
Notify the UI about it.
Hope the UI updates the right things.
Pray it didn’t break something else in the process.

Now here’s the kicker: AI agents have the same problem.

Wait… what? Isn’t AI just fancy autocomplete? “You type stuff in, it types stuff out” — that sort of deal?

Yes, for simple use cases. But AI agents are different beasts.

They’re called agents because they act. They don’t just respond — they do. They’re given tools (check out the MCP protocol if you want to dive deep), and they’re supposed to use them to perform tasks. Sometimes these tools just look up data or do basic computations. But sometimes… they change the world (or at least, your app’s state).

Now imagine a conversation-based flow. You ask the agent something, it runs a tool, then continues the conversation. Sounds fun — until the agent’s internal world is out of sync with reality.

Here’s what can happen:

You asked for “the current temperature” five steps ago.
The agent called a tool and told you it was 20°C.
But then the tool updated something, or another agent came along (👀 looking at you, A2A protocol), and the temperature changed.
Now you’re talking about old data — and the agent doesn’t even know it’s outdated.

This is where things get weirdly GUI-esque.

You need state awareness. Like in retained-mode GUIs, where widgets have to update when the data changes, agents need a way to know when something they believe is true… isn’t anymore.

So what do we do? Just… re-check everything all the time like a paranoid squirrel?

Maybe.

Or maybe we need to build in mechanisms for state change notification, just like GUI frameworks do. Imagine the MCP or A2A protocols being able to say:

“Hey buddy, I just changed the temperature value. Might wanna update your internal map of the world.”

That way, agents can respond intelligently to changes and avoid hallucinating outdated realities. You know, like a properly-behaved UI widget that actually listens when you tell it to change color. (Looking at you, custom toggle button from 2017.)

TL;DR

Retained GUIs and AI agents both suffer from the same fundamental challenge: managing and reacting to shared mutable state.

So next time you see an agent doing something dumb, remember — it’s not (always) the AI’s fault. Sometimes, it’s just trying to deal with state like the rest of us: in confusion, with outdated facts, and no idea that everything changed two messages ago.

Welcome to the future. It’s weird, and full of shared mutable state.

What do AI Agents and Retained GUI Frameworks Have in Common? was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

CrazyString

Maxim Zaks — Sun, 25 Aug 2024 07:48:48 GMT

An experiment of utilizing small string optimization and efficient Unicode code point indexing

For the last year I did a bunch of experiments with the 🔥 Mojo programming language.

One of the latest experiments is CrazyString. Yesterday I recorded a video, where I explain some of the concepts I employed in this experiment.

https://medium.com/media/912ef26f001e8fbc7fa684d9162060b5/href

Poor persons package management in Mojo

Maxim Zaks — Tue, 31 Oct 2023 11:41:41 GMT

Photo by Claudio Schwarz on Unsplash

As you might have noticed, if you followed my last few posts, I am currently exploring the new programming language called Mojo.

I created multiple small repos, where I implement few fundamental elements, like sorting and hashing functions, a hash map and some tree data structures and a CSV builder and parser.

Specifically the CSV package is quite useful, as I like to collect the results of a benchmark in a tabular form and CSV is the simplest and most portable format for it.

So it makes sense to reference the CSV package from other projects in order to use it in benchmark scripts. Sadly Mojo does not provide any package management tools yet. This is why I decided to invest a few hours into a simple solution until the official package manager makes its debut.

https://medium.com/media/a079616a63394a6ef3b41d0006f32181/href

What you see above is a bash script which lets you checkout specific folders from a git repository.

check_out_remote_module "https://github.com/mzaks/mojo-csv" "csv"

Here we say that we would like to checkout the csv folder which represents a Mojo module from a Github repository. The csv folder will end up in your root project folder, which will make it accessible in your code. The script uses sparse-checkout, which means that only the desired folders are checked out. This is useful as in the case of mojo-csv, the repo also contains large csv files for benchmarking, which we don’t want to check out into our projects.

After the code is checked out, the script generates a .checkoutinfo file in every folder, which contains a timestamp the source URL and the path to the folder:

Sun Oct 29 20:07:58 CET 2023
URL: https://github.com/mzaks/mojo-csv
Path: csv

If a repo contains multiple modules, it is possible to check out multiple folders:

check_out_remote_module "https://github.com/mzaks/mojo-trees" "fiby_tree" "left_child_right_sibling=lcrs_tree"

As you can see above, I am checking out fiby_tree and left_child_right_sibling from mojo-trees repo. In case of left_child_right_sibling module I also decided to rename the module, so I append =lcrs_tree to the source folder name, which means that I want the target folder to be named lcrs_tree.

It is also possible to pick a nested module as you can see it in following example:

check_out_remote_module "https://github.com/tairov/llama2.mojo" "read/libc/stdio=file_io"

And last but not least, it is also very simple to pick a module form a specific branch, or tag as you can see below:

check_out_remote_module "-b v0.2.0 https://github.com/gabrieldemarmiesse/mojo-stdlib-extensions" "stdlib_extensions/builtins=list"

While I was implementing this solution, I had some ideas for an official Mojo package manager, but this is a topic for another blog post.

Please let me know, if you find this script useful and if there is a need for another blog post to explain things like how to structure your repo to increase module accessibility.

Faster prefix sum computation with SIMD and Mojo

Maxim Zaks — Thu, 19 Oct 2023 11:16:02 GMT

Since a couple of months I am experimenting with the new programming language called Mojo. The language is still in a very early stage, so is the standard library. In order to try things out, I currently concentrate on very basic functionality like hash functions, sorting and some tree data structures.

While I was working on sorting, I implemented counting and radix sort, which incorporate prefix sum algorithm. A prefix sum algorithm is generally very easy to implement, but I found an article, where the author claims to make it couple of times faster by employing SIMD operations.

One of the key features of Mojo is first class SIMD support, so I decided to go down the rabbit hole and check if I could implement a faster prefix sum algorithm in Mojo.

Lets start from the beginning, what is a prefix sum?

For an in depth understanding I think it is best to follow the Wikipedia link above, but actually the simple implementation of the algorithm in Mojo is self explanatory:

var element = array[0]
for i in range(1, len(array)):
    array[i] += element
    element = array[i]

So an element at index i in the array is equal to itself plus element at index i — 1

Given an array: 1, 1, 1, 1, 1, 1, 1, 1

A prefix sum of this array is: 1, 2, 3, 4, 5, 6, 7, 8

Which is computed by 7 additions:

1, 
(1 + 1) = 2, 
(2 + 1) = 3, 
(3 + 1) = 4, 
(4 +1) = 5, 
(5 + 1) = 6, 
(6 + 1) = 7, 
(7 + 1 ) = 8

For me, it was hard to imagine how one would implement it with SIMD as every iteration is based on the previous one. But there is a way:

  1, 1, 1, 1, 1, 1, 1, 1
+ 0, 1, 1, 1, 1, 1, 1, 1
= 1, 2, 2, 2, 2, 2, 2, 2
+ 0, 0, 1, 2, 2, 2, 2, 2
= 1, 2, 3, 4, 4, 4, 4, 4
+ 0, 0, 0, 0, 1, 2, 3, 4 
= 1, 2, 3, 4, 5, 6, 7, 8

So it is possible to compute a prefix sum of an 8 element vector in 3 (log2(8)) steps , where we combine a vector shift right with a vector addition operation. In Mojo the code looks as following:

var v1 = SIMD[DType.uint8, 8](1, 1, 1, 1, 1, 1, 1, 1)
print(v1) # [1, 1, 1, 1, 1, 1, 1, 1]
v1 += v1.shift_right[1]()
print(v1) # [1, 2, 2, 2, 2, 2, 2, 2]
v1 += v1.shift_right[2]()
print(v1) # [1, 2, 3, 4, 4, 4, 4, 4]
v1 += v1.shift_right[4]()
print(v1) # [1, 2, 3, 4, 5, 6, 7, 8]

On the first line we define a SIMD vector to be 8 elements wide with numeric values of type uint8. Then we print the vector and proceed with shift right and assigned addition to mutate the vector 3 times and print the (intermediate/final) results along the way.

You might be surprised by the syntax of the shift_right method call. The number of places we want to shift is passed in square brackets instead of parentheses. This means that the value is passed not at runtime, but at compile time, which also means that the value needs to be known at compile time. For more info on this topic please consult Mojo Programming manual.

How can we compute a prefix sum for a generic array whose size is only known at run time?

In order to do this we need to break down the runtime known array into chunks and perform the static compile defined operations on those chunks.

In order to perform the SIMD prefix sum on chunks we actually need to change the algorithm a bit.

Say we still have an array: 1, 1, 1, 1, 1, 1, 1, 1

But now we want to break it down in two 4 element chunks. The computation should look as following:

First chunk:
  1, 1, 1, 1
+ 0, 1, 1, 1
= 1, 2, 2, 2
+ 0, 0, 1, 2
= 1, 2, 3, 4
+ 0, 0, 0, 0
= 1, 2, 3, 4

Second chunk:
  1, 1, 1, 1
+ 0, 1, 1, 1
= 1, 2, 2, 2
+ 0, 0, 1, 2
= 1, 2, 3, 4
+ 4, 4, 4, 4
= 5, 6, 7, 8

Which is reflected in following Mojo code:

var v1 = SIMD[DType.uint8, 4](1, 1, 1, 1)
var v2 = SIMD[DType.uint8, 4](1, 1, 1, 1)
print(v1) # [1, 1, 1, 1]
v1 += v1.shift_right[1]()
print(v1) # [1, 2, 2, 2]
v1 += v1.shift_right[2]()
print(v1) # [1, 2, 3, 4]
v1 += 0
print(v1) # [1, 2, 3, 4]
print(v2) # [1, 1, 1, 1]
v2 += v2.shift_right[1]()
print(v2) # [1, 2, 2, 2]
v2 += v2.shift_right[2]()
print(v2) # [1, 2, 3, 4]
v2 += v1[3]
print(v2) # [5, 6, 7, 8]

As you can see above, we need to carry over the last value from previous chunk in order to increment all vector values in current chunk by it.

So given the chunk of size n, we need to perform:

result += result.shift_right[1 << i]()

log2(n) times, where i is a number between 0 and log2(n)

My first instinct was to put the above statement in a for loop:

var v1 = SIMD[DType.uint8, 4](1, 1, 1, 1)
for i in range(0, 2):
    v1 += v1.shift_right[1 << i]()
print(v1)

This however will not compile. As I already mentioned, the shift_right method expects a compile time known value, an i in the for loop is a runtime value (although the range is compile time known) But no worries, the Mojo standard library has our backs. It provides a function which allows us to perform loop unrolling in a more functional way:

from algorithm import unroll

fn prefix_sum_on_chunk(inout v1: SIMD[DType.uint8, 4], carry_over: UInt8):
    @parameter
    fn add[i: Int]():
        v1 += v1.shift_right[1 << i]()
    unroll[2, add]()
    v1 += carry_over

var v1 = SIMD[DType.uint8, 4](1, 1, 1, 1)
print(v1) # [1, 1, 1, 1]
prefix_sum_on_chunk(v1, 0)
print(v1) # [1, 2, 3, 4]

This way the compiler emits code which is similar to what we wrote above (without runtime branching and condition checks)

Next question, how big should the chunks be?

This depends on the type of the element in the array (how much bytes one elements occupies) and the hardware we are running the algorithm on. Standard library does provide an autotune function which should automate this kind of decision, but to be honest with you, I did some manual tuning and come to the conclusion that on my laptop (11th Gen Intel(R) Core(TM) i7–1165G7 @ 2.80GHz) following vector width performs best:

1 byte elements (int8, uint8), with 256 wide SIMD vector
2 byte elements (int16, uint16, float16), with 128 wide SIMD vector
4 byte elements (int32, uint32, float32), with 64 wide SIMD vector
8 byte elements (int64, uint64, float64), with 32 wide SIMD vectors

This implies that, if the array is smaller then the preferred vector width, or not an exact multiple of the preferred vector width, we need to compute the rest with a smaller vector width.

To make it clear: say we have an array of uint64 with 80 elements in it. As I mentioned before we prefer to take chunks of 32 elements for uint64 arrays, which means that we can compute the first 64 elements by taking two chunks with 32 wide SIMD vector and then we would need to reduce the vector size to 16 in order to compute the rest.

You can find the complete simd_prefix_sum implementation if you follow the link.

Given my previous explanation you should be able to follow along the code. That said, please don’t hesitate to write a comment if you have questions or suggestions.

Last but not least I would like to talk about runtime characteristics of the scalar and SIMD prefix sum functions, but first another disclaimer.

After I implemented the SIMD prefix sum and pushed it to the GitHub repository, I announced it on the Mojo Discord server, where a user pointed out, that there is already a prefix sum function in standard library, which I missed. So for the benchmark comparison I included the runtime characteristics of the std function as well.

Table and chart based on benchmark results

The table and chart above shows that the SIMD prefix sum takes from 0.03 to 0.2 nanoseconds per array element to compute dependent on the element and array size, where the scalar prefix sum is very stable at around 0.5 nanoseconds per element. The SIMD speedup is between 2.5x and 15x which is quite great. The results also shows that there is something strange going on with the prefix sum function in the standard library. It is some times comparable with my SIMD implementation, but in some cases, slower than the scalar prefix sum.

You can find the code I used for benchmarks here.

Thank you for reading and leave a clap or two if you will.

FibyTree vs. Set and SortedSet

Maxim Zaks — Thu, 24 Aug 2023 04:22:36 GMT

Photo by Kelly Sikkema on Unsplash

Couple of days ago, I wrote an article about FibyTree, a data structure I designed and implemented in Mojo. Today I would like to reveal some benchmarks.

Words of caution. This benchmarks are preliminary. At current point in time Mojo code runs only in a Jupyter Notebook, so I run Mojo and Python code in the Jupyter Notebook. This is also the first performance evaluation I do on FibyTree. I did design the data structure with performance in mind, but I did not have the time to think about and try out optimisations. BTW if you have ideas, how I can improve FibyTree, please let me know!

For the benchmark I chose seven set operations:

Insert an element to the set
Check if an element is in the set
Delete an element from the set
Build a union of two sets
Build an intersection of two sets
Build a difference of two sets
Build a symmetric difference of two sets

I chose ten different sizes for the sets (10, 100, 300, 500, 1K, 3K, 9K, 15K, 30K and 50K) in order to understand how the size correlates with performance, specifically when it comes to a tree based data structure this is important.

The time unit through the benchmarks is nano seconds.

Insert an element to the set

Here is the Python script I use to populate the Set/SortedSet:

def perf_test_random_add_p(size, min=-30000, max=30000, sorted=True):
    total = 0

    s = SortedSet() if sorted else set()
    for i in range(size):
        v = random.randint(min, max)
        tik = time.time_ns()
        s.add(v)
        tok = time.time_ns()
        total += (tok - tik)

    return total / size

def perf_test_ordered_add_p(size, sorted=True):
    total = 0
    s = SortedSet() if sorted else set()
    tik = time.time()
    for i in range(size):
        tik = time.time_ns()
        s.add(i)
        tok = time.time_ns()
        total += (tok - tik)
        
    return total / size

As you can see I am measuring only the time it takes to add an element to a set. I am also considering two possibilities:

Adding random values to a set
Adding elements in ascending order

I did this distinction mainly because adding sorted elements to a binary search tree is the worst case scenario, but I was also surprised to see that Python Set implementation also has quite a preference.

Here is the corresponding Mojo code:

fn perf_test_random_add(size: Int, min: Int = -30000, max: Int = 30000) -> Float64:
    var total = 0
    
    var tik = now()
    var tok = now()
    var f = fiby()
    
    for _ in range(size):
        let i = random_si64(min, max).to_int()
        tik = now()
        f.add(i)
        tok = now()
        total += (tok - tik)
    
    return total / size
    
fn perf_test_ordered_add(size: Int) -> Float64:
    var total = 0
    var tik = now()
    var f = fiby()
    var tok = now()
    total += tok - tik
    for i in range(size):
        tik = now()
        f.add(i)
        tok = now()
        total += (tok - tik)
        
    tik = now()
    f.balance()
    tok = now()
    total += (tok - tik)
    
    return total / size

As I mentioned before, it was very surprising for me to see that adding elements in ascending order to a Python Set is so much faster then adding random values. Generally adding random elements to a FibyTree seems to have very nice performance characteristics. Through the randomness the balancing does not have to kick in that often and average difference per insert between 100 elements and 50K elements is just 1.5x.

As expected adding elements in ascending order to FibyTree follows an exponential growth, just because we need to balance the tree so often.

Adding random elements to FibyTree up to 10K of size seems to be about 10x more efficient then adding to Python Set or SortedSet.

Check if an element is in the set

Here is the Python code, where we measure the time to call __contains__ method:

def perf_test_contains_p(size, sorted=True):
    total = 0

    s = SortedSet() if sorted else set()
    for i in range(size):
        s.add(random.randint(-size, size))
    
    for i in range(size):
        tik = time.time_ns()
        r = s.__contains__(i)
        tok = time.time_ns()
        total += (tok - tik)

    return total / size

And it’s Mojo counterpart:

fn perf_test_contains(size: Int, inout found: Int) -> Float64:
    var f = fiby()
    for _ in range(size):
        let i = random_si64(-size, size).to_int()
        f.add(i)
    
    var total = 0
    
    var tik = now()
    var tok = now()

    var res = DynamicVector[Bool](size)
    for i in range(size):
        tik = now()
        let r = f.__contains__(i)
        tok = now()
        res.push_back(r)
        total += (tok - tik)
    
    var count = 0
    for i in  range(len(res)):
        if res[i]:
            count += 1
    found = count
    
    return total / size

There is also perf_test_contains_on_balanced function, where we call f.balance() after all elements are added.

As we can see, the search in FibyTree gets constantly slower with bigger set. That said finding elements in a balanced FibyTree is ~5x fastert than SortedSet and about 3x faster than Set.

Delete an element from the set

The operation is quite similar to contains, as we first need to find the element and then delete it.

Bellow you can find corresponding Python and Mojo code:

def perf_test_delete_p(size, sorted=True):
    total = 0

    s = SortedSet() if sorted else set()
    for i in range(size):
        s.add(random.randint(-size, size))
    
    for i in range(size):
        tik = time.time_ns()
        r = s.discard(i)
        tok = time.time_ns()
        total += (tok -tik)

    return total / size

fn perf_test_delete(size: Int, inout found: Int) -> Float64:
    var f = fiby()
    for _ in range(size):
        let i = random_si64(-size, size).to_int()
        f.add(i)
    
    var total = 0
    
    var tik = now()
    var tok = now()

    var res = DynamicVector[Bool](size)
    for i in range(size):
        tik = now()
        let r = f.delete(i)
        tok = now()
        res.push_back(r)
        total += (tok - tik)
    
    var count = 0
    for i in  range(len(res)):
        if res[i]:
            count += 1
    found = count
    
    return total / size

Balanced FibyTree is about 4x faster than Set and whooping 20–30x faster than SortedSet. It’s also interesting to notice that delete operation in FibyTree is on average faster then contains, which kind of make sense as the tree get smaller through the process. This is however not the case for Python Set.

Build a union of two sets

For this operation we create two random sets and apply the union operation. The time measured for operation execution is divided by the set size.

Python code:

def perf_test_union_p(size, sorted=True):
    total = 0

    s1 = SortedSet() if sorted else set()
    s2 = SortedSet() if sorted else set()
    for i in range(size):
        s1.add(random.randint(-size, size))
        s2.add(random.randint(-size, size))
    
    tik = time.time_ns()
    s3 = s1.union(s2)
    tok = time.time_ns()

    return (tok - tik) / size

Mojo code:

fn perf_test_union(size: Int) -> Float64:
    var f1 = fiby()
    var f2 = fiby()
    for _ in range(size):
        let i = random_si64(-size, size).to_int()
        f1.add(i)
        f2.add(i)
    
    let tik = now()
    f1.union_inplace(f2)
    let tok = now()
    
    return (tok - tik) / Float64(size)

We use union_inplace method in Mojo, which does not create a new instance but mutates the tree in-place, because of a bug I stumbled upon and reported in Mojo compiler. From performance point of view this should not make a big difference though.

The growth of execution time for FibyTree seems to be quite linear based on the set size, where Set and SortedSet have a bit of a degenerating performance. It also shows that Set is 1–7x slower than FibyTree and SortedSet 6–12x slower, with a spike of 26x for 10 elements set.

Build an intersection of two sets

Similar procedure as for union, Python and Mojo code below:

def perf_test_intersection_p(size, sorted=True):
    total = 0

    s1 = SortedSet() if sorted else set()
    s2 = SortedSet() if sorted else set()
    for i in range(size):
        s1.add(random.randint(-size, size))
        s2.add(random.randint(-size, size))
    
    tik = time.time_ns()
    s3 = s1.intersection(s2)
    tok = time.time_ns()

    return (tok - tik) / size

fn perf_test_intersection(size: Int) -> Float64:
    var f1 = fiby()
    var f2 = fiby()
    for _ in range(size):
        let i = random_si64(-size, size).to_int()
        f1.add(i)
        f2.add(i)
    
    let tik = now()
    f1.intersection_inplace(f2)
    let tok = now()
    
    return (tok - tik) / Float64(size)

It seems like Python Set has a very efficient way to compute intersection, at least for smaller sets, the larger sets are still ~2x slower than FibyTree. Sorted Set is 3–5x slower, with a spike of 17x for 10 elements set. Not sure what is happening there but the spike was consistent throughout multiple runs.

Build a difference of two sets

Similar procedure, Python and Mojo code below:

def perf_test_difference_p(size, sorted=True):
    total = 0

    s1 = SortedSet() if sorted else set()
    s2 = SortedSet() if sorted else set()
    for i in range(size):
        s1.add(random.randint(-size, size))
        s2.add(random.randint(-size, size))
    
    tik = time.time_ns()
    s3 = s1.difference(s2)
    tok = time.time_ns()

    return (tok - tik) / size

fn perf_test_difference(size: Int) -> Float64:
    var f1 = fiby()
    var f2 = fiby()
    for _ in range(size):
        let i = random_si64(-size, size).to_int()
        f1.add(i)
        f2.add(i)
    
    let tik = now()
    f1.difference_inplace(f2)
    let tok = now()
    
    return (tok - tik) / Float64(size)

We observe Set being 1–5x slower and SortedSet 5–9x slower with a 19x spike for 10 elements set.

Build a symmetric difference of two sets

Similar procedure, Python and Mojo code below:

def perf_test_symmetric_difference_p(size, sorted=True):
    total = 0

    s1 = SortedSet() if sorted else set()
    s2 = SortedSet() if sorted else set()
    for i in range(size):
        s1.add(random.randint(-size, size))
        s2.add(random.randint(-size, size))
    
    tik = time.time_ns()
    s3 = s1.symmetric_difference(s2)
    tok = time.time_ns()

    return (tok - tik) / size

fn perf_test_symmetric_difference(size: Int) -> Float64:
    var f1 = fiby()
    var f2 = fiby()
    for _ in range(size):
        let i = random_si64(-size, size).to_int()
        f1.add(i)
        f2.add(i)
    
    let tik = now()
    f1.symmetric_difference_inplace(f2)
    let tok = now()
    
    return (tok - tik) / Float64(size)

We observe Set being 1–6x slower and SortedSet 7–11x slower with a 20x spike for 10 elements set.

All in all, I think FibyTree is a viable data structure to investigate further. As I mention in the beginning of the article, this benchmarks can be seen as preliminary, but they already establish some baselines we can expect from the data structure. As next steps, I will consider, if there are any vectorisation and parallelisation techniques I could utilise to make the operations run faster and maybe reducing some memory management overhead by utilising stack allocations and direct heap allocations (dropping DynamicVector in favour of direct memory allocation and manual capacity management).

Benchmark results summarised

Insert an element to the set:

Set: 4–13x

SortedSet 6–13x

Check if an element is in the set:

Set: 3–4x, 9x spike for 10 elements set

SortedSet: 4–6x, 10x spike for 10 elements set

Delete an element from the set:

Set: 4x

SortedSet: 14–36x

Build a union of two sets:

Set: 1–7x

SortedSet: 6–12x, 26x spike for 10 elements set

Build an intersection of two sets:

Set: 1–2x

SortedSet: 3–5x, 17x spike for 10 elements set

Build a difference of two sets:

Set: 1–5x

SortedSet: 5–9x, 19x spike for 10 elements set

Build a symmetric difference of two sets:

Set: 1–6x

SortedSet: 7–11x, 20x spike for 10 elements set

A Message from AI Mind

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the AI Mind Publication
🧠 Improve your AI prompts effortlessly and FREE
🧰 Discover Intuitive AI Tools

🔥 FibyTree vs. 🐍 Set and SortedSet was originally published in AI Mind on Medium, where people are continuing the conversation by highlighting and responding to this story.

A high level introduction to FibyTree

Maxim Zaks — Sat, 19 Aug 2023 18:17:49 GMT

Photo by Todd Quackenbush on Unsplash

FibyTree is an optionally fully balanced and complete implicit binary search tree, which has the same semantics as a sorted set. I did a reference implementation of this data structure in Mojo, but fear not if you have no idea about Mojo, this article is very high level. So without further ado, lets start from the beginning.

What is a tree?

A tree is a graph where one node has 0 incoming edges and all the others have only one. The node with 0 incoming edges is called the root of the tree.

What is a binary tree?

Binary tree is a tree where every node has at most 2 outgoing edges.

Normally we define a binary search tree as following:

class Node:
  def __init__(self):
    self.left = None
    self.right = None

Where the two outgoing edges are represented with left and right fields. These fields are either set to None or point to another Node instance.

This example is a typical reference based tree data structure. As a Node instance stores two pointers, it occupies 16 bytes on 64-bit address space machine. Fields left and right point to a Node instance which can be at arbitrary address and hence this representation does not guarantee data locality and probably will involve CPU cache misses during traversal.

This leads us to the idea of an implicit, or space efficient data structures. Here is how FibyTree stores it’s outgoing edges:

struct FibyTree:
  var left: DynamicVector[UInt16]
  var right: DynamicVector[UInt16]
  
  fn __init__(inout self):
    self.left = DynamicVector[UInt16]()
    self.right = DynamicVector[UInt16]()

In FibyTree we represent the edges as indices. We decided to use unsigned 16 bit int as an index, meaning that our tree can have at most 2¹⁶ = 65.536 nodes. This however also mean that our overhead is only 4 bytes per node. When the left and right vectors are empty we have an empty tree. The root node is always placed at index 0. If a node does not have children it stores it’s own index. So for example:

left: [0]
rigth: [0]

Means that we have only one root node in the tree.

left: [1, 1]
right:[0, 1]

Represents a tree which has a tree with a root and only left child

left: [0, 1]
right:[1, 1]

Is the opposite case — a root with only right child

left: [1, 1, 2]
right: [2, 1, 2]

Is a tree where the root node has a left and a right child

left: [2, 1, 2]
right: [1, 1, 2]

Is logically equivalent to previous tree, the only difference is, we flipped the children indices, now left child is at index 2 and right child is at index 1

left:[1, 2, 3, 4]
right: [0, 1, 2, 3]

Is a tree of 4 nodes, where every node has only a left child.

BTW, representing 4 nodes in this way will take 16 bytes, the same amount of bytes as it takes to represent just one node as reference based tree.

The way we currently represent nodes is kind of pointless as the nodes do not cary any data, they just encode the tree structure. In order for a node to represent data we would need to associate data with the node:

class Node:
  def __init__(self, data):
    self.data = data
    self.left = None
    self.right = None

In the Node class, data property is another pointer, which points to an arbitrary location in memory and moves the size of a Node instance to 24 bytes. Important to notice, this size is pure overhead cost for associating the data as tree, the size of the data itself is unknown.

struct FibyTree[T: AnyType]:
  var elements: DynamicVector[T]
  var left: DynamicVector[UInt16]
  var right: DynamicVector[UInt16]
  
  fn __init__(inout self):
    self.elements = DynamicVector[T]()
    self.left = DynamicVector[UInt16]()
    self.right = DynamicVector[UInt16]()

In order to associate data with a FibyTree, we introduce an elements filed which is a vector of generic type T. This way FibyTree owns the data, the data is stored in contiguous memory and has no overhead per node.

Lets talk about binary search trees now

A binary search tree is a binary tree, wich stores comparable elements in such way that a left child of the root contains data which is smaller then the root data and the right child data is bigger then the root data. This way by traversing the tree, we can identify which branch to take in order to find the node with desired element. Because of this property, a binary search tree can be seen as a sorted set, where when we add an element, which is not represented by a node yet, a new node will be created as a child of the leaf node, where we unsuccessfully finished our search. I will not go into further details about the properties of the binary search trees as there is enough material online.

In order to make FibyTree a binary search tree and a semantic equivalent of a sorted set we provide following API:

struct FibyTree[T: AnyType, cmp: fn(T, T)->Int]:
  var elements: DynamicVector[T]
  var left: DynamicVector[UInt16]
  var right: DynamicVector[UInt16]
  var deleted: Int

  fn __init__(inout self):
    self.elements = DynamicVector[T]()
    self.left = DynamicVector[UInt16]()
    self.right = DynamicVector[UInt16]()
    self.deleted = 0

  fn add(inout self, element: T):
  fn delete(inout self, element: T) -> Bool:
  fn sorted_elements(self) -> UnsafeFixedVector[T]:
  fn clear(inout self):
  fn union(self, other: Self) -> Self:
  fn intersection(self, other: Self) -> Self:
  fn difference(self, other: Self) -> Self:
  fn symetric_difference(self, other: Self) -> Self:
  fn is_subset(self, other: Self) -> Bool:
  fn is_superset(self, other: Self) -> Bool:
  fn is_disjoint(self, other: Self) -> Bool:
  fn __len__(self) -> Int:
  fn __contains__(self, element: T) -> Bool:

With this API we can add, delete and search for elements in the FibyTree. We can return all elements as a sorted vector, produce a new tree which is a union, intersection, difference or symmetric difference of two FibyTrees. We can check if one tree is a subset, superset, or disjoint from another. We can get a length of the tree and also clear it.

Lets talk about last but not least feature of FibyTree. We can turn it in to complete fully balance binary search tree.

What does it mean though?

When we add elements to binary search tree we add new nodes at the bottom, which can lead to tree degradation, worst case scenario we add sorted elements one after another, which will lead to a single branch tree, which is basically a linked list. Balancing the tree means that we strive to minimise the difference between depths of sub trees. A complete binary tree is a binary tree, where each level is fully filled except for the last one. A good example for a complete binary tree is a heap. A Complete fully balanced binary search tree guarantees a ceil(log2(n + 1)) depth for the tree and therefor at most ceil(log2(n + 1)) comparisons for search.

In order to balance the FibyTree we sort the elements and left / right indices in eytzinger order (BTW if you were wondering this is where FibyTree gets its y from) wich has O(n) runtime complexity. When a FibyTree is fully balanced we don’t even need to lookup left and right element by index, we can use the eytzinger formula:

left = (n + 1) * 2 – 1
right = (n + 1) * 2

to look left / right child.

This article is only the high level introduction. I will write other articles where I will discuss performance benchmarks and implementation details.

You can find the very early (not fully optimised and with some copy paste code) but already complete implementation of FibyTree in Mojo bellow.

Thank you for reading.

Please leave a 👏 if you liked it.

https://medium.com/media/dd2dc057e5d3ff6c5dba043dcd7888f0/href

A Message from AI Mind

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the AI Mind Publication
🧠 Improve your AI prompts effortlessly and FREE
🧰 Discover Intuitive AI Tools

A high level introduction to FibyTree was originally published in AI Mind on Medium, where people are continuing the conversation by highlighting and responding to this story.

Simple CSV parser in Mojo

Maxim Zaks — Sun, 28 May 2023 13:17:10 GMT

Parsing 400MB per second with less than 80 lines of 🔥

Photo by Mika Baumeister on Unsplash

So what is so hard about parsing CSV, isn’t it just .split("\n") and .split(",")? It can be, if you don’t have to comply with the RFC 4180.

RFC 4180 states, that a field can be placed between quotes (") and then it is legal to have new lines and commas as part of the field. So .split("\n") and .split(",") will not work anymore.

Back in 2019 I ported FlexBuffers to C# (FlexBuffers-CSharp) as I wanted to use it for a project I was working on. And since the initial data came as CSV, I also implemented a CSV to FlexBuffers converter.

Nowadays I am exploring a new programming language called Mojo. After taking it for a spin by implementing a function to count chars in UTF-8 string, I decided to go a bit further, brush up my knowledge of CSV parsing and check how good it is going to fly in Mojo.

First, I ported my converter to Mojo with just some small adaptations:

from String import *
from Vector import DynamicVector

struct CsvTable:
    var inner_string: String
    var starts: DynamicVector[Int]
    var ends: DynamicVector[Int]
    var column_count: Int
    
    fn __init__(inout self, owned s: String):
        self.inner_string = s
        self.starts = DynamicVector[Int](10)
        self.ends = DynamicVector[Int](10)
        self.column_count = -1
        self.parse()
    
    @always_inline
    fn parse(inout self):
        let QUOTE = ord('"')
        let COMMA = ord(',')
        let LF = ord('\n')
        let CR = ord('\r')
        let length = len(self.inner_string.buffer)
        var offset = 0
        var in_double_quotes = False
        self.starts.push_back(offset)
        while offset < length:
            let c = self.inner_string.buffer[offset]
            if c == QUOTE:
                in_double_quotes = not in_double_quotes
                offset += 1
            elif not in_double_quotes and c == COMMA:
                self.ends.push_back(offset)
                offset += 1
                self.starts.push_back(offset)
            elif not in_double_quotes and c == LF and not in_double_quotes:
                self.ends.push_back(offset)
                if self.column_count == -1:
                    self.column_count = len(self.ends)
                offset += 1
                self.starts.push_back(offset)
            elif not in_double_quotes and  c == CR and length > offset + 1 and self.inner_string.buffer[offset + 1] == LF:
                self.ends.push_back(offset)
                if self.column_count == -1:
                    self.column_count = len(self.ends)
                offset += 2
                self.starts.push_back(offset)
            else:
                offset += 1

        self.ends.push_back(length)
        
    fn get(self, row: Int, column: Int) -> String:
        if column >= self.column_count:
            return ""
        let index = self.column_count * row + column
        if index >= len(self.ends):
            return ""
        return self.inner_string[self.starts[index]:self.ends[index]]

The code is fairly straightforward. I define a structure called CsvTable, which contains a string holding the CSV text. Upon creating an instance of the structure, it determines the starting and ending positions for each field in the table, as well as the number of columns. The calculation occurs byte by byte within the parse function, where it checks for " , \n, or \r characters as specified in RFC 4180 and responds accordingly. The parse function assumes that the CSV is well-formed, meaning that every row has the same number of columns. Additionally, there is a get function that retrieves a field based on the provided row and column indices. If invalid indices are provided, the function returns an empty string, although some may argue that raising an exception would be more suitable.

The CsvTable implementation has certain shortcomings and lacks several evident features, including:

Type inference for fields
Removal of quotation marks when returning a field using the get function
Provision of a flag to deviate from RFC-4180 and ignore quotation marks in fields
Offering the ability to set a different character as the column separator

While it is possible that I may implement these features in the future, my current focus is on SIMD (Single Instruction, Multiple Data) and the resulting performance enhancements.

Back in 2019, a paper called “Parsing Gigabytes of JSON per Second” by Geoff Langdale and Daniel Lemire created quite a buzz. They showed how using SIMD instructions could significantly speed up JSON parsing. The same authors also had a solution for parsing CSV files. I got curious and wanted to try implementing their approach using Mojo. Unfortunately, I ran into some issues because the current Mojo standard library didn’t have all the necessary intrinsics exposed. I could have used MLIR inter-op, but instead, I decided to build the parser using only the standard library’s own structs and functions. Despite the limitations, I managed to create a pretty efficient and, in my opinion, simple and elegant parser.

In the following sections, I will demonstrate my implementation, provide performance benchmarks, and highlight the distinctions between my solution and the one presented by Geoff Langdale and Daniel Lemire.

So, lets start by looking at the SimdCsvTable:

from DType import DType
from Functional import vectorize
from Intrinsics import compressed_store
from Math import iota, any_true, reduce_bit_count
from Memory import *
from Pointer import DTypePointer
from String import String, ord
from TargetInfo import dtype_simd_width
from Vector import DynamicVector

alias simd_width_u8 = dtype_simd_width[DType.ui8]()

struct SimdCsvTable:
    var inner_string: String
    var starts: DynamicVector[Int]
    var ends: DynamicVector[Int]
    var column_count: Int
    
    fn __init__(inout self, owned s: String):
        self.inner_string = s
        self.starts = DynamicVector[Int](10)
        self.ends = DynamicVector[Int](10)
        self.column_count = -1
        self.parse()
    
    @always_inline
    fn parse(inout self):
        let QUOTE = ord('"')
        let COMMA = ord(',')
        let LF = ord('\n')
        let CR = ord('\r')
        let p = DTypePointer[DType.si8](self.inner_string.buffer.data)
        let string_byte_length = len(self.inner_string)
        var in_quotes = False
        self.starts.push_back(0)
        
        @always_inline
        @parameter
        fn find_indexies[simd_width: Int](offset: Int):
            let chars = p.simd_load[simd_width](offset)
            let quotes = chars == QUOTE
            let commas = chars == COMMA
            let lfs = chars == LF
            let all_bits = quotes | commas | lfs
            
            let offsets = iota[simd_width, DType.ui8]()
            let sp: DTypePointer[DType.ui8] = stack_allocation[simd_width, UI8, simd_width]()
            compressed_store(offsets, sp,  all_bits)
            let all_len = reduce_bit_count(all_bits)
            
            let crs_ui8 = (chars == CR).cast[DType.ui8]()
            let lfs_ui8 = lfs.cast[DType.ui8]()

            for i in range(all_len):
                let index = sp.load(i).to_int()
                if quotes[index]:
                    in_quotes = not in_quotes
                    continue
                if in_quotes:
                    continue
                let current_offset = index + offset
                self.ends.push_back(current_offset - (lfs_ui8[index] * crs_ui8[index - 1]).to_int())
                self.starts.push_back(current_offset + 1)
                if self.column_count == -1 and lfs[index]:
                    self.column_count = len(self.ends)
            
        vectorize[simd_width_u8, find_indexies](string_byte_length)
        self.ends.push_back(string_byte_length)
    
    fn get(self, row: Int, column: Int) -> String:
        if column >= self.column_count:
            return ""
        let index = self.column_count * row + column
        if index >= len(self.ends):
            return ""
        return self.inner_string[self.starts[index]:self.ends[index]]

The structure of the struct closely resembles CsvTable, with the main distinction being the implementation of the parse function. To access the underlying string byte buffer with SIMD, we need to represent it as a DTypePointer[DType.si8]. This allows us to use the simd_load method, enabling SIMD operations on vectors of bytes. The find_indices function plays a crucial role in our parsing process. With the help of the vectorize function, we can apply the find_indices function to chunks of the inner_string, effectively converting them into SIMD vectors. Subsequently, we create three bit-sets that indicate whether an element in the vector is equal to ", , or \n, respectively:

let chars = p.simd_load[simd_width](offset)
let quotes = chars == QUOTE
let commas = chars == COMMA
let lfs = chars == LF
let all_bits = quotes | commas | lfs

The all_bits bit-set (a SIMD vector of bools) represents the combination of the markers.

Now we need to transform a SIMD vector of bools into a list of offsets. Which we can do with following four lines:

let offsets = iota[simd_width, DType.ui8]()
let sp: DTypePointer[DType.ui8] = stack_allocation[simd_width, UI8, simd_width]()
compressed_store(offsets, sp,  all_bits)
let all_len = reduce_bit_count(all_bits)

First we use the iota function to produce a vector of increasing ints starting from zero. Side note: AFAIK the iota function has it’s roots in APL and is also called an index generator.

Moving forward, our objective is to “compress” the offsets using the all_bits vector. This involves removing all instances in the offsets where the corresponding values in all_bits are False. The resulting list’s size will correspond to the number of True values in the all_bits vector. We allocate memory on the stack for the compressed_store function’s result and determine its length by invoking the reduce_bit_count(all_bits) function.

Given the length and the compressed list, we can iterate over the special characters in current text chunk and push start and end offsets if they are not in quotes.

Now it is time to do some benchmarking. I picked the same CSV file which were provided in the simdcsv repo.

For the nfl.csv, I got following result:

SIMD min parse time in nanoseconds:
3135722.0
Non SIMD min parse time in nanoseconds:
4992006.0
Difference
1.5919797737171855
SIMD bytes parsed per nanosecond
0.43519738038002093
Non SIMD bytes parsed per nanosecond
0.27336866181651226

And for EDW.TEST_CAL_DT.csv the number are a bit lower:

SIMD min time in nanoseconds:
2007207.0
Non SIMD min time in nanoseconds:
2446398.0
Difference
1.2188070288714616
SIMD byte per nanosecond
0.25521333873387247
Non SIMD byte per nanosecond
0.20939601814586178

It’s worth noting that at present, Mojo code can only be run through a cloud-based Jupyter Notebook. This means that the absolute performance numbers may not carry as much weight, but the relative difference between the SIMD and non-SIMD parsers should still hold true. In our analysis, we found that the SIMD parser was around 60% faster for the nfl dataset, while for the second dataset, it showed a modest improvement of about 20%.

To be frank, the performance boost achieved was not as substantial as I had anticipated. However, it still represents a notable improvement, which I consider a win.

Now, let’s discuss why I had to deviate from the solution proposed by Geoff Langdale and Daniel Lemire. In Mojo, when we compare a SIMD vector of integers with an integer, we obtain a SIMD vector of booleans, which is indeed a useful feature. However, I encountered a challenge when trying to perform a shift operation on the boolean vector. In contrast, performing a shift on a bit-set represented as a 64-bit unsigned integer is a straightforward operation. This limitation hindered my ability to replicate certain aspects of the original solution.

Another obstacle I encountered was the need to check if a separator is within quotation marks, which requires the use of carry-less multiplication. Geoff Langdale wrote a blog post explaining the advantages of this approach. Unfortunately, the Mojo standard library does not currently provide access to this rather specialized operation. While it may be possible to expose carry-less multiplication through MLIR inter-op, I opted not to pursue that route for now.

Therefore, due to these limitations in the Mojo standard library, I had to find alternative approaches to address these specific requirements in my implementation, which introduced some branching in the code.

That’s all from my side. Thank you for reading until the end. Feel free to leave a comment. Take care, and until next time!

Counting chars with SIMD in Mojo

Maxim Zaks — Thu, 18 May 2023 14:05:05 GMT

Photo by Nathaniel Shuman on Unsplash

Mojo is a very young (actually a work in progress) programming language designed and developed by a new company called Modular. Here is a blurb from their website:

Mojo combines the usability of Python with the performance of C, unlocking unparalleled programmability of AI hardware and extensibility of AI models.

So, the language is a Python superset, which should have full Python inter-op and allows developers, with proper know-how, develop efficient modules. This sounds a little bit like Cython, however Mojo has much more ambitious goals. If you are interested in more details, browse through the FAQ compiled by the Modular team.

To enable low-level programming, Mojo introduces extra keywords and concepts to Python. For instance, it introduces a struct keyword for defining efficient value types and an fn keyword for defining efficient functions. If you want to delve into more specific information, I recommend referring to the Mojo programming manual.

At some point in the future, Mojo is planned to be released as an open-source project. However, currently, external developers can only try out the language by requesting access to a Jupyter Notebook provided by the Modular team.

Yours truly was granted an access a couple of days ago, and after spending some time studying the documentation, I became curious about the low-level implementation of Strings in Mojo.

Mojo has three types to express a string:

StringLiteral a built in type
StringRef also a built in type
String a type defined in the standard library

As the name implies StringLiteral represents a string we write in the code:

let s = "hello" # s is of type StringLiteral

In Mojo, a StringRef can be created from a StringLiteral and it appears to primarily serve as a type for ABI (Application Binary Interface) inter-operation.

let s: StringRef = "hello" 
# StringLiteral "hello" is converted to StringRef and assigned to s

A StringLiteral has a data method, which lets us get raw pointer to the underlying data. However, the type of the pointer, pointer>, is quite mysterious since it’s not explained in the docs at all. At first, I thought it could be related to MLIR, because Mojo allows referencing MLIR types directly (see Low-level IR in Mojo), but even after searching, I couldn’t find any information about it on the MLIR side either. It’s definitely an intriguing aspect of Mojo’s implementation!

However, despite the mystery surrounding the pointer> type, it’s worth noting that the Mojo standard library does include a module called Pointer. This module defines a Pointer struct that can be initialized with a pointer. Therefore, after some experimentation and tinkering, I was able to write the following code:

let s = "hello"

let p = Pointer(s.data())

for i in range(len(s)):
    print(p[i])

Which returned the internal byte stream of the string:

And what do we learn from this? Mojo uses UTF-8 encoding to store strings!

Ok, but what about the actual String type from the standard library?

If we want to use the String struct from the standard library, we can simply write:

from String import String

let s: String = "hello"

In order to output the chars from a string, I figured, I should be able to write following:

from String import String

let s: String = "hello"

for c in s:
    print(c)

But this end up in an error: ‘String’ does not implement the ‘__iter__’ method. So lets try something else:

from String import String

let s: String = "hello"

for i in range(len(s)):
    print(s[i])

And we get the characters printed:

h
e
l
l
o

That is great, but what about a string with characters, which are longer then one byte?

from String import String

let s: String = "hello 🔥"

for i in range(len(s)):
    print(s[i])

While this code compiles successfully, it doesn’t produce any output in the Jupyter Notebook. It appears to result in a runtime error. Even attempting to print a specific character, such as print(s[6]), doesn't work as expected. To troubleshoot the issue, let's examine the underlying byte stream:

let s = "hello 🔥"
let p = Pointer(s.data())

for i in range(len(s)):
    print(p[i])

# returns 
# 104
# 101
# 108
# 108
# 111
# 32
# -16
# -97
# -108
# -91

This printed output highlights one small oddity, which caught my eye earlier. The byte stream is typed as si8 which is a signed 1-byte integer. I think it is more logical to type sequence of bytes as an unsigned byte integer ui8. But 🤷, that is not super important. What is important, we see that the byte sequence is 10 bytes long, if we execute print(len(s)) we also get 10 as a result. If we execute print(s[6:10]) we get 🔥 as the result.

What did we learn? In current implementation, the String struct is a light weight wrapper around the UTF-8 byte sequence. The length corresponds to the number of bytes, not number of characters and if we try to access a multi byte character with an incorrect range, we get a runtime error.

To hone our Mojo skills, let’s create our own function that takes a StringLiteral as input and returns the number of characters. However, before we proceed, it’s essential to grasp how UTF-8 encoding works and how we can identify when multiple bytes form a single character. Fortunately, all the necessary information about UTF-8 is available in this Wikipedia article.

Looking at the table I copied from the Wikipedia article, we can observe that every Unicode character can be encoded in UTF-8 using 1 to 4 bytes. The first byte in the sequence representing a character has a special trailing bits pattern that indicates the length of the sequence. Any byte that isn’t in the first position will always have a 10 trailing bits pattern. To determine whether a byte represents the start of a character, we can use the following check on each byte:

(byte >> 6) != 0b10

As described above, in Mojo, we have easy access to the underlying bytes of the string literal, so we can simply loop over each byte and apply the check. If the condition (byte >> 6) != 0b10 is true, we increment the character count. While this is a straightforward solution, we can further optimize it in Mojo.

Mojo offers excellent support for SIMD (Single Instruction Multiple Data) operations, which allow us to execute a single operation on a vector of values. In our case, we need to perform a right shift and an equality comparison on a sequence of 1-byte values. With SIMD, we can group these values into SIMD vectors and carry out both operations as following:

let p = DTypePointer[DType.si8](string_literal.data()).bitcast[DType.ui8]()
(p.simd_load[64]() >> 6) != 0b10

On the first line, we extract the data from the string literal and encapsulate it within a DTypePointer. A DTypePointer represents a pointer to DType values, which is necessary for invoking the simd_load method. This method, in turn, generates a SIMD type that enables us to execute vectorized operations.

You might be curious about the .bitcast[DType.ui8]() operation. This is required in Mojo, because the string literal data is initially typed as si8. By applying the .bitcast[DType.ui8]() operation, we rectify this issue, ensuring compatibility with the >> operator that we intend to use.

On the second line, we load 64 bytes from the pointer as a SIMD vector and perform shift to the right on all elements and after that, we compare all elements with 0b10. The result of this comparison will be a SIMD vector of booleans, which we can cast to a SIMD vector of ui8

.cast[DType.ui8]()

Now we can sum all the element of the SIMD vector to get the number of chars.

.reduce_add().to_int()

Below is my first implementation of a function that calculates the number of characters in a string literal:

fn chars_len(s: StringLiteral) -> Int:
    let p = DTypePointer[DType.si8](s.data()).bitcast[DType.ui8]()
    let l = len(s)
    var offset = 0
    var result = 0
    while l - offset >= 64:
        result += ((p.simd_load[64](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += 64
    while l - offset >= 32:
        result += ((p.simd_load[32](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += 32
    while l - offset >= 16:
        result += ((p.simd_load[16](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += 16
    while l - offset >= 8:
        result += ((p.simd_load[8](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += 8
    while l - offset >= 4:
        result += ((p.simd_load[4](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += 4
    while l - offset >= 2:
        result += ((p.simd_load[2](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += 2
    while l - offset >= 1:
        result += ((p.simd_load[1](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += 1
        
    return result

Oh boy, this looks like a lot of copy pasta! How comes?

Well, the length of the string literal is only known at runtime. The more bytes we can consume in a large SIMD vector, the better, so we “fall through” the different sizes of SIMD vectors.

Is there a better way?

After the first implementation, I come up with another one, which is better in the sense of less conditions, but has a memcpy:

from Bit import *
from Memory import *
from Pointer import *

fn chars_len[simd_width: Int](s: StringLiteral) -> Int:
    let p = DTypePointer[DType.si8](s.data()).bitcast[DType.ui8]()
    let l = len(s)
    var offset = 0
    var result = 0
    while l - offset >= simd_width:
        result += ((p.simd_load[simd_width](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        offset += simd_width
    
    if offset < l:
        let rest_p: DTypePointer[DType.ui8] = stack_allocation[simd_width, UI8, 1]()
        memset_zero(rest_p, simd_width)
        memcpy(rest_p, p.offset(offset), l - offset)
        result += ((rest_p.simd_load[simd_width]() >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
        result -= simd_width - (l - offset)
    
    return result

In this implementation, we let the user provide the SIMD vector size, although they probably should use the autotune feature, but if they know that the string is short, they could pass a better vector size value.

We run the algorithm, explained above, on as many bytes as possible for the given vector size. Then we check, if we consumed all the bytes through the vector transformation. If not, we allocate memory on stack for another go and copy the unprocessed bytes, from string literals underlying bytes sequence, to the newly allocated stack region. As the rest_p memory region has 0 bytes, which do not belong to the string, and 0 byte values will result in a positive char count, we need to subtract them from the result:

result -= simd_width - (l - offset)

And that concludes this blog post. However, I’m contemplating writing another one where we not only count the number of characters but also the number of graphemes in a string. Additionally, I plan to provide a function for safely truncating strings based on the number of bytes. If you find these topics interesting, please let me know in the comments section. Your feedback and suggestions are greatly appreciated!

Update 19th of May 2023

Every new technology needs a great community behind it and Mojo already has it!

Thanks to the community feedback, I was able to produce a third version of the code, which is very elegant, without sacrificing the performance:

from DType import DType
from Functional import vectorize
from Pointer import DTypePointer
from TargetInfo import dtype_simd_width

alias simd_width_u8 = dtype_simd_width[DType.ui8]()

fn chars_count(s: StringLiteral) -> Int:
    let p = DTypePointer[DType.si8](s.data()).bitcast[DType.ui8]()
    let string_byte_length = len(s)
    var result = 0
    
    @parameter
    fn count[simd_width: Int](offset: Int):
        result += ((p.simd_load[simd_width](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()
    
    vectorize[simd_width_u8, count](string_byte_length)
    return result

There are a few concepts in this solution which I did not touch in my blog post before, so let me give a brief explanation.

alias simd_width_u8 = dtype_simd_width[DType.ui8]()

With the alias keyword we identify that the value of simd_width_u8 is a constant, which will be computed at compile time. This value identifies how many entries a SIMD vector can have for the ui8 type. This can be computed at compile time, based on the architecture and what kind of SIMD support it has. In the Notebook, this value evaluates to 64 as the system, which runs the Notebook, has AVX512 support.

    @parameter
    fn count[simd_width: Int](offset: Int):
        result += ((p.simd_load[simd_width](offset) >> 6) != 0b10).cast[DType.ui8]().reduce_add().to_int()

This is an inner function, which will be called with an offset to perform the count. The @parameter decorator is needed, because we are capturing the result variable. For more details please read the documentation.

vectorize[simd_width_u8, count](string_byte_length)

By calling the vectorize function, I avoid the copy and paste frenzy, I introduced in the first solution. This function executes the count function for us, based on the compile time parameters (the SIMD vector width, the function for the loop body) and arguments (the total loop count) we provide. My guess is, it does something similar to what I did manually in solution one, but this is not our burden to read and write anymore. Which is great!