state machines -> async/await -> durable workflows -> 'agents'

Feb 6, 2025

Log of personal notes about async/await and durable workflows and state machines, and how they are all ways to model asynchronous execution. Not very readable, I need to make it more coherent.

Async await is a rotation of the state machine idea

The primary problem you want to solve when writing concurrent code is keeping the state consistent, across functions/classes/objects.
You could solve it by modeling each “entity” in your code as a state machine - define all the possible states, and transitions between them. Something similar to XState.
But that’s too tedious, maybe more precise but we like writing imperative code. Most often, there are only a few states that we care about, and making them all explicit is overkill.
Async await solves it by defining specific points in your code where you can pause execution, and resume it later. So, effectively (assuming the process doesn’t crash), each await point is a “state”, and code running between two await points is a transition. Kind of like an atomic transaction in a database, it isn’t interrupted by other concurrent running code.
The language compiler/interpreter is managing this state machine for you, so you don’t have to.

Durable workflows are persistent async/await

Async/await is a way to model a single entity’s state and transitions. However, if your process crashes, or you don’t want to keep the machine running in memory, you need to persist the state somewhere.
This is typically solved using databases and queues. You store the state in a database and return from your event handler once some work is done. Any incoming events which would progress the state are stored in a queue, and a separate process picks them up and resumes the workflow.
I don’t mean a separate Queue dependency like Kafka or RabbitMQ, just an abstraction that logs function call requests. For a simple python FastAPI server, it’d be uvicorn.
While explicit state is useful, for example, if modeling a long-term business process, you could do without them often. Async/await solves the same problem for ephemeral processes! We just need to make sure that at the next run, execution begins at the last suspend point.
Durable execution solves this. Assume your workflow is modeled with an async function.
- You keep a persistent log, everytime an await point is encountered in the function, say using a database.
- Whenever you suspend the process, the function raises an exception, so its execution is terminated.
- When running the function again, it runs the full code again. However, it doesn’t evaluate the await points already present in the log, and instead just reads their results from the log.
- Effectively, you get resume behavior without keeping the process in-memory all this time.
However, this introduces the drawback that your function execution must be deterministic. If you change the code between two runs, the function won’t be able to read the log correctly.

Event loops and durable execution

The universe is an event loop. All processes are going on concurrently, there are just some synchronization points when two entities interact.

Regular apps decouple the event loop for user events and state management

When writing say, a SaaS application, I am trying to model my specific business/software domain, in code.
The regular way of making apps combines a stateless api, and a database.
This puts the two processes (user events, and state manager) under separate coroutines, not running under the same event loop. They interact through events (user actions).
However, this splits a user’s umm journey(?) in your application across these two “event loops”. Thats why a user’s “journey” mapped out when thinking about the product, doesn’t translate directly to your code split across database schemas and api logic.

Durable workflows combine them

Durable execution combines your state and application logic in a coroutine.
When you pull the object state in a durable workflow, you are creating a single coroutine to manage your object.
Workflows make reasoning about the object’s lifecycle easier since actions and state are colocated.

But this makes state migration difficult

When your workflow instance owns the state, different instances might be running at different stages in the flow. For workflows to be durable, their execution history can’t change, so you can not make code updates that might interfere with it.
Hence, migrations are harder.
In comparison, migrations are much easier with the traditional apps. You can pause the app, update the state and app code separately, restart.

Virtual objects (or State machines) as a middle ground

By using durable handlers in a virtual object, you ensure the “object” is always in a consistent state. Same as a state machine.
So your workflows just ensure consistent transitions and actual state doesn’t live in them.
However, as your workflows get complex, state gets richer, you can’t keep everything inside a single virtual object. So, you start to get concurrent objects, hierarchical objects, etc.
So you will end up with database tables + stateless methods, in the limit, again. However, this code style feels simpler to evolve code over time, compared to starting with database schema?

Virtual objects are a good way to model AI agents

We want to model AI agents similar to external humans/entities that you can interact with multiple times, over long time periods.
Frameworks like Langgraph model them as explicit state machines. Virtual objects would be a less explicit but more friendly way to model them.
Your state transition code would look more linear.
It also feels like a nicer target when generating code using an LLM. User journeys can be mapped directly to workflows/virtual objects.

You should follow Sunil Pai who’s all about modeling AI agents using Cloudflare’s Durable objects.
Durable objects are the same abstraction as Virtual objects in workflow engines (or Actors in other frameworks), but with serverless compute baked in. With a typical workflow engine, you deploy the handler functions separately, and make them accessible for the engine to call. With Durable objects, Cloudflare manages it for you leveraging their Workers platform.
It is pretty nice, but my brain works in python. While Cloudflare Workers does support running python (Durable objects will probably soon), they use a modded runtime (compiling python to webassembly using Pyodide). This means third party dependencies might not work out of the box. I don’t know, I am not fond of it, thats the entire USP of python.

To be continued

Sources

State machines. Probably middle school course using GW-BASIC and hand-drawn flowcharts. A bunch more during undergrad working with digital circuit projects, but I never really wrote code until graduation.
Notes on structured concurrency by Nathaniel Smith. I think thats where I first discovered that async-await is a thing, and exactly how concurrency is different from parallelism. 2019 or thereabouts.
Temporal framework docs. Around 2020, when it first came out, I first found out that durable execution is a thing. Lots of code written in Golang so I couldn’t really figure out how it worked. Still the mental model that all business processes can be approximated as workflows, was a revelation.
Async Rust - 2023. Async/Await is a way to model state machines. Funny enough, I haven’t even written non-trivial rust.
Restate blog in 2024. I finally learnt to see all execution as a durable event log combined with function calls. Also, the virtual object abstraction, which is an easier translation from state machines.
DBOS in 2024. Durable execution at its core is just putting rows in a workflows and an events table.