The Saga Pattern in Elixir

Welcome to another blog post! First, an excerpt from my life:

A bike in town keeps running me over.

It’s a vicious cycle.

Alrighty, now let’s dive into today’s topic: Using the Saga Pattern in Elixir.

The Problem

If you build software long enough, you will eventually encounter use cases that require multiple steps to succeed before the overall process can be considered “complete”. If any of the steps fail, you’ll want to abort any further steps, roll back the state, and inform someone about the failure. So, either all steps succeed or none do. This is called an atomic transaction, and guaranteeing this Atomicity is harder than you’d expect.

For most use cases, wrapping the steps in a single Ecto.Multi transaction is enough to guarantee Atomicity. If a step fails, Ecto rolls back all database transactions and returns an error. Usually, you’ll pass back the error message and let the caller handle it. However, sometimes you might encounter use cases in which you can’t wrap all steps in a single transaction. Usually, they include steps that are separated either by time or space.

For example, if a customer rents a car through your software, you might want to mark the order as “pending” until you get the confirmation from your garage that the requested car is indeed available. If the car isn’t available, you shouldn’t charge the customer’s credit card and inform them that they have to pick another car or cancel the order. You couldn’t keep the database transaction open until you hear back from the garage, possibly hours or days later. This is an example of a use case separated by “time”.

For another example, imagine that if a customer rents a luxury car, you reserve the car with another car rental provider through an API call because you don’t own such cars yourself. You decide to reserve the car even before you charge the customer’s credit card because you want to avoid the embarrassment that your high-profile customer doesn’t get the car although they’ve just paid for it. But this means that you’ll need to cancel the car reservation if the payment fails. Since the reservation record lives in the other car rental’s database, you can’t wrap it in your transaction and have to call their API again to cancel the reservation. This is an example of a use-case separated by “space”.

In both situations, you could resolve the error cases manually. Just expose a “cancel order” button in the UI which refunds all payments and cancels the reservation and let the support employee click it whenever necessary. However, we are software engineers and automating this stuff is our job. So, let’s see how we could do that using the Saga pattern.

The Solution

The Saga pattern targets exactly this issue. It is sometimes referred to as Long-lived transaction (LLT) and was proposed in 1987 (that’s even before the World Wide Web was invented by Tim Berners-Leein 1989!). As with all software patterns, it only offers a blueprint and not a ready-made solution you can simply copy and paste. The blueprint is basically this:

Each Saga is a sequence of transactions.
Each transaction executes a state change and triggers the next transaction.
Every transaction has a compensation which rolls back the state change of the transaction.
If a transaction fails, the Saga executes the compensations of all previous transactions to roll back the state.

The Saga pattern usually comes in one of two flavours: Choreography-based and Orchestration-based. Let’s look at each.

Like what you read? Sign up for more!

Choreography-based

The Choreography-based Saga pattern is common if your transaction spans multiple microservices and no single microservice controls all steps of the Saga. In this approach, every transaction notifies the other microservices that it has completed, usually through events. Other microservices listen to these “completed” events and trigger their own transactions. If a transaction fails, it emits a “compensation” event, which notifies the other microservices that they need to roll back their own transactions.

For example, you might have an OrderService that handles the initial customer order, a ReservationService that reserves the luxury car with an external vendor, and a PaymentService that charges the customer’s credit card after the reservation succeeds.

Below is an example of a choreography-based Saga pattern that uses events to execute a long-lived transaction between microservices. If you don’t work with microservices, this choreography-based approach might still be interesting to you. It could also occur as function calls between different modules in your application. The essence here is that no single module owns all transactions of the Saga; instead, they are distributed across many places.

The arrows are events that are published by one service and handled by the other.

In this use-case, the OrderService publishes an OrderCreated event to which the ReservationService reacts by trying to reserve the car. If the reservation fails, the ReservationService publishes a ReservationFailed event, which triggers the OrderService to cancel the order. If the reservation succeeds, the PaymentService is notified and attempts to charge the customer. If the charge fails, both the ReservationService and the OrderService react to the PaymentFailed event and cancel the reservation and order. If the payment succeeds, the OrderService confirms the order, which completes the Saga.

Now, how would we implement this in Elixir?

In my recent blog post Resurrect your Phoenix Context, I argue against a “One module per use-case” approach and suggest that you consolidate all use-cases that “belong together” into one module instead. I think that event handlers should be treated the same unless you have hundreds of them, in which case you could group them by their “entity” (e.g. Order, Reservation, Payment).

Below is roughly how I’d implement the choreography-based Saga pattern above in Elixir. I’ll focus on the OrderService, the other services would have a similar setup.

defmodule OrderService.EventHandler do
  def handle_event(%Event{name: "ReservationFailed", data: data}) do
    OrderService.Orders.handle_reservation_failed(data)
  end

  def handle_event(%Event{name: "PaymentFailed", data: data}) do
    OrderService.Orders.handle_payment_failed(data)
  end

  def handle_event(%Event{name: "PaymentSucceeded", data: data}) do
    OrderService.Orders.handle_payment_succeeded(data)
  end
end

defmodule OrderService.Orders do
  def create_order(attrs) do
    # You'll want to make inserting the record and
    # omitting the event atomic too.
    Ecto.Multi.new()
    |> Ecto.Multi.insert(:order, Order.changeset(attrs))
    |> Ecto.Multi.run(:event, fn _, %{order: order} ->
      OrderService.Events.publish(:order_created, order)
    end)
    |> Repo.transaction()
    |> case do
      {:ok, %{order: order}} -> {:ok, order}
      {:error, _operation, error, _changes} -> {:error, error}
    end
  end

  def handle_reservation_failed(%{order_id: order_id, reason: reason}) do
    with {:ok, order} <- get_order(order_id),
         {:ok, order} <- delete_order(order) do
      send_order_cancelled_email(order, reason)
    end
  end

  # I omitted the other handlers because they look quite similar.
end

This is a very rough implementation. The main point here is that you won’t have one function or module that handles the entire Saga, but rather multiple independent functions that each handle only a small part of the Saga. They all need to work together in a distributed manner to complete the use case.

You also won’t have dedicated “transaction” and “compensation” actions, but rather generic functions that look alike. The Saga remains rather abstract in the choreography-based Saga pattern, and it’s up to you to document it well enough so that it becomes visible to other developers too. This is what the orchestration-based approach does better.

Orchestration-based

The orchestration-based Saga pattern consolidates all transactions and compensations into a central orchestrator instead of distributing them over multiple services or functions like the choreography-based does. This approach makes it easy to understand all facets of the Saga and allows for exhaustive testing. However, it is also a single point of failure. Your Saga won’t complete if the handling server crashes or becomes unavailable.

This is a simplified sequence diagram for executing the “Renting a luxury Car” Saga using an Orchestrator:

This is how you could implement the Orchestrator using Elixir:

defmodule MyApp.Orders do
  # This repository keeps all CRUD functions
  # for the Order schema.
  alias MyApp.Orders.OrderRepo

  alias MyApp.Payments
  alias MyApp.Reservations
  alias MyApp.UserNotifier

  def create_order(attrs) do
    with {:ok, order} <- OrderRepo.create_order(attrs),
         :ok <- handle_order_created(order) do
      {:ok, order}
    end
  end

  defp handle_order_created(order) do
    with {:car, {:ok, car}} <- Reservations.reserve_car(order),
         # We enqueue a Background Job that charges the customer.
         # This is an example of a Saga that is time-separated
         # in which we can't execute all transactions at once
         # and need to wait for some to finish before we proceed.
         {:charge, :ok} <- Payments.enqueue_charge_order_job(order, car) do
      :ok
    else
      # These error clauses are our "compensations" and
      # they should roll back any previous state change.
      {:car, error} ->
        {:ok, _order} = OrderRepo.delete_order(order)
        error

      {:charge, error} ->
        :ok = Reservations.cancel_car(order)
        {:ok, _order} = OrderRepo.delete_order(order)
        error
    end
  end

  # This is the last transaction of the Saga which
  # is called by Payments when the "Charge Customer"
  # Background Job succeeds. You see that a Saga might
  # need to start and stop between transactions.
  def handle_payment_succeeded(order_id) do
    with {:ok, order} <- OrderRepo.get_order(order_id),
         {:ok, order} <- OrderRepo.update_order(order, %{status: :confirmed}) do
      UserNotifier.send_order_confirmed_email(order)
    end
  end

  # This is the compensation for the "Charge customer"
  # transaction. It needs to roll back the previous transactions
  # and notify someone that the order failed.
  def handle_payment_failed(order_id) do
    with {:ok, order} <- OrderRepo.get_order(order_id),
         :ok <- Reservations.cancel_car(order),
         {:ok, _order} <- OrderRepo.delete_order(order) do
      UserNotifier.send_order_failed_email(order)
    end
  end
end

Our orchestrator consolidates all transactions and compensations for the Saga in a single place. It might not be a single function, but a few related functions which handle different steps of the Saga. In our example above, the Saga is time-separated because we schedule a background job for charging our customer instead of charging them right away. This shows us that even if a Saga relies on synchronous function calls, it might not be able to execute all functions at once but rather has to start and stop in between steps.

One note of caution: Be extra careful when you design the compensations. Compensations must “clean up” your state, so they cannot fail. Otherwise, your system will remain in an inconsistent state. That’s why you must make sure that your compensations will succeed - ideally - 100% of the time, all the time. This means they should have retry mechanisms and exhaustive error handling to ensure that they succeed eventually. And if they don’t succeed, at least they should fail with a big bang that alerts you and your team and allows you to clean up the state manually.

If our Saga can execute without having to wait too long, we could implement it as a short-lived transaction instead.

Like what you read? Sign up for more!

Short-lived Transactions

Although the Saga pattern is also called a long-lived transaction, you can use it for short-lived transactions as well. That is, even if your Saga fits into a single database transaction. As long as you use the transaction-compensation pairing with rollbacks in the event of failures, you have a Saga pattern. Here’s an example of such a Saga:

defmodule MyApp.Orders do
  alias MyApp.{Reservations, Payments}

  def create_order(attrs) do
    Ecto.Multi.new()
    |> Ecto.Multi.insert(:order, Order.changeset(attrs))
    |> Ecto.Multi.run(:car, fn _, %{order: order} ->
      Reservations.reserve_car(order)
    end)
    |> Ecto.Multi.run(:charge, fn _, %{order: order, car: car} ->
      Payments.charge_order(order, car)
    end)
    |> Repo.transaction()
    |> case do
      {:ok, %{order: order}} ->
        {:ok, order}

      # These error handlers are the "compensations"
      # of the previous "transactions" of the Saga.

      {:error, :order, error, _changes} ->
        # No need to delete the order.
        # Ecto rolled back the db transaction.
        {:error, error}

      {:error, :car, error, _changes} ->
        # No need to delete the order.
        # Ecto will roll back the insertion for us.
        {:error, error}

      {:error, :charge, error, %{car: car}} ->
        # This compensation needs a retry mechanism
        # and exception handling. Otherwise, the compensation
        # might not clean up the reservation properly.
        :ok = Reservations.cancel_car(car)
        {:error, error}
    end
  end
end

Sage library

The Sage offers an attractive framework for writing Saga’s in Elixir. It comes with important features like transaction retries and exception handling. I haven’t used it in production yet, but it might be a good option if you have very complex and critical Sagas. Learn more about Sage in this excellent blog post.

Synchronous vs Asynchronous Communication in Sagas

In both the choreography-based and orchestration-based Saga patterns, you can trigger transactions either synchronously or asynchronously. You can either execute each transaction as a function or API call, or use an event-based system to trigger and return the result of each transaction.

The advantage of synchronous transactions is that they involve less code and are easier to test and reason about. However, your Saga might fail for reasons unrelated to its logic, for example, if another microservice or an external API is unavailable. In these situations, you might want to retry after a little while instead of rolling back the entire Saga.

An asynchronous, event-based Saga offers such a retry mechanism and even simple exception handling. If a transaction fails, our event handler will try to deliver the event again, possibly with an exponential backoff. This makes the overall Saga more resilient. But an event system requires much more code, and the Saga becomes a distributed procedure with code in many places.

Which approach is best for you depends on your situation. I suggest that you opt for the synchronous, orchestration-based Saga pattern whenever you can, given its simplicity and the centralisation of business logic. However, if you have an event system in place already, the choreography-based or asynchronous approach might offer valuable benefits without much additional work.

Conclusion

And that’s it! I hope you enjoyed this article! If you want to support me, you can buy my book or video courses (one and two). Follow me on BlueSky or subscribe to my newsletter below if you want to get notified when I publish the next blog post. Until next time! Cheerio 👋