Back to list of talks

Why Turing.jl? A limited comparison of probabilistic programming frameworks

This talk was given at the Learn Bayes seminar series at Karolinska Institutet on 15 May 2025. The title that we publicly used was 'The Design of Turing.jl', mainly because the above title was too long :).

I later gave a reprise at The Alan Turing Institute's probabilistic programming study group on 4 June 2025.

For personal reasons, the recording of this talk is not publicly available. (If you are at the ATI, you can access the study group recording.) Thus I have endeavoured to provide a full text writeup of the talk here.

As the title suggests, this talk involves some comparison of Turing.jl with other PPLs, specifically Stan and PyMC. However, the main focus of the talk is really about the unique features of Turing.jl and how they arise from the design of the library and other technical aspects of Julia.

Reproducibility

The Julia examples in this page were last tested with

  [47edcb42] ADTypes v1.16.0
  [366bfd00] DynamicPPL v0.36.15
  [7da242da] Enzyme v0.13.66
  [6fdf6af0] LogDensityProblems v2.1.2
⌃ [234d2aa0] MicroCanonicalHMC v0.1.6
  [fce5fe82] Turing v0.39.10

Table of contents

'Eight schools' in various PPLs

Back to top

The 'eight schools' problem (Rubin, 1981) is a classic example of a Bayesian model (specifically, a hierarchical model). In this model, students at eight different schools were given coaching for the SAT examination, and the average effect on their scores (y) was measured for each school, along with the standard deviation (sigma).

The TensorFlow Probability website has more detailed information on the model. Here we provide only a short overview:

In this way we have allowed for each of the schools to have a different effect (theta[i]), while also sharing a common source via mu and tau.

When taking a Bayesian approach to this modelling, we need to set some kind of prior for the parameters mu and tau. Here, we choose fairly non-informative priors: mu is given a normal distribution with a large standard deviation of 5, and tau follows a Cauchy distribution, but truncated to be non-negative (i.e., it cannot be less than 0).

Putting all of this together, we get a Turing.jl model:

To run this...

Launch Julia with julia --project=. and then in the Julia REPL, enter ]. This will cause the REPL to enter package-manager mode, indicated by a prompt of pkg>. Then type add Turing to install the Turing.jl package.

Once that is done, press Backspace to re-enter the REPL mode, which has a prompt that looks like julia>. You can then run the commands in this file.

using Turing

J = 8
y = [28, 8, -3, 7, -1, 1, 18, 12]
sigma = [15, 10, 16, 11, 9, 11, 10, 18]

@model function eight_schools_centered(J, y, sigma)
    mu ~ Normal(0, 5)
    tau ~ truncated(Cauchy(0, 5); lower=0)
    theta = Vector{Float64}(undef, J)
    for i in 1:J
        theta[i] ~ Normal(mu, tau)
        y[i] ~ Normal(theta[i], sigma[i])
    end
end

model_esc = eight_schools_centered(J, y, sigma)

chain = sample(model_esc, NUTS(), 1000; num_warmup=10000, thinning=10)

The exact same model can be written in Stan as follows (with credit to PosteriorDB):

data {
  int<lower=0> J; // number of schools
  array[J] real y; // estimated treatment
  array[J] real<lower=0> sigma; // std of estimated effect
}
parameters {
  array[J] real theta; // treatment effect in school j
  real mu; // hyper-parameter of mean
  real<lower=0> tau; // hyper-parameter of sdv
}
model {
  tau ~ cauchy(0, 5); // a non-informative prior
  theta ~ normal(mu, tau);
  y ~ normal(theta, sigma);
  mu ~ normal(0, 5);
}

To sample from this model, you'll have to go via Python, R, or Julia. Here is a Python script that executes the model above.

To run this...

You will need to have the Stan model and this Python script saved in the same directory. The Stan model should be named eight_schools_centered.stan.

Then create a virtual environment with the cmdstanpy package installed. Inside this virtual environment, you can run the script with python eight_schools_centered_stan.py.

from cmdstanpy import CmdStanModel, install_cmdstan
from pathlib import Path
import time

install_cmdstan()

DATA = {
    "y": [28, 8, -3, 7, -1, 1, 18, 12],
    "sigma": [15, 10, 16, 11, 9, 11, 10, 18],
    "J": 8,
}

def main():
    stan_file = Path(__file__).parent / "eight_schools_centered.stan"
    model = CmdStanModel(stan_file=stan_file)
    x = time.time()
    fit = model.sample(data=DATA, chains=1,
                       iter_warmup=10000, save_warmup=False,
                       iter_sampling=10000, thin=10)
    y = time.time()
    print(fit.summary())
    print(f"Time taken: {y - x} seconds")


if __name__ == "__main__":
    main()
Finally, we'll look at the same model in PyMC.

To run this...

You will need a virtual environment with the pymc package installed.

import pymc as pm
import numpy as np
import time

J = 8
y = np.array([28, 8, -3, 7, -1, 1, 18, 12])
sigma = np.array([15, 10, 16, 11, 9, 11, 10, 18])

def main():
    with pm.Model() as eight_schools_centered:
        mu = pm.Normal("mu", 0, 5)
        tau = pm.HalfCauchy("tau", 5)
        theta = pm.Normal("theta", mu, tau, shape=J)
        obs = pm.Normal("obs", theta, sigma, observed=y)

        start = time.time()
        trace = pm.sample(draws=10000, tune=10000, chains=1)
        end = time.time()

    print(f"took {end - start} seconds")

if __name__ == "__main__":
    main()

Julia's other PPLs

Julia also has other PPLs apart from Turing.jl, such as Gen.jl and RxInfer.jl. (I've probably missed out about 5 more.)

Unfortunately I can't seem to access the Gen.jl website, so I shall refrain from fully commenting on it. My current impression of it (which seems supported by the paper) is that it contains a huge library of tools for programmable inference, i.e., operations that tend to be common in Bayesian inference. In some ways this is more principled and also allows for an extreme level of customisation over inference algorithms. Turing on the other hand is much more like a 'framework' where you import a few functions, write a model, and sample with less fuss over how you're exactly doing it. (Note that Stan takes this to an extreme: the model is simply a function that returns a number and everything else stems from that.)

RxInfer.jl has a rather different approach to modelling than Turing: it constructs a graph of all the variables in the model, which leads to some restrictions on what is possible in the model. In contrast Turing is more general-purpose and Turing models are genuine functions which can contain arbitrary Julia code. In return RxInfer can make use of this richer structure to perform inference more efficiently, for example when there are conjugate variables.

Comparing PPL syntaxes

Back to top

Between the three PPLs, Turing.jl arguably has the 'nicest' syntax: writing a Turing.jl model is almost like writing the mathematical equations that define the model. PyMC has a fairly similar syntax although it requires a little bit more boilerplate (for example, parameter names get specified twice, once as a variable to assign to and once as a string for PyMC to keep track of).

Notably, both PyMC and Turing allow you to define parameters 'as you go along'. This is rather like a dynamically typed programming language. On the other hand, Stan requires you to separately define a list of parameters and their sizes before they can be used in the model. This in turn resembles a statically typed programming language.

It would not be unfair to say that Stan is more verbose. However, much like how both dynamic and static languages exist, there are benefits to being explicit sometimes.

For example, it makes it very clear which quantities are parameters and which ones are observed. In Turing.jl, data variables are specified in one of two ways: either by listing them as model arguments, or via an explicit conditioning syntax. However, there can be some ambiguity over this. For example, the following model (GitHub issue) does not have well-defined behaviour:

using Turing

@model function f(x)
    y = x + 1
    y ~ Normal()
end

sample(f(1.0), NUTS(), 1000)

It appears here that x is a data variable, since it is a model argument. Thus, y is entirely derived from data. However, actually performing the sampling, one will find that y is treated as a parameter.

Comparing PPL speeds

Back to top

Speed is one of the topics which I would prefer to not go into. It always garners a lot of attention, and for good reason: people would like to get their results faster! model-dependent.

For this specific model, we have that Stan > Turing > PyMC in terms of speed (that is, Stan is the fastest). You can verify this by running the code examples above yourself (helpful timing functions have been inserted into all of them). On my current laptop, I get: 0.7 seconds for Stan, 1.5 seconds for Turing, and 6 seconds for PyMC. However, the exact numbers can depend quite heavily on the model being tested as well as how they are implemented. For example, the Turing code can be sped up by using MvNormal instead of looping over several Normals. I also don't pretend that I have given 100% optimised versions for the other PPLs.

Note that the first time you sample from a Turing model, it will be quite slow. This is mainly because of Julia's just-in-time compilation model: code is not compiled until it is called. (This is the origin of one of the most common complaints about Julia, the 'time to first X': it takes a long time to run anything for the first time.)

Stan has a similar 'feature' as the Stan model has to be compiled before it can be executed. In fact, you may well have found similar behaviour for Python in general, even though CPython is not a just-in-time implementation: this is again to do with code compilation.

The central thesis of this talk

Back to top

The main difference that I want to highlight, though, is that Turing.jl has a very thin domain-specific layer. That is to say, the amount of Turing.jl-specific code that a Turing user writes is quite low, essentially limited to @model and ~.

For example, in a Turing model, we have to specify a number of different distributions such as Normal(mean, sd). These distributions are not defined in Turing.jl, but form part of a much more general package, Distributions.jl.

In contrast, PyMC's distributions don't exist as distributions in their own right: they have a 'special' meaning in that instantiating them adds them to the model. That is why you have to import them from pymc. To obtain an object that behaves like a 'distribution', you need to call (for example) pm.Normal.dist(mean, sd).

Stan, of course, exists as a closed ecosystem: its distributions are defined directly in the Stan codebase and you cannot add new distributions from outside. (Indeed you do not really define new distributions per se; this can instead be accomplished by adding custom probability terms.)

The consequence of this is twofold:

(Some) unique Turing.jl features

Back to top

The bulk of this section focuses on some things that Turing.jl can do precisely because the DSL is quite thin.

External samplers

Back to top

Much like adding a custom distribution, to add a custom sampler, you only need to implement an interface that is defined outside of Turing.jl. Specifically, any MCMC sampler needs to subtype AbstractMCMC.AbstractSampler, and define two AbstractMCMC.step methods that, respectively, define how to generate the first sample, and how to generate the N-th sample given the previous one.

Caveat

This is actually not fully true, as of Turing v0.39. Although the main requirement is to fulfil the AbstractMCMC interface, there are some additional interface details that only come from Turing.jl, most notably Turing.Inference.getparams: see the Turing.jl documentation for more information. These are likely to be removed in the near future.

The relative ease of defining samplers makes Turing.jl a rich ground for implementing new inference techniques. For example, the microcanonical Hamiltonian Monte Carlo (MCHMC) sampler is implemented in theMicroCanonicalHMC.jl package. Using it in Turing.jl is as simple as wrapping it in a call to externalsampler():

To run this...

Go back to the same directory that you ran the Julia code in. If you run julia --project=. it will activate the same environment as previously.

As before, you will need to install a dependency: press ], then enter add MicroCanonicalHMC, then press Backspace. You can then run all of the code below in the REPL.

using Turing, MicroCanonicalHMC

J = 8
y = [28, 8, -3, 7, -1, 1, 18, 12]
sigma = [15, 10, 16, 11, 9, 11, 10, 18]

@model function eight_schools_centered(J, y, sigma)
    mu ~ Normal(0, 5)
    tau ~ truncated(Cauchy(0, 5); lower=0)
    theta = Vector{Float64}(undef, J)
    for i in 1:J
        theta[i] ~ Normal(mu, tau)
        y[i] ~ Normal(theta[i], sigma[i])
    end
end

model_esc = eight_schools_centered(J, y, sigma)
sample(model_esc, NUTS(), 5000)

using MicroCanonicalHMC
mchmc_sampler = externalsampler(MCHMC(2000, 0.001))
sample(model_esc, mchmc_sampler, 5000)

Here is an example of the MCHMC sampler being used in conjunction with Turing.jl models to perform inference on astrophysical data.

Submodels

Back to top

Because Turing.jl models are essentially pure Julia code (with the exception of ~ statements), code inside models can be refactored and shifted around much like one would expect to with ordinary functions.

For example, consider an alternative ('non-centered') parameterisation of the eight-schools problem. Instead of writing that theta[i] ~ Normal(mu, tau), we'll instead draw theta_trans[i] ~ Normal(0, 1), before multipling it by tau and adding mu to get theta. This is an entirely equivalent model, but leads to better sampling especially when tau is small.

If you run this code example, you should find that NUTS gives much better sampling results, especially fortau (you can see this from the effective sample size, which is calculated by ess()).

using Turing

J = 8
y = [28, 8, -3, 7, -1, 1, 18, 12]
sigma = [15, 10, 16, 11, 9, 11, 10, 18]

@model function eight_schools_noncentered(J, y, sigma)
    mu ~ Normal(0, 5)
    tau ~ truncated(Cauchy(0, 5); lower=0)
    # We have to change the name of this vector
    theta_trans = Vector{Float64}(undef, J)
    for i in 1:J
        # This was our original model:
        # theta[i] ~ Normal(mu, tau)
        # We changed it to:
        theta_trans[i] ~ Normal(0, 1)
        theta_i = theta_trans[i] * tau + mu
        # This likelihood term remains the same.
        y[i] ~ Normal(theta_i, sigma[i])
    end
end

model_esnc = eight_schools_noncentered(J, y, sigma)

chain = sample(model_esnc, NUTS(), 1000; num_warmup=10000, thinning=10)
ess(chain)

However, writing out both this parameterisation and the centred one results in some repetition. For example, suppose we wanted to change the priors on mu and tau, and check whether there was still a difference between the two parameterisations: we'd have to change the priors in both models, which is not ideal.

To avoid this, Turing.jl provides a to_submodel function that allows you to define a submodel that can be used in multiple models. This is very similar to how one would use a function in ordinary Julia code: you would extract out common sections into a single function. This code example demonstrates how, and you can also check out the documentation page:

using Turing

J = 8
y = [28, 8, -3, 7, -1, 1, 18, 12]
sigma = [15, 10, 16, 11, 9, 11, 10, 18]

@model function priors()
    mu ~ Normal(0, 5)
    tau ~ truncated(Cauchy(0, 5); lower=0)
    return (mu=mu, tau=tau)
end

@model function eight_schools_centered(J, y, sigma)
    p ~ to_submodel(priors())
    theta = Vector{Float64}(undef, J)
    for i in 1:J
        theta[i] ~ Normal(p.mu, p.tau)
        y[i] ~ Normal(theta[i], sigma[i])
    end
end

@model function eight_schools_noncentered(J, y, sigma)
    p ~ to_submodel(priors())
    theta_trans = Vector{Float64}(undef, J)
    for i in 1:J
        theta_trans[i] ~ Normal(0, 1)
        theta = theta_trans[i] * p.tau + p.mu
        y[i] ~ Normal(theta, sigma[i])
    end
end

Notice how the priors are now defined only in the priors() model. In order for later models to 'see' the values of those parameters, we need to make sure to return them at the end of that submodel. Then, when we call p ~ to_submodel(priors()), we can access that return value as p.

Caveat

The submodel interface may change in the future. In particular, we would like you to not have to explicitly return the parameters. See this issue for more information.

Nested submodels also work as expected: this allows you to build up complex models from simple building blocks. This approach can be used, for example, in infectious disease modelling In my opinion, the usefulness of this strategy has not been fully explored (nor has the design been fully optimised, as mentioned in the note above). It is a long-term aim of mine to build something similar to the excellent brms package using submodels.

Back to top

Gradient-based samplers, such as Hamiltonian Monte Carlo and its variants (like NUTS), require you to be able to evaluate the derivative of the model log probability density with respect to its parameters. This is usually accomplished using 'automatic differentiation', which does not calculate symbolic derivatives but rather provides you with the derivative of a function at a given point.

There are a number of different approaches to automatic differentiation, and the Julia ecosystem has quite a few AD packages, with varying levels of maturity. True to its spirit, Turing.jl does not define its own AD backend, but rather makes use of... all of them! This is made possible by the ADTypes.jl and DifferentiationInterface.jl packages, which provide a unified interface for interacting with AD backends.

For example, Enzyme is one of the most modern AD backends available in Julia. It works by differentiating code at the LLVM level, which generally leads to extremely high performance. Using Enzyme with Turing only requires one to add an adtype argument to the sampler:

Enzyme compatibility

Perfect compatibility between Enzyme and Turing has not always been guaranteed to date. As of the time of writing, Turing v0.39 (and DynamicPPL v0.36) do largely work with Enzyme! The most up-to-date status can be seen on the ADTests page.

If you run into any issues using Turing with Enzyme, please do feel free to post an issue. We can't guarantee that it will be fixed but we'll do our best (within reason).

using Turing, ADTypes
import Enzyme: set_runtime_activity, Reverse

J = 8
y = [28, 8, -3, 7, -1, 1, 18, 12]
sigma = [15, 10, 16, 11, 9, 11, 10, 18]

@model function eight_schools_noncentered(J, y, sigma)
    mu ~ Normal(0, 5)
    tau ~ truncated(Cauchy(0, 5); lower=0)
    theta_trans = Vector{Float64}(undef, J)
    for i in 1:J
        theta_trans[i] ~ Normal(0, 1)
        theta = theta_trans[i] * tau + mu
        y[i] ~ Normal(theta, sigma[i])
    end
end

model = eight_schools_noncentered(J, y, sigma)

# ForwardDiff.jl is the default AD backend in Turing.jl.
# If you don't specify an AD backend, it will do the same as this.
forwarddiff = AutoForwardDiff()
chain = sample(model, NUTS(; adtype=forwarddiff), 2000)

# You can switch AD backends by passing a different adtype argument.
enzyme_reverse = AutoEnzyme(; mode=set_runtime_activity(Reverse, true))
chain = sample(model, NUTS(; adtype=enzyme_reverse), 2000)

What about Mooncake?

Mooncake is a wonderful AD package! It works very nicely with most Turing models, and the code quality is second to none.

However, Mooncake's development has been quite closely tied to Turing's, at least in terms of the people working on it. In a traditional academic talk one might therefore like to highlight it. But I am not an academic, and my aim here is not to advertise my work but rather to make a point about the composability of the Julia ecosystem. I think using Enzyme as an example makes for an even stronger case.

The benefit of this is customisability and control. Different AD backends have different coverage of Julia features and performance characteristics, and this allows you to choose the one that works best for you.

But how do you choose an AD backend? To help with this, we publish a table of performance benchmarks for AD backends on various Turing models. Thus, you can see which of these your model might be similar to, and extrapolate from there. Turing's documentation also provides some general tips, and if you want to get more hands-on, it also describes how to benchmark AD backends yourself, so that you can tailor the choice to your specific model.

Manual derivatives

If you have analytic derivatives, you can specify this in a rather hacky way by overloading the LogDensityProblems.logdensity_and_gradient method. (Right now it is not possible to use DifferentiationInterface.jl for this.)

In this model, there is a single parameter which is normally distributed. So the log probability density is (const - x^2/2), and the gradient with respect to this parameter is -x. Because we take vectors as inputs and outputs, we need to index and wrap it back in a vector in the following implementation.

using Turing
import DynamicPPL, LogDensityProblems

@model f() = x ~ Normal()

function LogDensityProblems.logdensity_and_gradient(
    ldf::DynamicPPL.LogDensityFunction{<:DynamicPPL.Model{typeof(f)}},
    x::AbstractVector
)
    # This function must return the log probability density as the first
    # argument and the gradient as the second argument.
    # You could manually calculate the first argument too, but here we 
    # just defer to the existing logdensity function.
    return LogDensityProblems.logdensity(ldf, x), [-x[1]]
end

sample(f(), NUTS(), 1000)

All the other stuff

Back to top

There are a ton of other integrations here which I don't have time to go into, like differential equations, neural networks, optimisation, Gaussian processes, MCMC tempering, and variational inference (including Pathfinder). The selection of topics covered here is not intended to imply that any of these are less important. Sorry if your favourite topic was not mentioned.

What's difficult about this?

Back to top

So far I've presented a number of things that Turing.jl can do, by virtue of being a single package that fits into a larger Julia ecosystem. One might be prompted to ask if there is any downside to this.

I think that the numerical computing ecosystem in Julia is quite unique and that this modularity (distributions defined in Distributions, etc.) is a particular strength of it. This does, however, also mean that there are a lot of moving parts and it's very possible for changes in one package to unwittingly break other packages. It also means that it's not clear whose responsibility it is to fix bugs or to run tests when some interaction between two packages goes wrong.

Stan is a closed ecosystem, so if there's a bug in Stan's AD code, then the Stan developers know that it's their job to fix it. If a Turing model doesn't work with Enzyme, it's not immediately clear whose problem it is. So far things have worked out quite nicely: we try to create as minimal a working example as possible, usually one that does not involve Turing-specific code; then the Enzyme developers can fix the underlying bugs. This is quite a time-consuming process though (you can see an example of me doing this here) and it does somewhat rely on good faith on both sides to make it work.

At the same time, it's a huge relief (to me) that I don't need to understand deeply how to write my own AD backend in order to maintain Turing.jl. In a way, the fact that Turing.jl can be developed by a relatively small developer team is only possible because we mostly only need to worry about the actual probabilistic programming bits. Of course, we have at least a passing knowledge of all the things we interact with, but we don't need to be experts.

Another problem is that I've talked a lot about satisfying some interface, but interfaces in Julia are practically nonexistent. The only way to specify an interface is via documentation, and this has obvious problems: firstly things are often undocumented, and secondly even if they are, it's very easy for them to get out of date. (Note that Turing is hardly innocent in this regard: some recent bugs were introduced because the external sampler interface was not documented properly.)

In general the only way to tell if you have satisfied an undocumented interface is to test out the code and see if it works. Of course this is a hugely unreliable way to develop software. There's a very nice blog post by Invenia describing this problem, along with and some strategies for overcoming it. I've long wanted to try to do this a bit better in Turing, though it's a long road ahead.

Julia's multiple dispatch system, along with an (inexplicable) allergy towards modules, can also make bugs quite frustrating to track down: things may unexpectedly fail (or unexpectedly work correctly!) because of some function definition or import 5 files away.

Choosing a PPL

Back to top

With all of this said, we should probably return to the question of which PPL is 'the best'. This in itself is quite a loaded question. I hope that I've convinced you that Turing.jl has quite some cool features, but I also don't believe that that alone should mean that everybody should use Turing. Rather I think that it's important to understand the differences between the PPLs.

In my opinion one of the biggest strengths of Turing is being able to write plain Julia code in your model, and also an extreme level of customisability. Thus if you want to be really able to control what goes on, Turing is a really good choice, and this is especially so if you are experimenting with new inference techniques. (In many ways this is really a strength of Julia, which we inherit by virtue of being a Julia package.)

It is not true that every user wants to be able to control their model in this way. If you simply want to have top-notch performance and you don't care about how your AD is being performed, then Stan provides a perfectly good solution with their in-house (reverse-mode) AD. Stan's documentation, and the amount of existing (and up-to-date!) tutorials out there is far greater than for Turing. In many ways Turing's documentation assumes that you already know what Bayesian modelling is, and it teaches you how to do it in Turing.jl, but this can really be quite unapproachable for beginners.

There are a huge number of reasons why one might choose a PPL over another that have absolutely nothing to do with technical design choices. A huge one is of course the language that it is written in. I think it is a not uncommon experience in the Julia community to advertise a library to someone else, only to get the response that 'but then I'd have to use Julia'. If you want to write your modelling code in pure(ish) Python, then PyMC lets you do this; if you don't mind calling an external Stan model, then all your modelling can be written in Python, and this would let you combine it with (for example) data processing in Python, without having to engineer a multi-language codebase.

How you can support your favourite PPL

Back to top

Regardless of which PPL you choose to use, I would love to encourage you to support it (and open-source software)!

  1. Use it: Open-source software generally does not pay bills. I'm very lucky to be able to work full-time on Turing.jl (for now), but at the very least, we need to demonstrate that people use what we write.
  2. Cite papers: Turing.jl is primarily an academic project so this really helps. You can find a list of papers in the Turing.jl README.
  3. Get in touch: In order for us to understand what the most important issues are, we do need to have bidirectional communication with the community. Please don't hesitate to open an issue even if it's just to discuss something: we're quite friendly (I think). I've been trying to put together an (approximately...) fortnightly newsletter for Turing.jl, which is posted on Slack, GitHub, and our website.
  4. Contribute code (?!): If you would like to get involved, we would love to have you on board as well. It can be quite difficult to navigate the internals of Turing but I am definitely more than happy to help guide new contributors. Of course, everyone is busy, so we hardly expect this. But if you do think that you might want to have a job in the future writing software, it's very impressive to put on your CV that you've contributed to a big open-source project :)