Reseller News

Get started with Numba

Want faster number-crunching in Python? You can speed up your existing Python code with the Numba JIT, often with only one instruction.

Python is not the fastest language, but lack of speed hasn’t prevented it from becoming a major force in analytics, machine learning, and other disciplines that require heavy number crunching. Its straightforward syntax and general ease of use make Python a graceful front end for libraries that do all the numerical heavy lifting.

Numba, created by the folks behind the Anaconda Python distribution, takes a different approach from most Python maths-and-stats libraries. Typically, such libraries — like NumPy, for scientific computing — wrap high-speed maths modules written in C, C++, or Fortran in a convenient Python wrapper. Numba transforms your Python code into high-speed machine language, by way of a just-in-time compiler or JIT.

There are big advantages to this approach. For one, you’re less hidebound by the metaphors and limitations of a library. You can write exactly the code you want, and have it run at machine-native speeds, often with optimisations that aren’t possible with a library. What’s more, if you want to use NumPy in conjunction with Numba, you can do that as well, and get the best of both worlds.

Installing Numba

Numba works with Python 3.6 and most every major hardware platform supported by Python. Linux x86 or PowerPC users, Windows systems, and Mac OS X 10.9 are all supported.

To install Numba in a given Python instance, just use pip as you would any other package: pip install numba. Whenever you can, though, install Numba into a virtual environment, and not in your base Python installation.

Because Numba is a product of Anaconda, it can also be installed in an Anaconda installation with the conda tool: conda install numba.

The Numba JIT decorator

The simplest way to get started with Numba is to take some numerical code that needs accelerating and wrap it with the @jit decorator.

Let’s start with some example code to speed up. Here is an implementation of the Monte Carlo search method for the value of pi — not an efficient way to do it, but a good stress test for Numba.

import random
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples
print(monte_carlo_pi(10_000_000))

On a modern machine, this Python code returns results in about four or five seconds. Not bad, but we can do far better with little effort.

import numba
import random
@numba.jit()
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples
print(monte_carlo_pi(10_000_000))

This version wraps the monte_carlo_pi() function in Numba’s jit decorator, which in turn transforms the function into machine code (or as close to machine code as Numba can get given the limitations of our code). The results run over an order of magnitude faster.

The best part about using the @jit decorator is the simplicity. We can achieve dramatic improvements with no other changes to our code. There may be other optimisations we could make to the code, and we’ll go into some of those below, but a good deal of “pure” numerical code in Python is highly optimisable as-is.

Note that the first time the function runs, there may be a perceptible delay as the JIT fires up and compiles the function. Every subsequent call to the function, however, should execute far faster. Keep this in mind if you plan to benchmark JITed functions against their unJITted counterparts; the first call to the JITted function will always be slower.

Numba JIT options

The easiest way to use the jit() decorator is to apply it to your function and let Numba sort out the optimizations, just as we did above. But the decorator also takes several options that control its behaviour.

nopython

If you set nopython=True in the decorator, Numba will attempt to compile the code with no dependencies on the Python runtime. This is not always possible, but the more your code consists of pure numerical manipulation, the more likely the nopython option will work. The advantage to doing this is speed, since a no-Python JITted function doesn't have to slow down to talk to the Python runtime.

parallel

Set parallel=True in the decorator, and Numba will compile your Python code to make use of parallelism via multiprocessing, where possible. We’ll explore this option in detail later.

nogil

With nogil=true, Numba will release the Global Interpreter Lock (GIL) when running a JIT-compiled function. This means the interpreter will run other parts of your Python application simultaneously, such as Python threads. Note that you can’t use nogil unless your code compiles in nopython mode.

cache

Set cache=True to save the compiled binary code to the cache directory for your script (typically __pycache__). On subsequent runs, Numba will skip the compilation phase and just reload the same code as before, assuming nothing has changed. Caching can speed the startup time of the script slightly.

fastmath

When enabled with fastmath=True, the fastmath option allows some faster but less safe floating-point transformations to be used. If you have floating-point code that you are certain will not generate NaN (not a number) or inf (infinity) values, you can safely enable fastmath for extra speed where floats are used — e.g., in floating-point comparison operations.

boundscheck

When enabled with boundscheck=True, the boundscheck option will ensure array accesses do not go out of bounds and potentially crash your application. Note that this slows down array access, so should only be used for debugging.

Types and objects in Numba

By default Numba makes a best guess, or inference, about which types of variables JIT-decorated functions will take in and return. Sometimes, however, you’ll want to explicitly specify the types for the function. The JIT decorator lets you do this:

from numba import jit, int32
@jit(int32(int32))
def plusone(x):
    return x+1

Numba’s documentation has a full list of the available types.

Note that if you want to pass a list or a set into a JITted function, you may need to use Numba’s own List() type to handle this properly.

Using Numba and NumPy together

Numba and NumPy are meant to be collaborators, not competitors. NumPy works well on its own, but you can also wrap NumPy code with Numba to accelerate the Python portions of it. Numba’s documentation goes into detail about which NumPy features are supported in Numba, but the vast majority of existing code should work as-is. If it doesn’t, Numba will give you feedback in the form of an error message.

Parallel processing in Numba

What good are sixteen cores if you can use only one of them at a time? Especially when dealing with numerical work, a prime scenario for parallel processing?

Numba makes it possible to efficiently parallelise work across multiple cores, and can dramatically reduce the time needed to deliver results.

To enable parallelization on your JITted code, add the parallel=True parameter to the jit() decorator. Numba will make a best effort to determine which tasks in the function can be parallelized. If it doesn’t work, you’ll get an error message that will give some hint of why the code couldn’t be sped up.

You can also make loops explicitly parallel by using Numba’s prange function. Here is a modified version of our earlier Monte Carlo pi program:

import numba
import random
@numba.jit(parallel=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in numba.prange(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples
print(monte_carlo_pi(10_000_000))

Note that we’ve made only two changes: adding the parallel=True parameter, and swapping out the range function in the for loop for Numba’s prange (“parallel range”) function. This last change is a signal to Numba that we want to parallelise whatever happens in that loop. The results will be faster, although the exact speedup will depend on how many cores you have available.

Numba also comes with some utility functions to generate diagnostics for how effective parallelisation is on your functions. If you’re not getting a noticeable speedup from using parallel=True, you can dump out the details of Numba’s parallelisation efforts and see what might have gone wrong.