The Julia language is a very powerful parallel computing model that works across multiple cores and cluster nodes.

Parallel Julia: Part 2

In a previous article, I introduced the Julia language. Julia is designed for HPC, which makes it different and more exciting than most other programming languages. Julia is, among other things, open source, fast, scalable, easy to learn, and extensible. It fills a void in HPC that allows users to “tinker” with hardware and software. In this installment, I’ll look at some hands-on experience with Julia.

As stated, Julia is new, which means some aspects of the language are still developing. The extremely good documentation is worth consulting if you want to explore further. Because Julia is somewhat of a moving target, it is best to pull the latest release from the web. Currently, Julia builds support:

  • GNU/Linux: x86/64 (64-bit); x86 (32-bit).
  • Darwin/OS X: x86/64 (64-bit); x86 (32-bit).
  • FreeBSD: x86/64 (64-bit); x86 (32-bit).

The following examples were built and run on a Limulus personal cluster running Scientific Linux 6.2 on an Intel i5-2400S with 4GB of memory. If you don’t want to bother building Julia but would like to play with it on the web, check out the on-line Julia version (do not enter Your Name or Session Name, just click the Try Julia Now bar.)

To build Julia locally, move to a working directory with at least 2GB of available file space and enter:

$ git clone git://

After the download completes, you will have a new Julia directory. To build Julia, move to the Julia directory, enter make, and go grab a cup of coffee, head out to lunch, or walk the dog. The Julia build takes a while and will pull down any needed packages. It also uses LLVM, an open modular compiler framework as a back end for the language. When done building, you should find a Julia binary in your working directory. To start the Julia interpreter, enter ./julia:


   _       _ _(_)_     |
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  A fresh approach to technical computing
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.0.0+86921303.rc6cb
 _/ |\__'_|_|_|\__'_|  |  Commit c6cbcd11c8 (2012-05-25 00:27:29)
|__/                   |


If you don't want to see the title on subsequent start ups use julia -q to get the command line. It is also a good idea to set your PATH to include the new binary. Like many popular interactive tools, you can enter expressions like:

julia> sqrt(2*7)+(6/4)

As I mentioned before, both a user manual and library reference are available online, so you can explore more of the Julia language using these resources. Because Julia is pointed squarely at the HPC crowd and because parallel computing is an integral part of technical computing, I’ll jump right into how Julia expresses parallelism.

Diving Into the Deep End

Before I begin, however, let me provide my standard “MPI is still great” disclaimer. Higher level languages often try to hide the details of low-level parallel communication. With this “feature” comes some loss of efficiency, similar to writing Fortran instead of low-level machine code. The trade-off is often acceptable because the gains in convenience outweigh the loss of efficiency. Not to worry, MPI is here to stay and it is still great. Just remember higher level languages that hide low-level MPI coding let more people play in the HPC game.

As stated in the Julia manual, Julia provides a simple one-sided messaging model:

Julia's implementation of message passing is different from other environments such as MPI. Communication in Julia is generally “one-sided,” meaning that the programmer needs to explicitly manage only one processor in a two-processor operation. Furthermore, these operations typically do not look like “message send” and“message receive” but rather resemble higher level operations like calls to user functions.

The authors also state that Julia provides two built in primitives:

remote references and remote calls. A remote reference is an object that can be used from any processor to refer to an object stored on a particular processor. A remote call is a request by one processor to call a certain function on certain arguments on another (possibly the same) processor.

Before I explore parallel computation, I need to start Julia on a multicore machine with two processors with:

Julia -q -p 2

I will look at dynamically adding processors in the next section, but for now, I will use two cores on the same machine. It also makes sense not to oversubscribe the number of cores on your machine (i.e., the -p argument should not exceed the number of cores on your machine). In the example below, I use remote_call to add two numbers on another processor. The first argument is the processor number, the second is the function, and the remaining arguments are the arguments to the function (in this case, I am adding 2 + 2). Then I fetch the result. Continuing, I can also use a remote call to operate on previous results.

julia> r = remote_call(2, +, 2, 2)

Julia> fetch(r)

julia> s = remote_call(2, +, 1, r)

julia> fetch(s)

Remote calls return immediately and do not wait for the task to complete. The processor that made the call proceeds to its next operation while the remote call happens somewhere else. A fetch() call will wait until the result is available, however. Also, you can wait for a remote call to finish by issuing a wait() on its remote reference.

In the above example, the need to supply processor numbers is not very portable, and as such, these primitives are not used by most programmers. A Julia macro called @spawn removes this dependency. For example:

julia> r = @spawn 7-1

julia> fetch(r)

As I will explain below, a @parallel macro also is very helpful with loops. Because Julia is interactive, it is also important to keep in mind that it is possible to define and enter functions from the command line that are not automatically copied to remote cores (on the local or remote nodes). To make functions available on all processing cores, the load function must be used. This function will load Julia programs on all known cores associated with the current Julia instance. Alternatively, Julia will load the file startup.jl (if it exists) in your home directory.

Calling All Cores

Julia can use cores on the local machine and on remote machines (remote nodes will almost always be cluster nodes). The following is brief description of how to add cores dynamically to your Julia instance. First, I will start Julia with one core. The nprocs() function will report the number of cores available to the current instance. For example:

$ julia -q
julia> nprocs()1

If I want to add two local cores, I can use the addprocs_local() function as follows:

julia> addprocs_local(1)
ProcessGroup(1,{LocalProcess(), Worker("",9009,4,IOStream(),IOStream(),{}, 
{},2,false)},{Location("",0), Location("",9009)},2,{(1,0)=>WorkItem(bottom_func,(),false,

In this case, I added one core. Note that nprocs() now will show a total of two cores:

julia> nprocs()

Adding remote cores (those on other machines) can be done in two ways. The first is to add the nodes explicitly with the addprocs_ssh() function. In the example below, I am adding nodes named n0 and n2. It also should be noted that using remote nodes assumes that Julia has been installed in the same location on each node or is available via a shared file system. Make sure the PATH variable points to your Julia binary on the remote nodes. Again, note the number of processors (cores) has been increased – to four.

julia> addprocs_ssh({"n0","n2"})
ProcessGroup(1,{LocalProcess(), Worker("",9009,4,IOStream(),IOStream(),{},
{},2,false)  ...  },{Location("",0), Location("",9009)  ... Location("",9009)},4,
{(1,0)=>WorkItem(bottom_func,(),false,(thunk(AST(lambda({},{{#1, #2}, {{#1, Any, 2}, {#2, Any, 2}}, {}},
  #1 = top(Array)(top(Any),2)
  #2 = #1
  return addprocs_ssh(#2)

julia> nprocs()

To verify that the remotes nodes are indeed involved with the computation, I can devise a simple parallel loop using the @parallel macro. I also use the Julia run function to run hostname on each node. For example:

julia> @parallel for i=1:4

julia> limulus

By selecting i=1:4, I expect Julia to use all the available cores, and indeed, two local cores and the two remote nodes report in with their name (Note: the local host machine is named “limulus”). The nodes are used in a round-robin fashion, and if the loop count is increased, Julia will cycle through the available resources again.

julia> @parallel for i=1:8

julia> limulus

To get a better feel for parallel computation, I can run the example from the Julia documentation. First, see how it works on one node. The following program will generate a random bit (0 or1) and sum the result (the + represents a parallel reduction). The tic function starts a timer, and toc reports the elapsed time.

julia> tic();
       nheads = @parallel (+) for i=1:100000000
       println("Number of Heads: $nheads in $s seconds")
elapsed time: 11.234276056289673 seconds
Number of Heads: 50003873 in 11.234276056289673 seconds

The loop took 11.23 seconds to complete using one core. Next, if I run the exact same loop using two local and two remote nodes, I get 5.67 seconds. Finally, if I restart Julia with four local cores, I get 3.71 seconds. The speed-ups certainly indicate Julia is running in parallel (you can also observe the julia process with top on worker nodes). Note that the simple parallel loop required no information about the number or location of cores. This feature is extremely powerful because it allows applications to be written on a small number of cores (or a singe core) and then run in parallel when more cores are available. Although such a method does not guarantee efficiency, removing the parallel bookkeeping from the programmer is always helpful. Many other useful parallel functions are worth exploring, including things like myid(), which gives a unique processor number that can be used for identification.

The final way to add cores is through Grid Engine using the addprocs_sge() function. I assume other schedulers like Torque will be supported in the future. This feature, while seemingly mundane, can be very useful. It allows programs to scale dynamically over a whole cluster or cloud. Programs could be constructed to seek resources as needed and, if available, use them. Consider a user running a dynamic Julia program from their notebook or workstation. The program could request cores from a cluster or cloud; if the resources are not available, the request could be withdrawn or the user’s program could wait until the resources become available. Although this functionally is still under development, the parallel computing component of Julia is written in Julia, so improving and experimenting with the code is quite simple because there is no need to dig down into the code and modify a low-level routine. As noted, because Julia is new, some features are still under development, like removing cores or better error recovery.

Another interesting scenario for Julia is use on a desk-side HPC resource. For instance, the Limulus system I am using here is a fully functioning four-node cluster with one head/login node that is powered all the time (just like a workstation). When extra resources are needed, the user or the batch scheduler can power up nodes at any time. With the ability to request resources dynamically, a Julia program could request cores through the batch scheduler, which activates the needed nodes and runs the desired program. The user never has to think about nodes, batch queues, or administrative issues. The Julia program will manage itself and allow the user to focus more on the problem at hand rather than the details of parallel computing.

When Ever

Taking away the responsibility of managing explicit communication and synchronization from the user has advantages. How work is scheduled is now virtually transparent to the user. With Julia, multiple parallel computations are managed as tasks or co-routines. Whenever code performs a communication operation like fetch or wait, the current task is suspended and a scheduler picks another task to run. When the wait event completes (e.g., the data shows up), the task is restarted. This design has the advantage of a dynamic environment where explicit synchronization is not the responsibility of the user. In addition, dynamic scheduling allows easy implementation of master/worker divide-and-conquer algorithms.

I have only touched a small part of Julia’s parallel capability, and I hope this partial introduction has given you a feel for the power of Julia’s parallel computing model. Remember, all this goodness comes with blazingly fast execution, so creating parallel applications or algorithms is not just an academic exercise. In my next installment, I will discuss distributed arrays and other useful HPC features. In the meantime, many other aspects of Julia exist with which you can tinker to your heart’s delight.

Reminder: Check Out The CDP

In a previous article, I mentioned a new initiative called The Cluster Documentation Project (CDP) that is designed to document and publish High Performance Computing (HPC) Cluster information in a professional manner. If you have any interest in contributing financially or otherwise, please visit the CDP information page – Thanks!