BSP Variables

Registering, putting and getting

If we want to write more interesting EBSP programs, we need to have a way to communicate between the different Epiphany cores. In EBSP communication happens in one of two ways: using message passing, which we will introduce later, or via registered variables. An EBSP variable exists on every processor, but does not necessarily have the same size on every Epiphany core.

Variable registration

We register a variable by calling bsp_push_reg:

int a = 0;
bsp_push_reg(&a, sizeof(int));
bsp_sync();

Here we declare an integer a, and initialize it with zero. Next we register the variable with the BSP system, by passing its local location, and its size.

To ensure that all cores have registered a variable, we perform a barrier synchronisation after the registration. The Epiphany cores will halt execution until every other core reaches this point in the program, so it synchronizes the program execution between the Epiphany cores. Only one variable may be declared between calls to bsp_sync!

Putting and getting values

Registered variables can be written to or be read from by other cores. In BSP this is referred to as putting something in a variable, or getting the value of a variable. To write for example our processor ID to the next core we can write:

int b = s;
bsp_put((s + 1) % p, &b, &a, 0, sizeof(int));
bsp_sync();

Let us explain this code line by line. As in the Hello World example, here we define s and p to be the processor id and the number of processors respectively. In our call to bsp_put we pass the following arguments (in order):

  1. The pid of the target (i.e. the receiving) processor.
  2. A pointer to the source data that we want to copy to the target processor.
  3. A pointer representing a registered variable. Note that this pointer refers to the registered variable on the sending processor – the EBSP system can identify these processors such that it knows which remote address to write to.
  4. The offset (in bytes) from which we want to start writing at the target processor.
  5. The number of bytes to copy.

Before we want to use a communicated value on the target processor, we need to again perform a barrier synchronisation by calling bsp_sync. This ensures that all the outstanding communication gets resolved. After the call to bsp_sync returns, we can use the result of bsp_put on the target processor. The code between two consecutive calls to bsp_sync is called a superstep.

When we receive the data, we can for example write the result to the standard output. Below we give the complete program which makes use of bsp_put to communicate with another processor. Here, and in the remainder of this post we will only write the code in between the calls to bsp_begin and bsp_end, the other code is identical to the code in the Hello World example.:

int s = bsp_pid();
int p = bsp_nprocs();

int a = 0;
bsp_push_reg(&a, sizeof(int));
bsp_sync();

int b = s;
bsp_put((s + 1) % p, &b, &a, 0, sizeof(int));
bsp_sync();

ebsp_message("received: %i", a);

This results in the following output:

$01: received: 0
$02: received: 1
$07: received: 6
$00: received: 15
...

Where we have suppressed the output from the other cores. As we see we are correctly receiving the processor id of the previous cores!

An alternative way of communication is getting the value of a registered variable from a remote core. The syntax is very similar:

a = s;
int b = 0;
bsp_get((s + 1) % p, &a, 0, &b, sizeof(int));
bsp_sync();

The arguments for bsp_get are:

  1. The pid of processor we want to get the value from.
  2. The pointer representing a registed variable.
  3. The offset (in bytes) at the remote processor from which we want to start reading.
  4. A pointer to the local destination.
  5. The number of bytes to copy.

And again, we perform a barrier synchronisation to ensure the data has been transferred. If you are familiar with concurrent programming, then you might think we are at risk of a race condition! What if processor s reaches the bsp_get statement before processor (s + 1) % p has set the value for a equal to its process number? Do we then obtain zero? In this case, we do not have to worry – no data transfer is initialized until each core has reached bsp_sync. Indeed we receive the correct output:

$01: received: 2
$03: received: 4
$11: received: 12
$14: received: 15
...

Unbuffered communication

So far we have discussed writing to, and reading from variables using bsp_put and bsp_get. These two functions are buffered. When calling bsp_put for example, the current source value at the time of the function call is guarenteed to be sent to the target processor, but it does not get sent until the next barrier synchronisation – so behind the scenes the EBSP library stores a copy of the data. The BSP standard was originally designed for distributed memory systems with very high latency, in which this design makes a lot of sense. On the Epiphany platform this gives a lot of unnecessary overhead since data is copied to external memory.

This problem is not unique to the Epiphany platform however. Together with the MulticoreBSP which targets modern multicore processors, two additional BSP primitives were introduced that provide unbuffered variable communication, bsp_hpput and bsp_hpget. Here the hp... prefix stands for high performance.

However, although their function signatures are completely identical, these are not meant as a drop-in replacements for bsp_put and bsp_get. They are unsafe in the sense that data transfer happens at once. This means that when using these functions you should be aware of possible race conditions – which can notoriously lead to mistakes that can be very hard to debug.

To facilitate writing code using only unbuffered communication we introduce a ebsp_barrier function that performs a barrier synchronisation without transferring any outstanding communication that has arisen from calls to bsp_put and bsp_get. Let us look at an example program using these unbuffered variants:

int s = bsp_pid();
int p = bsp_nprocs();

int a = 0;
bsp_push_reg(&a, sizeof(int));
bsp_sync();

int b = s;
// barrier ensures b has been written to on each core
ebsp_barrier();

bsp_hpput((s + 1) % p, &b, &a, 0, sizeof(int));

// barrier ensures data has been received
bsp_sync();
ebsp_message("received: %i", a);

When writing or reading large amounts of data in between different bsp_sync calls, the hp... functions are much more efficient in terms of local memory usage (which is very valuable because of the small size) as well as running speed. However, extra care is needed to effectively synchronize between threads. For example, if we remove either the ebsp_barrier, or the bsp_sync calls in the previous example program, there will be a race condition.

We test the program, and see that the output is indeed identical to before:

$01: received: 0
$08: received: 7
$02: received: 1
$10: received: 9
...

Interface (Variables)

Epiphany

void bsp_push_reg(const void *variable, const int nbytes)

Register a variable as available for remote access.

The operation takes effect after the next call to

bsp_sync(). Only one registration is allowed in a single superstep. When a variable is registered, every core must do so.
Parameters
  • variable: A pointer to the local variable
  • nbytes: The size in bytes of the variable

The system maintains a stack of registered variables. Any variables registered in the same superstep are identified with each other. There is a maximum number of allowed registered variables at any given time, the specific number is platform dependent. This limit will be lifted in a future version.

Registering a variable needs to be done before it can be used with the functions bsp_put(), bsp_hpput(), bsp_get(), bsp_hpget().

Usage example:

int a, b, c, p;
int x[16];

bsp_push_reg(&a, sizeof(int));
bsp_sync();
bsp_push_reg(&x, sizeof(x));
bsp_sync();

p = bsp_pid();

// Get the value of the `a` variable of core 0 and save it in `b`
bsp_get(0, &a, 0, &b, sizeof(int));

// Save the value of `c` into the array `x` on core 0, at array location p
bsp_put(0, &c, &x, p*sizeof(int), sizeof(int));

Remark
In the current implementation, the parameter nbytes is ignored. In future versions it will be used to make communication more efficient.

void bsp_put(int pid, const void *src, void *dst, int offset, int nbytes)

Copy data to another processor (buffered).

The data in src is copied to a buffer (currently in the inefficient external memory) at the moment bsp_put is called. Therefore the caller can replace the data in src right after bsp_put returns. When

bsp_sync() is called, the data will be transferred from the buffer to the destination at the other processor.
Parameters
  • pid: The pid of the target processor (this is allowed to be the id of the sending processor)
  • src: A pointer to the source data
  • dst: A variable location that was previously registered using bsp_push_reg()
  • offset: The offset in bytes to be added to the remote location corresponding to the variable location dst
  • nbytes: The number of bytes to be copied

Remark
No warning is thrown when nbytes exceeds the size of the variable src.
Remark
The current implementation uses external memory which restrains the performance of this function greatly. We suggest you use bsp_hpput() wherever possible to ensure good performance.

void bsp_get(int pid, const void *src, int offset, void *dst, int nbytes)

Copy data from another processor (buffered)

No data transaction takes place until the next call to bsp_sync, at which point the data will be copied from source to destination.

Parameters
  • pid: The pid of the target processor (this is allowed to be the id of the sending processor)
  • src: A variable that has been previously registered using bsp_push_reg()
  • dst: A pointer to a local destination
  • offset: The offset in bytes to be added to the remote location corresponding to the variable location src
  • nbytes: The number of bytes to be copied

Remark
The official BSP standard dictates that first all the data of all bsp_get() transactions is copied into a buffer, after which all the data is written to the proper destinations. This would allow one to use bsp_get to swap to variables in place. Because of memory constraints we do not comply with the standard. In our implementation. The bsp_get() transactions are all executed at the same time, therefore such a swap would result in undefined behaviour.
Remark
No warning is thrown when nbytes exceeds the size of the variable src.

void bsp_sync()

Denotes the end of a superstep, and performs all outstanding communications and registrations.

Serves as a blocking barrier which halts execution until all Epiphany cores are finished with the current superstep.

If only a synchronization is required, and you do not want the outstanding communications and registrations to be resolved, then we suggest you use the more efficient function ebsp_barrier()

void bsp_hpput(int pid, const void *src, void *dst, int offset, int nbytes)

Copy data to another processor, unbuffered.

The data is immediately copied into the destination at the remote processor, as opposed to bsp_put which first copies the data to a buffer. This means the programmer must make sure that the other processor is not using the destination at this moment. The data transfer is guaranteed to be complete after the next call to

bsp_sync().
Parameters
  • pid: The pid of the target processor (this is allowed to be the id of the sending processor)
  • src: A pointer to local source data
  • dst: A variable location that was previously registered using bsp_push_reg()
  • offset: The offset in bytes to be added to the remote location corresponding to the variable location dst
  • nbytes: The number of bytes to be copied

Remark
No warning is thrown when nbytes exceeds the size of the variable src.

void bsp_hpget(int pid, const void *src, int offset, void *dst, int nbytes)

Copy data from another processor.

This function is the unbuffered version of bsp_get().

As opposed to

bsp_get(), the data is transferred immediately When bsp_hpget() is called. When using this function you must make sure that the source data is available and prepared upon calling. For performance reasons, communication using this function should be preferred over buffered communication.
Parameters
  • pid: The pid of the target processor (this is allowed to be the id of the sending processor)
  • src: A variable that has been previously registered using bsp_push_reg()
  • dst: A pointer to a local destination
  • offset: The offset in bytes to be added to the remote location corresponding to the variable location src
  • nbytes: The number of bytes to be copied

Remark
No warning is thrown when nbytes exceeds the size of the variable src.

void ebsp_barrier()

Synchronizes cores without resolving outstanding communication.

This function is more efficient than bsp_sync().