Memory Management

Memory management on the Epiphany platform requires some special care so we discuss it in this separate section. The Epiphany cores have very little local (fast) memory, and access the external (larger) memory space is very slow. Therefore one needs to pay special attention to memory management in order to write good programs for the Epiphany platform.

We provide some functions that aid in memory allocation. These are not part of the offical BSP standard, but meant as a utility library. This page will cover these helper functions. If you are interested in the more technical details (specific for the Parallella), see Parallella memory details.

Introduction

In short, there are two types of memory:

  • local memory: 32 KB for each core, fast
  • external memory: 32 MB shared for all cores, slow

In principle, all computations should be performed on data in the fast local memory. However 32 KB might not be enough for all your data. In this case you have to store the data in external memory and transfer the required parts to local memory to do the computations. Access to external memory can be a factor 100 slower in some cases so this should be avoided when possible.

How do you know in what type of memory your data is stored? Let us look at the following example:

int global_var;

int my_function(int argument) {
    int local_var;
    void* buffer1 = ebsp_malloc(10 * sizeof(int));
    void* buffer2 = ebsp_ext_malloc(10 * sizeof(int));


    // ....

    if (buffer1) ebsp_free(buffer1);
    if (buffer2) ebsp_free(buffer2);
}

In this example, the following variables will be stored in local memory: - global_var - local_var - argument - buffer1, as well as the data it points to - my_function (the machine code) However, buffer2 points to data that is stored in the large external memory. The pointer itself is stored in local memory.

In general, global and local variables in your C source code will be stored in local memory, unless otherwise specified with some special gcc attributes. Code itself (i.e. the machine code) can also be stored in both types of memory. Normal C code will be stored in local memory, unless specified using gcc attributes. Variables allocated using ebsp_ext_malloc() are stored in external memory.

Data copying

Direct memcpy

In C you can copy data using memcpy(destination, source, nbytes). This function is available on the Epiphany as well, but its implementation (depending on the version of gcc and newlib) is not properly optimized for the Epiphany architecture. In particular the function itself is stored in external memory (unless you choose to save the complete C library in local memory) and it also does not perform 8-byte transfers. For this reason we have created ebsp_memcpy() which is stored in local memory and does transfers utilizing 8-byte read/write instructions when possible. It is therefore faster than memcpy and should be preferred.

DMA engine

Each Epiphany processor contains a so-called DMA engine which can be used to transfer data. This DMA engine can be viewed as a separate core that can copy data while the normal Epiphany core does other things. The Epiphany core can simply give the DMA engine a task (a source and destination address along with some other options) and the DMA engine will copy the data so that the Epiphany core can continue with other operations. The advantage of the DMA engine over normal memory access is that the DMA engine is faster and can transfer data while the CPU does other things. There are two DMA channels, meaning that two pairs of source/destination addresses can be set and the Epiphany core can continue while the DMA engine is transfering data.

We have provided some utility functions to make the use of the DMA engine easier. If you want to use the DMA engine using the e_dma_xxx functions from the ESDK you can do so, but only use E_DMA_0. The other DMA channel (E_DMA_1) is used internally by the library.

Warning

The DMA engine can not transfer data from the local core to itself (i.e. to another memory location in the same core). Either the source or destination (or both) should point to another core’s memory or to external memory.

The Epiphany BSP library provides the functions ebsp_dma_push() and ebsp_dma_wait(). They implement a queue of DMA tasks that are handled sequentially. With ebsp_dma_push() you can push a task to this queue and with ebsp_dma_wait() you can wait for the task to complete:

// A handle identifies the transfer task
ebsp_dma_handle descriptor_1;
ebsp_dma_handle descriptor_2;

// Start two transfers
ebsp_dma_push(&descriptor_1, destination_1, source_1, data_size_1);
ebsp_dma_push(&descriptor_2, destination_2, source_2, data_size_2);

// perform some computations
// ...

// Wait for them to finish
ebsp_dma_wait(&descriptor_1);
ebsp_dma_wait(&descriptor_2);

Pushing a new task will start the DMA engine if it was not started yet. If it was already running, the library will add the task to an internal queue and automatically point the DMA engine to the next task when it is finished. For those who are interested, this is implemented using interrupts.

In order to use the DMA engine to write data to another core, one needs a memory address that points to the local memory of another core. For this we provide the function ebsp_get_direct_address():

// Some buffer
float data[16];

// Register it in the BSP system
bsp_push_reg(&data, sizeof(data));
bsp_sync();

// Get an address for the data buffer on the core with pid 3
float* remote_data = ebsp_get_direct_address(3, &incoming_data);

// Now we can pass 'remote_data' to the DMA engine, or use it directly
*remote_data = 1.0f;

The above example shows how to obtain an address of a variable on another core. This address can then be passed as source or destination to ebsp_dma_push().

Example

Memory allocation

The memory allocation functions work analogously to the normal C function malloc. Memory allocated by ebsp_ext_malloc() and by ebsp_malloc() can both be freed with the same function ebsp_free(), as in the following example:

// Allocate local memory
float* local_data = (float*)ebsp_malloc(16 * sizeof(float));
if (!local_data) {
    ebsp_message("Memory allocation failed!");
} else {
    for (int i = 0; i < 16; i++)
        local_data[i] = 2.0f;
    // Free the memory
    ebsp_free(local_data);
}

// Allocate external memory
float* external_data = (float*)ebsp_ext_malloc(16 * sizeof(float));
if (external_data) {
    do_computation();
    // Free the memory
    ebsp_free(external_data);
}

Note that calling ebsp_free() with a null pointer results in undefined behaviour, so the following is NOT allowed:

float* local_data = (float*)ebsp_malloc(16 * sizeof(float));
if (local_data) {
    for (int i = 0; i < 16; i++)
        local_data[i] = 2.0f;
}
// !!! WRONG: This will crash if local_data is NULL
ebsp_free(local_data);

External memory DMA transfers

The following example demonstrates the use of ebsp_dma_push() to write a buffer of local data to external memory and read a buffer from external memory using the DMA engine.:

// Allocate buffers
float* external_data_1 = (float*)ebsp_ext_malloc(16 * sizeof(float));
float* external_data_2 = (float*)ebsp_ext_malloc(16 * sizeof(float));
float* local_data_1 = (float*)ebsp_malloc(16 * sizeof(float));
float* local_data_2 = (float*)ebsp_malloc(16 * sizeof(float));

// Fill local buffer 1 with data
for (int i = 0; i < 16; i++)
    local_data_1[i] = 2.0f;

// Fill external buffer 2 with data
// Note that this is slow
for (int i = 0; i < 16; i++)
    external_data_2[i] = 2.0f;

// To 'tasks' for the DMA engine:
// Copy local_data_1 to external_data_1 (write to external memory)
// Copy external_data_2 to local_data_2 (read from external memory)

// This corresponds to two handles
ebsp_dma_handle descriptor_1;
ebsp_dma_handle descriptor_2;

// Start the DMA with the writing task
ebsp_dma_push(&descriptor_1, external_data_1, local_data_1, sizeof(local_data_1));

// We can 'push' the next task while the DMA already works on the first task
ebsp_dma_push(&descriptor_2, local_data_2, external_data_2, sizeof(local_data_2));

// Do lengthy computation in the mean time
do_computations();

// Wait for the DMA to finish the second task
ebsp_dma_wait(&descriptor_2);

// Because the DMA performs the tasks in order,
// we can be assured that the first task is completed as well

Core to core DMA transfers

To use the DMA to transfer data to another core, we need to get the address that points to another core. This can be done using ebsp_get_direct_address(). In the following example we have to arrays: my_data and incoming_data. The idea is to copy the contents of my_data on the local core into incoming_data on the next core. To do this, we first register incoming_data. After this we can get the address of the corresponding array on a remote core. In this case we take the core with pid s + 1 where s is the pid of the local core. With this address we can now use ebsp_dma_push() to copy data using the DMA engine. During this transfer, other computations can be done. After this we use ebsp_dma_wait() which blocks untill the transfer is complete (or returns immediately if already completed).:

int s = bsp_pid();
int p = bsp_nprocs();

// Data to be sent
float my_data[16];

// Buffer to receive data
float incoming_data[16];

// Register it in the BSP system
bsp_push_reg(&incoming_data, sizeof(incoming_data));
bsp_sync();

// Get an address for the incoming_data buffer on the core with pid s + 1.
float* remote_data = ebsp_get_direct_address((s+1)%p, &incoming_data);

// Start the DMA to copy the data from my_data on this core to incoming_data on the next core
ebsp_dma_handle descriptor;
ebsp_dma_push(&descriptor, remote_data, &my_data, sizeof(my_data));

// Do lengthy computation
do_computations();

// Wait for the DMA transfer to finish
ebsp_dma_wait(&descriptor);

// Done

Interface

void *ebsp_ext_malloc(unsigned int nbytes)

Allocate external memory.

This function allocates memory in external RAM, meaning the memory is slow and should not be used with time critical computations.

Return
A pointer to the allocated memory, guaranteed to be 8-byte aligned to ensure fast transfers, or zero on error.
Parameters
  • nbytes: The size of the memory block

When no more space is available, the function will return zero. Note that it is not allowed to call ebsp_free() with a zero pointer so this should always be checked.

void *ebsp_malloc(unsigned int nbytes)

Allocate local memory.

This function allocates memory in local SRAM, meaning the memory is fast but extremely limited.

Return
A pointer to the allocated memory, guaranteed to be 8-byte aligned to ensure fast transfers, or zero on error.
Parameters
  • nbytes: The size of the memory block

When no more space is available, the function will return zero. Note that it is not allowed to call ebsp_free() with a zero pointer so this should always be checked.

void ebsp_free(void *ptr)

Free allocated external or local memory.

Note that the malloc functions can return null pointers on error, and ebsp_free will crash on null pointers.

Parameters

void ebsp_memcpy(void *dst, const void *src, size_t nbytes)

Performs a memory copy completely analogous to the standard C memcpy().

This function is provided because the default

memcpy generated by the epiphany-gcc compiler has some drawbacks. First of all it is stored in external memory, unless you store the full C library (newlib) on the epiphany cores. Secondly it does not do the optimal 8-byte transfers so it is far from optimal.
Parameters
  • dst: Destination address
  • src: Source address
  • nbytes: Amount of bytes to be copied

This function resides in local core memory and does 8-byte transfers when possible, meaning if both dst and src are 8-byte aligned. In other cases, 4-byte or single byte transfers are used.

void ebsp_dma_push(ebsp_dma_handle *desc, void *dst, const void *src, size_t nbytes)

Push a new task to the DMA engine.

See the documentation on Memory Management for details on the DMA engine.

Warning

Source src and destination dst can NOT be both on the local core. Doing so results in undefined behaviour. At least one of the two addresses should be on another core or in external memory.

Parameters
  • desc: Used in combination with ebsp_dma_wait(). Should be seen as a handle to the task. Its contents are populated by this function.
  • dst: Destination address
  • src: Source address
  • nbytes: Amount of bytes to be copied

Assumes previous task in desc is completed (use ebsp_dma_wait())

The DMA (E_DMA_1) will be started if it was not started yet. If it was already started, this task will be pushed to a queue so that it will be done some time later. Use ebsp_dma_wait() to wait for the task to complete.

Usage example:

int s = bsp_pid();
int p = bsp_nprocs();

// Data to be sent
float mydata[16];

// Buffer to receive data
float incomingdata[16];

// Register it in the BSP system
bsp_push_reg(&incomingdata, sizeof(incomingdata));
bsp_sync();

// Get an address for the incomingdata buffer on the core with pid s + 1.
float* remotedata = ebsp_get_direct_address((s+1)%p, &incomingdata);

// Start the DMA to copy the data from mydata on this core to incomingdata on the next core
ebsp_dma_handle descriptor;
ebsp_dma_push(&descriptor, remotedata, &mydata, sizeof(mydata));

// Do lengthy computation
do_computations();

// Wait for the DMA transfer to finish
ebsp_dma_wait(&descriptor);

// Done

Remark
The desc pointer should be 8-byte aligned or behaviour is undefined. This should not be a problem because the malloc functions always return 8-byte aligned pointers, and having an ebsp_dma_handle struct as local variable will be 8-byte aligned as well.

void ebsp_dma_wait(ebsp_dma_handle *desc)

Wait for the task to be completed.

This function blocks untill the task in

desc is completed. Use somewhere after ebsp_dma_push(). See ebsp_dma_push() for example code.
Parameters

void *ebsp_get_direct_address(int pid, const void *variable)

Get a raw remote memory address for a variable that was registered using bsp_push_reg()

The returned pointer (if nonzero) can be written to and read from directly. Note that the data will be transferred directly, as in

bsp_hpput(), so synchronization issues should be considered.
Return
A pointer to the remote variable, or 0 if it was not registered
Parameters
  • pid: Remote core id
  • variable: An address that was registered using bsp_push_reg()

This function is meant to be used in combination with ebsp_dma_push() to transfer data between cores while doing computations at the same time.