Skip to main content

Time difference between L1 cache fetch and memory fetch

Aim

Measure how fast a fetch from L1 cache is when compared to a fetch from memory. Instead of writing pure assembly code, we will use Rust's inline assembly feature.

Brief Refresher on Cache and Memory

Memory is the working space of all the processes. Any data that a process needs to store/read has to be present in the memory. Compared to a processor's functions like additions, multiplications, address generations, fetching data from memory is a time taking operation. To not have it become a bottleneck, CPUs have a faster, smaller banks of memory called caches.

Once a memory address is accessed, if the processor thinks that the address will be accessed again in future, it keeps the fetched data in cache so that subsequent fetches are faster. Usually, processors cache all accessed memory addresses.

In cases where there are multiple caches, their arrangement is dependent on the chip architecture. Usually, the cache closest to the processor is called an L1 cache and in a multi core architecture, every core has its own L1 cache.

Measurement Scheme

  • We will use rdtsc instruction to get the current timestamp. This returns the number of clock cycles passed since the processor's start.
  • To measure the time needed to fetch data from memory, we need to make sure that the data is not present in cache. This is done using clflush instruction. This instruction removes the requested data from all the caches in the cache hierarchy
  • Using these, we can measure required time periods with the following flow:
           +---------------------------------+
           |      Measure Time (T1)          |
           |                                 |
           |      Flush Data                 |
Section 1  |                                 |
           |      Load Data                  |
           |                                 |
           |      Measure Time (T2)          |
           +---------------------------------+
           |                                 |
Section 2  |      Subtract T1 from T2 (R1)   |
           +---------------------------------+
           |                                 |
           |      Measure Time (T3)          |
           |                                 |
Section 3  |      Load Data                  |
           |                                 |
           |      Flush Data                 |
           |                                 |
           |      Measure Time (T4)          |
           +---------------------------------+
           |                                 |
Section 4  |      Subtract T3 from T4 (R2)   |
           +---------------------------------+

In the first section, we flush the data so that, when we next load the data, it is fetched from memory. So the subtracted value T2 - T1 in second section represents the time taken to load data from memory. In the third section, we load the data before flushing. Because we already loaded the data in the first section, the processor by default puts it in cache. So the data for the load in this section is fetched from L1 cache (because it is closest to the processor). Consequently, the subtracted value T4 - T3 represents the time taken to load data from cache.

Since caches are faster than memory, R2 should be smaller than R1. So R1 - R2 measures how fast the cache is when compared to memory.

Code

rdtsc instruction puts the least significant 32 bits of current timestamp in EAX register and the remaining bits in EDX register. Instead of dealing with two registers every time, we shift the value in EDX 32 bits to the left and add it to RAX so that RAX contains the final 64 bit timestamp.

1
2
3
4
5
6
   mfence
   lfence
   rdtsc
   lfence
   shl $$32, %rdx
   add %rdx, %rax

The *fence instructions around rdtsc are needed to prevent the processor from executing these instructions out of order.

We create a macro rdtsc!() that expands to the above code and using it, following is the code that computes the time difference between L1 fetch and memory fetch.

let data = "some string";
let flush_and_load: u64;
let load_and_flush: u64;

unsafe {
    asm!(concat!(
    rdtsc!(),
    "mov %rax, %rbx
    clflush ($2)
    mov ($2), %rax",
    rdtsc!(),
    // End of Section 1

    "sub %rbx, %rax
    mov %rax, %rcx",
    // End of Section 2

    rdtsc!(),
    "mov %rax, %rbx
    mov ($2), %rax
    clflush ($2)",
    rdtsc!(),
    // End of Section 3

    "sub %rbx, %rax"
    // End of Section 4
    )
    : "=&{rax}" (load_and_flush), "=&{rcx}" (flush_and_load)
    : "r" (&data)
    : "rbx", "rcx", "rax", "rdx"
);
}

println!("{}", flush_and_load - load_and_flush);

General structure of the inline assembly in Rust is

1
2
3
4
5
6
7
8
unsafe {
    asm!(assembly template
       : output operands
       : input operands
       : clobbers
       : options
    );
}

We don't use the options argument in our code. So we have only 4 arguments - assembly template, output operands, input operands, clobbers.

Output Operands

We need two output variables to store the values of R1 and R2 as described in Measurement Scheme section. In our assembly template, R1 ends up in RCX register and R2 in RAX. The phrase "=&{rax}" (load_and_flush), "=&{rcx}" (flush_and_load) tells the compiler to store value in RCX in flush_and_load variable and the value in RAX in load_and_flush variable.

Input Operands

Our code only takes one input - the address of data to fetch. This can be any address accessible to the process. In the code, we store a static string "some string" in variable data and pass its address &data as input to the assembly code. The r portion of "r" (&data) tells the compiler to store the value of &data in a register.

Clobbers

In our assembly code, we make use of the registers RAX, RBX, RCX and RDX. We have to convey this to the compiler so that it doesn't use these registers for its own code generation. This is done by specifying these registers as clobbers.

Assembly Template

This closely follows the flow in Measurement Scheme section with the following quirks:

  • In the list of outputs and inputs, the &data input is the 3rd one. So we refer to it in the assembly code as $2 (0-based index)
  • clflush takes a memory address as its argument and removes the data at the memory address from all caches

Results

Following is the box plot of the outputs of the above code run 1000 times in debug and release mode in my personal computer:

/images/release-and-debug-both.svg

So a data access from cache saves roughly 190 processor cycles in my laptop. The results will of course vary with the processor architecture. We don't see much difference in debug and release mode because the assembly code is not optimized by the compiler.

Unanswered Questions

  1. How can the code be improved to reduce the interference by other processes running on the same processor?
  2. The difference in debug build varies from 170 to 220. What causes this variance?

References

  1. Flush+Reload attack by Yarom Y., Falkner K. E. - https://eprint.iacr.org/2013/448.pdf

    This paper exploits the fact that clflush removes the data from all the caches to trace the execution of a victim process and read the process' data

  2. http://embed.rs/articles/2016/arm-inline-assembly-rust/

    Good reference for Rust's inline assembly feature.