Time difference between L1 cache fetch and memory fetch
Contents
Aim
Measure how fast a fetch from L1 cache is when compared to a fetch from memory. Instead of writing pure assembly code, we will use Rust's inline assembly feature.
Brief Refresher on Cache and Memory
Memory is the working space of all the processes. Any data that a process needs to store/read has to be present in the memory. Compared to a processor's functions like additions, multiplications, address generations, fetching data from memory is a time taking operation. To not have it become a bottleneck, CPUs have a faster, smaller banks of memory called caches.
Once a memory address is accessed, if the processor thinks that the address will be accessed again in future, it keeps the fetched data in cache so that subsequent fetches are faster. Usually, processors cache all accessed memory addresses.
In cases where there are multiple caches, their arrangement is dependent on the chip architecture. Usually, the cache closest to the processor is called an L1 cache and in a multi core architecture, every core has its own L1 cache.
Measurement Scheme
We will use
rdtsc
instruction to get the current timestamp. This returns the number of clock cycles passed since the processor's start.To measure the time needed to fetch data from memory, we need to make sure that the data is not present in cache. This is done using
clflush
instruction. This instruction removes the requested data from all the caches in the cache hierarchyUsing these, we can measure required time periods with the following flow:
+---------------------------------+ | Measure Time (T1) | | | | Flush Data | Section 1 | | | Load Data | | | | Measure Time (T2) | +---------------------------------+ | | Section 2 | Subtract T1 from T2 (R1) | +---------------------------------+ | | | Measure Time (T3) | | | Section 3 | Load Data | | | | Flush Data | | | | Measure Time (T4) | +---------------------------------+ | | Section 4 | Subtract T3 from T4 (R2) | +---------------------------------+
In the first section, we flush the data so that, when we next load the data, it is fetched from memory. So the subtracted value T2 - T1
in second section represents the time taken to load data from memory. In the third section, we load the data before flushing. Because we already loaded the data in the first section, the processor by default puts it in cache. So the data for the load in this section is fetched from L1 cache (because it is closest to the processor). Consequently, the subtracted value T4 - T3
represents the time taken to load data from cache.
Since caches are faster than memory, R2 should be smaller than R1. So R1 - R2
measures how fast the cache is when compared to memory.
Code
rdtsc
instruction puts the least significant 32 bits of current timestamp in EAX
register and the remaining bits in EDX
register. Instead of dealing with two registers every time, we shift the value in EDX
32 bits to the left and add it to RAX
so that RAX
contains the final 64 bit timestamp.
The *fence
instructions around rdtsc
are needed to prevent the processor from executing these instructions out of order.
We create a macro rdtsc!()
that expands to the above code and using it, following is the code that computes the time difference between L1 fetch and memory fetch.
General structure of the inline assembly in Rust is
We don't use the options
argument in our code. So we have only 4 arguments - assembly template, output operands, input operands, clobbers.
Output Operands
We need two output variables to store the values of R1
and R2
as described in Measurement Scheme section. In our assembly template, R1
ends up in RCX
register and R2 in RAX
. The phrase "=&{rax}" (load_and_flush), "=&{rcx}" (flush_and_load)
tells the compiler to store value in RCX
in flush_and_load
variable and the value in RAX
in load_and_flush
variable.
Input Operands
Our code only takes one input - the address of data to fetch. This can be any address accessible to the process. In the code, we store a static string "some string"
in variable data
and pass its address &data
as input to the assembly code. The r
portion of "r" (&data)
tells the compiler to store the value of &data
in a register.
Clobbers
In our assembly code, we make use of the registers RAX
, RBX
, RCX
and RDX
. We have to convey this to the compiler so that it doesn't use these registers for its own code generation. This is done by specifying these registers as clobbers
.
Assembly Template
This closely follows the flow in Measurement Scheme section with the following quirks:
In the list of outputs and inputs, the
&data
input is the 3rd one. So we refer to it in the assembly code as$2
(0-based index)clflush
takes a memory address as its argument and removes the data at the memory address from all caches
Results
Following is the box plot of the outputs of the above code run 1000 times in debug and release mode in my personal computer:
So a data access from cache saves roughly 190 processor cycles in my laptop. The results will of course vary with the processor architecture. We don't see much difference in debug and release mode because the assembly code is not optimized by the compiler.
Unanswered Questions
How can the code be improved to reduce the interference by other processes running on the same processor?
The difference in debug build varies from 170 to 220. What causes this variance?
References
-
Flush+Reload attack by Yarom Y., Falkner K. E. - https://eprint.iacr.org/2013/448.pdf
This paper exploits the fact that
clflush
removes the data from all the caches to trace the execution of a victim process and read the process' data -
http://embed.rs/articles/2016/arm-inline-assembly-rust/
Good reference for Rust's inline assembly feature.