The age-old problem of the widening gap between microprocessor performance and memory performance, sometimes called the “Memory Wall”, is getting much attention these days. To combat this problem, advanced memory hierarchies with multilevel caches are present in all modern microprocessors. New memory technologies like DRDRAM (Rambus) and DDR-SDRAM are also becoming common. These new memory technologies have in common that they provide increased peak memory bus bandwidth by enhancing the bus between the memory and the microprocessor. Both feature DDR signaling, the possibility of transferring two times per bus cycle, to increase performance. DRDRAM (PC800) use a 16 bit (2 Byte) wide bus running at 400MHz. With DDR signaling this equals 800MHz transfers, giving a peak bandwidth of 1.6GB/s. DDR-SDRAM (PC2100) use a 64-bit bus running at 133MHz. This gives us 266MHz transfers and a 2.1GB/s peak bandwidth. Notice however that there is a difference between peak bus bandwidth and effective memory bandwidth. Where the peak bus bandwidth is just the product of the bus width and the bus frequency, the effective memory bandwidth includes addressing, and other things that are needed to perform a memory read or write. The bold figures of DDR-SDRAM and DRDRAM do not indicate the how these new memory technologies perform in real life. In this article, we will look into the cause of the failed promises of these technologies by focusing on the most important part of memory performance, latency. What neither of these two new memory technologies gives us is reduced memory latency, that is the time it takes to look something up in memory. This is because they are both based on DRAM. The latency is not so much an issue of the memory interface as it is the memory cell itself, and since both these two new memories use DRAM, the latency is not improved. As we will see in the following sections, latency is more important than peak bus bandwidth when it comes to providing effective memory bandwidth.
Before we can journey into the depth of memory bandwidth, it is essential that you understand the basics of caches.
A modern microprocessor use caches to help reduce the actual latency of memory operations. By keeping often used data and instructions in small fast memories close to the processor core, the cache memory, the effective latency is drastically reduced.
Caches work on the principle of locality. Two types of locality exist, the temporal locality and the spatial locality. Temporal locality is the fact that if a program has used a piece of data or instruction, it is likely that it will use the same data soon again. This is clear for instructions if you consider program structures such as loops. Spatial locality is the fact that if a program uses a piece of data or instruction, it is likely to use data or instructions close to the previous access soon. For data, this can be thought of as consecutive accesses to an array of elements, and for instructions, it occurs because of the sequential nature of programs.
When data that is needed is not present in the cache, a cache miss occurs. The processor has to go to the slow main memory, fetch the data there and load it into the cache. During this time the processor must sit idle and wait for the new data to arrive. This is not entirely true for modern microprocessors with out of order execution, but in principle, that is how it works. When the new data arrives it comes in a chunk called a cache block, or a cache line. This block contains what was accessed but also the data close to what was accessed so that we can benefit from the spatial locality.
Another important aspect of cache memories is the written policy. Most modern microprocessors use write-back, write-allocate caches. Write-back means that when the processor performs a write, that write will only take place in the cache and not in the memory. When the block that was written is evicted from the cache because some other block needs that space, the modified block must first be written to memory. Write-allocate means that if a write misses in the cache, the block is first loaded from memory to the cache, and then the write occurs in the cache. The result of this is that a simple memory copy operation does not take two main memory transfers as one could imagine, one from the old location to the cache and one to the new location from the cache. No, in fact, a memory copy operation results in three main memory transfers if it is not done in some clever way.
First, you read the data from the memory into the cache, then you write the data into a new location. With write allocate, this write will first read the old contents this memory location into the cache, only to have it totally overwritten by the memory copy operation. This new data will then be written to the main memory when the modified cache block is evicted from the cache. Thus a memory copy involves three main memory operations.
This was not meant to be a course in cache memories, it was only meant to be a quick intro to a very important part of the memory system. Now we are ready to look at the memory access.
The Memory Access
The only time the processor accesses main memory during ordinary program execution is when a cache miss occurs. The processor then loads a whole cache block into the cache. Accessing main memory is split into several phases because data is stored in the memory in a big matrix. This matrix needs to be accessed with row and column addresses so that the data wanted can be found. To understand this we need to understand some terms.
- RAS – Row Access Strobe. A signal indicating that the row address is being transferred.
- CAS – Column Access Strobe. A signal indicating that the column address is being transferred.
- tRCD – Time between RAS and CAS.
- tRP – The RAS Precharge delay. Time to switch memory row.
- tCAC – Time to access a column.
Now that we have defined these terms, the normal SDRAM access works like this.
- The CPU addresses the memory row and bank during the time RAS is held (tRP).
- After a certain time, tRCD, the CPU address the memory with the column of interest during the time CAS is held (tCAC).
- The addressed data is now available for transfer over the 64-bit memory bus.
- The immediate following 64 bits are transferred the next cycle and so on for the whole cache block.
For SDRAM these times are usually presented as 3-2-2 or 2-2-2, where these numbers indicate tCAC, tRP, and t RCD. Thus for 2-2-2 memory, the first 64-bit chunk is transferred after 6 cycles.
It is worth noting that this is not a complete picture of how an SDRAM access works. Several special cases not covered here also exist. This is sufficient right now though.
Two Example Architectures
To shed more light on the inner workings of memory bandwidth, we will consider two architectures and see what we can expect from them. These two example architectures are the AMD K7 and the Intel P6. The example processors from these architectures are the Athlon Thunderbird and the Celeron Mendochino respectively.
|Cache block size:||64 Bytes||32 Bytes|
|Memory bus width:||64 bits (8 Bytes)||64 bits (8 bytes)|
|Memory frequency:||133 MHz||66 MHz|
|Peak bus bandwidth:||1064 MB/s||528 MB/s|
The above table specifies the cache and memory parameters we are interested in to be able to understand the memory bandwidth of these processors. Of course, there are many more parameters to consider in a real system like chipset type, chipset quality, and BIOS tweak settings. The following reasoning is only an approximation.
What Can We Expect
Using the knowledge gained in the previous sections we can now calculate the peak effective memory bandwidth. A data streaming program with sequential reads and writes is the type of program we can expect to get the highest practical utilization of the memory bus. A program such as the memory benchmark STREAM is this type of program. The majority of the data reads and writes will be satisfied by the cache and only once a cache miss occurs will the processor load data from the main memory. This main memory read will be a whole cache block, and once the data is in the cache will the processor continue to read from the cache. With 32 Byte cache lines as in the P6 case, the main memory read will take 6 cycles for the first 64-bit chunk, 1 cycle for the second, 1 for the third and 1 for the fourth to fill the 32 Byte cache block. This is often referred to as 6-1-1-1. Thus a cache line fill will take 9 bus cycles. With a bus clock of 66 MHz this 32 Byte read will take 136 ns, and that makes the cache line fill bandwidth 235MB/s, or less than half the theoretical peak bandwidth of the bus. With a 64 Byte cache line, the main memory read will take 6-1-1-1-1-1-1-1 cycles to fill the 64 Byte line. These 13 cycles will take 98 ns on a 133 MHz bus, leading to a cache line fill bandwidth of 655MB/s. This is slightly more than half the peak bus bandwidth. The reason the K7 can utilize more of its theoretical peak bus bandwidth is because it has a larger cache line. We summarize these calculations in the following table.
|Peak bus bandwidth:||1064 MB/s||528 MB/s|
|Peak effective bandwidth:||655 MB/s||235 MB/s|
The Real Thing
Having seen what we can expect, how well do these calculations picture the reality. Remember that we did not take into account any delays the chipset or other parts of the memory system add to the results above. We did not account for any tweaks on the memory access that the chipset might apply either. Let us first examine the systems used a little more in detail.
|Frequency:||1100 MHz||400 MHz|
|Chipset:||Via Apollo KT133||Intel BX|
|Memory bus frequency:||133 MHz||66 MHz|
|Operating system:||Linux 2.4.1||Linux 2.4.1|
g77 -O4 -march=i686 -funroll-loops
g77 -O4 -march=i686 -funroll-loops
To measure the actual memory bandwidth we use STREAM, a streaming memory benchmark by John McAlpin available at http://www.streambench.org/. I have used the new Fortran version of this benchmark, compiled and run it under Linux. The results reported here are for the Triad phase of the benchmark.
|Estimated effective bandwidth:||655 MB/s||235 MB/s|
|Measured effective bandwidth:||685 MB/s||253 MB/s|
As you can see the results match very well. In fact, the real measured bandwidth is higher than the Peak effective bandwidth that we calculated above. Actually, I am not certain why this is the case, but the reason for this must lie in that the model we have used is much too simple to describe the fine details of memory performance. For example, the same row address can be reused across cache misses and this is unaccounted for in the model.
What will the future bring, and how will we be able to overcome the memory latency problem? What will DDR signaling give us, and what can features such as hardware prefetching, like in the Pentium-4, give us in memory performance. In the following sections, we will look into this.
The DDR Promise
DDR (Double Data Rate) memory seems to imply that the memory bandwidth will be doubled, and this is indeed the case if we only consider the peak bandwidth of the bus. Learning from the previous sections we will see that this is not the case in reality. DDR-SDRAM can do nothing about the initial latency of the lookup in memory. In fact, the current crop of DDR-SDRAM is actually slower than regular good quality SDRAM when it comes to memory latency. Current PC2100 DDR-SDRAM has 2.5-2-2 access timing. This means that the initial memory access will be 6.5 SDR bus cycles or 13 DDR bus cycles. The rest of the cache block contents is transferred in a DDR fashion. This means that the cache line fill on a K7 with PC2100 DDR-SDRAM takes 6.5-0.5-0.5-0.5-0.5-0.5-0.5-0.5 SDR cycles, or 13-1-1-1-1-1-1-1 DDR cycles. With an SDR clock of 133 MHz, the cache line fill time will be 10 cycles or 75 ns, giving a bandwidth of 851 MB/s. This is only 40% of the peak bus bandwidth, and only a 30% increase in effective bandwidth over 2-2-2 SDRAM. Indeed the reports of performance increases with DDR-SDRAM in bandwidth-starved applications is in the vicinity of 30%. These 30% is nothing to scoff at though, any improvement of performance is, of course, welcome, as long as we understand the implications of new technology.
The Mighty Pentium 4
The Pentium 4 with its i850 chipset sports peak and effective memory bandwidths never before seen in PC:s. The Pentium 4 has an impressive quad pumped 100 MHz front side bus (the bus between the processor and the chipset), allowing 400 MHz transfers over its 64-bit width. This results in a peak frontside bus bandwidth of 3.2 GB/s. The i850 chipset connects this to two DRDRAM channels, each giving a peak bandwidth of 1.6 GB/s. Thus the peak memory bandwidth is 3.2 GB/s. What is astonishing is that on streaming applications, like the STREAM benchmark, the Pentium 4 achieves close to 50% of its peak bandwidth, or in real numbers more than 1.5 GB/s. This is much more than any PC processor before it, and generally more than most workstation processors. How is this possible?
Two features of the Pentium 4 makes this possible. First the fact that it has two individual DRDRAM channels, each connected to its own memory. The second feature is the fact that the Pentium 4 supports hardware stride prefetching. Now, what does that mean?
Prefetching is a technique used to improve cache performance. By loading data into the cache before it is needed, much of the memory latency can be hidden. But how does the processor know what data to load into the cache before it needs the data? As you might imagine this is a matter of seeing into the future, and not even the Pentium 4 can do that. It just has to guess. This guess is based on the history of memory references. If the processor notices that the data being referenced lies in a special pattern, for instance consecutively in memory, it would be likely that this can be true for the next memory reference also. This is how stride prefetching works. If data is being accessed in a consecutive pattern, perhaps with some fixed stride, the processor will start issuing prefetches to hide the memory latency. The prefetching together with the independent memory busses that can handle one memory request each, gives the Pentium 4 it’s impressive effective bandwidth of 1.5GB/s. It should be noted, however, that this is only possible if the prefetching hardware can find a pattern it supports to issue prefetches for, ie. the application has a streaming behavior. But on the other hand, nonstreaming applications might not need the high bandwidth anyway.
I hope that this article has been able to get its point across. Memory bandwidth is more than the bus connecting the memory to the processor. In fact, the largest single factor in memory bandwidth is memory latency. This is something modern memory interfaces does not improve and that is why the memory bandwidth does not scale with the bus bandwidth. Hopefully, future memory standards like DDR-II will help solve this problem, but that is another story.